November 12, 2013
My ideas often come on a grand scale, and many of the ambitious projects I want to tackle require a fundamental paradigm shift in some field/industry. I’ll list a few of them here. I didn’t invent most of them (even though I thought I did), and some of them are partially-solved (but none are in widespread use).
- In an earlier post I wrote about a future world where meetings are scheduled by a computer system that knows everyone’s time constraints. The system should be able to automatically shuffle around everyone’s schedules to make all kinds of crazy things possible. For example, suppose a musician gets sick right before a rehearsal. The system could reschedule the rehearsal for a later date, in a way that works for everyone in the band. If there is no time that works for the whole group, the system controls each member’s schedule and re-optimizes it to make a time. Or on a bigger scale: think about an entire university using this to schedule class times (dynamically). There’s a lot more to this idea, so check out that blog post for details.
- Unit tests are often used not to assert that a function is correct, but rather to detect when its behavior has been changed (possibly by a seemingly-unrelated change somewhere else in the codebase). It seems like these kinds of tests could be automatically generated. I could imagine a build system that automatically generates random inputs to every function and then records their outputs. Then whenever you build/deploy the application, the system runs the new code against these inputs and checks that the outputs match what was previously recorded. This doesn’t guarantee correctness, but it hopefully tells you when you’ve broken something.
- A type system that allows you to prove that function calls will not allocate memory and forget to release it. I’ve worked out a system where this is possible in another blog post. Similarly, a type system that lets you prove termination or get bounds on the asymptotic resource (CPU/memory) usage of a procedure. Or a type system that models latency in network requests (see my notes above about auto-generating AJAX calls). I have a lot of ideas for type systems. Some type system ideas that I thought I came up with (but didn’t): types that distinguish between unsafe vs. escaped strings (e.g. to prevent injection attacks), types that let you create ad-hoc subsets of types (e.g. even integers, sorted lists, etc.), and using monads for resource management.
- Software for life-critical systems, such as airplane autopilots, medical devices, etc. are often formally-verified—meaning that a computer program checks that the software matches the specification exactly (so it should be bug-free, so long as the specification is correct). Yet tons of life-critical software runs on unverified kernels, and so are fundamentally unsafe. Understandably, it would be near-impossible to verify a large kernel like Linux, but why can’t we have completely verified microkernels for use in these safety-critical applications? I know, easier said than done.
- Of course, it would be nice if every piece of software running on a machine was formally-verified. From the kernel, to the compiler, to the web browser. Pretty unrealistic, I know, but think for a second about what this means. Yes, your computer system would be bug-free—but there’s something else I want to draw attention to. If all software was formally-verified, it could all run in kernel mode. Similarly, all programs could share the same address space (no need for virtual memory). We could even eliminate the need for a task scheduler—all programs could implement cooperative multitasking, and we could be sure that no program would hang and bring the system down. Together, these facts would allow software to be vastly more performant than today’s computer systems. Modern hardware is very underutilized.
- Self-balancing electric unicycles. Unicycles very space efficient, but exhausting to ride (if you can even ride them at all). An electric one could be a very energy-efficient, compact mode of transportation, and the self-balancing system means that even non-unicyclers could ride it. Before you say it’s impossible, I built one and it worked really well before it was stolen from me. I could ride over 5 miles on a single charge, at about 15 mph. For longer trips I could easily take it on the subway with me. With only one wheel, they can be made cheaper than Segways.
- I have to remember a zillion different passwords for all my online accounts. If I could compute cryptographic hashes in my head, I could eliminate this need—I would simply have to remember a single master password, and my password for each website would be the hash of the master password concatenated with the name of the website. But no human is capable of this—but why should they be? That’s what computers are for. All web browsers should automatically implement this behavior for every password field—they should compute the hash of the password you enter concatenated with the website domain and send that to the server. The password you type in should NEVER make it to the server. Currently no browsers do this—why not? There are browser extensions that do this, but the fact that they are extensions means that most people aren’t benefitting from the idea.
- Algorithms theoreticians probably don’t spend a lot of time thinking about UTF-8 encodings for strings. We have a large theory of string algorithms, but they all assume indexing is constant-time. For a naive UTF-8 implementation of strings, indexing is linear-time. I bet we could come up with a clever string data structure for variable-width encodings. It should have worst-case logarithmic-time indexing (like a balanced tree), linear-time traversal, and its space usage should be better than just using fixed-width characters (that’s the whole point of variable-width encodings, after all). Or can we do even better, with average-case constant-time random access?
- Web applications depend on the browser to faithfully execute front-end code. Part of the reason that business logic typically resides on the server is because of security—front-end code can easily be modified by the user and shouldn’t be trusted. What if there was a way for the browser to keep track of the computation, and build a cryptographic proof that it was executed faithfully? Then the browser sends this “certificate of computation” along with the computed result back to the server, which checks that they match up (without actually re-executing the computation). Is this even possible? I recently discovered that I didn’t invent this idea—it’s called verifiable computing.
November 11, 2013
Yesterday I built Subjunct, a small web application that lets you share private messages.
You already have shared secrets (e.g. inside jokes) with your friends that can be used as passwords to secure information. Subjunct lets you use your existing shared secrets to communicate securely with other people, without requiring anyone to create an account.
As usual, the project is open source!
October 31, 2013
Note: All of the domain names mentioned below were unregistered at the time of this writing.
I recently wrote a script to find unregistered .com domains (see this post for details). It builds a simple statistical model of the English lexicon and generates word-like strings according to that model. I let the program run for a few hours, and it generated thousands of unregistered domain names. The results were unexpectedly entertaining.
Here are a few of the ones I especially liked:
antispicy.com brideweed.com dewdropping.com evilship.com honeysweetways.com milkgrower.com neckmold.com outjinx.com piewipe.com shakespearmaid.com speedboatman.com subpubic.com weirdlike.com
Could be real websites
The following domains seem to me like they could be actual startups, although I don’t know what these startups would do. Another description for this category is “domains that I think are worth more than their list price.”
blamework.com bordable.com cojustify.com factioner.com fribbly.com goosebeak.com harmproof.com hairsplice.com jawfall.com lowmeter.com mistag.com parabold.com pipewalk.com polycook.com punnable.com ruinproof.com songfulness.com stoplifted.com tipmost.com uglified.com voiled.com yardkeep.com
Found in a textbook
I must have trained the model on a rather technical dictionary, because many of the domains it produced sound like they came straight out of a textbook.
ganglionlike.com isoporic.com monozygous.com myotropism.com ozonomer.com phonozoa.com postasis.com protonometer.com scabiosis.com
Could be real words
A jillion of them seemed like they could be actual words, though they are not.
embowel.com introscopy.com isolatry.com pederate.com sejunctive.com varicate.com
Signal vs. noise
Some of these domains are better than others. For example, overfamed.com is probably worth more than subcollembolize.com. I would like to train a classifier, such as a support vector machine, on some hand-labeled examples and use it to filter out the bad names. What features would we give this classifier?
- Number of characters
- Number of vowels
- Number of Google search results
- Number of repeated characters
Hold on—I have an idea for a research project. Stay tuned!
October 30, 2013
In my last post, I wrote a small Python script called
domain_finder.pywhich finds unregistered .com domain names. It worked pretty well, so I decided to move it into a folder where I archive my small projects.
After moving it, I ran it again just for fun. But this time, it yielded a series of numbers before printing its real output:
$ ./domain_finder.py -102 -100 1 4 5 6 8 101 103 monial.com medinid.com <-- Available pluvious.com caribbed.com <-- Available infang.com ...
Where were those numbers coming from? I scanned the script for
domain = word + ".com" if available(domain): print domain + " <-- Available" else: print domain
It clearly wasn’t those. So if I wasn’t printing those numbers, who was? I took a look at my imports:
from collections import defaultdict from random import random, choice from string import ascii_lowercase from subprocess import Popen, PIPE from time import time, sleep
Those are all from the standard library—they should just work, no matter what, right? But I knew it had to be one of them. I started removing them one by one. The numbers disappeared when I commented this one out:
from collections import defaultdict from random import random, choice from string import ascii_lowercase # from subprocess import Popen, PIPE from time import time, sleep
subprocesswas printing those numbers? Weird. That’s a widely used Python library—surely people would have complained about this before me. Something wasn’t right. I opened a Python shell and tested it out:
>>> import subprocess -102 -100 1 4 5 6 8 101 103
Okay, at least there was nothing wrong with my script. So what’s
subprocessdoing? Is my installation of Python broken? I Google’d around—nothing came up. Why would moving the script cause the program to change behavior? I looked in the directory for clues.
$ ls domain_finder.py select.py merger.py
There were two other files, select.py (my linear-time implementation of the k-select algorithm) and merger.py (an algorithm I invented for a school project). It then dawned on me—select is a system call that is used to wait on file descriptors, and there is a corresponding Python module of the same name—and I bet it is imported by
To test my hypothesis, I renamed
kselect.py. Sure enough, that fixed it.
But this didn’t sit well with me. I never directly imported
select, and I only used modules in the standard library. My program should work anywhere. When you use the
subprocessmodule, there is an unwritten rule that you are not allowed to have a file named
select.pyin the same directory. Maybe I should have just not been stupid enough to name a file
select.py, but does that mean I’m supposed to know the name of every module in the standard library before naming a script?
October 29, 2013
Coming up with good product names is a weakness of mine, and finding names that are available as .com domains is especially difficult. I wrote a small Python script that trains an n-gram Markov chain on the English lexicon, generates word-like strings according to that model, and checks if the strings are available as .com domains.
Here is some sample output:
$ ./domain_finder.py sibility.com caliology.com mandarch.com hopefulness.com spinical.com <-- Available! opinate.com motion.com synous.com torrency.com <-- Available! mullock.com genetic.com apophysis.com intrated.com <-- Available! calycule.com <-- Available! amido.com swiveless.com <-- Available! bajury.com <-- Available! tendother.com <-- Available! embole.com argasia.com stroot.com ...
The code uses the
whoiscommand and the
/usr/share/dict/wordswordlist (both are available by default on Linux and Mac OS X installations).
#!/usr/bin/python -O from collections import defaultdict from random import random, choice from string import ascii_lowercase from subprocess import Popen, PIPE from time import time, sleep # get a list of words with only ASCII characters words = [w.strip().lower() for w in open("/usr/share/dict/words").readlines()] words = [w for w in words if all([c in ascii_lowercase for c in w])] words = ["^" + w + "$" for w in words if w != ""] # construct a discrete-time markov chain of n-grams n = 5 # this is the "n" in n-grams, try adjusting this for different results transitions = defaultdict(lambda: defaultdict(float)) for word in words: if len(word) >= n: transitions[""][word[:n]] += 1.0 for i in range(len(word) - n): gram = word[i : i + n] next = word[i + 1 : i + n + 1] transitions[gram][next] += 1.0 # normalize the probabilities for gram in transitions: total = sum([transitions[gram][next] for next in transitions[gram]]) for next in transitions[gram]: transitions[gram][next] /= total # sample a probability mass function (dict from elements to probabilities) def sample(pmf): sample = random() cdf = 0.0 for e in pmf: cdf += pmf[e] if cdf >= sample: return e return choice(pmf.keys()) # generate a word according to the markov chain def gen_word(): # start with a prefix word = sample(transitions[""]) # wait until the markov chain adds a terminator to the word while word[-1] != "$": # append a new letter chosen according to the markov chain gram = word[-n:] if gram in transitions: word += sample(transitions[gram])[-1:] else: word += choice(ascii_lowercase + "$") # optional: allow multi-word domains if word[-1] == "$" and random() > 0.7 and len(word) < 8: word += sample(transitions[""]) # remove the boundary markers and return the word return word.replace("^", "").replace("$", "") # check whether a domain is available (e.g., "example.com") # returns a bool indicating if the domain is available or None on timeout def available(domain): # use the "whois" command to determine availability process = Popen(["whois", domain], stdout=PIPE, stderr=PIPE) end_time = time() + 4.0 # timeout after 4 seconds while time() < end_time: if process.poll() is not None: return "No match for" in process.stdout.read() sleep(0.1) try: process.kill() except: pass return None # generate domain names forever while True: # generate a few words and pick the smallest word = sorted([gen_word() for i in range(3)], key=lambda x: len(x)) # report whether the domain is available domain = word + ".com" if available(domain): print domain + " <-- Available" else: print domain
As always, feel free to use this for anything. See the follow-up post.