Home Articles Projects Music About
Follow @stepchowfun

Some big ideas, and a few small ones

November 12, 2013
My ideas often come on a grand scale, and many of the ambitious projects I want to tackle require a fundamental paradigm shift in some field/industry. I’ll list a few of them here. I didn’t invent most of them (even though I thought I did), and some of them are partially-solved (but none are in widespread use).
  • In an earlier post I wrote about a future world where meetings are scheduled by a computer system that knows everyone’s time constraints. The system should be able to automatically shuffle around everyone’s schedules to make all kinds of crazy things possible. For example, suppose a musician gets sick right before a rehearsal. The system could reschedule the rehearsal for a later date, in a way that works for everyone in the band. If there is no time that works for the whole group, the system controls each member’s schedule and re-optimizes it to make a time. Or on a bigger scale: think about an entire university using this to schedule class times (dynamically). There’s a lot more to this idea, so check out that blog post for details.
  • We live in the Dark Ages of web development. JavaScript is a sloppy, weakly-typed language that accidentally became the de facto front-end language of the web. CSS is rarely formally analyzed—and there is a lot of room for improvement here. I bet most web applications have hundreds or thousands of old, unused CSS rules/classes because it’s hard to know what is being used and what isn’t. And, like JavaScript, it has a very sloppy feel—there are all kinds of browser hacks/extensions you need to use in order to write portable stylesheets. Back-end code is usually written in a different programming language. Writing a web application really means writing two applications—the front end and the back end—and making sure they speak the same language (which is very error-prone). Why can’t I write web applications in a single (possibly formally-verified) language that compiles to these ugly error-prone technologies? I want a system which compiles my application into a front-end and a back-end, and automatically generates AJAX calls between them and does appropriate data marshaling. To me, the engineer, an AJAX call should look no different than a normal method call (maybe I don’t even care which of my methods run on the front end or the back end). I’m only going to briefly mention that injection attacks, dead links, malformed HTML/CSS/JavaScript, etc. are all completely preventable—in the right system. All of these problems are solved in a web framework called Ur/Web (developed by one of my professors at MIT), but unfortunately you have to be an expert in type theory to understand it.
  • Unit tests are often used not to assert that a function is correct, but rather to detect when its behavior has been changed (possibly by a seemingly-unrelated change somewhere else in the codebase). It seems like these kinds of tests could be automatically generated. I could imagine a build system that automatically generates random inputs to every function and then records their outputs. Then whenever you build/deploy the application, the system runs the new code against these inputs and checks that the outputs match what was previously recorded. This doesn’t guarantee correctness, but it hopefully tells you when you’ve broken something.
  • A type system that allows you to prove that function calls will not allocate memory and forget to release it. I’ve worked out a system where this is possible in another blog post. Similarly, a type system that lets you prove termination or get bounds on the asymptotic resource (CPU/memory) usage of a procedure. Or a type system that models latency in network requests (see my notes above about auto-generating AJAX calls). I have a lot of ideas for type systems. Some type system ideas that I thought I came up with (but didn’t): types that distinguish between unsafe vs. escaped strings (e.g. to prevent injection attacks), types that let you create ad-hoc subsets of types (e.g. even integers, sorted lists, etc.), and using monads for resource management.
  • Software for life-critical systems, such as airplane autopilots, medical devices, etc. are often formally-verified—meaning that a computer program checks that the software matches the specification exactly (so it should be bug-free, so long as the specification is correct). Yet tons of life-critical software runs on unverified kernels, and so are fundamentally unsafe. Understandably, it would be near-impossible to verify a large kernel like Linux, but why can’t we have completely verified microkernels for use in these safety-critical applications? I know, easier said than done.
  • Of course, it would be nice if every piece of software running on a machine was formally-verified. From the kernel, to the compiler, to the web browser. Pretty unrealistic, I know, but think for a second about what this means. Yes, your computer system would be bug-free—but there’s something else I want to draw attention to. If all software was formally-verified, it could all run in kernel mode. Similarly, all programs could share the same address space (no need for virtual memory). We could even eliminate the need for a task scheduler—all programs could implement cooperative multitasking, and we could be sure that no program would hang and bring the system down. Together, these facts would allow software to be vastly more performant than today’s computer systems. Modern hardware is very underutilized.
  • Self-balancing electric unicycles. Unicycles very space efficient, but exhausting to ride (if you can even ride them at all). An electric one could be a very energy-efficient, compact mode of transportation, and the self-balancing system means that even non-unicyclers could ride it. Before you say it’s impossible, I built one and it worked really well before it was stolen from me. I could ride over 5 miles on a single charge, at about 15 mph. For longer trips I could easily take it on the subway with me. With only one wheel, they can be made cheaper than Segways.
  • I have to remember a zillion different passwords for all my online accounts. If I could compute cryptographic hashes in my head, I could eliminate this need—I would simply have to remember a single master password, and my password for each website would be the hash of the master password concatenated with the name of the website. But no human is capable of this—but why should they be? That’s what computers are for. All web browsers should automatically implement this behavior for every password field—they should compute the hash of the password you enter concatenated with the website domain and send that to the server. The password you type in should NEVER make it to the server. Currently no browsers do this—why not? There are browser extensions that do this, but the fact that they are extensions means that most people aren’t benefitting from the idea.
  • Algorithms theoreticians probably don’t spend a lot of time thinking about UTF-8 encodings for strings. We have a large theory of string algorithms, but they all assume indexing is constant-time. For a naive UTF-8 implementation of strings, indexing is linear-time. I bet we could come up with a clever string data structure for variable-width encodings. It should have worst-case logarithmic-time indexing (like a balanced tree), linear-time traversal, and its space usage should be better than just using fixed-width characters (that’s the whole point of variable-width encodings, after all). Or can we do even better, with average-case constant-time random access?
  • Web applications depend on the browser to faithfully execute front-end code. Part of the reason that business logic typically resides on the server is because of security—front-end code can easily be modified by the user and shouldn’t be trusted. What if there was a way for the browser to keep track of the computation, and build a cryptographic proof that it was executed faithfully? Then the browser sends this “certificate of computation” along with the computed result back to the server, which checks that they match up (without actually re-executing the computation). Is this even possible? I recently discovered that I didn’t invent this idea—it’s called verifiable computing.
0 Comments

Subjunct: share your secrets

November 11, 2013
Yesterday I built Subjunct, a small web application that lets you share private messages.
Screenshot
You already have shared secrets (e.g. inside jokes) with your friends that can be used as passwords to secure information. Subjunct lets you use your existing shared secrets to communicate securely with other people, without requiring anyone to create an account.
As usual, the project is open source!
0 Comments

Markov chain domain names follow-up

October 31, 2013
Note: All of the domain names mentioned below were unregistered at the time of this writing.
I recently wrote a script to find unregistered .com domains (see this post for details). It builds a simple statistical model of the English lexicon and generates word-like strings according to that model. I let the program run for a few hours, and it generated thousands of unregistered domain names. The results were unexpectedly entertaining.

My favorites

Here are a few of the ones I especially liked:
antispicy.com
brideweed.com
dewdropping.com
evilship.com
honeysweetways.com
milkgrower.com
neckmold.com
outjinx.com
piewipe.com
shakespearmaid.com
speedboatman.com
subpubic.com
weirdlike.com

Could be real websites

The following domains seem to me like they could be actual startups, although I don’t know what these startups would do. Another description for this category is “domains that I think are worth more than their list price.”
blamework.com
bordable.com
cojustify.com
factioner.com
fribbly.com
goosebeak.com
harmproof.com
hairsplice.com
jawfall.com
lowmeter.com
mistag.com
parabold.com
pipewalk.com
polycook.com
punnable.com
ruinproof.com
songfulness.com
stoplifted.com
tipmost.com
uglified.com
voiled.com
yardkeep.com

Found in a textbook

I must have trained the model on a rather technical dictionary, because many of the domains it produced sound like they came straight out of a textbook.
ganglionlike.com
isoporic.com
monozygous.com
myotropism.com
ozonomer.com
phonozoa.com
postasis.com
protonometer.com
scabiosis.com

Could be real words

A jillion of them seemed like they could be actual words, though they are not.
embowel.com
introscopy.com
isolatry.com
pederate.com
sejunctive.com
varicate.com

Signal vs. noise

Some of these domains are better than others. For example, overfamed.com is probably worth more than subcollembolize.com. I would like to train a classifier, such as a support vector machine, on some hand-labeled examples and use it to filter out the bad names. What features would we give this classifier?
  • Number of characters
  • Number of vowels
  • Number of Google search results
  • Number of repeated characters
  • ...
Hold on—I have an idea for a research project. Stay tuned!
0 Comments

Leaky abstractions

October 30, 2013
In my last post, I wrote a small Python script called domain_finder.py which finds unregistered .com domain names. It worked pretty well, so I decided to move it into a folder where I archive my small projects.
After moving it, I ran it again just for fun. But this time, it yielded a series of numbers before printing its real output:
$ ./domain_finder.py 
-102
-100
1
4
5
6
8
101
103
monial.com
medinid.com <-- Available
pluvious.com
caribbed.com <-- Available
infang.com
...
Where were those numbers coming from? I scanned the script for print statements. There were only two:
domain = word + ".com"
if available(domain):
  print domain + " <-- Available"
else:
  print domain
It clearly wasn’t those. So if I wasn’t printing those numbers, who was? I took a look at my imports:
from collections  import defaultdict
from random       import random, choice
from string       import ascii_lowercase
from subprocess   import Popen, PIPE
from time         import time, sleep
Those are all from the standard library—they should just work, no matter what, right? But I knew it had to be one of them. I started removing them one by one. The numbers disappeared when I commented this one out:
from collections  import defaultdict
from random       import random, choice
from string       import ascii_lowercase
# from subprocess   import Popen, PIPE
from time         import time, sleep
So, subprocess was printing those numbers? Weird. That’s a widely used Python library—surely people would have complained about this before me. Something wasn’t right. I opened a Python shell and tested it out:
>>> import subprocess
-102
-100
1
4
5
6
8
101
103
Okay, at least there was nothing wrong with my script. So what’s subprocess doing? Is my installation of Python broken? I Google’d around—nothing came up. Why would moving the script cause the program to change behavior? I looked in the directory for clues.
$ ls
domain_finder.py    select.py    merger.py
There were two other files, select.py (my linear-time implementation of the k-select algorithm) and merger.py (an algorithm I invented for a school project). It then dawned on me—select is a system call that is used to wait on file descriptors, and there is a corresponding Python module of the same name—and I bet it is imported by subprocess.
To test my hypothesis, I renamed select.py to kselect.py. Sure enough, that fixed it.
But this didn’t sit well with me. I never directly imported select, and I only used modules in the standard library. My program should work anywhere. When you use the subprocess module, there is an unwritten rule that you are not allowed to have a file named select.py in the same directory. Maybe I should have just not been stupid enough to name a file select.py, but does that mean I’m supposed to know the name of every module in the standard library before naming a script?
0 Comments

Generating domain names with Markov chains

October 29, 2013
Coming up with good product names is a weakness of mine, and finding names that are available as .com domains is especially difficult. I wrote a small Python script that trains an n-gram Markov chain on the English lexicon, generates word-like strings according to that model, and checks if the strings are available as .com domains.
Here is some sample output:
$ ./domain_finder.py 
sibility.com
caliology.com
mandarch.com
hopefulness.com
spinical.com <-- Available!
opinate.com
motion.com
synous.com
torrency.com <-- Available!
mullock.com
genetic.com
apophysis.com
intrated.com <-- Available!
calycule.com <-- Available!
amido.com
swiveless.com <-- Available!
bajury.com <-- Available!
tendother.com <-- Available!
embole.com
argasia.com
stroot.com
...
The code uses the whois command and the /usr/share/dict/words wordlist (both are available by default on Linux and Mac OS X installations).
#!/usr/bin/python -O
from collections  import defaultdict
from random       import random, choice
from string       import ascii_lowercase
from subprocess   import Popen, PIPE
from time         import time, sleep

# get a list of words with only ASCII characters
words = [w.strip().lower() for w in open("/usr/share/dict/words").readlines()]
words = [w for w in words if all([c in ascii_lowercase for c in w])]
words = ["^" + w + "$" for w in words if w != ""]

# construct a discrete-time markov chain of n-grams
n = 5 # this is the "n" in n-grams, try adjusting this for different results
transitions = defaultdict(lambda: defaultdict(float))
for word in words:
  if len(word) >= n:
    transitions[""][word[:n]] += 1.0
  for i in range(len(word) - n):
    gram = word[i : i + n]
    next = word[i + 1 : i + n + 1]
    transitions[gram][next] += 1.0

# normalize the probabilities
for gram in transitions:
  total = sum([transitions[gram][next] for next in transitions[gram]])
  for next in transitions[gram]:
    transitions[gram][next] /= total

# sample a probability mass function (dict from elements to probabilities)
def sample(pmf):
  sample = random()
  cdf = 0.0
  for e in pmf:
    cdf += pmf[e]
    if cdf >= sample:
      return e
  return choice(pmf.keys())

# generate a word according to the markov chain
def gen_word():
  # start with a prefix
  word = sample(transitions[""])

  # wait until the markov chain adds a terminator to the word
  while word[-1] != "$":
    # append a new letter chosen according to the markov chain
    gram = word[-n:]
    if gram in transitions:
      word += sample(transitions[gram])[-1:]
    else:
      word += choice(ascii_lowercase + "$")

    # optional: allow multi-word domains
    if word[-1] == "$" and random() > 0.7 and len(word) < 8:
      word += sample(transitions[""])

  # remove the boundary markers and return the word
  return word.replace("^", "").replace("$", "")

# check whether a domain is available (e.g., "example.com")
# returns a bool indicating if the domain is available or None on timeout
def available(domain):
  # use the "whois" command to determine availability
  process = Popen(["whois", domain], stdout=PIPE, stderr=PIPE)
  end_time = time() + 4.0 # timeout after 4 seconds
  while time() < end_time:
    if process.poll() is not None:
      return "No match for" in process.stdout.read()
    sleep(0.1)
  try:
    process.kill()
  except:
    pass
  return None

# generate domain names forever
while True:
  # generate a few words and pick the smallest
  word = sorted([gen_word() for i in range(3)], key=lambda x: len(x))[0]

  # report whether the domain is available
  domain = word + ".com"
  if available(domain):
    print domain + " <-- Available"
  else:
    print domain
As always, feel free to use this for anything. See the follow-up post.
0 Comments
  • «
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • »
© 2019 Stephan Boyer