## Unicode library for C++

October 30, 2012

I wrote a lightweight, low-level ASCII/Unicode processing library in C++ out of dissatisfaction with existing solutions. It’s very minimalistic and has no external dependencies. It supports the bare minimum features for dealing with text encodings:

// detect the encoding for a string encoding_type detect_encoding(const std::string &input); // determine whether a string is valid in a particular encoding bool is_valid(const std::string &input, encoding_type encoding); // convert a string from one encoding to another std::string convert_encoding(const std::string &input, encoding_type input_encoding, encoding_type output_encoding, bool include_bom); // get the number of code points in a string size_t get_length(const std::string &input, encoding_type encoding); // return the number of bytes of the code point at pos, or 0 if the byte index does not refer to a valid code point size_t get_char_size(const std::string &input, size_t pos, encoding_type encoding); // get the code point at a particular byte index uint32_t get_char(const std::string &input, size_t pos, encoding_type encoding); // set the code point at a particular byte index void set_char(std::string &input, size_t pos, uint32_t code_point, encoding_type encoding); // add a code point to the end of a string void add_char(std::string &input, uint32_t code_point, encoding_type encoding); // determine whether a code point is a letter bool is_alpha(uint32_t code_point); // determine whether a code point is uppercase bool is_upper(uint32_t code_point); // determine whether a code point is lowercase bool is_lower(uint32_t code_point); // determine whether a code point is titlecase bool is_title(uint32_t code_point); // determine whether a code point is a number bool is_numeric(uint32_t code_point); // determine whether a code point is whitespace bool is_whitespace(uint32_t code_point); // determine whether a code point is a line separator bool is_newline(uint32_t code_point); // convert a code point to uppercase (return the input if no uppercase form exists) uint32_t to_upper(uint32_t code_point); // convert a code point to lowercase (return the input if no lowercase form exists) uint32_t to_lower(uint32_t code_point); // convert a code point to titlecase (return the input if no titlecase form exists) uint32_t to_title(uint32_t code_point);

Get it here!

## Finite automata and regular expressions

October 28, 2012

Anyone who has taken a class on computability theory, compilers, or finite automata knows that regular expressions can be parsed in linear time by generating a nondeterministic finite automaton (NFA) from the regular expression and converting it into a deterministic one (DFA).

Unfortunately, regular expression libraries like Python’s

`re`

module frequently generate exponential-time parsers. Why? The reason is that regex libraries like `re`

produce backtracking parsers that can do more than just parse regular languages. They can parse some non-regular languages and support extra features like backreferences, which finite automata alone cannot do.To combat this, I’ve built the backbone of a true regular expressions library: a finite automata engine. It can be used to build a regex parser generator that takes worst-case linear time, as a true regex parser should do.

## Algebraic data types

August 21, 2012

### Primitive types

A type system starts with some primitive types. Some examples include:

`Int`

(the type of integers, perhaps limited to some range)`Real`

(the type of real numbers, perhaps represented as IEEE 754 floating-point numbers)`Char`

(the type of Unicode code points or ASCII characters)`Bool`

(logical`true`

and`false`

)

Any good programming language will let us construct compound types from simple ones like these. Let’s examine some of these operations, which are called type constructors.

### Product types

Most programmers will be familiar with product types, but in programming they are more commonly called

*tuples*(the name “tuple” is a generalization of the terms “double”, “triple”, etc.). A product type is a compound type that is formed by a sequence of types, and is commonly denoted`(T1, T2, ..., Tn)`

or `T1 × T2 × ... × Tn`

. In set theory parlance, tuples correspond to cartesian products, and that is why they’re called “product types.”As an example, we could represent a point in 3-space by a Cartesian triple:

`(Real, Real, Real)`

.Most programming languages will allow the elements of a tuple to be named (rather than indexed). This has the same type theoretic meaning, but named tuples are often called

*records*,*structs*, or*classes*, depending on the programming language and context.*Note:*In type theory, we usually do not distinguish between a 1-tuple (called a “singleton”) and the type of its element (however, most programming languages do make this distinction). This means that, for example,

`(Int)`

is usually equivalent to `Int`

in the literature.A special case of tuples is the 0-tuple, which is also called the unit type. The unit type can only hold one value (which means it stores zero information). It is sometimes called

`Void`

or `Unit`

, and its value is often called `null`

, `nil`

, `none`

, or `nothing`

.Common properties of the type product constructor:

- Closure: the product of two types is a type
- Associativity (sometimes):
`(A × B) × C = A × (B × C)`

- Identity element:
`Unit`

Depending on the type system, types and their constructors can form various algebraic structures, such as a monoid.

### Sum types

Sum types, also called

*tagged unions*,*disjoint unions*, or*variants*, are used to describe values that could be of one of several types (the corresponding set theoretic operation is union of disjoint sets). We usually write sum types as`T1 + T2 + .. + Tn`

, `T1 | T2 | ... | Tn`

, or `T1 ∪ T2 ∪ ... ∪ Tn`

.Many programmers often use sum types without knowing it. For example, what is the return type of the following Python function?

def f(flag): return 5 if flag else "Hello"

It seems that

`f`

could return an `int`

or a `str`

, so its type must be `int + str`

.As a special case, if we take the sum of zero types, we get a special type called

`Bottom`

or `⊥`

, which can take on no values. A function which does not return (i.e., does not halt) is said to have a return type of `Bottom`

. Likewise, the sum of all types is called `Top`

or `⊤`

, and it is sometimes called `Any`

, `Object`

, or `General`

, because it can take on any value.Note that, unlike product types, sum types can be defined inductively, as we shall see with lists and trees.

Properties of the union constructor:

- Closure: the union of two types is a type
- Associativity (sometimes):
`(A + B) + C = A + (B + C)`

- Commutativity:
`A + B = B + A`

- Identity element:
`Bottom`

You can see that types and the sum operation might form a commutative monoid, depending on whether

`+`

is associative.### Algebraic data types

We can represent a number of different data structures using

*algebraic data types*, which are simply sums of products of types. For example, we can represent a list of integers asIntList = Unit + Int × IntList

where the product operation (

`×`

) has a higher precedence than the sum operation (`+`

), and `Unit`

is the unit type (which represents an empty list here). In English, this means: *a list of integers is either the empty list or an integer paired with a list of integers.*Using a nominative type system, we can also represent enumerated types as the union of disjoint unit types:

Clubs = Unit Diamonds = Unit Hearts = Unit Spades = Unit Suit = Clubs + Diamonds + Hearts + Spades

This only works in a nominative type system because, for example,

`Clubs`

and `Diamonds`

need to be disjoint unit types or else the their union would still be the unit type (in type theory we talk about *the*unit type, but in a nominative type system we can have several unit types). In a structural type system, we would have`Suit = Clubs = Diamonds = Hearts = Spades`

, which is not what we want.We might define a binary tree of integers like this:

IntTree = Unit + Int × IntTree × IntTree

This reads:

*a binary tree can be the empty tree or an integer and two subtrees.*### Fun with algebra

The title of this article suggests that we can manipulate expressions involving types algebraically, as we do with numbers. To make this easier, let’s adopt the following notation:

`1`

is the unit type`T`

is a primitive type`X + Y`

is the sum of two types`X × Y`

is the product of two types (which has a higher syntactic precedence than sum)

Additionally, we can let the rest of the natural numbers represent sums of unit types (e.g.

`2 = 1 + 1 = bool`

). Repeated products can be written as exponentials (e.g. `X × X × X = X^3`

). Now let’s describe some common data structures.`1 + T`

is a nullable type (also called an option type or a maybe type)`L = 1 + T × L`

is the type of lists`B = 1 + T × B × B`

is the type of binary trees, like`IntTree`

defined above

Take a look at that second one (lists). With a bit of hand-waving (treating type addition and multiplication like the corresponding operations on reals), we can solve for

`L`

:L = 1 + T × L L - T × L = 1 L × (1 - T) = 1 L = 1 / (1 - T) L = 1 + T + T^2 + T^3 + ...

Interpreting that last line in set theoretic terms, we can translate it into English as:

*a list is either empty, or it contains one element, or two elements, or three elements, etc.*Pretty cool, huh? If we’re treating these symbols as numbers, then they correspond to the number of values that the corresponding types can take on (i.e. the sizes of the corresponding sets). For example, the unit type has 1 possible value, so in this notation, it becomes`1`

. The fact that `L = 1 + T + T^2 + T^3 + ...`

does not converge simply means that there are an infinite number of list topologies we can construct. In general, the algebraic forms of inductive datatypes do not converge.I want to mention one more thing you can do with this notation. In 1997, a computer scientist named Gérard Huet suggested that we can also take derivatives of types in an unpublished work on zipper types. Oddly, the derivative laws look very much like their classical counterparts from differential calculus! The derivative of a type describes the “one-hole context” of a type, which is a data structure that represents a pointer to an item in a larger container type. For example, one-hole contexts can be used to mark the position of an item in a linked list or a binary tree and can be used to efficiently implement such containers in a purely functional fashion.

## My electric unicycle

August 21, 2012

### Introduction

I recently built a partially self-balancing electric unicycle called “Bullet,” featuring:

- A custom MIG-welded steel chassis
- A 450 Watt electric motor
- Two 7 Ah 12 Volt batteries
- A 5DOF inertial measurement unit
- The OSMC H-bridge
- An ATmega328P microcontroller

I say “partially” self-balancing because it only balances along one axis (forward/backward), and the rider still needs to balance left and right (it’s analagous to riding a bicycle “no hands”). It operates much like a Segway—you lean forward to accelerate, and lean back to brake. The top speed is about 15 mph, and it easily goes 5 miles on a single charge. This was my primary mode of transportation on the MIT campus (see the “Press and Aftermath” section below).

The thing in my right hand (in the video at the bottom) is a “kill switch”—if I let go of it, the unicycle deactivates the motor, much like a treadmill safety key.

Also, you may notice that the circuit is fully exposed. When the unicycle falls forward, the circuit has about two inches of ground clearance (so it never hits the floor). However, I’m currently building a proper case for more protection.

### Details

To estimate its orientation, Bullet integrates readings from the gyro and accelerometer using a complementary filter. To balance, the angle estimate is fed through a PID loop (with no integral term). The loop runs at 625 Hz. The output from this stage determines the duty cycle of a 1.22 kHz PWM signal, which is connected to the H-bridge. The code was written in C, and is in the public domain.

The circuit itself is essentially a pretty standard microcontroller setup—filtering capacitors on the power rails, RESET pin held high (for AVR microcontrollers), external 20 MHz crystal oscillator, some indicator LEDs, and the IMU connected to the ADC pins. Two of the digital output pins are wired to the motor controller, which is connected to the motor and the batteries. The motor controller has an onboard switching voltage regulator that powers the logic subcircuit (and also the charge pump for the high-side MOSFETs of the H-bridge). I expect that most undergraduates in electrical engineering could probably reconstruct the circuit given this description, burn the code onto the microcontroller, and have a working electric unicycle controller. A trip to the machine shop to construct the chassis and voilà, you’ve made your very own self-balancing unicycle!

### Learning to ride

Unfortunately, one cannot simply pick up a self-balancing unicycle and ride it with ease. It took me several hours to be able to ride in a straight line without crashing, and it took several days to learn how to turn in a controlled manner. Many of my friends have tried riding it, usually with little success (including some actual unicyclers). However, my talented friend Harry can ride it about as well as I do.

Here are some riding tips, in case you ever build one:

- Don’t let go of the safety switch unless you crash. Bullet is designed to help you balance—let it do it’s job (even while mounting). If you ever let go of the safety switch while riding, you will fall (however, it was designed so that if you accidentally let go for a fraction of a second, you can usually recover).
- Bumps in the terrain are
*always*bigger than they appear. - Put most of your wait on the seat, not the pedals.
- Gain weight! An inverted pendulum with a high center of gravity is easier to balance than one where most of the weight comes from the motor and wheel. To illustrate this fact, try balancing a long broom on your hand (with the broomstick in your palm) vs. a short broom.
- When you first press the safety switch, Bullet will assume an upright position as fast as it can, even if you are in the way. So don’t be in the way (it can run into your ankles or run over your foot)!

### Practicality

As I said earlier, Bullet is the primary way I navigate MIT and the surrounding Cambridge area. I often zoom past students, faculty, custodians, and tourists, with generally positive reactions from everyone. I’ve been told one can be fined for riding a scooter in the Infinite Corridor. Fortunately, Bullet ain’t no scooter.

However, Bullet is

*heavy*. I live on the fifth floor of a dorm with no elevator, so carrying Bullet up and down the stairs can be quite tiresome. If I were to start over, I would try to minimize weight with a wheel hub motor, NiCad or NiMH batteries, and an aluminum frame (instead of steel).Bullet’s small size (compared to a bicycle, scooter, moped, etc.) means I can take it on the bus or train with no problem, so I can take public transportation to the stop nearest my destination and ride Bullet the rest of the way. With over 5 miles of range on a single-charge, battery life is usually not a problem in my urban hometown.

### Related work

I am certainly not the first person to build an electric unicycle. Perhaps the most well-known self-balancing unicycle is Trevor Blackwell’s Eunicycle, which also uses the OSMC. His design is similar to mine, but uses a much more expensive battery pack ($218 for his vs $44 for mine). Also, the Eunicycle’s motor and gearbox cost a grand total of $644, whereas Bullet’s drive system (including the wheel itself) was $195. Finally, the IMU he uses is about $100 more than mine. Overall, Bullet is several hundred dollars cheaper than the Eunicycle, but this comes at a price (mostly weight).

Focus Designs (of whom Adam Savage of MythBusters fame is a customer) manufactures a commercial electric unicycle called the SBU (“Self-Balancing Unicycle”). At $1,795, the SBU is much more expensive than mine, but it has extra safety features and a hub motor, which results in a smoother ride. For comparison, Bullet uses a toothed belt drive, and the Eunicycle features a metal gearbox. Because of the belt drive, I can often feel Bullet’s motor engaging the wheel with a slight jerk.

### Video

Here is a video of me riding it and some of my friends trying it out (no audio, sorry):

### Press and aftermath

Bullet has been featured on a number of media outlets, including:

- MIT Admissions
- Engadget
- MAKE
- Hack A Day
- MIT CSAIL
- Slice of MIT
- The Huffington Post
- Coolest Gadgets
- EcoFriend
- TheTechJournal
- PhysOrg.com
- TechMog
- Softpedia
- The Green Optimistic
- New Rising Media
- TrendHunter
- All Best Reviews
- DailyMail
- GizmoWatch
- BostInno
- The Register
- Ubergizmo
- DVICE
- Geekosystem
- Neatorama
- Cambridge Bicycle
- The Discovery Channel

## Monads, part 2: impure computations

July 9, 2012

### Refresher

In the last post, we learned about monads and how to use them to implement exception handling, vector operations, and debug output. To refresh your memory, a monad is defined by:

- A type constructor that defines a
*monadic type*(usually parameterized by some other type). - A
`unit`

function that takes a value and returns a value in the corresponding monadic type. - A
`bind`

operation that takes a monadic value and a function from the non-monadic type to the monadic type. It returns a value in the monadic type.

### The `IO`

monad

Today we’ll talk about how one can use monads to describe impure computations in a pure language. Let’s define a monad which represents an impure computation:

class Computation: UNIT = 0 BIND = 1 INPUT = 2 OUTPUT = 3 def __init__(self, type, data): self.type = type self.data = data def __repr__(self): if self.type == Computation.UNIT: type_str = 'Computation.UNIT' elif self.type == Computation.BIND: type_str = 'Computation.BIND' elif self.type == Computation.INPUT: type_str = 'Computation.INPUT' elif self.type == Computation.OUTPUT: type_str = 'Computation.OUTPUT' return 'Computation(' + type_str + ', ' + self.data.__repr__() + ')'

Note that we’ve defined

`__repr__`

so that we can print a computation and examine its structure. Now, let’s define `unit`

and `bind`

for this monad:def unit(x): return Computation(Computation.UNIT, [x]) def bind(x, f): return Computation(Computation.BIND, [x, f])

Okay, that’s the whole monad! Let’s rewrite some helpers:

input = Computation(Computation.INPUT, []) def output(text): return Computation(Computation.OUTPUT, [text])

We now have a way of describing computations! It’s almost like we have a programming language within a programming language. In order for this monad to be useful, though, we need a way of executing a computation. As you know by now, this is not possible in our functional subset of Python. Let’s break the rules and write just one function in imperative Python, and we’ll consider this to be part of the runtime:

def execute(computation): if computation.type == Computation.UNIT: return computation.data[0] elif computation.type == Computation.BIND: return execute(computation.data[1](execute(computation.data[0]))) elif computation.type == Computation.INPUT: return raw_input() elif computation.type == Computation.OUTPUT: print computation.data[0] return None

Sweet, let’s try it out! We’ll start with

*Hello, World*:main = output('Hello, World!')

What is

`main`

? Let’s find out:>>> main Computation(Computation.OUTPUT, ['Hello, World!'])

It’s a computation of type

`Computation.OUTPUT`

, just as we would expect. Now let’s run this computation:>>> execute(main) Hello, World!

**Sweet victory!**You can now see how monads can be used to do I/O in a pure functional language. Let me emphasize that the

`execute`

procedure isn’t written in our pure language, and it’s not referentially transparent. Rather, it is implemented as part of the runtime. The interpreter uses it to execute our program (when we call `execute(main)`

, we are basically playing the role of the runtime).### More advanced examples

Now that we understand the idea, let’s try to write some more useful programs than

*Hello, World*. First, we can try out the program from the previous example which reads a line of text and prints it back:main = bind(input, output)

This time, we get:

>>> main Computation(Computation.BIND, [Computation(Computation.INPUT, []), <function output at 0x00000000027E1198>])

And running the program:

>>> execute(main) This will get printed back! This will get printed back!

Note that I entered the first

`This will get printed back!`

, and the second one was printed by the program as we would expect.Now let’s write a program that responds to user input:

def respond(input): if input == 'yes': return output('You said YES!') else: return output('You said NO!') main = bind(input, respond)

Let’s test it out:

>>> execute(main) yes You said YES! >>> execute(main) no You said NO!

It works! Let’s modify the example so that it keeps asking the user for input until he/she says

`yes`

:def main_wrapper(dummy): return main def respond(input): if input == 'yes': return output('You said YES!') else: return bind(output('Try saying \'yes\' once in a while.'), main_wrapper) main = bind(input, respond)

Does it work?

>>> execute(main) no Try saying 'yes' once in a while. no Try saying 'yes' once in a while. no Try saying 'yes' once in a while. yes You said YES!

Sweet—recursion doesn’t seem to be a problem either. Notice how we defined

`main_wrapper`

simply because `bind`

expects a function of one argument. It turns out that this is a common monadic pattern, so let’s abstract it:def sequence(u, v): return bind(u, lambda x: v)

`sequence`

performs the computation `u`

, ignores its return value, and then performs `v`

and returns the result. Note that `execute`

must be eager when evaluating its argument. If it was lazy, the first computation (`u`

) might never get executed because the lambda expression doesn’t use it. Let’s rewrite our example using `sequence`

:def respond(input): if input == 'yes': return output('You said YES!') else: return sequence(output('Try saying \'yes\' once in a while.'), main) main = bind(input, respond)

Does it still work?

>>> execute(main) no Try saying 'yes' once in a while. no Try saying 'yes' once in a while. no Try saying 'yes' once in a while. yes You said YES!

Excellent. To conclude, let’s write a program that asks the user for two lines of input, concatenates them, and prints the result:

def respond2(input1): return lambda input2: output('You said "' + input1 + '" and "' + input2 + '".') def respond1(input1): return bind(input, respond2(input1)) main = bind(input, respond1)

And finally:

>>> execute(main) hello world You said "hello" and "world".

It works! Hopefully by now you have an idea of how to compose impure computations with monads. Let me know in the comments if anything wasn’t clear!