Recently when working with CSS, I wanted to automate some deployment-related tasks, such as: combining a
number of CSS files into one; fixing image paths from production to deployment values; and stripping out
the comments from the production–ready file. So I wrote a script that I that I now call every time I need
to deploy CSS, whether when adding a new design or modifying an existing one.
While working on the comment–stripping part of the script (which was the meat of the task) I researched a
style of programming called Automata–Based Programming. I liked it so much that I'll discuss its
characteristics and merits here.
Motivation for a character–based text processing script
I wanted to achieve the de–commenting in a single pass, producing an output CSS file without modifying the input
file. (I also didn't want to resort to using regular expressions immediately, wanting to develop a more fine–grained,
character–based approach. My assumption was that a large file should be processed as a character stream, rather than a
Maybe a solution
The first thing that came to mind was the naive, character–for–character procedural approach,
something along the lines of, in pseudocode:
This was bound to build up to a crescendo of nested "if"s and "while"s. There is massive duplication with calls
getchar, as well as referencing and setting
inside_comment. More importantly, this code builds on
the assumption that the number of characters preceding the start of a comment is divisible by two. If that number
isn't even, the solution will fail and it would take additional conditional branching to make it work. All in all, a brittle,
hardly readable solution with to not–so–subtle logical flaws.
Next, I considered using a string library to delegate pattern–matching to. I had seen the
class being used for parsing JSON, so I thought about using
StringScanner#scan to move through
the string finding comments. However, all that ultimately boiled down to was using a library to find occurrences of just two
*/. It was all about these two combinations of these two characters.
No nested structures, no recursion, no well–formed language constructs. No grammar to validate against. This realization
made using a library look rather redundant.
Pragmaitic Programmers manipulate text the same way woodworkers shape wood.
So I decided to develop a raw, from–scratch approach — processing the CSS character–by–character.
I visualized characters being pumped from the source file to the target file. Obviously, that alone wasn't enough —
the characters would need to step through some sort of transformation phase. How could the de–commenting function
be formulated? What if there was an intermediate destination for these characters — one that would act like a valve,
letting characters through if they were'nt inside a comment and closing when they were?
This "valve" data structure would accumulate characters when a comment was entered. When it would detect the end of that
comment, the accumulated characters would be cleared (meaning, the comment would be discarded). This process would go on
until end of file was reached.
Implementation–wise, this would be a queue with some helper methods for analyzing the top two characters. The Ruby
Array class, with it's
#shift method, provided a foundation for that data structure.
This approach opened the door for completely linearizing the program — only one "while" loop (the outermost loop
that reads in characters as long as there are more) was needed. That, it turned out, is one of the characteristics
of Automata–Based Programming: a flat, "un–tangled" program without a lot of indentation levels and just one
loop: the input–reading one.
With each step of the program (which adds a character to the data structure) I check for a presence of a comment based
on what the top 2 characters on the queue are (the values I'm looking for are
A boolean flag (the
@comment variable) is then set accordingly.
Based on a subset of all possible values of these three factors (
@comment), one of the following three steps is executed:
- Dequeue the first character to the output (we have just entered a comment), set flag as true;
- Clear the stack (we have reached the end of a comment), set flag as false;
- Dequeue the first character to the output (no comment in sight).
Here, we have another characteristic of Automata–Based Programming: with each step of the program (here, it is adding a
character to the queue) we query and / or modify the the program's "state". ("State" is the value of a set of variables
at a particular step of the program). When the program continues, a number of branches are possible. The branch that is associated with each state will be executed. This has the effect of "flattening out"
the program's logical branches — they become conditional statements that often all appear at one level. Almost assembly
(Here is the complete program on GitHub).
Characteristics and advantages
I find that Automata–Based Programming is very useful, because it allows me to create programs around
solid logical structures. Because ABP is based om logical anchors as to what is happening at each step, it
allows me to write programs that are easy to reason about. For example, the following statements about the
helper data structure can be formalized:
- A precondition for querying the state of the queue: The queue must contain at least two characters.
- A precondition for removing the first character from the queue: The queue must contain at least
- An invariant for the queue: When we are not inside a comment the queue is never longer than 3
ABP–style programs are also easy to visualize as their underlying theoretical model: automata.
Specifically, the visualization of the queue acting as a physical valve was the crucial step in solving the problem.
All that remained was to write the program.
Programs created in this style are easy to read, a pleasure to write, and are robust in structure, as they
build on a formal logical model (A Finite State Machine, or FSM). ABP–style programs allow one to program for
correctness and therefore, testability. Beyond clean code, there are other reasons to be proficient in this
paradigm. Let's list a few of its applications:
Advanced text processing. The example in this article is very simple, but Finite State Machines are effective
at processing text input of varying complexity and structure, from markup to Turing–complete languages. An example
of this is the part of browser code called the tokenization algorithm. Its input is an HTML string and the output is a token
representation of that string.
expressed as a state machine.
The logic behind Graphical User Interfaces. Because user interfaces are stateful, FSMs are very
well–suited for modeling them in States, Events, and Transitions. For a modern example, see Alex MacCaw's
library for building RIAs.
Making Object–Oriented better by encapsulating and controlling "state". In OO,
objects have state at runtime. State (defined as the total value of an object's instance variables) is one–third
of the OO trinity: state, behavior, & identity.
OO state, however, is less rigorously defined than FSM state, and is therefore easier to get wrong.
(See Rich Hickey's article on state).
Especially in large OO programs, state often acquires somewhat of a life of its own, causing side effects
and unexpected behavior. Constructing OO programs that implement state in accordance with the state machine
model is useful in making stateful programs more manageable and predictable.
Some resources on state–based computation and Automata–Based Programming:
The Wikipedia has a fantastic article with an example of how to transform a character — processing program from
imperative style into ABP style: Automata–Based Programming.
In "Programming Pearls", Jon Bentley describes the evolution of a program to find the maximal–value subarray
in an array of integers. The final solution combines scanning and state–saving to achieve a very short and fast
Although it doesn't have explicit "State" classes or transitions, it works by querying and (re)setting state
at each step; I like to think of it as the prototypal ABP–style program.