Literate Code — a responsive prototype for a programming essay

Update February 22nd, 2013:

The Literate Code project is undergoing a big remake. A new version, soon to be released, will offer the following improvements:

  • Better responsiveness across different resolutions.
  • The ability to accommodate code listings of different widths.
  • Control over line wrapping and horizontal scrolling.
  • Simplified, more efficient markup and responsiveness.

The new version will be at this URL.

The Computer Science curriculum is a curious thing. On one hand, certain things remain unchanged: the von Neumnann architecture, automata theory, compilation & interpretation, fundamental data structures and algorithms, etc. These are the foundations.

On the other hand, the discipline of software engineering is dynamic and evolving. A lot of it is happening in Open Source; it is discussed in the developer blogosphere; and often documented online. This makes it a tough field to capture in a static, "tetxbook" curriculum.

The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style.

Who is better qualified, then, to teach software engineering than the experienced programmer / developer working right in the trenches? And, which channel is more natural for the dissemination of his knowledge than the web?

Self–publishing of such expert content has never been easier. But the presentation, reader–friendliness, and consumability of it deserve special attention. Packaging and design play a great role towards increasing the accessibility of content.

Literate Code, the demo for this page, is a prototype of a possible design approach (open source and hosted on GitHub). The ingredients?

  • A static adaptive layout (inspired by the brilliant Frameless Grid by Joni Korpi);
  • Optimal use of screen real estate, whether in Desktop, Tablet, or Mobile modes;
  • Syntax highlighting through google–code–prettify;
  • Color and typography defaults to maximize readability.

It's very much a "minimum viable product"; something to prove the concept. Got feedback? I'd love to hear it.

PS. The essay used for the demo below is a modified version of an article originally written in 2011. In programmer maturity years, that's about a decade. Still, as a demo it does the job.

Problem–solving with Automata–Based Programming


Tags: programming ruby

Recently when working with CSS, I wanted to automate some deployment-related tasks, such as: combining a number of CSS files into one; fixing image paths from production to deployment values; and stripping out the comments from the production–ready file. So I wrote a script that I that I now call every time I need to deploy CSS, whether when adding a new design or modifying an existing one.

While working on the comment–stripping part of the script (which was the meat of the task) I researched a style of programming called Automata–Based Programming. I liked it so much that I'll discuss its characteristics and merits here.

Motivation for a character–based text processing script

I wanted to achieve the de–commenting in a single pass, producing an output CSS file without modifying the input file. (I also didn't want to resort to using regular expressions immediately, wanting to develop a more fine–grained, character–based approach. My assumption was that a large file should be processed as a character stream, rather than a large string).

Maybe a solution

The first thing that came to mind was the naive, character–for–character procedural approach, something along the lines of, in pseudocode:

inside_comment = false
while not end_of_file do
    one = getchar
    two = getchar
    if not inside_comment and one =='/' and two =='*'
        inside_comment = true
        if not inside_comment
            print one, two
        if inside_comment
            if one =='*' and two =='/'
                inside_comment = false
                one = getchar
                two = getchar
            if one =='*' and two =='/'
                inside_comment = false

This was bound to build up to a crescendo of nested "if"s and "while"s. There is massive duplication with calls to getchar, as well as referencing and setting inside_comment. More importantly, this code builds on the assumption that the number of characters preceding the start of a comment is divisible by two. If that number isn't even, the solution will fail and it would take additional conditional branching to make it work. All in all, a brittle, hardly readable solution with to not–so–subtle logical flaws.

Next, I considered using a string library to delegate pattern–matching to. I had seen the StringScanner class being used for parsing JSON, so I thought about using StringScanner#scan to move through the string finding comments. However, all that ultimately boiled down to was using a library to find occurrences of just two patterns: /* and */. It was all about these two combinations of these two characters. No nested structures, no recursion, no well–formed language constructs. No grammar to validate against. This realization made using a library look rather redundant.

Pragmaitic Programmers manipulate text the same way woodworkers shape wood.

So I decided to develop a raw, from–scratch approach — processing the CSS character–by–character. I visualized characters being pumped from the source file to the target file. Obviously, that alone wasn't enough — the characters would need to step through some sort of transformation phase. How could the de–commenting function be formulated? What if there was an intermediate destination for these characters — one that would act like a valve, letting characters through if they were'nt inside a comment and closing when they were?

This "valve" data structure would accumulate characters when a comment was entered. When it would detect the end of that comment, the accumulated characters would be cleared (meaning, the comment would be discarded). This process would go on until end of file was reached.

Automata–Based solution

Implementation–wise, this would be a queue with some helper methods for analyzing the top two characters. The Ruby Array class, with it's #shift method, provided a foundation for that data structure.

This approach opened the door for completely linearizing the program — only one "while" loop (the outermost loop that reads in characters as long as there are more) was needed. That, it turned out, is one of the characteristics of Automata–Based Programming: a flat, "un–tangled" program without a lot of indentation levels and just one loop: the input–reading one.

def process(char)
  @comment = true if comment_start
  @comment = false if comment_end
  c = filter
  c unless c.nil?

With each step of the program (which adds a character to the data structure) I check for a presence of a comment based on what the top 2 characters on the queue are (the values I'm looking for are /* and */). A boolean flag (the @comment variable) is then set accordingly.

Based on a subset of all possible values of these three factors (comment_start, comment_end, and @comment), one of the following three steps is executed:

  • Dequeue the first character to the output (we have just entered a comment), set flag as true;
  • Clear the stack (we have reached the end of a comment), set flag as false;
  • Dequeue the first character to the output (no comment in sight).

def filter
  return get_first if ready and comment_start #["c", "/", "*"] - pop "c"
  clear if comment_end #["/", "*", "a", "b", "c", "*", "/"]
  get_first if ready and !comment #["a", "b", "c"] or ["b", "c", "/"]

Here, we have another characteristic of Automata–Based Programming: with each step of the program (here, it is adding a character to the queue) we query and / or modify the the program's "state". ("State" is the value of a set of variables at a particular step of the program). When the program continues, a number of branches are possible. The branch that is associated with each state will be executed. This has the effect of "flattening out" the program's logical branches — they become conditional statements that often all appear at one level. Almost assembly language–like.

(Here is the complete program on GitHub).

Characteristics and advantages

I find that Automata–Based Programming is very useful, because it allows me to create programs around solid logical structures. Because ABP is based om logical anchors as to what is happening at each step, it allows me to write programs that are easy to reason about. For example, the following statements about the helper data structure can be formalized:

  • A precondition for querying the state of the queue: The queue must contain at least two characters.
  • A precondition for removing the first character from the queue: The queue must contain at least three characters.
  • An invariant for the queue: When we are not inside a comment the queue is never longer than 3 characters.

ABP–style programs are also easy to visualize as their underlying theoretical model: automata. Specifically, the visualization of the queue acting as a physical valve was the crucial step in solving the problem. All that remained was to write the program.

Programs created in this style are easy to read, a pleasure to write, and are robust in structure, as they build on a formal logical model (A Finite State Machine, or FSM). ABP–style programs allow one to program for correctness and therefore, testability. Beyond clean code, there are other reasons to be proficient in this paradigm. Let's list a few of its applications:

Advanced text processing. The example in this article is very simple, but Finite State Machines are effective at processing text input of varying complexity and structure, from markup to Turing–complete languages. An example of this is the part of browser code called the tokenization algorithm. Its input is an HTML string and the output is a token representation of that string. It's expressed as a state machine.

The logic behind Graphical User Interfaces. Because user interfaces are stateful, FSMs are very well–suited for modeling them in States, Events, and Transitions. For a modern example, see Alex MacCaw's Super.js — a modular, jQuery-based JavaScript library for building RIAs.

Making Object–Oriented better by encapsulating and controlling "state". In OO, objects have state at runtime. State (defined as the total value of an object's instance variables) is one–third of the OO trinity: state, behavior, & identity.

OO state, however, is less rigorously defined than FSM state, and is therefore easier to get wrong. (See Rich Hickey's article on state). Especially in large OO programs, state often acquires somewhat of a life of its own, causing side effects and unexpected behavior. Constructing OO programs that implement state in accordance with the state machine model is useful in making stateful programs more manageable and predictable.[1]

Relevant resources

Some resources on state–based computation and Automata–Based Programming:

The Wikipedia has a fantastic article with an example of how to transform a character — processing program from imperative style into ABP style: Automata–Based Programming.

In "Programming Pearls", Jon Bentley describes the evolution of a program to find the maximal–value subarray in an array of integers. The final solution combines scanning and state–saving to achieve a very short and fast program.[2] Although it doesn't have explicit "State" classes or transitions, it works by querying and (re)setting state at each step; I like to think of it as the prototypal ABP–style program.