A Neat Trick

I was looking over the old parser code in the C implementation of Thun and by digging into the source of the glibc C library I found a neat trick that allowed the me to simplify the parser, reducing the size of that code by about two-thirds.

strpbrk()

#undef strpbrk

#ifndef STRPBRK
#define STRPBRK strpbrk
#endif

/* Find the first occurrence in S of any character in ACCEPT.  */
char *
STRPBRK (const char *s, const char *accept)
{
  s += strcspn (s, accept);
  return *s ? (char *)s : NULL;
}

strcspn()

It builds a table of 256 values and then indexes that by the chars in the input string to classify them. It does some other neat tricks, but that's the main one I think.

Between that and the fact that I was calling strlen() in the "trim leading blanks" function (leading to slightly quadratic behaviour) I felt that the parsing code was embarrassing because it was doing too much work. I rewrote it to use a single static table and to use indicies into the input string rather than recursive calls (I use the Joy list machinery to store the intermediate partial lists. Relying on the GC like that in C code feel a little naughty.)

New Parser

In order to be confident that the C code is correct I first wrote a model of it in Python:

WORD, LBRAK, RBRAK, SPACE = range(4)

table = 256 * [WORD]
table[ord('[')] = LBRAK
table[ord(']')] = RBRAK
table[ord(' ')] = SPACE  # Can add other space chars, eh?


def parse(s):
    s = s.encode('UTF_8')
    stack, lists = (), []
    i, j = len(s), 0  # j is both state (parsing a word or not) and offset to end of current word.
    while i:
        kind = table[s[i - 1]]
        if WORD == kind:
            if not j: j = i
        else:
            if j:
                assert j != i
                stack = s[i:j].decode('UTF_8'), stack
                j = 0
            if RBRAK == kind:
                lists.append(stack)
                stack = ()
            elif LBRAK == kind:
                assert lists, 'Unclosed bracket.'
                stack = stack, lists.pop()
        i -= 1
    if j:
        assert j != i
        stack = s[i:j].decode('UTF_8'), stack
    assert not lists, 'Extra closing bracket.'
    return stack

Note that this code doesn't look for integers and Booleans, it treats everything that isn't space or square brackets as part of a symbol.

It would be easy to add more SPACE entries to the table to handle other kinds of blank chars (and perhaps the control codes in ASCII?) I haven't done that yet but I will eventually. Anyway, here's a link to the C code:

Conclusion

Read the source. Reading library source code plus a straightforward application of partial evaluation resulted in a more concise and efficient parser for Thun.