→ Heliotrope Pajamas

2025-10-11

A Neat Trick

I was looking over the old parser code in the C implementation of Thun and by digging into the source of the glibc C library I found a neat trick that allowed the me to simplify the parser, reducing the size of that code by about two-thirds.

strpbrk()

In the original parser code I used a library function called "strpbrk":

char *strpbrk(const char *s, const char *accept);
The strpbrk() function locates the first occurrence in the string s of any of the bytes in the string accept.

~ Manual page for strpbrk

This function turns out to be a thin wrapper around a function called strcspn.

#undef strpbrk

#ifndef STRPBRK
#define STRPBRK strpbrk
#endif

/* Find the first occurrence in S of any character in ACCEPT.  */
char *
STRPBRK (const char *s, const char *accept)
{
  s += strcspn (s, accept);
  return *s ? (char *)s : NULL;
}
→ strpbrk.c

strcspn()

size_t strcspn(const char *s, const char *reject);
The strcspn() function calculates the length of the initial segment of s which consists entirely of bytes not in reject.

~ Manual page for strcspn

This code is pretty gnarly, you can look for yourself if you're brave.

→ strcspn.c

It builds a table of 256 values and then indexes that by the chars in the input string to classify them. It does some other neat tricks, but that's the main one I think.

Between that and the fact that I was calling strlen() in the "trim leading blanks" function (leading to slightly quadratic behaviour) I felt that the parsing code was embarrassing because it was doing too much work. I rewrote it to use a single static table and to use indicies into the input string rather than recursive calls (I use the Joy list machinery to store the intermediate partial lists. Relying on the GC like that in C code feel a little naughty.)

New Parser

In order to be confident that the C code is correct I first wrote a model of it in Python:

WORD, LBRAK, RBRAK, SPACE = range(4)

table = 256 * [WORD]
table[ord('[')] = LBRAK
table[ord(']')] = RBRAK
table[ord(' ')] = SPACE  # Can add other space chars, eh?


def parse(s):
    s = s.encode('UTF_8')
    stack, lists = (), []
    i, j = len(s), 0  # j is both state (parsing a word or not) and offset to end of current word.
    while i:
        kind = table[s[i - 1]]
        if WORD == kind:
            if not j: j = i
        else:
            if j:
                assert j != i
                stack = s[i:j].decode('UTF_8'), stack
                j = 0
            if RBRAK == kind:
                lists.append(stack)
                stack = ()
            elif LBRAK == kind:
                assert lists, 'Unclosed bracket.'
                stack = stack, lists.pop()
        i -= 1
    if j:
        assert j != i
        stack = s[i:j].decode('UTF_8'), stack
    assert not lists, 'Extra closing bracket.'
    return stack

Note that this code doesn't look for integers and Booleans, it treats everything that isn't space or square brackets as part of a symbol.

It would be easy to add more SPACE entries to the table to handle other kinds of blank chars (and perhaps the control codes in ASCII?) I haven't done that yet but I will eventually. Anyway, here's a link to the C code:

→ New parser source

Conclusion

Read the source. Reading library source code plus a straightforward application of partial evaluation resulted in a more concise and efficient parser for Thun.