A Self-Recognizing Grammar

The following is based on "A Concise Extensible Metalanguage for Translator Implementation", by Douglas L. Michels, 3rd USA-JAPAN Computer Conference, 1978.

The paper covers a cool method of "growing" translator (compiler) systems from a simple "seed" system, and in §6 "A Minimal Recognizer"...

Four Operations

There are four symbols involved representing the four operations necessary for minimal self-recognition. There are two binary prefix operators 'l' and '&' for alternation and concatination respectively, a '"' unary prefix recognizer operator, and the '.' operator for recursion on the whole grammar. To spice things up a bit I've switched to using some Unicode glyphs:

The Grammar

These twenty-seven characters represent a self-recognizing grammar, perhaps the simplest known?

In words, there is one rule, called here ●, which is a circle '●' representing a recursive "call" to the whole grammar, or one of two binary expressions '∧' or '∨' followed by two recursive sub-expressions, or the "quote" or "scan" operator '«' followed by its argument (one of the four symbols.)

(The '«' operator can be considered to quote the following symbol, changing it from it's typical meaning: an operation to perform, to it's quoted meaning: a data pattern to look for, a name. And it can be considered the scan operator, as it's meaning is to scan and consume if found it's quoted argument. Yin and Yang, Yang and Yin.)

These are the same twenty-seven symbols broken up into a tree structure. Each of the binary operators are followed by their two args indented one level. The terminal ops ('●' and '«'), uh, terminate the branches.

skip()

To understand how it works we can start with the `Skip()` function introduced in §6.3 of the paper. It's a simple auxilliary function the task of which is to "skip" over one sub-expression of the grammar to reach the next. It's used to find the second arg of a binary op.

def skip(grammar):
    '''
    Find the next sub-expression by skipping over the current one.
    '''
    assert grammar  # One cannot skip an empty grammar.
    ch, grammar = grammar[0], grammar[1:]
    if ch == '●': return grammar
    if ch == '∧': return skip(skip(grammar))
    if ch == '∨': return skip(skip(grammar))
    if ch == '«': return grammar[1:]
    raise ValueError

to_tree()

Most of the functions to follow have this skeleton, a switch over the symbols with recursion for the binary ops. We can illustrate this by implementing different but similar function that converts the linear string form of the grammar into a tree data structure.

def to_tree(grammar):
    '''
    Convert a grammar string to a nice data structure.
    '''
    assert grammar
    ch, grammar = grammar[0], grammar[1:]
    if ch == '●': return 'RECUR'
    if ch == '∧': return 'AND', to_tree(grammar), to_tree(skip(grammar))
    if ch == '∨': return 'OR',  to_tree(grammar), to_tree(skip(grammar))
    if ch == '«': return 'RECOGNIZE', grammar[0]
    raise ValueError

('OR',
 ('RECOGNIZE', '●'),
 ('OR',
  ('AND',
   ('OR', ('RECOGNIZE', '∧'), ('RECOGNIZE', '∨')),
   ('AND', 'RECUR', 'RECUR')),
  ('AND',
   ('RECOGNIZE', '«'),
   ('OR',
    ('RECOGNIZE', '∧'),
    ('OR',
     ('RECOGNIZE', '∨'),
     ('OR', ('RECOGNIZE', '«'), ('RECOGNIZE', '●')))))))

Neat, eh? It's a pretty straightforward linear-to-tree conversion. It should be obvious how this structure of tuples and strings aligns with the tree diagram of the grammar above, eh? It's the same tree (at least to some kind of isomorphism?)

Interpreter

In §6.3 of the paper Michels specifies a simple "Recognition Interpreter" for this grammar. It's a little clunky, as it's broken up into two functions, `Test()` and `Remaining()`, but I think that's only due to limitations of the language he's using to specify the implementation. It's also a bad algorithm in the sense that it repeats a lot of work (a lot) but that's because it's meant to be simple, not efficient. In practice the inefficiencies are easy to overcome with known methods (i.e. caching.)

In any event, I implemented a recognition interpreter in Python that combines the work of both functions. It is wrapped in a function that accepts a grammar and an input source string (in the self-recognizing case these are of course the same string!) and returns a Boolean value indicating whether the source is matched by the grammar.

def mo(grammar, source) -> bool:
    '''
    Does the source code match the grammar?
    '''
    # To replicate with fidelity to heredity we must keep a copy of our
    # "DNA" handy.  The first grammar is the program and it's consumed,
    # so we need the second (reference to the) grammar for the recursive
    # case.
    return not _mo(grammar, grammar, source)

The `_mo()` function returns the rest of the source, whatever is leftover after the program/grammar matches a prefix (which should hopefully be the whole input!) In order to signal failure on one branch of a '∨' (OR) op we could return a two-tuple of bool and string, or we can use a side-channel like the Python exception system, as we do here. When `_mo()` fails it raises this Backup exception:

class Backup(Exception): pass

The function itself is straightforward and follows the skeleton (a switch on the symbols):

def _mo(G, R, I) -> str:
    if not R: return I

    ch, R = R[0], R[1:]

    if ch == '●': return _mo(G, G, I)
    if ch == '∧': return _mo(G, skip(R), _mo(G, R, I))
    if ch == '∨':
        try: return _mo(G, R, I)
        except Backup: return _mo(G, skip(R), I)
    if ch == '«':
        if I[0] == R[0]: return I[1:]
        raise Backup

    raise ValueError

Compiler

We could write an interpreter function that used the data-structure form of the grammar from above to avoid scanning and re-scanning the grammar string for sub-expressions, but it would be even more fun to write a compiler for this grammar.

The idea is that instead of interpreting the grammar in the presence of an input string we emit functions that embody the actions of the operators. This is straightfoward (especially in a language like Python), the only tricky bit is handling the recursion. The functions we will be emitting cannot reference themselves before they are finished being created so we have to close the loop from outside. What this means is that the top-level compiled function for the grammar must be passed a reference to itself to handle the recursive cases, just as the interpreter above needs to carry around a complete copy of the grammar string to pass to itself when it reaches a recursive branch.

def compile_go(program):
    
    f = _compile_go(program)
    
    def go(source):
        # Start at index 0; f gets a reference to itself for recursion.
        return f(source, 0, f)
    
    return go

We compile the expression to a function `f()` that is the "binary" for or executable form of the grammar, and then wrap it in a helper function `go()` to give it an initial index (this function iterates through the source string with a index "pointer" rather than chopping off characters using Python's slice syntax) and a reference to itself which it needs for recursion, which is handled by this helper function. We only need one of these.

def RECUR(source, index, self):
    return self(source, index, self)

(That is a cool function. You don't get to write functions like that every day, eh? It's a "fixed-point" I think?)

What does this compiler (the output of the compiler-compiler we are discussing) return? It's a two-tuple of Boolean and integer representing the success or failure of the parse and how far it got in the input string. (This pattern is the other main way to handle this besides e.g. a Backup exception.) The two functions described so far (`compile_go()` and `RECUR()`) don't mention their return values. We first encounter them in a leaf function for success:

def SUCCESS(_source, index, _self):
    return True, index

This function is only used in one place so it could be defined inline (as a lamda expression or something) but I like to call it out as its own thing to emphasize that we're just using plain ol' functions here, nothing fancy. Anyway, let's get on with it.

Here's the `_compile_go()` function. Note that `skip()` is called during compilation, not during matching. This means that we pay the cost of skipping once and then avoid it during subsequent calls to the compiled function. Compilers are cool.

We wind up with a function call tree that is isomorphic to the data-structure emitted by `to_tree()` encoded into the internal references between the various sub-functions created by the compiler-compiler.

def _compile_go(program):
    '''
    Return a function that accepts a source string, an index into that
    string, and a reference to a function, and that returns a two-tuple of bool
    and an index integer.  The bool indicated success (or not) of the
    parse/match algorithm and the index integer tells how far the
    process went in the string.  (Typically it should be the length of
    the source string for a successful match.)
    '''

    # The null program matches anything.
    if not program:
        return SUCCESS 

    ch, program = program[0], program[1:]

    if ch == '●':
        return RECUR

    if ch == '«':
        scan_for = program[0]

        def RECOGNIZE(source, index, self):
            try: result = scan_for == source[index]
            except IndexError:
                return False, index
            # Use the bool/int pun to increment index on success.
            return result, index + result
        return RECOGNIZE

    left = _compile_go(program)
    right = _compile_go(skip(program))  # skip() at compile-time, eh?

    if ch == '∧':
        def AND(source, index, self):
            result, i = left(source, index, self)
            return right(source, i, self) if result else (result, index)
        return AND

    if ch == '∨':
        def OR(source, index, self):
            result, i = left(source, index, self)
            return (result, i) if result else right(source, index, self)
        return OR

    raise ValueError

>>> G = '∨«●∨∧∨«∧«∨∧●●∧««∨«∧∨«∨∨«««●'
>>> gogo = compile_go(G)
>>> gogo(G)
(True, 27)

I especially love the beautiful symmetry between the `AND()` and `OR()` functions:

def AND(source, index, self):
    result, i = left(source, index, self)
    return right(source, i, self) if result else (result, index)

def OR(source, index, self):
    result, i = left(source, index, self)
    return (result, i) if result else right(source, index, self)

No doubt they would be even prettier in, say, Haskell, but you can see the formal mathematical symmetry well enough in crude Python. Note how `AND()` passes along `i` while `OR()` discards it.

The Translator Interpreter

The point of the paper is to show how a more complex (but still simple) and actually useful translator can be "grown" from a "seed" like the above minimal self-recognizer. In §4.3 of the paper Michels specifies a translator that includes an output operator and named sub-expressions ("rules"). I transcribed it into Python code, available here:

It's a bit slow. There's another great paper that shows how to make a more efficent version, and implements it in a kind of low-level assembly for a little virtual machine: "Bootstrapping a Small Translator Writing System", Fay, Michael, UCSC technical report 77-3-002, March 1977.

These days we can just add a cache and get acceptable performance, but forty-five years ago that wasn't really an option. I haven't fully explored Fay's paper yet but I will after I finish writing this blog post. Maybe a part II? Anyway, by adding a cache in front of the `Test()` function the program gains ~100x speed-up in my crude tests.

Output and Named Rules

The translator adds two features to the minimal system: the ability to output symbols (to a tape, file, pipe, stream, buffer, whatever), and the ability to define named rules that can "call" each other (or themselves).

Output is handled by adding a new quote operator that emits its argument to the output rather than consuming it from the input, as a parse operator it always succeeds. The paper uses '>' but I like '»' for this (it's a nice symmetry with using '«' for the scan/quote op, eh?)

For named rules we let expressions be prefixed with a single-character name (uppercase ASCII chars) and extend expressions with a "call" operator that accepts a name arg. The paper uses ':', I'm using '⊦' (logical assertion symbol) as it lends the grammar a declarative air (and I couldn't find a good "call function" symbol in the commonly supported Unicode symbols.)

The "object language" that the translator interpreter supports looks like this (this is also it's own self-recognizing grammar, extended from the minimal.)

In a more conventional syntax, and using the original symbols (as this is taken directly from the paper):

Fusion of Functions

The main function of this interpreter is `Machine()` and it uses three separate functions: `Test()`, `Remainder()`, and `Emit()`, which do a lot of redundant work (which is why caching the `Test()` function gives such a dramatic effect.)

def Machine(G, I):
    if Test(G, rest(G), I) and not Remaining(G, rest(G), I):
        return True, Emit(G, rest(G), I)
    return False, ''

We can "fuse" those functions together and do most of the work at the same time. The `Machine()` function will look like this:

def Machine(G, I):
    result, remaining, out = fused(G, rest(G), I)
    if result and not remaining:
        return True, '', out
    return False, remaining, out

Note how everything is now done in `fused()`. I also return both the un-matched remaining input and the output so far in the case of a failed or incomplete parse. That's mostly for informational (not to say debugging) purposes.

def fused(G, R, I, O=''):
    ch, tail = first(R), rest(R)

    if '⊦' == ch:
        name = first(tail)
        rule = Find(G, name)
        return fused(G, rule, I, O)

    if '∧' == ch:
        result, remaining, out = fused(G, tail, I, O)
        if result:
            return fused(G, Skip(tail), remaining, out)
        return result, I, O

    if '∨' == ch:
        result, remaining, out = fused(G, tail, I, O)
        if result:
            return result, remaining, out
        return fused(G, Skip(tail), I, O)

    if '»' == ch:
        return True, I, Concat(O, first(tail))

    if '«' == ch:
        if not I:
            return False, I, O
        return first(tail) == first(I), rest(I), O

    raise ValueError  # should never get here

As you can see, the skeleton of the function is familiar, it's a switch on the symbol set. Output is unsurprising, it concatinates its arg onto the output buffer. The important thing is that the output is reset when backtracking. (In Python strings are immutable so each intermediate output string hangs around in the call stack during sub-expression evaluation and is only discarded if the sub-expression succeeds, so backtracking works fine.)

As for named rules, they are looked up in the grammar string (the "program" that the interpreter is interpreting) with an auxilliary function `Find()`. (The implementation in the paper extracts the name of the rule that it is looking for over and over again as it recurs (it's recursive), I extract the name once. I also rewrote it in iterative rather than recursive form, but that's just for kicks. I doubt it has a noticable effect on runtime, eh?)

def Find(G, name):
    while G:
        rule_name, G = G[0], G[1:]
        if name == rule_name:
            return G
        G = Skip(G)
    raise ValueError

And that's about it. The `Skip()` function is modified to deal with the two new operation symbols, of course:

def Skip(R):
    ch = first(R)
    if '⊦' == ch: return rest(rest(R))
    if '∧' == ch: return Skip(Skip(rest(R)))
    if '∨' == ch: return Skip(Skip(rest(R)))
    if '»' == ch: return rest(rest(R))
    if '«' == ch: return rest(rest(R))
    raise ValueError  # should never get here

The grammar matches itself. (There is no output because the grammar doesn't use the '»' operator. It can recognize it, but it doesn't use it. In other words if you examine the grammar string you'll find that '»' only ever appears in it after a '«' quote/scan operator symbol.)

>>> Machine(grammar, grammar)
(True, '', '')

That's pretty freakin' cool, eh? I mean, it's a little silly to find and re-find the rule bodies over and over again, but we already know how to deal with that, eh?

The thing to admire is how such a simple system can be extended easily to add more sophisticated features without sacrificing the elegance and simplicity which makes the whole thing easy to understand (and extend!)

Convert to compiler.

It is straightforward to apply the technique used above to this interpreter to convert it into a compiler. (I should note that it's also possible to emit the code for functions instead of the functions themselves to make e.g. a compiler to C or whatever.)

The only tricky bit is that now instead of a single self-mentioning grammar ('●') we will have an open-ended set of named rules that can "call" each other or themselves, a potential universe of circular evaluation paths, each of which will send our compiler into an infinite loop and blow out the Python call stack. To solve this I just put the functions into a dictionary as they are compiled and pass that (instead of a single function) to the top-level compiled function at the start of its run. Then in the bodies of the functions that call other functions we use (instead of a single `RECUR()` function) a function that looks up and calls that other function by name at runtime.

def call(name, rules):
    return lambda I, O: (rules[name](I, O))

It would be neat to dig into the rule functions after they were created and "link" them to each other to avoid the overhead of looking them up in the `rules` dict at runtime, and Python's facilities for introspection and dynamicism are such that it's probably possible (I tried but it involves too much brittle magic) but it's not really worth the effort I think?

def _compile(R, rules):
    ch, tail = first(R), rest(R)

    if '⊦' == ch:
        return call(first(tail), rules)

    if '∧' == ch:
        left, right = _compile(tail, rules), _compile(skip(tail), rules)
        def AND(I, O):
            result, remaining, out = left(I, O)
            return right(remaining, out) if result else (result, I, O)
        return AND

    if '∨' == ch:
        left, right = _compile(tail, rules), _compile(skip(tail), rules)
        def OR(I, O):
            result = left(I, O)
            return result if result[0] else right(I, O)
        return OR

    if '»' == ch:
        out = first(tail)
        return lambda I, O: (True, I, concat(O, out))

    if '«' == ch:
        scan_for = first(tail)
        def SCAN(I, O):
            if not I: return False, I, O
            return scan_for == first(I), rest(I), O
        return SCAN

    raise ValueError(f'{repr(ch)} {len(tail)}')  # should never get here

Really, there's nothing to it once the issue of inter-rule calling has been solved. And again, you can emit source code rather than functions and make a compiler that generates, uh, source code for a (new) compiler (or whatever.)

If you continue on in this way you eventually wind up with something like Val Shorre's Meta-II (which actually pre-dates this paper by several years.)

There's an awesome lecture by Bill McKeeman about it that he delivered at Stanford University in March, 2009:

Conclusion

As McKeeman freely admits towards the end of his lecture these machines are toys. They are fun and beautiful, but to the practially-minded person their main value is to illustrate how more complex compilers might work and to enthuse students (which is what McKeeman used it for.)

Appendix A: Translator Interpreter §4.3

For convenience here is the Python transliteration of the Translator Interpreter code from §4.3 of the paper.

from functools import cache


grammar = ''.join('''\

    G |&:L
        &:R
         :G
       &:L
        :R

    R |&":
        :L
       |&"&
         &:R
          :R
        |&"|
          &:R
           :R
         |&">
           :L
          &""
           :L

    L |"A|"B|"C|"D|"E|"F|"G|"H|"I|"J|"K|"L|"M
      |"N|"O|"P|"Q|"R|"S|"T|"U|"V|"W|"X|"Y|"Z
      |"&|"||":|""|">|"(|")|"'|"_|"[|"]|"=|";"*

'''.split()).replace('_', ' ')

# Here it is for reference:
# G|&:L&:R:G&:L:RR|&"::L|&"&&:R:R|&"|&:R:R|&">:L&"":LL|"A|"B|"C|"D|"E|"F|"G|"H|"I|"J|"K|"L|"M|"N|"O|"P|"Q|"R|"S|"T|"U|"V|"W|"X|"Y|"Z|"&|"||":|""|">|"(|")|"'|" |"[|"]|"=|";"*


# Some string handling utilities.


def first(string):
    return string[0]


def rest(string):
    return string[1:]


def Concat(left, right):
    return f'{left}{right}'


# The Interpreter


def Machine(G, I):
    if Test(G, rest(G), I) and not Remaining(G, rest(G), I):
        return True, Emit(G, rest(G), I)
    return False, ''


@cache
def Test(G, R, I):
    ch = first(R)
    if ':' == ch: return Test(G, Find(G, rest(R)), I)
    if '&' == ch:
        if Test(G, rest(R), I):
            return Test(G, Skip(rest(R)), Remaining(G, rest(R), I))
        return False
    if '|' == ch:
        if Test(G, rest(R), I):
            return True
        return Test(G, Skip(rest(R)), I)
    if '>' == ch: return True
    if '"' == ch: return bool(I) and first(rest(R)) == first(I)
    raise ValueError  # should never get here


def Remaining(G, R, I):
    ch = first(R)
    if ':' == ch: return Remaining(G, Find(G, rest(R)), I)
    if '&' == ch: return Remaining(G, Skip(rest(R)), Remaining(G, rest(R), I))
    if '|' == ch:
        if Test(G, rest(R), I):
            return Remaining(G, rest(R), I)
        return Remaining(G, Skip(rest(R)), I)
    if '>' == ch: return I
    if '"' == ch: return rest(I)
    raise ValueError  # should never get here


def Emit(G, R, I):
    ch = first(R)
    if ':' == ch: return Emit(G, Find(G, rest(R)), I)
    if '&' == ch: return Concat(
        Emit(G, rest(R), I),
        Emit(G, Skip(rest(R)), Remaining(G, rest(R), I))
        )
    if '|' == ch:
        if Test(G, rest(R), I):
            return Emit(G, rest(R), I)
        return Emit(G, Skip(rest(R)), I)
    if '>' == ch: return first(rest(R))
    if '"' == ch: return ''
    raise ValueError  # should never get here


def Skip(R):
    ch = first(R)
    if ':' == ch: return rest(rest(R))
    if '&' == ch: return Skip(Skip(rest(R)))
    if '|' == ch: return Skip(Skip(rest(R)))
    if '>' == ch: return rest(rest(R))
    if '"' == ch: return rest(rest(R))
    raise ValueError  # should never get here


def Find(G, R):
    if first(G) == first(R):
        return rest(G)
    return Find(Skip(rest(G)), R)


if __name__ == '__main__':
    p = Machine(grammar, grammar)
    assert p[0]