→ Heliotrope Pajamas

Thun

Thun is a dialect of the Joy programming language. It's main difference is that it's a little bit simpler (e.g. it has fewer data types and features than Joy.) This document is about the implementation of Thun, for details on the language itself and how to use it please see the Thun docs:

→ https://ariadne.systems/pub/~sforman/Thun/

Data Types

Thun has four fundamental types: integers, strings, Booleans, and lists (which can contain any types, including lists.)

Let's compare and contrast some implementations of Thun. For clarity and simplicity let's start by examining Elm and OCaml data structure declarations. In Elm:

type JoyType
    = JoySymbol String
    | JoyInt Integer
    | JoyList (List JoyType)
    | JoyTrue
    | JoyFalse

(Instead of a single Boolean type with two values I use two Types each with one value. I don't know why, and it probably doesn't matter much from a pragmatic point of view, but it's mathematically a little weird perhaps.)

The OCaml datastructure is almost identical:

type joy_type =
  | JoySymbol of string
  | JoyInt of int
  | JoyList of joy_type list
  | JoyTrue
  | JoyFalse

In both languages integers are already arbitrary precision and lists are built-in so the datatype code is particularly succinct and clear.

In Python we don't write any datatype declarations at all. (Python has recently developed some typing systems, but they are of limited usefulness as compared with languages like Haskell or OCaml.) Instead, we use Python ints, strings and Booleans as themselves, and model lists as two-tuples of item and tail. E.g. a crude grammar:

    list = () | (item, list)

This reads, "A list is the empty tuple -or- a two-tuple where the first item is any value and the second item is a list." In other words, this is the venerable "cons list":

→ Cons List

In the C language we can use a "tagged union": an enum to define type "tags" and then a struct with two members or elements (what's the jargon here? I forget): a type "tag" and a value (a union of the three types char* for symbols, mpz_t from GMP for integers, and JoyList for lists. Booleans do not have values, the type tag hold the bit.)

→ Tagged Union

We use a simple linked list to model lists (the "cons list" again.)

enum JoyTypeType {
        joyInt,
        joyList,
        joySymbol,
        joyTrue, joyFalse
};

typedef struct list_node* JoyList;
typedef JoyList* JoyListPtr;

typedef struct {
        enum JoyTypeType kind;
        union {
                char *symbol;
                mpz_t i;
                JoyList el;
        } value;
} JoyType;

typedef JoyType* JoyTypePtr;

typedef struct list_node {
        JoyTypePtr head;
        JoyList tail;
} JoyListNode;

#define EMPTY_LIST (JoyList)NULL

It's crude but effective.

Nim is kind of in-between. It has a built-in notion of a list, so we don't have to implement that, but we still build the same sort of "tagged union" as we do in C (but I think they call it something else. It's a concept with many many names):

type

  JoyListType* = List[JoyType]

  JoyTypeType* = enum
    joySymbol,
    joyTrue,
    joyFalse,
    joyInt,
    joyList

  JoyType* = ref object
    case kind*: JoyTypeType
    of joySymbol: symVal*: string
    of joyFalse, joyTrue: nil
    of joyInt: intVal*: BigInt
    of joyList: listVal*: JoyListType

Much nicer than C, but not as nice as Elm or OCaml, eh?

In Scheme, like Python, we don't declare data types (although we could, there are several options) we reuse Scheme's data types, including the list (of course! Scheme is a Lisp, eh?)

In Prolog likewise we do not declare types, instead the resulting datastructures are implied by the definition of the parser:

joy_parse([J|Js]) --> joy_term(J), !, joy_parse(Js).
joy_parse([]) --> [].

joy_term(list(J)) --> [lbracket], !, joy_parse(J), [rbracket].
joy_term(Term) --> [tok(Codes)], { joy_term(Term, Codes) }.

joy_term(     int(I),   Codes) :- numeric(Codes), !, atom_codes(I, Codes).
joy_term( bool(true),  "true") :- !.
joy_term(bool(false), "false") :- !.
joy_term(  symbol(S),   Codes) :- atom_codes(S, Codes).

We use Prolog integers, lists, and symbols directly, Boolean values are modeled by (Prolog) symbols, and all types are "encased" or "boxed" in predicates that denote their type: int(I), bool(B), symbol(S), and list(L). It's worth noting I think that the above eight lines of Prolog code are the entire parser. There's a lexer phase, and that is five lines of code. There is also some character recognition cruft that is another dozen lines or so (not counting blank//0), but even so this is concise code, yes?

Parsing

This brings us to parsing, but first an aside on lexing. Lexing Thun is completely trivial. You can use the old Lisp lexing trick: substitute " [ " for "[" and " ] " for "]" and then split on blankspace. In Python:

tokens = source.replace('[', ' [ ').replace(']', ' ] ').split()

A more efficient lexer function is left as an exercise to the reader. (I've written some. It's fun! For extra credit, write a program that takes the above Python code as a specification and generates an efficient lexer function.)

Anyway, parsing.

The idea is to build up a list-of-lists structure from a stream or sequence or tokens, where two possible tokens denote a "Dyck language"

In the theory of formal languages of computer science, mathematics, and linguistics, a Dyck word is a balanced string of brackets. The set of Dyck words forms a Dyck language. The simplest, D1, uses just two matching brackets, e.g. ( and ).
Dyck words and language are named after the mathematician Walther von Dyck. They have applications in the parsing of expressions that must have a correctly nested sequence of brackets, such as arithmetic or algebraic expressions.
→ Dyck language

By convention Thun (Joy) literals have the "top" of a list to the left, so the items in a list are parsed top-to-bottom as the stream of tokens is read left-to-right (although of course technically this is just a presentation issue, in RAM the tokens are stored all over the place, even their bits are not neccesarily adjacent in real space on the chips!) Anyway, this means that you can't build lists in the right order in one pass. In a pure language you have to reverse the inital list, but in languages like C you can lay out the list in parse order and just link it in reverse! So that's neat.

In more detail, the Thun source code:

one two three

Denotes a list with the symbols "one" on top, "two" as the second item, and "three" at last. E.g.:

>>> from joy import text_to_expression
>>> text_to_expression("one two three")
(one, (two, (three, ())))

Can you see the problem? In order to build the top-level two-tuple containing the symbol "one" we need the second tail two-tuple (the one containing "two"), which in turn needs the third two-tuple (containing "three"). In this situation there is no way around it, we have to build a temporary list in parse order and then reverse it when we get to the "]" closing bracket, converting it into a correctly ordered Thun cons list. (If it helps, you can think of Python tuples as containing immutable pointers to the values in them. In this case we are not allowed to use a NULL pointer for the tail and fill it in later. In Prolog, in contrast, lists can have logic variables as tails, which can be unified with values later on as needed.)

With all that in mind I hope the Python parser function will be pretty clear. It lexes the text in passing at the start of the parse loop.

def text_to_expression(text):
    '''
    Convert a string to a Joy expression.

    When supplied with a string this function returns a Python datastructure
    that represents the Joy datastructure described by the text expression.
    Any unbalanced square brackets will raise a ParseError.

    :param str text: Text to convert.
    :rtype: stack
    :raises ParseError: if the parse fails.
    '''
    frame = []
    stack = []

    for tok in text.replace('[', ' [ ').replace(']', ' ] ').split():

        if tok == '[':
            stack.append(frame)
            frame = []
            continue

        if tok == ']':
            thing = list_to_stack(frame)
            try:
                frame = stack.pop()
            except IndexError:
                raise ParseError('Extra closing bracket.') from None
        elif tok == "true:
            thing = True
        elif tok == "false":
            thing = False
        else:
            try:
                thing = int(tok)
            except ValueError:
                thing = Symbol(tok)

        frame.append(thing)

    if stack:
        raise ParseError('Unclosed bracket.')

    return list_to_stack(frame)

It makes use of an auxilliary utility function that takes care of the required reversal:

def list_to_stack(el, stack=()):
    '''
    Convert a Python list (or other sequence) to a Joy stack::

    [1, 2, 3] -> (1, (2, (3, ())))

    :param list el: A Python list or other sequence (iterators and generators
             won't work because ``reversed()`` is called on ``el``.)
    :param stack stack: A stack, optional, defaults to the empty stack. This
             allows for concatinating Python lists (or other sequence objects)
             onto an existing Joy stack.
    :rtype: stack

    '''
    for item in reversed(el):
        stack = item, stack
    return stack

(As I was re-reading this I realized that the parser could be simplified by iterating through the tokens in reverse order, so that we no longer need to reverse the lists! D'oh! Yay!)

In contrast in a language like C that lets us monkey around however we like, we can link the tail after parsing the head, like so:

JoyList
text_to_expression(char *text)
{
        JoyList result, head, tail;

        result = parse_node(&text);
        head = result;
        tail = parse_node(&text);
        while (NULL != tail) {
                head->tail = tail;
                head = tail;
                tail = parse_node(&text);
        }
        return result;
}

Nothing to it.

That's the only interesting detail about parsing Thun code.

Before moving on, however, I should mention that in languages where you can't use a for loop (Elm, OCaml, Scheme) you can use a single-token look-ahead pattern. It's probably clearest in Scheme:

(define (text->expression text) (parse (tokenize text)))

(define (tokenize str)  ; Let's do the simple trick.
  (string-split (string-replace (string-replace str "]" " ] ") "[" " [ ")))

(define (parse tokens)
  (if (null? tokens) '()
    (receive (term rest_of_tokens)
      (one-token-lookahead (car tokens) (cdr tokens))
      (cons term (parse rest_of_tokens)))))

(define (one-token-lookahead token tokens)
  (match token
    ("]" (abort "Extra closing bracket."))
    ("[" (expect-right-bracket tokens))
    (_ (values (tokenator token) tokens))))

(define (expect-right-bracket tokens0)
  (if (null? tokens0) (abort "Missing closing bracket.")
    (receive (token tokens) (car+cdr tokens0)
      (match token
        ("]" (values '() tokens))
        ("[" (receive (sub_list rest) (expect-right-bracket tokens)
               (receive (el rrest) (expect-right-bracket rest)
                 (values (cons sub_list el) rrest))))
        (_ (receive (el rest) (expect-right-bracket tokens)
           (values (cons (tokenator token) el) rest)))))))

(define (tokenator token)
  (cond ((string->number token) (string->number token))
        ((string=? token "true") #t)
        ((string=? token "false") #f)
        (else (string->symbol token))))

Interpreter

The interpreter is very simple, it just iterates through the expression putting values onto the stack and evaluating functions denoted by symbols.

In Python we just use a dict mapping strings (our symbols are a string class) to functions that take the whole state (stack, expression, dictionary) and returns the next state.

def joy(stack, expression, dictionary):
    '''
    Evaluate a Joy expression on a stack.

    This function iterates through a sequence of terms.
    Literals are put onto the stack and Symbols are
    looked up in the dictionary and the functions they
    denote are executed.

    :param stack stack: The stack.
    :param stack expression: The expression to evaluate.
    :param dict dictionary: A ``dict`` mapping names to Joy functions.

    :rtype: (stack, dictionary)

    '''
    expr = push_quote(expression)  # We keep a stack-of-stacks, see below.
    while expr:
        term, expr = next_term(expr)
        if isinstance(term, Symbol):
            try:
                func = dictionary[term]
            except KeyError:
                raise UnknownSymbolError(term) from None
            stack, expr, dictionary = func(stack, expr, dictionary)
        else:
            stack = term, stack
    return stack, dictionary

For languages that are more crunchy than Python we have to do a bit more. We can implement a complete set of functions and combinators and then definitions can be handled by prepending the body of the definition to the pending expression.

The definition bodies can be stored in a hashtable.

If the interpreter checks the basis functions before looking at definitions then the user cannot override or "shadow" the basis functions.