Parsing with Coroutines

17 November 2024 Edit: 9 December 2024

Romain Doumenc - CTO

<aside> 🐲

Note as of Dec, 9th 2024:

This blog entry was mostly an experiment with coroutines, and did not cover all the edge cases required to build a solid library (e.g. handling panics).

The same result can be trivially obtained by using an iter.Pull instead, and sharing variables in the calling stack; relying on the work done in the standard library.

The resulting code is much cleaner, and has the same expressiveness: this probably makes the experiment a warning against using coroutines directly in this case.

</aside>

The new iter package in the Go standard library is implemented with coroutines, which are actually useful for a broader set of use cases than simply iterators.

This post explores one such case, lexical scanning and parsing. The focus is on parsing log lines instead of real programming languages (both because it makes the parser shorter, but also because it offers some challenges in parsing non-regular grammars).

The format of interest is a derivative of RFC 5424, with some readability improvements (for example the level is using a single letter, and quotes are elided when possible):

[D]2024-11-03T11:11:06.446Z[dpath@60446 function=main.(*DPLAN).Reload process=265] Data plan up and running
[E]2024-11-03T11:11:08.474Z[dpath@60446] backend bk_01JBRVGWBM572XC5GKQB3ZNRDG not reachable!

The full code source is available on GitHub:

https://github.com/romaindoumenc/corolf

Typical problems with parsing / scanning

This experience report expects familiarity with recursive descent parsers (see Nystrom, 2015, chapters 4 and 6 for a fantastic introduction if you are not). This is a very common, robust way of parsing data – a similar approach is indeed used in Go standard library go/parser package.

The scanner is typically a big switch loop to implement a state machine (but see Pike, 2011 for a threaded, equivalent implementation):

switch (state, char) {
 case matchAny, '"':
	 state, tkStart = matchDoubleQuoted, position + 1
...
}

There is a lot of state to carry around (start of current token, current state, error in underlying reader, …), which is stored in the lexer. One must then carefully reset the state after the parser has consumed the current token.

Languages designed for human readers (e.g. log lines) present a more subtle problem: a single look ahead token is often not enough to know what to scan next. For example a double quote in the final part of the message is not a sign of a quoted value. This mean the scanner might depend on the parser (so the dependency is going both ways):

[W]2024-11-03T11:11:08.474Z[dpath@60446] message has $." characters

Finally, an often overlooked part of the parser is the ability to re-synchronize: given an invalid input, it is not acceptable to simply stop parsing: a simple strategy instead is to ignore the current message, until a synchronisation marker is found, and the parsing can restart (see §6.3, op.cited, or the Go internal parser for a concrete, if slightly more complex implementation). Simple restarts implementations rely on stack unwinding (the old encoding/json package used this), which comes with all warnings of using exception as a control flow mechanism.

Experimenting with Coroutines

Squinting a bit, we can view the parser / scanner dance choreography as the synchronisation of two state machines: (1) The scanner loops over characters of the input stream, looking for specific characters based on its current state. When the characters are found, the scanner state is updated, and the loop continues. (2) The parser is fed tokens from the scanner, and update the scanner state to correctly scan the next token.

In this experiment with coroutines, the data flowing from the scanner to the parser is a function injected in the scanner loop. The state machine is then self-contained (note that the cursor position does not need to escape the function, since the scanner function does not return until the stream is entirely consumed).

func scan(**next** func(byte, []byte) state) {
  // ...
  switch (state, c) {
				case matchAny, '\\'':
					state = matchQuoted
					curs = pos + 1
				case matchQuoted, '\\'':
				  state = **next**(c, token)
				  curs = pos + 1
	}
}