std.experimental.lexer

Summary

This module contains a range-based compile-time lexer generator.

Overview

The lexer generator consists of a template mixin, Lexer, along with several helper templates for generating such things as token identifiers.

To write a lexer using this API:

  1. Create the string array constants for your language.
  2. Create aliases for the various token and token identifier types specific to your language.
  3. Create a struct that mixes in the Lexer template mixin and implements the necessary functions.

Members

Functions

tokenStringRepresentation
string tokenStringRepresentation(IdType type)

Looks up the string representation of the given token type.

Mixin templates

Lexer
mixintemplate Lexer(Token, alias defaultTokenFunction, alias tokenSeparatingFunction, alias staticTokens, alias dynamicTokens, alias possibleDefaultTokens, alias tokenHandlers)

The implementation of the _lexer is contained within this mixin template.

Structs

LexerRange
struct LexerRange

Range structure that wraps the _lexer's input.

TokenStructure
struct TokenStructure(IdType, string extraFields = "")

The token that is returned by the lexer.

Templates

TokenId
template TokenId(IdType, alias staticTokens, alias dynamicTokens, alias possibleDefaultTokens, string symbol)

Generates the token type identifier for the given symbol.

TokenIdType
template TokenIdType(alias staticTokens, alias dynamicTokens, alias possibleDefaultTokens)

Template for determining the type used for a token type.

Examples

  • A lexer for D is available here.
  • A lexer for Lua is available here.
  • A lexer for JSON is available here.

TemplateParameters

Template Parameter Definitions

defaultTokenFunction defaultTokenFunction
A function that serves as the default token lexing function. For most languages this will be the identifier lexing function.
tokenSeparatingFunction tokenSeparatingFunction
A function that is able to determine if an identifier/keyword has come to an end. This function must return bool and take a single size_t argument representing the number of bytes to skip over before looking for a separating character.
staticTokens staticTokens
A listing of the tokens whose exact value never changes and which cannot possibly be a token handled by the default token lexing function. The most common example of this kind of token is an operator such as "*", or "-" in a programming language.
dynamicTokens dynamicTokens
A listing of tokens whose value is variable, such as whitespace, identifiers, number literals, and string literals.
possibleDefaultTokens possibleDefaultTokens
A listing of tokens that could posibly be one of the tokens handled by the default token handling function. An common example of this is a keyword such as "for", which looks like the beginning of the identifier "fortunate". tokenSeparatingFunction is called to determine if the character after the 'r' separates the identifier, indicating that the token is "for", or if lexing should be turned over to the defaultTokenFunction.
tokenHandlers tokenHandlers
A mapping of prefixes to custom token handling function names. The generated lexer will search for the even-index elements of this array, and then call the function whose name is the element immedately after the even-indexed element. This is used for lexing complex tokens whose prefix is fixed.

Here are some example constants for a simple calculator lexer:

// There are a near infinite number of valid number literals, so numbers are
// dynamic tokens.
enum string[] dynamicTokens = ["numberLiteral", "whitespace"];

// The operators are always the same, and cannot start a numberLiteral, so
// they are staticTokens
enum string[] staticTokens = ["-", "+", "*", "/"];

// In this simple example there are no keywords or other tokens that could
// look like dynamic tokens, so this is blank.
enum string[] possibleDefaultTokens = [];

// If any whitespace character or digit is encountered, pass lexing over to
// our custom handler functions. These will be demonstrated in an example
// later on.
enum string[] tokenHandlers = [
	 "0", "lexNumber",
	 "1", "lexNumber",
	 "2", "lexNumber",
	 "3", "lexNumber",
	 "4", "lexNumber",
	 "5", "lexNumber",
	 "6", "lexNumber",
	 "7", "lexNumber",
	 "8", "lexNumber",
	 "9", "lexNumber",
	 " ", "lexWhitespace",
	 "\n", "lexWhitespace",
	 "\t", "lexWhitespace",
	 "\r", "lexWhitespace"
];

Meta

Authors

Brian Schott, with ideas shamelessly stolen from Andrei Alexandrescu