Introduction

DynGenPar [1] is an innovative parser based on a new principle combining bottom-up and top-down features of traditional parsers. The most unique feature of the algorithm is the possibility to add rules to the grammar at almost any time, even during parsing. DynGenPar has the following characterizing properties:

Dynamic = The grammar is not hardcoded as in usual table-driven approaches, such as (Generalized) LR or Earley's algorithm. Instead, the algorithm works on a runtime representation of the grammar, which allows efficient handling of dynamic grammar changes. To decide when and how to shift or reduce, we use, instead of the usual LR tables, the initial graph, a graph which is easily updated as the grammar changes, along with some runtime top-down information.

Generalized = The algorithm exhaustively parses ambiguous grammars. In addition, epsilon productions are considered in order to support arbitrary CFGs. (Left recursion is naturally supported thanks to the bottom-up nature of the algorithm.) We use graph-structured (DAG-structured) stacks similar to the ones used in Tomita's Generalized LR algorithm. As additional generalizations, we also support Parallel Multiple Context-Free Grammars (PMCFGs) and next token constraints (useful, e.g. for scannerless parsing).

Due to the dynamic design, a parser generator is not needed. Instead, the parser can simply be used as a library.

DynGenPar supports dynamic grammar additions, incremental parsing, prediction, rule labels, custom parse actions, arbitrary token sources, hierarchical parsing, parallel multiple context-free grammars (PMCFGs), and next token constraints. See section Features for details.

Basic Usage

All the APIs provided by DynGenPar are in the DynGenPar namespace. You can use

using namespace DynGenPar;

to bring in the entire namespace.

The main class of DynGenPar is DynGenPar::Parser. To do any parsing, you will always have to instantiate at least one object of the DynGenPar::Parser class, or a subclass such as DynGenPar::PgfParser. Parsing is done through the DynGenPar::Parser::parse methods, either the fine-grained version with several arguments or the binding-friendly version which operates on a DynGenPar::ParseState object grouping all the arguments.

The parser operates on a stream of tokens. Those tokens have to be provided by a token source, i.e. a class implementing the DynGenPar::TokenSource interface. DynGenPar provides several ready-made token sources. You can also implement your own: Your class only has to derive from DynGenPar::TokenSource and implement the needed virtual methods, at least the pure virtual DynGenPar::TokenSource::readToken. In the following example, we will use the simplest token source, which operates directly on a list of tokens: the DynGenPar::ListTokenSource.

This is a basic example for how to use DynGenPar::Parser:

DynGenPar::ListTokenSource tokenSource;
DynGenPar::Parser parser(&tokenSource);
// We have to specify the grammar.
// In this example, the grammar is hardcoded.
parser.tokens << "a" << "b" << "c";
parser.rules["S"] << (Rule() << "A")
                  << (Rule() << "A" << "S");
parser.rules["A"] << (Rule() << "a")
                  << (Rule() << "b")
                  << (Rule() << "c");
parser.startCat = "S";
// Whenever we set the grammar directly, we have to call this function.
parser.initCaches();
// And here, we specify the sample input, also hardcoded.
tokenSource.inputTokens << "b" << "a";
// Now we parse the input, using the default arguments for
// DynGenPar::Parser::parse.
QList<DynGenPar::Match> matches = parser.parse();

The DynGenPar::Match class contains the results of the parsing process; in particular, the parsed tree(s) in a packed, i.e. DAG (directed acyclic graph) -structured, representation, see the DynGenPar::Match::tree field.

The following example shows how to iterate over a parse tree:

static void printSubtree(const DynGenPar::Node &node, int level,
                         QTextStream &stream);
// prints the parse tree given as an argument to stdout
static void printParseTree(const DynGenPar::Node &tree) {
  QTextStream stream(stdout);
  printSubtree(tree, 0, stream);
}
// recursive helper function for the above, which does the actual work
static void printSubtree(const DynGenPar::Node &node, int level,
                         QTextStream &stream) {
  for (int i=0; i<level; i++) stream << ' ';
  stream << node.cat << endl;
  switch (node.children.size()) {
    case 0: // no alternatives = error node
      for (int i=0; i<=level; i++) stream << ' ';
      stream << "ERROR" << endl;
      break;
    case 1: // 1 alternative, normal tree
      foreach (const DynGenPar::Node &child, node.children.first())
        printSubtree(child, level+1, stream);
      break;
    default: // multiple alternatives
      foreach (const DynGenPar::Alternative &subtree, node.children) {
        for (int i=0; i<=level; i++) stream << ' ';
        stream << "ALTERNATIVE" << endl;
        foreach (const DynGenPar::Node &child, subtree)
          printSubtree(child, level+2, stream);
      }
      break;
  }
}

You would call the above printParseTree function using e.g.

foreach (const DynGenPar::Match &m, matches) printParseTree(m.tree);

on the result of the previous example.

By default, category names, i.e. the DynGenPar::Cat typedef, are strings, which is convenient for debugging, but often not what is wanted in practice. For efficiency and/or interoperability, categories can be forced to be integers instead, by #defining DYNGENPAR_INTEGER_CATEGORIES. Some contexts such as the PGF support (pgf.h), the DynGenPar::FlexLexerTokenSource (flexlexertokensource.h) and the Java bindings always force this option.

Java Bindings

Java bindings for DynGenPar based on Qt Jambi are provided. Most of the DynGenPar API is available also to Java developers.

All the APIs provided by DynGenPar are in the concise.DynGenPar package. You can use

import concise.DynGenPar.*;

to import the entire package.

The following mappings and restrictions apply:

The DynGenPar namespace is mapped to the concise.DynGenPar package.
The global functions in the DynGenPar namespace are mapped to static methods of a concise.DynGenPar.Util class.
The global constants and enumerations in the DynGenPar namespace are mapped as follows:
- DynGenPar::PgfToken* to concise.DynGenPar.Pgf.Token*
- DynGenPar::PreludeBind to concise.DynGenPar.Pgf.PreludeBind
- DynGenPar::ByteToken* to concise.DynGenPar.ByteTokenSource.*
DYNGENPAR_INTEGER_CATEGORIES is always defined, i.e. category identifiers are always integers, not strings. See DynGenPar::Cat.
The internals of parse stacks cannot be accessed through Java. Parse stacks are only exposed in the form of opaque parse states (concise.DynGenPar.ParseState, i.e. DynGenPar::ParseState). The DynGenPar::ParseState::incrementalStacks member cannot be accessed from Java. In particular, only the overloads of DynGenPar::Parser::parse and of the Prediction methods in DynGenPar::Parser which take a DynGenPar::ParseState are available.
Likewise, the DynGenPar::PseudoCatScopeData class and the DynGenPar::PseudoCatScope::data method that returns a pointer to it are not available.
Public class/structure member fields such as DynGenPar::Parser::tokens are mapped to getters, e.g. concise.DynGenPar.Parser.tokens, and setters, e.g. concise.DynGenPar.Parser.setTokens. This is because Java only supports public member fields for Java code, it does not support native fields nor a property mechanism that would allow hiding the required function calls to access the C++ class's member field.
Qt lists, hash tables etc. are mapped to their Java equivalents.
Multi-valued hash tables are not directly supported in Java, therefore they are mapped to list-valued hash tables.
List classes (classes derived from QList) can be appended to using a Java-style add method, and converted to an iterable Java list using a toList method. (In C++, the toList method is just an alias for a trivial cast and returns a writable reference to a QList. In Java, due to the required conversion, the returned list can be used for reading only.)
Java does not support typedef, so unfortunately you have to spell out the actual types of things like DynGenPar::RuleSet.

This documentation documents the C++ API. With the exception of the above changes, the same interfaces can also be used from Java. In particular, it is transparently possible to implement virtual methods of C++ classes in derived Java classes, and thus interfaces required by the C++ code can also be implemented in Java.

In Java, the basic example for how to use DynGenPar::Parser would look as follows:

concise.DynGenPar.ListTokenSource tokenSource
  = new concise.DynGenPar.ListTokenSource();
concise.DynGenPar.Parser parser = new concise.DynGenPar.Parser(tokenSource);
// We have to specify the grammar.
// In this example, the grammar is hardcoded.
// Category identifiers must be integers, but char promotes to int.
// Unfortunately, the syntax is not quite as pretty as the C++ one.
parser.addToken('a');
parser.addToken('b');
parser.addToken('c');
java.util.HashMap<Integer, java.util.List<concise.DynGenPar.Rule> > rules
  = new java.util.HashMap<Integer,
                          java.util.List<concise.DynGenPar.Rule> >();
java.util.List<concise.DynGenPar.Rule> sRules
  = new java.util.ArrayList<concise.DynGenPar.Rule>();
concise.DynGenPar.Rule rule1 = new concise.DynGenPar.Rule();
rule1.add('A');
sRules.add(rule1);
concise.DynGenPar.Rule rule2 = new concise.DynGenPar.Rule();
rule2.add('A');
rule2.add('S');
sRules.add(rule2);
rules.put('S', sRules);
java.util.List<concise.DynGenPar.Rule> aRules
  = new java.util.ArrayList<concise.DynGenPar.Rule>();
concise.DynGenPar.Rule rule3 = new concise.DynGenPar.Rule();
rule3.add('a');
aRules.add(rule3);
concise.DynGenPar.Rule rule4 = new concise.DynGenPar.Rule();
rule4.add('b');
aRules.add(rule4);
concise.DynGenPar.Rule rule5 = new concise.DynGenPar.Rule();
rule5.add('c');
aRules.add(rule5);
rules.put('A', aRules);
parser.setRules(rules);
parser.setStartCat('S');
// Whenever we set the grammar directly, we have to call this function.
parser.initCaches();
// And here, we specify the sample input, also hardcoded.
java.util.List<Integer> inputTokens = new java.util.ArrayList<Integer>();
inputTokens.add('b');
inputTokens.add('a');
tokenSource.setInputTokens(inputTokens);
// Now we parse the input, using a default parse state.
concise.DynGenPar.ParseState parseState
  = new concise.DynGenPar.ParseState();
java.util.List<concise.DynGenPar.Match> matches = parser.parse(parseState);

The DynGenPar Java bindings are used in the FMathL Concise environment. [2]

Features

This section lists the advanced features supported by DynGenPar.

Dynamic Grammar Additions

The most unique feature of DynGenPar is the possibility to add rules to the grammar at almost any time, even during parsing. See DynGenPar::Parser::addRule.

Incremental Parsing

DynGenPar allows operating on its input incrementally, parsing interactively as input is produced and remembering its state. See DynGenPar::Parser::parse and DynGenPar::ParseState.

It is also possible to go back by resuming at a previous saved state.

Prediction

Possible continuations, i.e. tokens which can come next, can be predicted at any saved incremental parsing state, see DynGenPar::Parser::computePredictions and DynGenPar::Parser::computeConstrainedPredictions.

It is also possible to predict entire "literals", which are sequences of tokens appearing sequentially within a rule, see DynGenPar::Parser::computeMultiPredictions and DynGenPar::Parser::computeConstrainedMultiPredictions. This is particularly useful for scannerless parsing.

Rule Labels

CFG rules can have labels, which are reproduced in the produced parse tree. Those labels can be anything which can be stored in a QVariant, including strings, integers and pointers to arbitrary objects. See DynGenPar::Rule::Rule, DynGenPar::Rule::label and DynGenPar::Rule::setLabel.

Custom Parse Actions

It is also possible to attach an action to a rule, which will be executed when the rule is matched. Note that actions will currently only be executed for nonempty matches (and in particular, actions on epsilon rules will always be ignored) and that an action will be executed even if the detected match only appears in some of the considered parses and is later discarded due to the input that follows. An action must be a subclass of DynGenPar::Action implementing the pure virtual DynGenPar::Action::execute method. See DynGenPar::Action and DynGenPar::Rule::action.

Arbitrary Token Sources

DynGenPar accepts tokens from any source implementing the DynGenPar::TokenSource interface. Your class only has to derive from DynGenPar::TokenSource and implement the needed virtual methods, at least the pure virtual DynGenPar::TokenSource::readToken.

A number of ready-made token sources are also provided. See DynGenPar::ListTokenSource, DynGenPar::ByteTokenSource and DynGenPar::TextByteTokenSource.

Flex Lexer Token Source

In particular, it is possible to use a lexer generated by Flex as a token source. See DynGenPar::FlexLexerTokenSource.

Hierarchical Parsing

A token source can return, along with a token, a complete parse (sub)tree to use in lieu of a leaf node in the resulting parse tree, making it possible to call, from your token source, another parser, or another instance of DynGenPar (which is fully reentrant), and to return the result as a single token for the higher-level grammar. See DynGenPar::TokenSource::tree.

Parallel Multiple Context-Free Grammars (PMCFGs)

PMCFGs (Parallel Multiple Context-Free Grammars) are an extension of context-free grammars used for natural language, e.g. by the Grammatical Framework GF. They are natively supported by DynGenPar. See DynGenPar::Pmcfg, DynGenPar::Parser::loadPmcfg, DynGenPar::Parser::addPmcfgRule and DynGenPar::parseTreeToPmcfgSyntaxTree.

Grammatical Framework (GF) PGF Grammars

In particular, DynGenPar can import compiled grammars produced by the GF compiler, in the PGF (Portable Grammatical Framework) file format, automatically converting them to PMCFGs in standard form and providing a GF-compatible lexer. See DynGenPar::Pgf and DynGenPar::PgfParser.

Next Token Constraints

Rules can have constraints attached that require (expect) or forbid (taboo) certain tokens following the rule. This is particularly useful for scannerless parsing, where it allows one to implement the usual context-sensitive scannerless parsing primitives such as maximal matches. It can also be used to enforce grammatically correct word sequences, e.g. for the singular indefinite article ("a" vs. "an") in English. See DynGenPar::NextTokenConstraints and DynGenPar::Rule::nextTokenConstraints.

TODO

The following desirable features are currently not supported:

some PMCFGs with infinitely ambiguous context-free approximations
general context-sensitive constraints on rules (except PMCFG, next token)
triggering parse actions on empty matches (including epsilon rules)
error correction (we only have basic error detection)

(These features may or may not turn out to be required in practice.)

It is planned to add support for these features to the algorithm as needed, without changing the basic structure.

Copyright

DynGenPar: Dynamic Generalized Parser
Copyright (C) 2010-2013 Kevin Kofler <kevin.kofler@chello.at>
Copyright (C) 2014-2018 DAGOPT Optimization Technologies GmbH, written by Kevin Kofler <kofler@dagopt.com>

Acknowledgments

Support by the Austrian Science Fund FWF under contract numbers P20631, P23554 and P22239 is gratefully acknowledged.

Licensing and Warranty Disclaimer

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Citations

The publication [1] constitutes a suitable citation for academic works. Though not required by the software license, it is kindly requested to include that citation in any research papers using this program.