论文标题

仔细观察TDFA

A closer look at TDFA

论文作者

Borsotti, Angelo, Trafimovich, Ulya

论文摘要

我们提出了一种基于标记的确定性有限自动机的算法,用于正则表达式解析和征匹提取。该算法符合不同的歧义政策。我们为算法提供了详细的伪代码,涵盖了重要的实践优化。从正则表达式到优化自动机的所有转换都在逐步的示例上解释。我们考虑提前和即时确定化,并描述适合每种设置的算法的变体。我们提供的基准表明该算法在实践中非常快。我们的研究基于两个独立的实现:开源Lexer Generator RE2C和一个实验性Java库。

We present an algorithm for regular expression parsing and submatch extraction based on tagged deterministic finite automata. The algorithm works with different disambiguation policies. We give detailed pseudocode for the algorithm, covering important practical optimizations. All transformations from a regular expression to an optimized automaton are explained on a step-by-step example. We consider both ahead-of-time and just-in-time determinization and describe variants of the algorithm suited to each setting. We provide benchmarks showing that the algorithm is very fast in practice. Our research is based on two independent implementations: an open-source lexer generator RE2C and an experimental Java library.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源