DFA 和 NFA 已经把“字符串识别”拆成了状态、输入字符、转移规则和接受状态。正则表达式站在更高一层:开发者写 a(b|c)*,机器负责把它变成可以执行的匹配过程。
这篇文章只处理传统正则表达式的核心结构:字面量、连接、选择和重复。现代正则 API 还包含捕获组、环视、反向引用、贪婪/非贪婪策略等扩展;这些扩展属于工程实现层,不影响本文要展示的主线。
核心链路很短:
1
| regex text -> regex AST -> NFA design -> accepts(text)
|
本文不写正则 parser,直接手工构造 AST。前面文章已经展示过“字符串到 AST”的方法,这里把注意力放在第二步:一个正则 AST 节点怎样编译成 NFA。
正则表达式先变成结构
a(b|c)* 不是一串神秘字符。按传统正则语义,它可以拆成四种结构。
| 正则片段 |
AST 节点 |
含义 |
a |
Literal("a") |
匹配一个字符 |
bc |
Concatenate(Literal("b"), Literal("c")) |
先匹配 b,再匹配 c |
| `b |
c` |
Choose(Literal("b"), Literal("c")) |
b* |
Repeat(Literal("b")) |
匹配零个或多个 b |
因此,a(b|c)* 可以写成:
1 2 3 4
| pattern = Concatenate( Literal("a"), Repeat(Choose(Literal("b"), Literal("c"))), )
|
这个写法看起来比原始正则长很多,但它把隐含结构完全摊开了。正则编译器真正需要处理的不是字符画,而是一棵树。树上的每个节点都知道怎样把自己变成一台小 NFA,再由父节点把小 NFA 组合起来。
模式提炼:文本先变结构,结构再变机器
text -> tree -> executable model
直接解释字符串通常会很快变乱。更稳的做法是先把文本解析成结构,再把结构翻译成可执行模型。
| 场景 |
文本 |
结构 |
可执行模型 |
| 正则表达式 |
`a(b |
c)*` |
正则 AST |
| SQL |
select ... |
查询树 / 逻辑计划 |
物理执行计划 |
| 模板 |
hello {{name}} |
模板 AST |
渲染函数 |
| 规则配置 |
JSON / YAML |
规则节点 |
规则执行器 |
听到“这段文本语法越来越复杂”时,应该先寻找 AST,而不是继续堆字符串分支。
复用 NFA 基础设施
前一篇 NFA 的代码可以原样复用。为了组合多台小 NFA,状态用递增整数生成,避免不同片段的状态名冲突。
1 2 3 4 5 6 7 8 9 10 11 12 13
| from __future__ import annotations
from dataclasses import dataclass from itertools import count
State = int Character = str | None _next_state = count()
def new_state() -> State: return next(_next_state)
|
FARule、NFARulebook、NFA 和 NFADesign 仍然保持前一篇的形状。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
| @dataclass(frozen=True) class FARule: state: State character: Character next_state: State
def applies_to(self, state: State, character: Character) -> bool: return self.state == state and self.character == character
def follow(self) -> State: return self.next_state
class NFARulebook: def __init__(self, rules: list[FARule]) -> None: self.rules = rules
def next_states(self, states: set[State], character: Character) -> set[State]: next_states = set()
for state in states: for rule in self.rules_for(state, character): next_states.add(rule.follow())
return next_states
def rules_for(self, state: State, character: Character) -> list[FARule]: return [rule for rule in self.rules if rule.applies_to(state, character)]
def follow_free_moves(self, states: set[State]) -> set[State]: more_states = self.next_states(states, None)
if more_states <= states: return states
return self.follow_free_moves(states | more_states)
|
运行态 NFA 负责读字符,设计态 NFADesign 负责反复创建运行态对象。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
| class NFA: def __init__( self, current_states: set[State], accept_states: set[State], rulebook: NFARulebook, ) -> None: self._current_states = current_states self.accept_states = accept_states self.rulebook = rulebook
@property def current_states(self) -> set[State]: return self.rulebook.follow_free_moves(self._current_states)
def accepting(self) -> bool: return bool(self.current_states & self.accept_states)
def read_character(self, character: str) -> None: self._current_states = self.rulebook.next_states( self.current_states, character, )
def read_string(self, text: str) -> None: for character in text: self.read_character(character)
@dataclass(frozen=True) class NFADesign: start_state: State accept_states: set[State] rulebook: NFARulebook
def to_nfa(self) -> NFA: return NFA({self.start_state}, self.accept_states, self.rulebook)
def accepts(self, text: str) -> bool: nfa = self.to_nfa() nfa.read_string(text) return nfa.accepting()
|
每个正则节点都编译成 NFA
正则 AST 节点统一实现 to_nfa_design。matches 只是一个便利方法:先编译成 NFA,再用 NFA 读取字符串。
1 2 3 4 5 6
| class Pattern: def to_nfa_design(self) -> NFADesign: raise NotImplementedError
def matches(self, text: str) -> bool: return self.to_nfa_design().accepts(text)
|
Literal:一条字符转移
字面量节点最简单。它生成两个状态和一条字符转移。
1 2 3 4 5 6 7 8 9
| @dataclass(frozen=True) class Literal(Pattern): character: str
def to_nfa_design(self) -> NFADesign: start_state = new_state() accept_state = new_state() rule = FARule(start_state, self.character, accept_state) return NFADesign(start_state, {accept_state}, NFARulebook([rule]))
|
Concatenate:把前一段的接受状态接到后一段
连接节点表示“先匹配左边,再匹配右边”。做法是先分别编译左右两段,然后从左边每个接受状态加一条空转移,指向右边的起始状态。
1
| first accept --ε--> second start
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| @dataclass(frozen=True) class Concatenate(Pattern): first: Pattern second: Pattern
def to_nfa_design(self) -> NFADesign: first_design = self.first.to_nfa_design() second_design = self.second.to_nfa_design() extra_rules = [ FARule(state, None, second_design.start_state) for state in first_design.accept_states ] rules = first_design.rulebook.rules + second_design.rulebook.rules + extra_rules return NFADesign( first_design.start_state, second_design.accept_states, NFARulebook(rules), )
|
Choose:新起点分叉到两段
选择节点表示“匹配左边或右边”。做法是创建一个新起点,再用两条空转移分别连到左右两段的起点。接受状态是左右两段接受状态的并集。
1 2
| new start --ε--> first start new start --ε--> second start
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| @dataclass(frozen=True) class Choose(Pattern): first: Pattern second: Pattern
def to_nfa_design(self) -> NFADesign: first_design = self.first.to_nfa_design() second_design = self.second.to_nfa_design() start_state = new_state() extra_rules = [ FARule(start_state, None, first_design.start_state), FARule(start_state, None, second_design.start_state), ] rules = first_design.rulebook.rules + second_design.rulebook.rules + extra_rules return NFADesign( start_state, first_design.accept_states | second_design.accept_states, NFARulebook(rules), )
|
Repeat:新起点接受空串,并回环到子模式
重复节点表示“零次或多次”。它需要满足两个条件:空字符串应该被接受;每匹配完一次子模式以后,还能再回到子模式开头继续匹配。
1 2
| new start --ε--> pattern start pattern accept --ε--> pattern start
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| @dataclass(frozen=True) class Repeat(Pattern): pattern: Pattern
def to_nfa_design(self) -> NFADesign: design = self.pattern.to_nfa_design() start_state = new_state() extra_rules = [FARule(start_state, None, design.start_state)] extra_rules.extend( FARule(state, None, design.start_state) for state in design.accept_states ) rules = design.rulebook.rules + extra_rules return NFADesign( start_state, design.accept_states | {start_state}, NFARulebook(rules), )
|
模式提炼:组合节点只管连接边界
child_machine(s) + boundary_rules -> parent_machine
每个组合节点不需要理解子节点内部怎么工作。它只需要拿到子节点的起始状态、接受状态和规则表,再补几条边界规则。
| 节点 |
子机器数量 |
新增边界规则 |
接受状态 |
Literal |
0 |
字符转移 |
新接受状态 |
Concatenate |
2 |
左接受状态到右起点的空转移 |
右接受状态 |
Choose |
2 |
新起点到两边起点的空转移 |
两边接受状态并集 |
Repeat |
1 |
新起点到子起点、子接受状态回子起点 |
子接受状态加新起点 |
这个模式在 DSL 编译器里很常见:叶子节点生成最小执行单元,组合节点只负责把子单元粘起来。
完整示例
完整代码如下。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184
| from __future__ import annotations
from dataclasses import dataclass from itertools import count
State = int Character = str | None _next_state = count()
def new_state() -> State: return next(_next_state)
@dataclass(frozen=True) class FARule: state: State character: Character next_state: State
def applies_to(self, state: State, character: Character) -> bool: return self.state == state and self.character == character
def follow(self) -> State: return self.next_state
class NFARulebook: def __init__(self, rules: list[FARule]) -> None: self.rules = rules
def next_states(self, states: set[State], character: Character) -> set[State]: next_states = set()
for state in states: for rule in self.rules_for(state, character): next_states.add(rule.follow())
return next_states
def rules_for(self, state: State, character: Character) -> list[FARule]: return [rule for rule in self.rules if rule.applies_to(state, character)]
def follow_free_moves(self, states: set[State]) -> set[State]: more_states = self.next_states(states, None)
if more_states <= states: return states
return self.follow_free_moves(states | more_states)
class NFA: def __init__( self, current_states: set[State], accept_states: set[State], rulebook: NFARulebook, ) -> None: self._current_states = current_states self.accept_states = accept_states self.rulebook = rulebook
@property def current_states(self) -> set[State]: return self.rulebook.follow_free_moves(self._current_states)
def accepting(self) -> bool: return bool(self.current_states & self.accept_states)
def read_character(self, character: str) -> None: self._current_states = self.rulebook.next_states( self.current_states, character, )
def read_string(self, text: str) -> None: for character in text: self.read_character(character)
@dataclass(frozen=True) class NFADesign: start_state: State accept_states: set[State] rulebook: NFARulebook
def to_nfa(self) -> NFA: return NFA({self.start_state}, self.accept_states, self.rulebook)
def accepts(self, text: str) -> bool: nfa = self.to_nfa() nfa.read_string(text) return nfa.accepting()
class Pattern: def to_nfa_design(self) -> NFADesign: raise NotImplementedError
def matches(self, text: str) -> bool: return self.to_nfa_design().accepts(text)
@dataclass(frozen=True) class Literal(Pattern): character: str
def to_nfa_design(self) -> NFADesign: start_state = new_state() accept_state = new_state() rule = FARule(start_state, self.character, accept_state) return NFADesign(start_state, {accept_state}, NFARulebook([rule]))
@dataclass(frozen=True) class Concatenate(Pattern): first: Pattern second: Pattern
def to_nfa_design(self) -> NFADesign: first_design = self.first.to_nfa_design() second_design = self.second.to_nfa_design() extra_rules = [ FARule(state, None, second_design.start_state) for state in first_design.accept_states ] rules = first_design.rulebook.rules + second_design.rulebook.rules + extra_rules return NFADesign( first_design.start_state, second_design.accept_states, NFARulebook(rules), )
@dataclass(frozen=True) class Choose(Pattern): first: Pattern second: Pattern
def to_nfa_design(self) -> NFADesign: first_design = self.first.to_nfa_design() second_design = self.second.to_nfa_design() start_state = new_state() extra_rules = [ FARule(start_state, None, first_design.start_state), FARule(start_state, None, second_design.start_state), ] rules = first_design.rulebook.rules + second_design.rulebook.rules + extra_rules return NFADesign( start_state, first_design.accept_states | second_design.accept_states, NFARulebook(rules), )
@dataclass(frozen=True) class Repeat(Pattern): pattern: Pattern
def to_nfa_design(self) -> NFADesign: design = self.pattern.to_nfa_design() start_state = new_state() extra_rules = [FARule(start_state, None, design.start_state)] extra_rules.extend( FARule(state, None, design.start_state) for state in design.accept_states ) rules = design.rulebook.rules + extra_rules return NFADesign( start_state, design.accept_states | {start_state}, NFARulebook(rules), )
pattern = Concatenate( Literal("a"), Repeat(Choose(Literal("b"), Literal("c"))), )
for sample in ["", "a", "ab", "accc", "abcbc", "ad", "ba"]: print(f"{sample!r:7} -> {pattern.matches(sample)}")
|
运行结果:
1 2 3 4 5 6 7
| '' -> False 'a' -> True 'ab' -> True 'accc' -> True 'abcbc' -> True 'ad' -> False 'ba' -> False
|
a(b|c)* 要求第一个字符必须是 a,后面只能出现零个或多个 b / c。所以 a、ab、accc、abcbc 被接受;空字符串、ad、ba 被拒绝。
Java 程序员的迁移表
| Python 写法 |
Java 对照 |
作用 |
Pattern |
AST 基类 / sealed interface |
正则节点抽象 |
Literal |
record Literal(char ch) |
字面量节点 |
Concatenate |
record Concat(Pattern left, Pattern right) |
顺序组合 |
Choose |
record Choice(Pattern left, Pattern right) |
分支组合 |
Repeat |
record Repeat(Pattern inner) |
零次或多次 |
to_nfa_design() |
compile() |
从结构生成可执行模型 |
NFADesign.accepts |
预编译 matcher |
复用编译结果执行匹配 |
Java 的 Pattern.compile() 不等于本文这段玩具代码。真实实现要处理字符类、量词优先级、捕获组、Unicode、边界匹配、回溯策略和错误诊断。本文只保留一个工程直觉:正则文本不会直接“魔法执行”,它会先变成某种内部结构,再由匹配器执行。
如果只看传统正则核心,NFA 是一个很自然的中间模型。| 变成空转移分叉,连接变成空转移拼接,* 变成带回环的空转移结构。复杂模式由小机器组合出来。
从编译角度看 Pattern.compile
日常代码里直接写:
1 2 3
| Pattern pattern = Pattern.compile("a(b|c)*"); Matcher matcher = pattern.matcher(input); boolean matched = matcher.matches();
|
这个 API 暴露了两个阶段:编译和匹配。编译阶段把模式字符串变成内部结构;匹配阶段把这个结构应用到输入字符串。频繁匹配同一个 pattern 时,复用 Pattern 对象通常比每次重新编译更合理。
这个分层不只存在于正则表达式里。SQL 预编译、模板预编译、规则引擎预加载、表达式引擎缓存,都遵循同一个模式。
模式提炼:预编译可执行结构
source -> compiled_model; compiled_model.run(input)
当文本规则会被反复执行时,先把文本翻译成可执行结构,再复用这个结构。
| 场景 |
source |
compiled model |
run input |
| 正则 |
pattern 字符串 |
Pattern / NFA |
待匹配字符串 |
| SQL |
SQL 文本 |
查询计划 |
参数和表数据 |
| 模板 |
模板文本 |
渲染函数 |
渲染上下文 |
| 规则引擎 |
规则配置 |
规则图 / 决策树 |
事实集合 |
这个模式的收益来自两个地方:解析成本被摊薄,结构化模型也更容易检查、优化和缓存。
练习
第一,增加 Empty 节点,让它匹配空字符串。提示:起始状态本身就是接受状态。
第二,用现有节点手工构造 a|bc,测试 a、bc、b、abc 的结果。
第三,给 Repeat(Literal("a")) 写一组测试,确认它接受空字符串、a、aaa,拒绝 b。
第四,把 new_state() 的递增整数换成 UUID 或带前缀的字符串,观察规则表是否仍然能正常工作。
参考资料