DFA 和 NFA 已经把“字符串识别”拆成了状态、输入字符、转移规则和接受状态。正则表达式站在更高一层:开发者写 a(b|c)*,机器负责把它变成可以执行的匹配过程。

这篇文章只处理传统正则表达式的核心结构:字面量、连接、选择和重复。现代正则 API 还包含捕获组、环视、反向引用、贪婪/非贪婪策略等扩展;这些扩展属于工程实现层,不影响本文要展示的主线。

核心链路很短:

1
regex text -> regex AST -> NFA design -> accepts(text)

本文不写正则 parser,直接手工构造 AST。前面文章已经展示过“字符串到 AST”的方法,这里把注意力放在第二步:一个正则 AST 节点怎样编译成 NFA。

正则表达式先变成结构

a(b|c)* 不是一串神秘字符。按传统正则语义,它可以拆成四种结构。

正则片段 AST 节点 含义
a Literal("a") 匹配一个字符
bc Concatenate(Literal("b"), Literal("c")) 先匹配 b,再匹配 c
`b c` Choose(Literal("b"), Literal("c"))
b* Repeat(Literal("b")) 匹配零个或多个 b

因此,a(b|c)* 可以写成:

1
2
3
4
pattern = Concatenate(
Literal("a"),
Repeat(Choose(Literal("b"), Literal("c"))),
)

这个写法看起来比原始正则长很多,但它把隐含结构完全摊开了。正则编译器真正需要处理的不是字符画,而是一棵树。树上的每个节点都知道怎样把自己变成一台小 NFA,再由父节点把小 NFA 组合起来。

模式提炼:文本先变结构,结构再变机器

text -> tree -> executable model

直接解释字符串通常会很快变乱。更稳的做法是先把文本解析成结构,再把结构翻译成可执行模型。

场景 文本 结构 可执行模型
正则表达式 `a(b c)*` 正则 AST
SQL select ... 查询树 / 逻辑计划 物理执行计划
模板 hello {{name}} 模板 AST 渲染函数
规则配置 JSON / YAML 规则节点 规则执行器

听到“这段文本语法越来越复杂”时,应该先寻找 AST,而不是继续堆字符串分支。

复用 NFA 基础设施

前一篇 NFA 的代码可以原样复用。为了组合多台小 NFA,状态用递增整数生成,避免不同片段的状态名冲突。

1
2
3
4
5
6
7
8
9
10
11
12
13
from __future__ import annotations

from dataclasses import dataclass
from itertools import count


State = int
Character = str | None
_next_state = count()


def new_state() -> State:
return next(_next_state)

FARuleNFARulebookNFANFADesign 仍然保持前一篇的形状。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
@dataclass(frozen=True)
class FARule:
state: State
character: Character
next_state: State

def applies_to(self, state: State, character: Character) -> bool:
return self.state == state and self.character == character

def follow(self) -> State:
return self.next_state


class NFARulebook:
def __init__(self, rules: list[FARule]) -> None:
self.rules = rules

def next_states(self, states: set[State], character: Character) -> set[State]:
next_states = set()

for state in states:
for rule in self.rules_for(state, character):
next_states.add(rule.follow())

return next_states

def rules_for(self, state: State, character: Character) -> list[FARule]:
return [rule for rule in self.rules if rule.applies_to(state, character)]

def follow_free_moves(self, states: set[State]) -> set[State]:
more_states = self.next_states(states, None)

if more_states <= states:
return states

return self.follow_free_moves(states | more_states)

运行态 NFA 负责读字符,设计态 NFADesign 负责反复创建运行态对象。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class NFA:
def __init__(
self,
current_states: set[State],
accept_states: set[State],
rulebook: NFARulebook,
) -> None:
self._current_states = current_states
self.accept_states = accept_states
self.rulebook = rulebook

@property
def current_states(self) -> set[State]:
return self.rulebook.follow_free_moves(self._current_states)

def accepting(self) -> bool:
return bool(self.current_states & self.accept_states)

def read_character(self, character: str) -> None:
self._current_states = self.rulebook.next_states(
self.current_states,
character,
)

def read_string(self, text: str) -> None:
for character in text:
self.read_character(character)


@dataclass(frozen=True)
class NFADesign:
start_state: State
accept_states: set[State]
rulebook: NFARulebook

def to_nfa(self) -> NFA:
return NFA({self.start_state}, self.accept_states, self.rulebook)

def accepts(self, text: str) -> bool:
nfa = self.to_nfa()
nfa.read_string(text)
return nfa.accepting()

每个正则节点都编译成 NFA

正则 AST 节点统一实现 to_nfa_designmatches 只是一个便利方法:先编译成 NFA,再用 NFA 读取字符串。

1
2
3
4
5
6
class Pattern:
def to_nfa_design(self) -> NFADesign:
raise NotImplementedError

def matches(self, text: str) -> bool:
return self.to_nfa_design().accepts(text)

Literal:一条字符转移

字面量节点最简单。它生成两个状态和一条字符转移。

1
start --a--> accept
1
2
3
4
5
6
7
8
9
@dataclass(frozen=True)
class Literal(Pattern):
character: str

def to_nfa_design(self) -> NFADesign:
start_state = new_state()
accept_state = new_state()
rule = FARule(start_state, self.character, accept_state)
return NFADesign(start_state, {accept_state}, NFARulebook([rule]))

Concatenate:把前一段的接受状态接到后一段

连接节点表示“先匹配左边,再匹配右边”。做法是先分别编译左右两段,然后从左边每个接受状态加一条空转移,指向右边的起始状态。

1
first accept --ε--> second start
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
@dataclass(frozen=True)
class Concatenate(Pattern):
first: Pattern
second: Pattern

def to_nfa_design(self) -> NFADesign:
first_design = self.first.to_nfa_design()
second_design = self.second.to_nfa_design()
extra_rules = [
FARule(state, None, second_design.start_state)
for state in first_design.accept_states
]
rules = first_design.rulebook.rules + second_design.rulebook.rules + extra_rules
return NFADesign(
first_design.start_state,
second_design.accept_states,
NFARulebook(rules),
)

Choose:新起点分叉到两段

选择节点表示“匹配左边或右边”。做法是创建一个新起点,再用两条空转移分别连到左右两段的起点。接受状态是左右两段接受状态的并集。

1
2
new start --ε--> first start
new start --ε--> second start
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@dataclass(frozen=True)
class Choose(Pattern):
first: Pattern
second: Pattern

def to_nfa_design(self) -> NFADesign:
first_design = self.first.to_nfa_design()
second_design = self.second.to_nfa_design()
start_state = new_state()
extra_rules = [
FARule(start_state, None, first_design.start_state),
FARule(start_state, None, second_design.start_state),
]
rules = first_design.rulebook.rules + second_design.rulebook.rules + extra_rules
return NFADesign(
start_state,
first_design.accept_states | second_design.accept_states,
NFARulebook(rules),
)

Repeat:新起点接受空串,并回环到子模式

重复节点表示“零次或多次”。它需要满足两个条件:空字符串应该被接受;每匹配完一次子模式以后,还能再回到子模式开头继续匹配。

1
2
new start --ε--> pattern start
pattern accept --ε--> pattern start
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
@dataclass(frozen=True)
class Repeat(Pattern):
pattern: Pattern

def to_nfa_design(self) -> NFADesign:
design = self.pattern.to_nfa_design()
start_state = new_state()
extra_rules = [FARule(start_state, None, design.start_state)]
extra_rules.extend(
FARule(state, None, design.start_state)
for state in design.accept_states
)
rules = design.rulebook.rules + extra_rules
return NFADesign(
start_state,
design.accept_states | {start_state},
NFARulebook(rules),
)

模式提炼:组合节点只管连接边界

child_machine(s) + boundary_rules -> parent_machine

每个组合节点不需要理解子节点内部怎么工作。它只需要拿到子节点的起始状态、接受状态和规则表,再补几条边界规则。

节点 子机器数量 新增边界规则 接受状态
Literal 0 字符转移 新接受状态
Concatenate 2 左接受状态到右起点的空转移 右接受状态
Choose 2 新起点到两边起点的空转移 两边接受状态并集
Repeat 1 新起点到子起点、子接受状态回子起点 子接受状态加新起点

这个模式在 DSL 编译器里很常见:叶子节点生成最小执行单元,组合节点只负责把子单元粘起来。

完整示例

完整代码如下。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
from __future__ import annotations

from dataclasses import dataclass
from itertools import count


State = int
Character = str | None
_next_state = count()


def new_state() -> State:
return next(_next_state)


@dataclass(frozen=True)
class FARule:
state: State
character: Character
next_state: State

def applies_to(self, state: State, character: Character) -> bool:
return self.state == state and self.character == character

def follow(self) -> State:
return self.next_state


class NFARulebook:
def __init__(self, rules: list[FARule]) -> None:
self.rules = rules

def next_states(self, states: set[State], character: Character) -> set[State]:
next_states = set()

for state in states:
for rule in self.rules_for(state, character):
next_states.add(rule.follow())

return next_states

def rules_for(self, state: State, character: Character) -> list[FARule]:
return [rule for rule in self.rules if rule.applies_to(state, character)]

def follow_free_moves(self, states: set[State]) -> set[State]:
more_states = self.next_states(states, None)

if more_states <= states:
return states

return self.follow_free_moves(states | more_states)


class NFA:
def __init__(
self,
current_states: set[State],
accept_states: set[State],
rulebook: NFARulebook,
) -> None:
self._current_states = current_states
self.accept_states = accept_states
self.rulebook = rulebook

@property
def current_states(self) -> set[State]:
return self.rulebook.follow_free_moves(self._current_states)

def accepting(self) -> bool:
return bool(self.current_states & self.accept_states)

def read_character(self, character: str) -> None:
self._current_states = self.rulebook.next_states(
self.current_states,
character,
)

def read_string(self, text: str) -> None:
for character in text:
self.read_character(character)


@dataclass(frozen=True)
class NFADesign:
start_state: State
accept_states: set[State]
rulebook: NFARulebook

def to_nfa(self) -> NFA:
return NFA({self.start_state}, self.accept_states, self.rulebook)

def accepts(self, text: str) -> bool:
nfa = self.to_nfa()
nfa.read_string(text)
return nfa.accepting()


class Pattern:
def to_nfa_design(self) -> NFADesign:
raise NotImplementedError

def matches(self, text: str) -> bool:
return self.to_nfa_design().accepts(text)


@dataclass(frozen=True)
class Literal(Pattern):
character: str

def to_nfa_design(self) -> NFADesign:
start_state = new_state()
accept_state = new_state()
rule = FARule(start_state, self.character, accept_state)
return NFADesign(start_state, {accept_state}, NFARulebook([rule]))


@dataclass(frozen=True)
class Concatenate(Pattern):
first: Pattern
second: Pattern

def to_nfa_design(self) -> NFADesign:
first_design = self.first.to_nfa_design()
second_design = self.second.to_nfa_design()
extra_rules = [
FARule(state, None, second_design.start_state)
for state in first_design.accept_states
]
rules = first_design.rulebook.rules + second_design.rulebook.rules + extra_rules
return NFADesign(
first_design.start_state,
second_design.accept_states,
NFARulebook(rules),
)


@dataclass(frozen=True)
class Choose(Pattern):
first: Pattern
second: Pattern

def to_nfa_design(self) -> NFADesign:
first_design = self.first.to_nfa_design()
second_design = self.second.to_nfa_design()
start_state = new_state()
extra_rules = [
FARule(start_state, None, first_design.start_state),
FARule(start_state, None, second_design.start_state),
]
rules = first_design.rulebook.rules + second_design.rulebook.rules + extra_rules
return NFADesign(
start_state,
first_design.accept_states | second_design.accept_states,
NFARulebook(rules),
)


@dataclass(frozen=True)
class Repeat(Pattern):
pattern: Pattern

def to_nfa_design(self) -> NFADesign:
design = self.pattern.to_nfa_design()
start_state = new_state()
extra_rules = [FARule(start_state, None, design.start_state)]
extra_rules.extend(
FARule(state, None, design.start_state)
for state in design.accept_states
)
rules = design.rulebook.rules + extra_rules
return NFADesign(
start_state,
design.accept_states | {start_state},
NFARulebook(rules),
)


pattern = Concatenate(
Literal("a"),
Repeat(Choose(Literal("b"), Literal("c"))),
)

for sample in ["", "a", "ab", "accc", "abcbc", "ad", "ba"]:
print(f"{sample!r:7} -> {pattern.matches(sample)}")

运行结果:

1
2
3
4
5
6
7
''      -> False
'a' -> True
'ab' -> True
'accc' -> True
'abcbc' -> True
'ad' -> False
'ba' -> False

a(b|c)* 要求第一个字符必须是 a,后面只能出现零个或多个 b / c。所以 aabacccabcbc 被接受;空字符串、adba 被拒绝。

Java 程序员的迁移表

Python 写法 Java 对照 作用
Pattern AST 基类 / sealed interface 正则节点抽象
Literal record Literal(char ch) 字面量节点
Concatenate record Concat(Pattern left, Pattern right) 顺序组合
Choose record Choice(Pattern left, Pattern right) 分支组合
Repeat record Repeat(Pattern inner) 零次或多次
to_nfa_design() compile() 从结构生成可执行模型
NFADesign.accepts 预编译 matcher 复用编译结果执行匹配

Java 的 Pattern.compile() 不等于本文这段玩具代码。真实实现要处理字符类、量词优先级、捕获组、Unicode、边界匹配、回溯策略和错误诊断。本文只保留一个工程直觉:正则文本不会直接“魔法执行”,它会先变成某种内部结构,再由匹配器执行。

如果只看传统正则核心,NFA 是一个很自然的中间模型。| 变成空转移分叉,连接变成空转移拼接,* 变成带回环的空转移结构。复杂模式由小机器组合出来。

从编译角度看 Pattern.compile

日常代码里直接写:

1
2
3
Pattern pattern = Pattern.compile("a(b|c)*");
Matcher matcher = pattern.matcher(input);
boolean matched = matcher.matches();

这个 API 暴露了两个阶段:编译和匹配。编译阶段把模式字符串变成内部结构;匹配阶段把这个结构应用到输入字符串。频繁匹配同一个 pattern 时,复用 Pattern 对象通常比每次重新编译更合理。

这个分层不只存在于正则表达式里。SQL 预编译、模板预编译、规则引擎预加载、表达式引擎缓存,都遵循同一个模式。

模式提炼:预编译可执行结构

source -> compiled_model; compiled_model.run(input)

当文本规则会被反复执行时,先把文本翻译成可执行结构,再复用这个结构。

场景 source compiled model run input
正则 pattern 字符串 Pattern / NFA 待匹配字符串
SQL SQL 文本 查询计划 参数和表数据
模板 模板文本 渲染函数 渲染上下文
规则引擎 规则配置 规则图 / 决策树 事实集合

这个模式的收益来自两个地方:解析成本被摊薄,结构化模型也更容易检查、优化和缓存。

练习

第一,增加 Empty 节点,让它匹配空字符串。提示:起始状态本身就是接受状态。

第二,用现有节点手工构造 a|bc,测试 abcbabc 的结果。

第三,给 Repeat(Literal("a")) 写一组测试,确认它接受空字符串、aaaa,拒绝 b

第四,把 new_state() 的递增整数换成 UUID 或带前缀的字符串,观察规则表是否仍然能正常工作。

参考资料