Character-by-character string parsing library.
Contents
- line number tracking (can be disabled for performance)
- supports CR, LF and CRLF line endings
- verbose exceptions
- many methods to navigate and operate the parser
- forward / backward peeking and seeking
- forward / backward character consumption
- state stack
- character types
- expectations
- PHP 7.1+
Create a new parser instance with string input.
The parser begins at the first character.
<?php
use Kuria\Parser\Parser;
$input = 'foo bar baz';
$parser = new Parser($input);
The parser has several public properties that can be used to inspect its current state:
$parser->i
- current position$parser->char
- current character (orNULL
at the end of input)$parser->lastChar
- last character (orNULL
at the start of input)$parser->line
- current line (orNULL
if line tracking is disabled)$parser->end
- end of input indicator (TRUE
at the end,FALSE
otherwise)$parser->vars
- user-defined variables attached to the current state
Warning
All of the public properties (with the exception of $parser->vars
)
are read-only and must not be modified directly by the calling code.
Use the built-in parser methods to mutate the parser state. See Parser method overview.
Refer to doc comments of the respective methods for more information.
Also see Character types.
getCharType($char): int
- determine character typegetCharTypeName($charType): string
- get human-readable character type name
getInput(): string
- get the input stringsetInput($input): void
- replace the input string (this also resets the parser)getLength(): int
- get length of the input stringisTrackingLineNumbers(): bool
- see if line number tracking is enabledtype(): int
- get type of the current characteris(...$types): bool
- check whether the current character is of one of the specified typesatNewline(): bool
- see if the parser is at the start of a newline sequenceeat(): ?string
- go to the next character and return the current one (returnsNULL
at the end)spit(): ?string
- go to the previous character and return the current one (returnsNULL
at the beginning)shift(): ?string
- go to the next character and return it (returnsNULL
at the end)unshift(): ?string
- go to the previous character and return it (returnsNULL
at the beginning)peek($offset, $absolute = false): ?string
- get character at the given offset or absolute position (does not affect state)seek($offset, $absolute = false): void
- alter current positionreset(): void
- reset states, vars and rewind to the beginningrewind(): void
- rewind to the beginningeatChar($char): ?string
- consume specific character and return the next charactertryEatChar(): bool
- attempt to consume specific character and return success stateeatType($type): string
- consume all characters of the specified typeeatTypes($typeMap): string
- consume all characters of the specified typeseatWs(): string
- consume whitespace, if anyeatUntil($delimiterMap, $skipDelimiter = true, $allowEnd = false): string
- consume all characters until the specified delimiterseatUntilEol($skip = true): string
- consume all character until end of line or inputeatEol(): string
- consume end of line sequenceeatRest(): string
- consume reamaining charactersgetChunk($start, $end): string
- get chunk of the input (does not affect state)detectEol(): ?string
- find and return the next end of line sequence (does not affect state)countStates(): int
- get number of stored statespushState(): void
- store the current staterevertState(): void
- revert to the last stored state and pop itpopState(): void
- pop the last stored state without reverting to itclearStates(): void
- throw away all stored statesexpectEnd(): void
- ensure that the parser is at the endexpectNotEnd(): void
- ensure that the parser is not at the endexpectChar($expectedChar): void
- ensure that the current character matches the expectationexpectCharType($expectedType): void
- ensure that the current character is of the given type
<?php
use Kuria\Parser\Parser;
/**
* INI parser (example)
*/
class IniParser
{
/**
* Parse an INI string
*/
public function parse(string $string): array
{
// create parser
$parser = new Parser($string);
// prepare variables
$data = [];
$currentSection = null;
// parse
while (!$parser->end) {
// skip whitespace
$parser->eatWs();
if ($parser->end) {
break;
}
// parse the current thing
if ($parser->char === '[') {
// a section
$currentSection = $this->parseSection($parser);
} elseif ($parser->char === ';') {
// a comment
$this->skipComment($parser);
} else {
// a key=value pair
[$key, $value] = $this->parseKeyValue($parser);
// add to output
if ($currentSection === null) {
$data[$key] = $value;
} else {
$data[$currentSection][$key] = $value;
}
}
}
return $data;
}
/**
* Parse a section and return its name
*/
private function parseSection(Parser $parser): string
{
// we should be at the [ character now, eat it
$parser->eatChar('[');
// eat everything until ]
$sectionName = $parser->eatUntil(']');
return $sectionName;
}
/**
* Skip a commented-out line
*/
private function skipComment(Parser $parser): void
{
// we should be at the ; character now, eat it
$parser->eatChar(';');
// eat everything until the end of line
$parser->eatUntilEol();
}
/**
* Parse a key=value pair
*/
private function parseKeyValue(Parser $parser): array
{
// we should be at the first character of the key
// eat characters until = is found
$key = $parser->eatUntil('=');
// eat everything until the end of line
// that is our value
$value = trim($parser->eatUntilEol());
return [$key, $value];
}
}
<?php
$iniParser = new IniParser();
$iniString = <<<INI
; An example comment
name=Foo
type=Bar
[options]
size=150x100
onload=
INI;
$data = $iniParser->parse($iniString);
print_r($data);
Output:
Array ( [name] => Foo [type] => Bar [options] => Array ( [size] => 150x100 [onload] => ) )
The table below lists the default character types.
These types are available as constants on the Parser class
:
Parser::C_NONE
- no character (NULL)Parser::C_WS
- whitespace (tab, linefeed, vertical tab, form feed, carriage return and space)Parser::C_NUM
- numeric character (0-9
)Parser::C_STR
- string character (a-z
,A-Z
,_
and any 8-bit char)Parser::C_CTRL
- control character (ASCII 127 and ASCII < 32 except whitespace)Parser::C_SPECIAL
-!"#$%&'()*+,-./:;<=>?@[\\]^\`{|}~
# | Character | Type |
---|---|---|
NULL | none | C_NONE |
0 | 0x00 |
C_CTRL |
1 | 0x01 |
C_CTRL |
2 | 0x02 |
C_CTRL |
3 | 0x03 |
C_CTRL |
4 | 0x04 |
C_CTRL |
5 | 0x05 |
C_CTRL |
6 | 0x06 |
C_CTRL |
7 | 0x07 |
C_CTRL |
8 | 0x08 |
C_CTRL |
9 | \t |
C_WS |
10 | \n |
C_WS |
11 | \v |
C_WS |
12 | \f |
C_WS |
13 | \r |
C_WS |
14 | 0x0e |
C_CTRL |
15 | 0x0f |
C_CTRL |
16 | 0x10 |
C_CTRL |
17 | 0x11 |
C_CTRL |
18 | 0x12 |
C_CTRL |
19 | 0x13 |
C_CTRL |
20 | 0x14 |
C_CTRL |
21 | 0x15 |
C_CTRL |
22 | 0x16 |
C_CTRL |
23 | 0x17 |
C_CTRL |
24 | 0x18 |
C_CTRL |
25 | 0x19 |
C_CTRL |
26 | 0x1a |
C_CTRL |
27 | 0x1b |
C_CTRL |
28 | 0x1c |
C_CTRL |
29 | 0x1d |
C_CTRL |
30 | 0x1e |
C_CTRL |
31 | 0x1f |
C_CTRL |
32 | 0x20 |
C_WS |
33 | ! |
C_SPECIAL |
34 | " |
C_SPECIAL |
35 | # |
C_SPECIAL |
36 | $ |
C_SPECIAL |
37 | % |
C_SPECIAL |
38 | & |
C_SPECIAL |
39 | ' |
C_SPECIAL |
40 | ( |
C_SPECIAL |
41 | ) |
C_SPECIAL |
42 | * |
C_SPECIAL |
43 | + |
C_SPECIAL |
44 | , |
C_SPECIAL |
45 | - |
C_SPECIAL |
46 | . |
C_SPECIAL |
47 | / |
C_SPECIAL |
48 | 0 |
C_NUM |
49 | 1 |
C_NUM |
50 | 2 |
C_NUM |
51 | 3 |
C_NUM |
52 | 4 |
C_NUM |
53 | 5 |
C_NUM |
54 | 6 |
C_NUM |
55 | 7 |
C_NUM |
56 | 8 |
C_NUM |
57 | 9 |
C_NUM |
58 | : |
C_SPECIAL |
59 | ; |
C_SPECIAL |
60 | < |
C_SPECIAL |
61 | = |
C_SPECIAL |
62 | > |
C_SPECIAL |
63 | ? |
C_SPECIAL |
64 | @ |
C_SPECIAL |
65 | A |
C_STR |
66 | B |
C_STR |
67 | C |
C_STR |
68 | D |
C_STR |
69 | E |
C_STR |
70 | F |
C_STR |
71 | G |
C_STR |
72 | H |
C_STR |
73 | I |
C_STR |
74 | J |
C_STR |
75 | K |
C_STR |
76 | L |
C_STR |
77 | M |
C_STR |
78 | N |
C_STR |
79 | O |
C_STR |
80 | P |
C_STR |
81 | Q |
C_STR |
82 | R |
C_STR |
83 | S |
C_STR |
84 | T |
C_STR |
85 | U |
C_STR |
86 | V |
C_STR |
87 | W |
C_STR |
88 | X |
C_STR |
89 | Y |
C_STR |
90 | Z |
C_STR |
91 | [ |
C_SPECIAL |
92 | \ |
C_SPECIAL |
93 | ] |
C_SPECIAL |
94 | ^ |
C_SPECIAL |
95 | _ |
C_STR |
96 | ` | C_SPECIAL |
97 | a |
C_STR |
98 | b |
C_STR |
99 | c |
C_STR |
100 | d |
C_STR |
101 | e |
C_STR |
102 | f |
C_STR |
103 | g |
C_STR |
104 | h |
C_STR |
105 | i |
C_STR |
106 | j |
C_STR |
107 | k |
C_STR |
108 | l |
C_STR |
109 | m |
C_STR |
110 | n |
C_STR |
111 | o |
C_STR |
112 | p |
C_STR |
113 | q |
C_STR |
114 | r |
C_STR |
115 | s |
C_STR |
116 | t |
C_STR |
117 | u |
C_STR |
118 | v |
C_STR |
119 | w |
C_STR |
120 | x |
C_STR |
121 | y |
C_STR |
122 | z |
C_STR |
123 | { |
C_SPECIAL |
124 | | |
C_SPECIAL |
125 | } |
C_SPECIAL |
126 | ~ |
C_SPECIAL |
127 | 0x7f |
C_CTRL |
128 | 0x80 |
C_STR |
129 | 0x81 |
C_STR |
130 | 0x82 |
C_STR |
131 | 0x83 |
C_STR |
132 | 0x84 |
C_STR |
133 | 0x85 |
C_STR |
134 | 0x86 |
C_STR |
135 | 0x87 |
C_STR |
136 | 0x88 |
C_STR |
137 | 0x89 |
C_STR |
138 | 0x8a |
C_STR |
139 | 0x8b |
C_STR |
140 | 0x8c |
C_STR |
141 | 0x8d |
C_STR |
142 | 0x8e |
C_STR |
143 | 0x8f |
C_STR |
144 | 0x90 |
C_STR |
145 | 0x91 |
C_STR |
146 | 0x92 |
C_STR |
147 | 0x93 |
C_STR |
148 | 0x94 |
C_STR |
149 | 0x95 |
C_STR |
150 | 0x96 |
C_STR |
151 | 0x97 |
C_STR |
152 | 0x98 |
C_STR |
153 | 0x99 |
C_STR |
154 | 0x9a |
C_STR |
155 | 0x9b |
C_STR |
156 | 0x9c |
C_STR |
157 | 0x9d |
C_STR |
158 | 0x9e |
C_STR |
159 | 0x9f |
C_STR |
160 | 0xa0 |
C_STR |
161 | 0xa1 |
C_STR |
162 | 0xa2 |
C_STR |
163 | 0xa3 |
C_STR |
164 | 0xa4 |
C_STR |
165 | 0xa5 |
C_STR |
166 | 0xa6 |
C_STR |
167 | 0xa7 |
C_STR |
168 | 0xa8 |
C_STR |
169 | 0xa9 |
C_STR |
170 | 0xaa |
C_STR |
171 | 0xab |
C_STR |
172 | 0xac |
C_STR |
173 | 0xad |
C_STR |
174 | 0xae |
C_STR |
175 | 0xaf |
C_STR |
176 | 0xb0 |
C_STR |
177 | 0xb1 |
C_STR |
178 | 0xb2 |
C_STR |
179 | 0xb3 |
C_STR |
180 | 0xb4 |
C_STR |
181 | 0xb5 |
C_STR |
182 | 0xb6 |
C_STR |
183 | 0xb7 |
C_STR |
184 | 0xb8 |
C_STR |
185 | 0xb9 |
C_STR |
186 | 0xba |
C_STR |
187 | 0xbb |
C_STR |
188 | 0xbc |
C_STR |
189 | 0xbd |
C_STR |
190 | 0xbe |
C_STR |
191 | 0xbf |
C_STR |
192 | 0xc0 |
C_STR |
193 | 0xc1 |
C_STR |
194 | 0xc2 |
C_STR |
195 | 0xc3 |
C_STR |
196 | 0xc4 |
C_STR |
197 | 0xc5 |
C_STR |
198 | 0xc6 |
C_STR |
199 | 0xc7 |
C_STR |
200 | 0xc8 |
C_STR |
201 | 0xc9 |
C_STR |
202 | 0xca |
C_STR |
203 | 0xcb |
C_STR |
204 | 0xcc |
C_STR |
205 | 0xcd |
C_STR |
206 | 0xce |
C_STR |
207 | 0xcf |
C_STR |
208 | 0xd0 |
C_STR |
209 | 0xd1 |
C_STR |
210 | 0xd2 |
C_STR |
211 | 0xd3 |
C_STR |
212 | 0xd4 |
C_STR |
213 | 0xd5 |
C_STR |
214 | 0xd6 |
C_STR |
215 | 0xd7 |
C_STR |
216 | 0xd8 |
C_STR |
217 | 0xd9 |
C_STR |
218 | 0xda |
C_STR |
219 | 0xdb |
C_STR |
220 | 0xdc |
C_STR |
221 | 0xdd |
C_STR |
222 | 0xde |
C_STR |
223 | 0xdf |
C_STR |
224 | 0xe0 |
C_STR |
225 | 0xe1 |
C_STR |
226 | 0xe2 |
C_STR |
227 | 0xe3 |
C_STR |
228 | 0xe4 |
C_STR |
229 | 0xe5 |
C_STR |
230 | 0xe6 |
C_STR |
231 | 0xe7 |
C_STR |
232 | 0xe8 |
C_STR |
233 | 0xe9 |
C_STR |
234 | 0xea |
C_STR |
235 | 0xeb |
C_STR |
236 | 0xec |
C_STR |
237 | 0xed |
C_STR |
238 | 0xee |
C_STR |
239 | 0xef |
C_STR |
240 | 0xf0 |
C_STR |
241 | 0xf1 |
C_STR |
242 | 0xf2 |
C_STR |
243 | 0xf3 |
C_STR |
244 | 0xf4 |
C_STR |
245 | 0xf5 |
C_STR |
246 | 0xf6 |
C_STR |
247 | 0xf7 |
C_STR |
248 | 0xf8 |
C_STR |
249 | 0xf9 |
C_STR |
250 | 0xfa |
C_STR |
251 | 0xfb |
C_STR |
252 | 0xfc |
C_STR |
253 | 0xfd |
C_STR |
254 | 0xfe |
C_STR |
255 | 0xff |
C_STR |
Character types can be customized by extending the base Parser
class.
The following example changes "-
" and ".
" from CHAR_SPECIAL
to CHAR_STR
and inherits everything else.
<?php
class CustomParser extends Parser
{
const CHAR_TYPE_MAP = [
'-' => self::C_STR,
'.' => self::C_STR,
] + parent::CHAR_TYPE_MAP; // inherit everything else
}
// usage example
$parser = new CustomParser('foo-bar.baz');
var_dump($parser->eatType(CustomParser::C_STR));
Output:
string(11) "foo-bar.baz"