-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
501 lines (479 loc) · 24.9 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
<!DOCTYPE html>
<html lang="en_us">
<head>
<title>Pickle Tokenizer</title>
<meta charset="utf-8" />
<meta name="generator" content="Pelican" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" href="/static/css/main.css" />
<link rel="stylesheet" href="/static/css/theme.css" />
<link rel="icon" href="/images/yazani/yazani_1_extracted_bg_big_eyes_cropped.png" type="image/png" />
<link rel="apple-touch-icon" href="/images/yazani/yazani_1_extracted_bg_big_eyes_cropped.png" type="image/png" />
<script src="/static/misc.js"></script>
<script src="/blog/banner_image.js"></script>
<meta name="tags" content="programming, javascript, language-design" />
<meta property="og:site_name" content="dragoncoder047’s blog" />
<meta property="og:title" content="Pickle Tokenizer" />
<meta property="og:description" content="I’m starting to work on my Pickle programming language, this time in Javascript. After only a few days’ work, I’m surprised I got so much working. Currently I have both the tokenizer and the inheritance system working. The syntax of Pickle is pretty much in place now, and …" />
<meta property="og:image" content="/images/yazani/yazani_1_extracted_bg.png" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://dragoncoder047.github.io/blog/2023/pickle-tokenizer" />
<meta property="og:locale" content="['']" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:title" content="dragoncoder047’s blog - Pickle Tokenizer" />
<meta name="twitter:description" content="I’m starting to work on my Pickle programming language, this time in Javascript. After only a few days’ work, I’m surprised I got so much working. Currently I have both the tokenizer and the inheritance system working. The syntax of Pickle is pretty much in place now, and …" />
<meta name="twitter:image" content="/images/yazani/yazani_1_extracted_bg.png" />
<!-- PrismJS -->
<script src="/static/prism.js" data-autoloader-path="https://cdn.jsdelivr.net/npm/prismjs@v1.x/components/"></script>
<script src="/static/prism-runbutton.js"></script>
<script src="/phoo/prism-phoo.js"></script> <!-- /PrismJS -->
<!-- Katex -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/katex.css" type="text/css" />
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/katex.js"></script>
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/contrib/mhchem.js"></script>
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/contrib/auto-render.js"></script>
<link href="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/contrib/copy-tex.css" rel="stylesheet" type="text/css" />
<script src="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/contrib/copy-tex.js"></script>
<script>
window.addEventListener("DOMContentLoaded", () => {
renderMathInElement(document.body,
{
delimiters: [
{ left: "$$", right: "$$", display: true },
{ left: "$", right: "$", display: false },
{ left: "\\begin{align}", right: "\\end{align}", display: true },
]
}
);
});
</script> <!-- /Katex -->
</head>
<body class="match-braces rainbow-braces">
<header>
<a href="https://dragoncoder047.github.io/blog" class="flex-row"><div class="flex-row"><img src="/images/yazani/yazani_1_extracted_bg.png" style="max-height:10em" id="banner-image" /><div id="sitename-text"><h1>dragoncoder047’s blog</h1><h2>random thoughts about nonrandom things</h2></div></div></a>
<nav>
<ul>
<li><a href="https://dragoncoder047.github.io/blog/">Home</a></li>
<li><a href="https://dragoncoder047.github.io/blog/archives">Archives</a>
</li>
<li><a href="https://dragoncoder047.github.io/blog/tags">By tag</a>
</li>
<li><a href="/">Site root</a>
</li>
<li><a href="#">Projects</a>
<ul>
<li><a href="https://dragoncoder047.github.io/thuepaste">Thuepaste</a>
</li>
<li><a href="https://dragoncoder047.github.io/armdroid">Armdroid</a>
</li>
<li><a href="https://dragoncoder047.github.io/langton-music">Langton's Ant Music</a>
</li>
<li><a href="https://dragoncoder047.github.io/schemascii">Schemascii</a>
</li>
<li><a href="https://dragoncoder047.github.io/parasite">Parasite</a>
</li>
</ul>
</li>
<li>
<form action="https://www.google.com/search" method="GET">
<input name="q" type="search" placeholder="Search"></input>
<input type="hidden" name="as_sitesearch" value="https://dragoncoder047.github.io/blog"></input>
<input type="submit" value="Search"></input>
</form>
</li>
</ul>
</nav>
</header>
<main>
<h1><a href="https://dragoncoder047.github.io/blog/2023/pickle-tokenizer" rel="bookmark" title="Permalink to this page">Pickle Tokenizer</a></h1>
<div class="flex-row">
<span style="flex: 1">← Previous:
<a href="https://dragoncoder047.github.io/blog/2023/phooey-phooey-phooey">
Phooey! Phooey! Phooey!
</a>
</span>
<span>Next:
<a href="https://dragoncoder047.github.io/blog/2023/yet-another-garbage-collector">
Yet Another Garbage Collector
</a> →
</span>
</div>
<div class="post-info">
Posted <time class="published" datetime="2023-04-20T00:00:00-04:00">Thu 20 April 2023</time>
<br /> Modified <time class="modified" datetime="2023-04-21T00:00:00-04:00">Fri 21 April 2023</time>
<address>By
<a href="https://dragoncoder047.github.io/blog/">dragoncoder047</a>
</address>
<div class="tags">
Tags:
<a href="https://dragoncoder047.github.io/blog/tag/javascript">javascript</a>
<a href="https://dragoncoder047.github.io/blog/tag/language-design">language-design</a>
<a href="https://dragoncoder047.github.io/blog/tag/programming">programming</a>
</div>
</div>
<p><em>This post is part 3 of the pickle series:</em></p>
<ol class="series">
<li >
<a href='https://dragoncoder047.github.io/blog/2023/pickles'>Pickles!</a>
</li>
<li >
<a href='https://dragoncoder047.github.io/blog/2023/manual-memory-management-madness'>Manual Memory Management Madness</a>
</li>
<li class="active">
<a href='https://dragoncoder047.github.io/blog/2023/pickle-tokenizer'>Pickle Tokenizer</a>
</li>
<li >
<a href='https://dragoncoder047.github.io/blog/2023/yet-another-garbage-collector'>Yet Another Garbage Collector</a>
</li>
<li >
<a href='https://dragoncoder047.github.io/blog/2023/powerful-pickle-pattern-matching'>Powerful PICKLE Pattern Matching</a>
</li>
<li >
<a href='https://dragoncoder047.github.io/blog/2023/pickle-has-regular-expressions-apparently'>PICKLE Has Regular Expressions, Apparently</a>
</li>
<li >
<a href='https://dragoncoder047.github.io/blog/2023/its-september'>It's September!!</a>
</li>
<li >
<a href='https://dragoncoder047.github.io/blog/2023/continuations-and-the-thunk-queue'>Continuations and the thunk queue</a>
</li>
<li >
<a href='https://dragoncoder047.github.io/blog/2024/the-lesser-of-two-evils'>The Lesser of Two Evils</a>
</li>
<li >
<a href='https://dragoncoder047.github.io/blog/2024/a-hash-mapped-mess'>A Hash-Mapped Mess</a>
</li>
</ol>
<p>I’m starting to work on my Pickle programming language, this time in Javascript. After only a few days’ work, I’m surprised I got so much working. Currently I have both the tokenizer and the inheritance system working. The syntax of Pickle is pretty much in place now, and I just have a few tweaks left for the tokenizer, and hooking it up to a parser, before I am able to write the evaluator.</p>
<h2 id="tokenizer">Tokenizer</h2>
<p>In contrast to my previous attempt in C, I actually wrote a tokenizer. But in all honesty, the tokenizer really does 90% of the parsing – it just doesn’t recursively build up a tree of expressions using the parens (that will be done by the parser).</p>
<p>I got the tokenizer working in a little demo page I dubbed the “seeder” for some reason. Currently it is able to tokenize this code:</p>
<pre id="picklecode">
defun fib(x):
if $x == 0 or $x == 1:
return 1
else:
return (fib $x - 1) + (fib $x - 2)
print (fib 10)
</pre>
<p>into this stream of tokens:</p>
<pre id="outputtokens">
</pre>
<p>[EDIT Apr. 21: The above is now a live editor using the tokenizer as it was when I wrote this post – write away and the tokens will update automatically. The tokenization may have changed since then.]</p>
<p>The tokenizer is also a bit unique in that it can recover from a syntax error and keep scanning, allowing you to see and fix multiple syntax errors all at once. And it’s also nice that the <a href="https://ace.c9.io">Ace.js</a> code editor allows you to place annotation markers in the gutter, which is what I did in the “seeder”.</p>
<p>The only bug I can see here with this example is that the colon-block string part consumes the newline at the end of the block, so the <code>print</code> is considered to be on the same logical line, but it shouldn’t be. A simple <code>;</code> before the <code>print</code> would fix that, but I feel that this kind of indented block structure where an unindent ends both the block and the line, would be more common than having the string continue the line – in other words, having the default be to continue it and adding punctuation to end it would result in more “punctuation overload” versus having a special punctuation character mean continue the line and the default be to end it. Unfortunately, the former appears to be whet is implemented in the Javascript tokenizer.</p>
<h2 id="inheritance">Inheritance</h2>
<p>In <a href="https://dragoncoder047.github.io/blog/2023/pickles">my earlier post about Pickle</a>, I mentioned that Pickle would have a multiprototype-based inheritance system, a strange mix of Python and Javascript. Python supports multiple inheritance, but chokes on “ambiguous” inheritance trees, while Javascript only supports single inheritance through prototypes. But I think I’ve found a simple solution that implements multiprototype-based inheritance. Here’s a pared-down example:</p>
<pre class="highlight"><code class="language-js">class PickleObject {
constructor(name, ...prototypes) {
this.name = name;
this.prototypes = prototypes;
}
toJSON() {
return this.name;
}
getMRO() {
var fun = x => [x].concat(x.prototypes.map(fun));
return fun(this).flat(Infinity);
}
}
var A = new PickleObject("A");
var B = new PickleObject("B");
var X = new PickleObject("X", A, B);
var Y = new PickleObject("Y", B, A);
var Crash = new PickleObject("Crash", X, Y);
alert(JSON.stringify(Crash.getMRO()));
// -> ["Crash","X","A","B","Y","B","A"]</code></pre>
<p>This is exactly the same code that I posted earlier that Python can’t handle – but here, <code class="language-js highlight">Crash.getMRO()</code> simply returns a flat array that can be searched linearly. I’m not sure how fast this is, but I do have some optimization tricks that I could apply.</p>
<h2 id="whats-next">What’s next?</h2>
<p>I don’t know exactly what, but Pickle is still only half-written. After I write the parser, I’ll need to then write the evaluator. And the evaluator is going to be extremely complicated and probably very slow, although I do hope it will be somewhat readable due to Javascript’s built-in functional programming constructs that C doesn’t have natively.</p>
<p>Pickle does look like it’s going to be simpler than Phoo, certainly. Although Phoo did get complicated because I split everything into a zillion different files. One huge file for everything may be a bit much, but having a bazillion files and none have any more than 100 lines apiece is also a bit much. Aside from the weird operator semantics, I do hope Pickle’s flow will be easier to follow.</p>
<script src="https://cdn.jsdelivr.net/npm/ace-builds@1.10.0/src-noconflict/ace.min.js"></script>
<script>
(function() {
function pickleUnescapeChar(c) {
switch (c) {
case 'b': return '\b';
case 't': return '\t';
case 'n': return '\n';
case 'v': return '\v';
case 'f': return '\f';
case 'r': return '\r';
case 'a': return '\a';
case 'o': return '{';
case 'c': return '}';
case '\n': return '';
default: return c;
}
}
class PickleToken {
constructor(type, content, start, end, filename = "", message = "") {
var types = type.split(".");
this.type = types[0];
this.subtypes = types.slice(1);
this.content = content;
this.start = start;
this.end = end;
this.filename = filename;
this.message = message;
}
toJSON() {
return {
type: this.type,
subtypes: this.subtypes,
content: this.content,
start: this.start,
end: this.end,
filename: this.filename,
message: this.message
};
}
}
class PickleTokenizer {
constructor(string, filename) {
this.string = string;
this.i = 0;
this.beginning = null;
this.bi = 0;
this.filename = filename;
}
lineColumn() {
var before = this.string.slice(0, this.i);
var doneLines = before.split("\n");
var line = doneLines.length;
var col = doneLines.at(-1).length + 1;
return { line, col };
}
test(what) {
if (typeof what === "string") return this.string.slice(this.i).startsWith(what);
else if (what instanceof RegExp) return what.test(this.string.slice(this.i));
else return false;
}
chomp(what) {
if (!this.test(what)) return undefined;
if (typeof what === "string") {
this.i += what.length;
return what;
}
else if (what instanceof RegExp) {
var match = what.exec(this.string.slice(this.i));
this.i += match[0].length;
return match;
}
else return undefined;
}
done() {
return this.i >= this.string.length;
}
peek(i = 0) {
var j = this.i + i;
if (j >= this.string.length) return undefined;
return this.string[j];
}
errorToken(message = "") {
if (this.bi == this.i) this.i++;
return this.makeToken("error", this.string.slice(this.bi, this.i), message || `unexpected ${this.peek(-1)}`);
}
makeToken(type, content, message = "") {
return new PickleToken(type, content, this.beginning, this.lineColumn(), this.filename, message);
}
nextToken() {
if (this.done()) return undefined;
this.beginning = this.lineColumn();
this.bi = this.i;
if (this.test(/^:\s*\n/)) {
var i = this.i;
var lines = [];
this.chomp(/^:\s*\n/);
var indent = this.chomp(/^\s+/);
if (!indent) {
this.i = i;
return this.makeToken("error", this.chomp(/^:\s*\n/)[0], "expected indent after colon");
}
indent = indent[0];
var ensure_same = /^([\t ])\1*/.exec(indent);
if (!ensure_same) return this.makeToken("error", indent, "mix of tabs and spaces indenting block");
while (true) {
var line = this.chomp(/^[^\n]*/);
lines.push(line[0] || "");
if (!this.chomp("\n")) break;
if (!this.chomp(indent)) {
var b = this.lineColumn();
var bi = this.i;
var badIndent = this.chomp(/^(((?!\n)\s)*)(?=\S)/);
if (badIndent) {
if (badIndent[1].length > 0) {
this.beginning = b;
this.bi = bi;
return this.makeToken("error", badIndent[1], "unexpected unindent");
}
else break;
}
}
}
return this.makeToken("string.block", lines.join("\n"));
}
const TOKEN_REGEXES = [
{ type: "comment.block", re: /^(?<!#)(###+)(?!#)[\s\S\n\r]*?(?<!#)\1(?!#)/, significant: false },
{ type: "comment.line", re: /^#[^\n]*/, significant: false },
{ type: "paren", re: /^[\(\)\[\]]/, significant: true, groupNum: 0 },
{ type: "space", re: /^(?!\n)\s+/, significant: false },
{ type: "eol", re: /^[;\n]/, significant: true, groupNum: 0 },
{ type: "singleton", re: /^(true|false|nil)/, significant: true, groupNum: 0 },
{ type: "number.complex", re: /^-?[0-9]+(\.[0-9]+)?e[+-]\d+[+-][0-9]+(\.[0-9]+)?e[+-]\d+j/, significant: true, groupNum: 0 },
{ type: "number.rational", re: /^-?[0-9]+\/[0-9]+/, significant: true, groupNum: 0 },
{ type: "number.integer", re: /^-?([1-9][0-9]*|0x[0-9a-f]+|0b[01]+)/i, significant: true, groupNum: 0 },
{ type: "number.float", re: /^-?[0-9]+(\.[0-9]+)?(e[+-]\d+)?/i, significant: true, groupNum: 0 },
{ type: "symbol", re: /^[a-z_][a-z0-9_]*\??/i, significant: true, groupNum: 0 },
{ type: "symbol.operator", re: /^[-~`!@$%^&*_+=[\]|\\:<>,.?/]+/, significant: true, groupNum: 0 },
]
for (var { type, re, significant, groupNum } of TOKEN_REGEXES) {
if (this.test(re)) {
var match = this.chomp(re);
if (significant) return this.makeToken(type, match[groupNum]);
else return this.nextToken();
}
}
// Try strings
if (this.test("{")) {
var j = 0, depth = 0, string = "";
do {
var ch = this.peek(j);
if (ch == undefined) return this.errorToken("unclosed {");
if (ch == "{") depth++;
else if (ch == "}") depth--;
string += ch;
j++;
} while (depth > 0);
this.i += j;
return this.makeToken("string.curly", string.slice(1, -1));
}
else if (this.test(/^['"]/)) {
var q = this.chomp(/^['"]/)[0];
var j = 0, string = "";
while (true) {
var ch = this.peek(j);
// newlines must be backslash escaped
if (ch == undefined || ch == "\n") {
this.i += j;
return this.errorToken("unterminated string");
}
else if (ch == "\\") {
ch = pickleUnescapeChar(this.peek(j + 1));
j++;
}
else if (ch == q) break;
string += ch;
j++;
}
this.i += j + 1;
return this.makeToken("string.quote", string);
}
return this.errorToken();
}
}
const SEL = s => document.querySelector(s);
ace.config.set('basePath', 'https://cdn.jsdelivr.net/npm/ace-builds@1.10.0/src-noconflict/');
var editor = ace.edit("picklecode", { maxLines: 20 });
function output(x) {
SEL("#outputtokens").innerHTML += x;
}
function clearOutput() {
SEL("#outputtokens").innerHTML = "";
}
function foobar() {
var tokenizer = new PickleTokenizer(editor.getValue());
var annotations = [];
clearOutput();
try {
while (!tokenizer.done()) {
var oldi = tokenizer.i;
var tok = tokenizer.nextToken();
if (tok) {
if (tok.type == "error") {
annotations.push({
row: tok.start.line - 1,
column: tok.start.col,
text: tok.message + (tok.content ? `: ${tok.content}` : ""),
type: "error",
});
}
output(`[${tok.start.line}:${tok.start.col} - ${tok.end.line}:${tok.end.col}]\t${tok.type} ${tok.subtypes.length > 0 ? "(" + tok.subtypes.join(",") + ")" : ""}\t${JSON.stringify(tok.content)}\t${tok.message}\n`);
}
if (tokenizer.i == oldi) throw new Error("Tokenizer error");
}
} catch (e) {
output(`<span style="color:red">${e}\n${e.stack}</span>`)
console.error(e);
}
editor.getSession().setAnnotations(annotations);
}
editor.getSession().on('change', foobar);
foobar();
// Dark/light theme
const dmmq = window.matchMedia('(prefers-color-scheme: dark)');
function darkLight() {
if (dmmq.matches) editor.setTheme("ace/theme/terminal");
else editor.setTheme("ace/theme/chrome");
}
darkLight();
dmmq.addEventListener("change", darkLight);
})();
</script>
<hr />
<p><strong>Related Posts</strong></p>
<ul>
<li><a href="https://dragoncoder047.github.io/blog/2023/continuations-and-the-thunk-queue">Continuations and the thunk queue</a></li>
<li><a href="https://dragoncoder047.github.io/blog/2025/one-hell-of-a-physics-engine">One Hell of a Physics Engine</a></li>
<li><a href="https://dragoncoder047.github.io/blog/2025/so-far-ahead-yet-so-far-behind">So far ahead, yet so far behind</a></li>
<li><a href="https://dragoncoder047.github.io/blog/2024/boy-have-i-been">Boy, Have I Been...</a></li>
<li><a href="https://dragoncoder047.github.io/blog/2024/pointer-soup">Pointer Soup</a></li>
</ul>
<script src="https://giscus.app/client.js"
data-repo="dragoncoder047/blog"
data-repo-id="R_kgDOHCL60w"
data-category="Post Comments"
data-category-id="DIC_kwDOHCL6084CRxCW"
data-mapping="og:title"
data-reactions-enabled="1"
data-input-position="top"
data-theme="dark"
data-lang="en"
crossorigin="anonymous"
async
></script>
<section id="extras">
<div class="blogroll">
<ul>
<li><a href="https://www.conwaylife.com/">Conwaylife.com Forums</a></li>
<li><a href="https://www.python.org/">Python</a></li>
<li><a href="http://www.ulisp.com/">uLisp</a></li>
</ul>
</div>
<div class="social">
<ul>
<li><a href="https://github.com/dragoncoder047">dragoncoder047 on GitHub</a></li>
<li><a href="https://youtube.com/@dragoncoder047">dragoncoder047 on YouTube</a></li>
<li><a href="https://instagram.com/dragoncoder047/">dragoncoder047 on Instagram</a></li>
</ul>
</div>
</section>
</main>
<footer>
<address>
Site built by <a href="https://getpelican.com/">Pelican</a>
</address>
<a href="#" onclick="window.scrollTo({top: 0, left: 0});">Back to top</a>
<!-- Google tag (gtag.js) -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-XR0F89CCGK"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag("js", new Date());
gtag("config", "G-XR0F89CCGK");
</script>
</footer>
</body>
</html>