September 07, 2021
September 02, 2021
MySQL LRU Flushing and I/O Capacity
InnoDB background LRU list flushing is not limited by innodb_io_capcity
or innodb_io_capacity_max
.
I’ll prove it in this blog post, but since MySQL experts disagree (or don’t know for sure), I’d like you to prove me wrong.
This is not an intro; you’ll need to know all the InnoDB details wrt page flushing.
August 31, 2021
Optimizing SQL with Query Statistics
August 27, 2021
NoneSQL All the DevEx
August 26, 2021
Writing a simple JSON library from scratch: a tour through modern C++
Modern C++ has a lot of cool features. Move semantics means passing
around structs in functions is cheap. std::shared_ptr
means I don't have to manage any memory; no
more new
/delete
! (But try as I might to
understand std::unique_ptr
, I'm just not there yet.)
The syntax has also gotten some treatment with auto
and
tuple destructuring.
In order to test out this modern C++ I wanted a small but meaningful project that operates on very dynamic data. The two that always come to mind are JSON parsers or Lisp interpreters.
This post walks through writing a basic JSON library from scratch using only the standard library. The source code for the resulting library is available on Github.
The biggest simplification we'll make is that rather than full JSON numbers, we'll only allow integers.
Big caveat! I couldn't be farther from a C++ expert! Email or tweet me as you see mistakes, madness, lies.
API
The two big parts of the API will be about lexing (turning a string into an array of tokens) and parsing (turning an array of tokens into a JSON object-tree). A better implementation would implement the lexer as taking a character stream rather than a string, but taking a string is simpler. So we'll stick with that.
Both of these functions can fail so we'll return a tuple in both cases with a string containing a possibly blank error message.
We will define the header in ./include/json.hpp
.
#ifndef JSON_H
#define JSON_H
#include <tuple>
#include <vector>
#include <string>
namespace json {
std::tuple<std::vector<JSONToken>, std::string> lex(std::string);
std::tuple<JSONValue, int, std::string> parse(std::vector<JSONToken>, int index = 0);
} // namespace json
#endif
The token returned by lex
will need to contain the
token's string value, the location (offset) in the original source, a
pointer to the full source (for debugging), and the token's type. The
token type itself will be an enum of either string, number, syntax
(colon, bracket, etc.), boolean, or null.
...
#include <string>
#include <memory>
namespace json {
enum class JSONTokenType { String, Number, Syntax, Boolean, Null };
struct JSONToken {
std::string value;
JSONTokenType type;
int location;
std::shared_ptr<std::string> full_source;
};
...
} // namespace json
...
This is the only place in the entire code we'll pass around a
pointer. Using std::shared_ptr
means we don't have to do
any manual memory management either. No new
or
delete
.
Next, JSONValue
is a struct containing optional string,
boolean, number, array, and object fields with a type num to
differentiate.
...
#include <map>
#include <optional>
namespace json {
enum class JSONValueType { String, Number, Object, Array, Boolean, Null };
struct JSONValue {
std::optional<std::string> string;
std::optional<double> number;
std::optional<bool> boolean;
std::optional<std::vector<JSONValue>> array;
std::optional<std::map<std::string, JSONValue>> object;
JSONValueType type;
};
enum class JSONTokenType { String, Number, Syntax, Boolean, Null };
...
Thanks to std::optional
we can avoid using pointers to
describe these fields. I did take a look at std::variant
but it seemed like its API was overly complex.
Finally, we'll add two more functions: a high level parse
function that combines the job of lexing and parsing, and a
deparse
function for printing a JSONValue
as
a JSON string.
...
std::tuple<JSONValue, int, std::string> parse(std::vector<JSONToken>, int index = 0);
std::tuple<JSONValue, std::string> parse(std::string);
std::string deparse(JSONValue, std::string whitespace = "");
} // namespace json
...
Now we're ready to start on the implementation.
Lexing
First up is lexing; turning a JSON string into an array of tokens: a number, string, null keyword, boolean keyword, or syntax like comma or colon.
The main lex loop skips whitespace and calls helper functions for each
kind of token. If a token is found, we accumulate it and move to the
end of that token (some tokens like :
are a single
character, some tokens like "my great string"
are
multiple characters.)
Each token we find gets a pointer to the original JSON source for use
in error messages if parsing fails. Again this will be the only time
we explicitly pass around pointers in this implementation. We don't do
any manual management because we're going to use
std::shared_ptr
.
#include "json.hpp"
namespace json {
std::tuple<std::vector<JSONToken>, std::string> lex(std::string raw_json) {
std::vector<JSONToken> tokens;
// All tokens will embed a pointer to the raw JSON for debugging purposes
auto original_copy = std::make_shared<std::string>(raw_json);
auto generic_lexers = {lex_syntax, lex_string, lex_number, lex_null, lex_true, lex_false};
for (int i = 0; i < raw_json.length(); i++) {
// Skip past whitespace
if (auto new_index = lex_whitespace(raw_json, i); i != new_index) {
i = new_index - 1;
continue;
}
auto found = false;
for (auto lexer : generic_lexers) {
if (auto [token, new_index, error] = lexer(raw_json, i); i != new_index) {
// Error while lexing, return early
if (error.length()) {
return {{}, error};
}
// Store reference to the original source
token.full_source = original_copy;
tokens.push_back(token);
i = new_index - 1;
found = true;
break;
}
}
if (found) {
continue;
}
return {{}, format_error("Unable to lex", raw_json, i)};
}
return {tokens, ""};
}
} // namespace json
Two neat things you'll notice in there are tuple literal syntax
({tokens, ""}
) and how easy it is to type a value
containing an array of function pointers using auto
(generic_lexers
).
format_error
Since we referenced format_error
, let's define it. This
needs to accept a message prefix, the full JSON string, and the index
offset where the error should point to.
Inside the function we'll iterate over the string until we find the entire line containing this index offset. We'll display that line and a pointer to the character that is causing/starting the error.
...
#include <sstream>
namespace json {
std::string format_error(std::string base, std::string source, int index) {
std::ostringstream s;
int counter = 0;
int line = 1;
int column = 0;
std::string lastline = "";
std::string whitespace = "";
for (auto c : source) {
if (counter == index) {
break;
}
if (c == '\n') {
line++;
column = 0;
lastline = "";
whitespace = "";
} else if (c == '\t') {
column++;
lastline += " ";
whitespace += " ";
} else {
column++;
lastline += c;
whitespace += " ";
}
counter++;
}
// Continue accumulating the lastline for debugging
while (counter < source.size()) {
auto c = source[counter];
if (c == '\n') {
break;
}
lastline += c;
counter++;
}
s << base << " at line " << line << ", column " << column << std::endl;
s << lastline << std::endl;
s << whitespace << "^";
return s.str();
}
...
The printf
API is annoying and Clang 12 (latest Clang on
latest Fedora) doesn't seem to support std::format
. So we
just use
std::sstream
to do string "formatting".
But ok, back to lexing! Next up: whitespace.
lex_whitespace
This function's job is to skip past whitespace. Thankfully we've got
std::isspace
to help.
int lex_whitespace(std::string raw_json, int index) {
while (std::isspace(raw_json[index])) {
if (index == raw_json.length()) {
break;
}
index++;
}
return index;
}
It's very simple!
lex_syntax
All of the generic lexers follow the same pattern. They return either a valid token and the index where the token ends, or they return an error string.
Since all the syntax elements in JSON (,
, :
,
{
, }
, [
and , ]
)
are single characters, we don't need to write a "longest substring"
helper function. We simply check if the current character is one of
these characters and return a syntax token if so.
std::tuple<JSONToken, int, std::string> lex_syntax(std::string raw_json, int index) {
JSONToken token{"", JSONTokenType::Syntax, index};
std::string value = "";
auto c = raw_json[index];
if (c == '[' || c == ']' || c == '{' || c == '}' || c == ':' || c == ',') {
token.value += c;
index++;
}
return {token, index, ""};
}
lex_string
This one manages state so it's a little more complex. We need to check if the current character is a double quote, then iterate over characters until we find the ending quote.
It's possible to hit EOF here so we need to handle that case. And handling nested quotes is left as an exercise for the reader. :)
std::tuple<JSONToken, int, std::string> lex_string(std::string raw_json,
int original_index) {
int index = original_index;
JSONToken token{"", JSONTokenType::String, index};
std::string value = "";
auto c = raw_json[index];
if (c != '"') {
return {token, original_index, ""};
}
index++;
// TODO: handle nested quotes
while (c = raw_json[index], c != '"') {
if (index == raw_json.length()) {
return {token, index, format_error("Unexpected EOF while lexing string", raw_json, index)};
}
token.value += c;
index++;
}
index++;
return {token, index, ""};
}
Nothing too special to discuss here. So on to lexing numbers.
lex_number
Since we're only supporting integers, this one has no internal state. We check characters until we stop seeing digits.
std::tuple<JSONToken, int, std::string> lex_number(std::string raw_json,
int original_index) {
int index = original_index;
JSONToken token = {"", JSONTokenType::Number, index};
std::string value = "";
// TODO: handle not just integers
while (true) {
if (index == raw_json.length()) {
break;
}
auto c = raw_json[index];
if (!(c >= '0' && c <= '9')) {
break;
}
token.value += c;
index++;
}
return {token, index, ""};
}
Done. On to keywords: null
, false
, true
.
lex_keyword
This is a helper function that will check for a literal keyword.
std::tuple<JSONToken, int, std::string> lex_keyword(std::string raw_json,
std::string keyword,
JSONTokenType type,
int original_index) {
int index = original_index;
JSONToken token{"", type, index};
while (keyword[index - original_index] == raw_json[index]) {
if (index == raw_json.length()) {
break;
}
index++;
}
if (index - original_index == keyword.length()) {
token.value = keyword;
}
return {token, index, ""};
}
With this defined we can now implement lex_false
,
lex_true
, and lex_null
.
std::tuple<JSONToken, int, std::string> lex_null(std::string raw_json,
int index) {
return lex_keyword(raw_json, "null", JSONTokenType::Null, index);
}
std::tuple<JSONToken, int, std::string> lex_true(std::string raw_json,
int index) {
return lex_keyword(raw_json, "true", JSONTokenType::Boolean, index);
}
std::tuple<JSONToken, int, std::string> lex_false(std::string raw_json,
int index) {
return lex_keyword(raw_json, "false", JSONTokenType::Boolean, index);
}
And that's it for lexing! And although we defined all of these top-down, you'll want to write them mostly in reverse order or put in forward declarations.
If you wanted to you could now write a simple main.cpp
like:
#include "json.hpp"
#include <iostream>
int main(int argc, char *argv[]) {
if (argc == 1) {
std::cerr << "Expected JSON input argument to parse" << std::endl;
return 1;
}
std::string in{argv[1]};
auto [tokens, error] = json::lex(in);
if (error.size()) {
std::cerr << error << std::endl;
return 1;
}
for (auto t : tokens) {
std::cout << t.value << std::endl;
}
}
Set up a Makefile:
main: *.cpp ./include/*.hpp
clang++ -g -Wall -std=c++2a -I./include *.cpp -o $@
Build with make
and run ./main '{"a": 1}'
to see the list of tokens printed out.
Now let's move on to parsing from the array of tokens.
Parsing
This process takes the array of tokens and turns them into a tree
structure. The tree develops children as we spot [
or
{
tokens. The tree child ends when we spot ]
or }
tokens.
std::tuple<JSONValue, int, std::string> parse(std::vector<JSONToken> tokens,
int index) {
auto token = tokens[index];
switch (token.type) {
case JSONTokenType::Number: {
auto n = std::stod(token.value);
return {JSONValue{.number = n, .type = JSONValueType::Number}, index + 1, ""};
}
case JSONTokenType::Boolean:
return {JSONValue{.boolean = token.value == "true", .type = JSONValueType::Boolean}, index + 1, ""};
case JSONTokenType::Null:
return {JSONValue{.type = JSONValueType::Null}, index + 1, ""};
case JSONTokenType::String:
return {JSONValue{.string = token.value, .type = JSONValueType::String}, index + 1, ""};
case JSONTokenType::Syntax:
if (token.value == "[") {
auto [array, new_index, error] = parse_array(tokens, index + 1);
return {JSONValue{.array = array, .type = JSONValueType::Array}, new_index, error};
}
if (token.value == "{") {
auto [object, new_index, error] = parse_object(tokens, index + 1);
return {JSONValue{.object = std::optional(object), .type = JSONValueType::Object}, new_index, error};
}
}
return {{}, index, format_parse_error("Failed to parse", token)};
}
This in turn reference format_parse_error
on failure
which is an error-string-maker similar to
format_error
. It actually calls format_error
with more details specific to parsing.
std::string JSONTokenType_to_string(JSONTokenType jtt) {
switch (jtt) {
case JSONTokenType::String:
return "String";
case JSONTokenType::Number:
return "Number";
case JSONTokenType::Syntax:
return "Syntax";
case JSONTokenType::Boolean:
return "Boolean";
case JSONTokenType::Null:
return "Null";
}
}
std::string format_parse_error(std::string base, JSONToken token) {
std::ostringstream s;
s << "Unexpected token '" << token.value << "', type '"
<< JSONTokenType_to_string(token.type) << "', index ";
s << std::endl << base;
return format_error(s.str(), *token.full_source, token.location);
}
This function depended on a helper for turning the
JSONTokenType
enum into a string. As a user it's very
annoying when langauges doesn't give you stringifier methods for enums
by default for debugging. I know there's some ways to do this with
reflection in C++ but it seemed hairy.
But I digest.
parse_array
This function was called by parse
when we found an
opening bracket. This function needs to recursively call parse and
then check for a comma and call parse again ... until it finds the
closing bracket.
It will fail if it every finds something other than a comma or closing
bracket following a succesful call to parse
.
std::tuple<std::vector<JSONValue>, int, std::string>
parse_array(std::vector<JSONToken> tokens, int index) {
std::vector<JSONValue> children = {};
while (index < tokens.size()) {
auto t = tokens[index];
if (t.type == JSONTokenType::Syntax) {
if (t.value == "]") {
return {children, index + 1, ""};
}
if (t.value == ",") {
index++;
t = tokens[index];
} else if (children.size() > 0) {
return {{},
index,
format_parse_error("Expected comma after element in array", t)};
}
}
auto [child, new_index, error] = parse(tokens, index);
if (error.size()) {
return {{}, index, error};
}
children.push_back(child);
index = new_index;
}
return {
{},
index,
format_parse_error("Unexpected EOF while parsing array", tokens[index])};
}
And finally we need to implement parse_object
.
parse_object
This function is similar to parse_array
but it needs to
find $string COLON $parse() COMMA
pattern pairs.
std::tuple<std::map<std::string, JSONValue>, int, std::string>
parse_object(std::vector<JSONToken> tokens, int index) {
std::map<std::string, JSONValue> values = {};
while (index < tokens.size()) {
auto t = tokens[index];
if (t.type == JSONTokenType::Syntax) {
if (t.value == "}") {
return {values, index + 1, ""};
}
if (t.value == ",") {
index++;
t = tokens[index];
} else if (values.size() > 0) {
return {
{},
index,
format_parse_error("Expected comma after element in object", t)};
} else {
return {{},
index,
format_parse_error(
"Expected key-value pair or closing brace in object", t)};
}
}
auto [key, new_index, error] = parse(tokens, index);
if (error.size()) {
return {{}, index, error};
}
if (key.type != JSONValueType::String) {
return {
{}, index, format_parse_error("Expected string key in object", t)};
}
index = new_index;
t = tokens[index];
if (!(t.type == JSONTokenType::Syntax && t.value == ":")) {
return {{},
index,
format_parse_error("Expected colon after key in object", t)};
}
index++;
t = tokens[index];
auto [value, new_index1, error1] = parse(tokens, index);
if (error1.size()) {
return {{}, index, error1};
}
values[key.string.value()] = value;
index = new_index1;
}
return {values, index + 1, ""};
}
These parse functions are all slightly tedious but still very simple. And thankfully, we're done!
We can now implement the variation of parse
that ties
together lexing and parsing.
std::tuple<JSONValue, std::string> parse(std::string source) {
auto [tokens, error] = json::lex(... (truncated)
August 23, 2021
Automatically copy migration data in PlanetScale branches
August 21, 2021
Parser generators vs. handwritten parsers: surveying major language implementations in 2021
Developers often think parser generators are the sole legit way to build programming language frontends, possibly because compiler courses in university teach lex/yacc variants. But do any modern programming languages actually use parser generators anymore?
To find out, this post presents a non-definitive survey of the parsing techniques used by various major programming language implementations.
CPython: PEG parser
Until CPython 3.10 (which hasn't been released yet) the default parser was built using pgen, a custom parser generator. The team thought the PEG parser was a better fit for expressing the language. At the time the switch from pgen to PEG parser improved speed 10% but increased memory usage by 10% as well.
The PEG grammar is defined here. (It is getting renamed in 3.10 though so check the directory for a file of a similar name if you browse 3.10+).
This section was corrected
by MegaIng on Reddit. Originally
I mistakenly claimed the previous parser was
handwritten. It was not.
Thanks J. Ryan Stinnett for
a correction about the change in speed in the new PEG parser.
GCC: Handwritten
Source code for the C parser available here. It used to use Bison until GCC 4.1 in 2006. The C++ parser also switched from Bison to a handwritten parser 2 years earlier.
Clang: Handwritten
Not only handwritten but the same file handles parsing C, Objective-C and C++. Source code is available here.
Ruby: Yacc-like Parser Generator
Ruby uses Bison. The grammar for the language can be found here.
V8 JavaScript: Handwritten
Source code available here.
Zend Engine PHP: Yacc-like Parser Generator
Source code available here.
TypeScript: Handwritten
Source code available here.
Bash: Yacc-like Parser Generator
Source code for the grammar is available here.
Chromium CSS Parser: Handwritten
Source code available here.
Java (OpenJDK): Handwritten
You can find the source code here.
Some older commentary calls this implementation fragile. But a Java contributor suggests the situation has improved since Java 8.
Golang: Handwritten
Until Go 1.6 the compiler used a yacc-based parser. The source code for that grammar is available here.
In Go 1.6 they switched to a handwritten parser. You can find that change here. There was a reported 18% speed increase when parsing files and a reported 3% speed increase in building the compiler itself when switching.
You can find the source code for the compiler's parser here.
Roslyn: Handwritten
The C# parser source code is available here. The Visual Basic parser source code is here.
A C# contributor mentioned a few key reasons for using a handwritten parser here.
Lua: Handwritten
Source code available here.
Swift: Handwritten
Source code available here.
R: Yacc-like Parser Generator
I couldn't find it at first but Liorithiel showed me the parser source code is here.
Julia: Handwritten ... in Scheme
Julia's parser is handwritten but not in Julia. It's in Scheme! Source code available here.
PostgreSQL: Yacc-like Parser Generator
PostgreSQL uses Bison for parsing queries. Source code for the grammar available here.
MySQL: Yacc Parser Generator
Source code for the grammar available here.
SQLite: Yacc-like Parser Generator
SQLite uses its own parser generator called Lemon. Source code for the grammary is available here.
Summary
Of the 2021 Redmonk top 10 languages, 8 of them have a handwritten parser. Ruby and Python use parser generators.
Although parser generators are still used in major language implementations, maybe it's time for universities to start teaching handwritten parsing?
This tweet was published before I was corrected about Python's parser. It should say 8/10 but I cannot edit the tweet.
Let's actually survey the parsing techniques used by major programming languages in 2021 (with links to code 👾).
— Phil Eaton (@phil_eaton) August 21, 2021
In this post we discover that 9/10 of the top languages by @redmonk use a handwritten parser as opposed to a parser generator. 😱https://t.co/M69TqN78G5 pic.twitter.com/sGsdDmwshB