July 14, 2022
Big security updates from Tinybird
July 13, 2022
Revamped Auth Helpers for Supabase (with SvelteKit support)
July 12, 2022
Getting started with the PlanetScale CLI
Use AWS SNS to send data to Tinybird
July 10, 2022
Implementing a simple jq clone in Go, and basics of Go memory profiling
In this post we'll build a basic jq clone in Go. It will only be able to pull a single path out of each object it reads. It won't be able to do filters, mapping, etc.
$ cat large-file.json | head -n2 | ./jqgo '.repo.url'
"https://api.github.com/repos/petroav/6.828"
"https://api.github.com/repos/rspt/rspt-theme"
We'll start by building a "control" implementation that uses Go's builtin JSON library with a JSON path tool on top.
Then we'll implement a basic path-aware JSON parser in 600 lines of Go. It's going to use a technique (that may have a better name but) I call "partial parsing" or "fuzzy parsing" where we fully parse what we care about and only sort of parse the rest.
Why partial parsing? There are two general reasons. One is to use less memory than parsers that must always turn all of a text into an object in your language. The other is for when the language has complexities you don't want or need to deal with. We'll basically have to deal with all the complexities of JSON so this post is about the former reason: using less memory. I've written about a case for the second reason though in building a simple, fast SCSS implementation.
This partial parser is more complex than a typical handwritten parser. If you are unfamiliar with handwritten JSON parsers, you may want to take a look at previous articles I've written about parsing JSON.
Once we get this partial parser working we'll turn to Go's builtin profiler to find what we can do to make it faster.
All code for this post is available on Github.
Machine specs, versions
Since we're going to be doing some rudimentary comparisons of performance, here are my details. I am running everything on a dedicated server, OVH Rise-1.
- RAM: 64 GB DDR4 ECC 2,133 MHz
- Disk: 2x450 GB SSD NVMe in Soft RAID
- Processor: Intel Xeon E3-1230v6 - 4c/8t - 3.5 GHz/3.9 GHz
And relevant versions:
$ jq --version
jq-1.6
$ go version
go version go1.18 linux/amd64
$ uname -a
Linux phil 5.18.10-100.fc35.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul 7 17:41:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Now buckle up!
jq using Go's builtin JSON library
This is a very simple program. We just parse JSON data from stdin in a
loop. And after parsing each time we'll call a extractValueAtPath
function to grab the value at the path the user asks for.
To keep our path "parser" very simple we'll treat array access the
same as object access. So we'll look for x.0
instead of x[0]
,
unlike jq.
package main
import (
"encoding/json"
"io"
"log"
"os"
"strconv"
"strings"
)
func extractValueAtPath(a map[string]any, path []string) (any, error) {
// TODO
}
func main() {
path := strings.Split(os.Args[1], ".")
if path[0] == "" {
path = path[1:]
}
dec := json.NewDecoder(os.Stdin)
var a map[string]any
enc := json.NewEncoder(os.Stdout)
for {
err := dec.Decode(&a)
if err == io.EOF {
break
}
if err != nil {
log.Fatal(err)
}
v, err := extractValueAtPath(a, path)
if err != nil {
log.Fatal(err)
}
err = enc.Encode(v)
if err != nil {
log.Fatal(err)
}
}
}
Then we implement the extractValueAtPath
function itself,
entering into JSON arrays and objects until we reach the end of the
path.
func extractValueAtPath(a map[string]any, path []string) (any, error) {
if len(path) == 0 {
return nil, nil
}
var v any = a
for _, part := range path {
if arr, ok := v.([]any); ok {
n, err := strconv.Atoi(part)
if err != nil {
return nil, err
}
v = arr[n]
continue
}
m, ok := v.(map[string]any)
if !ok {
// Path into a non-map
return nil, nil
}
v, ok = m[part]
if !ok {
// Path does not exist
return nil, nil
}
}
return v, nil
}
Alright, let's give it a go module and build and run it!
$ go mod init control
$ go mod tidy
$ go build
# Grab a test file
$ curl https://raw.githubusercontent.com/json-iterator/test-data/master/large-file.json | jq -c '.[]' > large-file.json
$ cat large-file.json | head -n2 | ./control '.repo.url'
"https://api.github.com/repos/petroav/6.828"
"https://api.github.com/repos/rspt/rspt-theme"
Sweet. Now let's make sure it produces the same thing as jq.
$ cat large-file.json | ./control '.repo.url' > control.test
$ cat large-file.json | jq '.repo.url' > jq.test
$ diff jq.test control.test
$ echo $?
0
Great! It's working for a basic query. Let's see how it performs.
$ hyperfine \
"cat large-file.json | ./control '.repo.url' > control.test" \
"cat large-file.json | jq '.repo.url' > jq.test"
Benchmark 1: cat large-file.json | ./control '.repo.url' > control.test
Time (mean ± σ): 310.0 ms ± 14.4 ms [User: 296.2 ms, System: 49.3 ms]
Range (min … max): 296.1 ms … 344.9 ms 10 runs
Benchmark 2: cat large-file.json | jq '.repo.url' > jq.test
Time (mean ± σ): 355.8 ms ± 1.1 ms [User: 348.8 ms, System: 27.7 ms]
Range (min … max): 354.8 ms … 358.5 ms 10 runs
Summary
'cat large-file.json | ./control '.repo.url' > control.test' ran
1.15 ± 0.05 times faster than 'cat large-file.json | jq '.repo.url' > jq.test'
Now that's surprising! This naive implementation in Go is a bit faster than standard jq. But our implementation supports a heck of a lot less than jq. So this benchmark on its own isn't incredibly meaningful.
However, it's a good base for comparing to our next implementation.
Astute readers may notice that this version doesn't use a buffered reader from stdin, while the next version will. I tried this version with and without wrapping stdin in a buffered reader but it didn't make a meaningful difference. It might be because Go's JSON decoder does its own buffering. I'm not sure.
Let's do the fun implementation.
Partial parsing
Unlike a typical handwritten parser this partial parser is going to contain almost two parsers. One parser will care exactly about the structure of JSON. The other parser will only care about reading past the current value (whether it be a number or string or array or object, etc.) The path we pass to the parser will be used to decide whether each value should be fully parsed or partially parsed.
I'll reiterate: this partial parser is more complex than a typical handwritten parser. If you are unfamiliar with handwritten JSON parsers, you may want to take a look at previous articles I've written about parsing JSON.
The shell of this partial parser is going to look similar to the shell of the first parser.
package main
import (
"bufio"
"encoding/json"
"fmt"
"io"
"log"
"os"
"strconv"
"strings"
)
type jsonReader struct {
read []byte
}
... TO IMPLEMENT ...
func main() {
path := strings.Split(os.Args[1], ".")
if path[0] == "" {
path = path[1:]
}
b := bufio.NewReader(os.Stdin)
enc := json.NewEncoder(os.Stdout)
var jr jsonReader
var val any
var err error
for {
jr.reset()
val, err = jr.extractDataFromJsonPath(b, path)
if err == io.EOF {
break
}
if err != nil {
log.Println("Read", string(jr.read))
log.Fatalln(err)
}
err = enc.Encode(val)
if err != nil {
log.Fatalln(err)
}
}
}
Except instead of using the builtin JSON parser we'll call our own
extractDataFromJsonPath
function that handles parsing and extraction
all at once.
Before doing that we'll add a few helper functions. The first one grabs a byte from a reader and stores the read byte locally (so we can print out all read bytes if the program fails).
type jsonReader struct {
read []byte
}
func (jr *jsonReader) readByte(r *bufio.Reader) (byte, error) {
c, err := r.ReadByte()
if err != nil {
return byte(0), err
}
jr.read = append(jr.read, c)
return c, nil
}
The reset
member zeroes out the read
bytes and gets called before
each object is parsed in the main
main loop.
func (jr *jsonReader) reset() {
jr.read = nil
}
Now let's get into extractDataFromJsonPath
.
extractDataFromJsonPath
This is the real parser. It expects a JSON object and fully parses the object, almost.
func (jr *jsonReader) extractDataFromJsonPath(r *bufio.Reader, path []string) (any, error) {
if len(path) == 0 {
return nil, nil
}
err := jr.eatWhitespace(r)
if err != nil {
return nil, err
}
b, err := jr.readByte(r)
if err != nil {
return nil, err
}
// Make sure we're actually going into an object
if b != '{' {
return nil, fmt.Errorf("Expected opening curly brace, got: '%s'", string(b))
}
i := -1
var result any
for {
i++
err = jr.eatWhitespace(r)
if err != nil {
return nil, err
}
bs, err := r.Peek(1)
if err != nil {
return nil, err
}
b := bs[0]
// We found the end of the object
if b == '}' {
r.Discard(1)
break
}
// Key-value pairs must be separated by commas
if i > 0 {
if b == ',' {
r.Discard(1)
} else {
return nil, fmt.Errorf("Expected comma between key-value pairs, got: '%s'", string(b))
}
}
err = jr.eatWhitespace(r)
if err != nil {
return nil, err
}
// Grab the key
s, err := jr.expectString(r)
if err != nil {
return nil, err
}
err = jr.eatWhitespace(r)
if err != nil {
return nil, err
}
// Find a colon separating key from value
b, err = jr.readByte(r)
if err != nil {
return nil, err
}
if b != ':' {
return nil, fmt.Errorf("Expected colon, got: '%s'", string(b))
}
err = jr.eatWhitespace(r)
if err != nil {
return nil, err
}
Up to this point it looks like any old handwritten parser. There are a
few helpers in there (eatWhitespace
, expectString
) we'll implement
shortly.
But once we see each key and are ready to look for a value we can decide if we need to fully parse the value (if the path goes into this key) or if we can partially parse the value (because the path does not go into this key).
// If the key is not the start of this path, skip past this value
if path[0] != s {
err = jr.eatValue(r)
if err != nil {
return nil, err
}
continue
}
// Otherwise this is a path we want, grab the value
result, err = jr.expectValue(r, path[1:])
if err != nil {
return nil, err
}
}
return result, nil
}
And that's it! The core parsing loop is done. The meat now becomes 1)
the eatValue
function that partially parses JSON and 2) the
expectValue
function that either encounters a scalar value and
returns it or recursively calls extractDataFromJsonPath
to enter some new object.
Notes on helper naming
There are three main kinds of helpers you'll see. expectX
helpers
like expectString
will return early with an error if they fail to
find what they're looking for. eatX
helpers like eatWhitespace
will not return any value and will only move the read cursor
forward. And tryX
helpers like tryNumber
will do the same thing as
expectString
but return an additional boolean argument. So the
caller can decide whether or not to make other attempts at parsing.
But first let's fill in the two helpers we skipped. First off, eatWhitespace
.
eatWhitespace
This function peeks and reads bytes while the bytes are whitespace.
func (jr *jsonReader) eatWhitespace(r *bufio.Reader) error {
for {
bs, err := r.Peek(1)
if err != nil {
return err
}
b := bs[0]
isWhitespace := b == ' ' ||
b == '\n' ||
b == '\t' ||
b == '\r'
if !isWhitespace {
return nil
}
r.Discard(1)
}
}
That's it! Next we need to fill in expectString
.
expectString
This is a standard handwritten parser helper that looks for a double quote and keeps collecting bytes until it finds an ending double quote that is not escaped.
func (jr *jsonReader) expectString(r *bufio.Reader) (string, error) {
var s []byte
err := jr.eatWhitespace(r)
if err != nil {
return "", err
}
// Look for opening quote
b, err := jr.readByte(r)
if err != nil {
return "", err
}
if b != '"' {
return "", fmt.Errorf("Expected double quote to start string, got: '%s'", string(b))
}
var prev byte
for {
b, err := jr.readByte(r)
if err != nil {
return "", err
}
if b == '\\' && prev == '\\' {
// Just skip
prev = byte(0)
continue
} else if b == '"' {
// Overwrite the escaped double quote
if prev == '\\' {
s[len(s)-1] = '"'
} else {
// Otherwise it's the actual end
break
}
}
s = append(s, b)
prev = b
}
return string(s), nil
}
Standard stuff! Now let's get back to those meaty functions we
introduced before, starting with expectValue
.
expectValue
This function is called by extractDataFromJsonPath
when it wants to
fully parse a value.
If we see a left curly brace, we call extractDataFromJsonPath
with
it.
func (jr *jsonReader) expectValue(r *bufio.Reader, path []string) (any, error) {
bs, err := r.Peek(1)
if err != nil {
return nil, err
}
c := bs[0]
if c == '{' {
return jr.extractDataFromJsonPath(r, path)
Otherwise if we see a left bracket we call a new helper
extractArrayDataFromJsonPath
which will be almost identical to
extractDataFromJsonPath
but for parsing array syntax.
} else if c == '[' {
return jr.extractArrayDataFromJsonPath(r, path)
}
If the value we're trying to parse isn't an array or object and there's more of a path then we have to return null because we can't enter into a scalar value.
// Can't go any further into a path
if len(path) != 0 {
// Reached the end of this object but more of
// the path remains. So this object doesn't
// contain this path.
return nil, nil
}
Then we try to parse a scalar (numbers, strings, booleans, null
) and
ultimately return an error if nothing worked.
ok, val, err := jr.tryScalar(r)
if err != nil {
return nil, err
}
if !ok {
return nil, fmt.Errorf("Expected scalar, got: '%s'", string(c))
}
return val, err
}
Let's implement tryScalar
and its dependencies now. And we'll come
back to extractArrayDataFromJsonPath
afterward.
tryScalar
The tryScalar
is similar to expectValue
. It's called tryScalar
because it's allowed to fail.
We peek at the first byte and switch on a dedicated parsing helper based on it.
func (jr *jsonReader) tryScalar(r *bufio.Reader) (bool, any, error) {
bs, err := r.Peek(1)
if err != nil {
return false, nil, err
}
c := bs[0]
if c == '"' {
val, err := jr.expectString(r)
return true, string(val), err
} else if c == 't' {
val, err := jr.expectIdentifier(r, "true", true)
return true, val, err
} else if c == 'f' {
val, err := jr.expectIdentifier(r, "false", false)
return true, val, err
} else if c == 'n' {
val, err := jr.expectIdentifier(r, "null", nil)
return true, val, err
}
return jr.tryNumber(r)
}
This passes control flow to two new functions, expectIdentifier
and
tryNumber
. Let's do expectIdentifier
next.
expectIdentifier
This function tries to match the reader on a string passed to it.
func (jr *jsonReader) expectIdentifier(r *bufio.Reader, ident string, value any) (any, error) {
var s []byte
for i := 0; i < len(ident); i++ {
b, err := r.ReadByte()
if err != nil {
return nil, err
}
s = append(s, b)
}
if string(s) == ident {
return value, nil
}
return nil, fmt.Errorf("Unknown value: '%s'", string(s))
}
Thanks Michael
Lynch for pointing out in an earlier version that
expectIdentifier
does not need to Peek
/Discard
but can just
ReadByte
instead.
tryNumber
This function tries to parse a number. We'll do a very lazy number
parser that will most likely allow all valid numbers. Internally
we'll call json.Unmarshal
on the bytes we build up to do the
conversion itself.
func (jr *jsonReader) tryNumber(r *bufio.Reader) (bool, any, error) {
var number []byte
// Loop trying to find all number-like characters in a row
for {
bs, err := r.Peek(1)
if err != nil {
return false, nil, err
}
c := bs[0]
isNumberCharacter := (c >= '0' && c <= '9') || c == 'e' || c == '-'
if !isNumberCharacter {
break
}
number = append(number, c)
r.Discard(1)
}
if len(number) == 0 {
return false, nil, nil
}
var n float64
err := json.Unmarshal(number, &n)
return true, n, err
}
If we can't find a number, that's ok. We'll just say so in the first
argument by returning false
.
Outstanding functions
Ok we've come a while building out helper functions. The last two
remaining helpers are extractArrayDataFromJsonPath
and
eatValue
. Let's finish up these real parser functions before getting
to eatValue
, the primary partial parsing function.
extractArrayDataFromJsonPath
This function is almost identical to extractDataFromJsonPath
but
rather than parsing key-value pairs inside curly braces it parses
values inside brackets.
func (jr *jsonReader) extractArrayDataFromJsonPath(r *bufio.Reader, path []string) (any, error) {
// Path inside an array must be an integer
n, err := strconv.Atoi(string(path[0]))
if err != nil {
return nil, err
}
// Look for opening bracket. Make sure we're in an array
b, err := jr.readByte(r)
if err != nil {
return nil, err
}
if b != '[' {
return nil, fmt.Errorf("Expected opening bracket, got: '%s'", string(b))
}
var result any
i := -1
for {
i++
err = jr.eatWhitespace(r)
if err != nil {
return nil, err
}
bs, err := r.Peek(1)
if err != nil {
return nil, err
}
b := bs[0]
// Found closing bracket, exit the array
if b == ']' {
r.Discard(1)
break
}
// Array values must be separated by a comma
if i > 0 {
if b == ',' {
r.Discard(1)
} else {
return nil, fmt.Errorf("Expected comma between key-value pairs, got: '%s'", string(b))
}
}
err = jr.eatWhitespace(r)
if err != nil {
return nil, err
}
Just like extractDataFromJsonPath
it either calls eatValue
or
expectValue
depending on whether the current index matches the
requested path.
// If the key is not the start of this path, skip past this value
if i !=... (truncated)