a curated list of database news from authoritative sources

July 14, 2022

July 13, 2022

July 12, 2022

July 10, 2022

Implementing a simple jq clone in Go, and basics of Go memory profiling

In this post we'll build a basic jq clone in Go. It will only be able to pull a single path out of each object it reads. It won't be able to do filters, mapping, etc.

$ cat large-file.json | head -n2 | ./jqgo '.repo.url'
"https://api.github.com/repos/petroav/6.828"
"https://api.github.com/repos/rspt/rspt-theme"

We'll start by building a "control" implementation that uses Go's builtin JSON library with a JSON path tool on top.

Then we'll implement a basic path-aware JSON parser in 600 lines of Go. It's going to use a technique (that may have a better name but) I call "partial parsing" or "fuzzy parsing" where we fully parse what we care about and only sort of parse the rest.

Why partial parsing? There are two general reasons. One is to use less memory than parsers that must always turn all of a text into an object in your language. The other is for when the language has complexities you don't want or need to deal with. We'll basically have to deal with all the complexities of JSON so this post is about the former reason: using less memory. I've written about a case for the second reason though in building a simple, fast SCSS implementation.

This partial parser is more complex than a typical handwritten parser. If you are unfamiliar with handwritten JSON parsers, you may want to take a look at previous articles I've written about parsing JSON.

Once we get this partial parser working we'll turn to Go's builtin profiler to find what we can do to make it faster.

All code for this post is available on Github.

Machine specs, versions

Since we're going to be doing some rudimentary comparisons of performance, here are my details. I am running everything on a dedicated server, OVH Rise-1.

  • RAM: 64 GB DDR4 ECC 2,133 MHz
  • Disk: 2x450 GB SSD NVMe in Soft RAID
  • Processor: Intel Xeon E3-1230v6 - 4c/8t - 3.5 GHz/3.9 GHz

And relevant versions:

$ jq --version
jq-1.6
$ go version
go version go1.18 linux/amd64
$ uname -a
Linux phil 5.18.10-100.fc35.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul 7 17:41:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Now buckle up!

jq using Go's builtin JSON library

This is a very simple program. We just parse JSON data from stdin in a loop. And after parsing each time we'll call a extractValueAtPath function to grab the value at the path the user asks for.

To keep our path "parser" very simple we'll treat array access the same as object access. So we'll look for x.0 instead of x[0], unlike jq.

package main

import (
    "encoding/json"
    "io"
    "log"
    "os"
    "strconv"
    "strings"
)

func extractValueAtPath(a map[string]any, path []string) (any, error) {
    // TODO
}

func main() {
    path := strings.Split(os.Args[1], ".")
    if path[0] == "" {
        path = path[1:]
    }

    dec := json.NewDecoder(os.Stdin)
    var a map[string]any

    enc := json.NewEncoder(os.Stdout)

    for {
        err := dec.Decode(&a)
        if err == io.EOF {
            break
        }

        if err != nil {
            log.Fatal(err)
        }

        v, err := extractValueAtPath(a, path)
        if err != nil {
            log.Fatal(err)
        }

        err = enc.Encode(v)
        if err != nil {
            log.Fatal(err)
        }
    }
}

Then we implement the extractValueAtPath function itself, entering into JSON arrays and objects until we reach the end of the path.

func extractValueAtPath(a map[string]any, path []string) (any, error) {
    if len(path) == 0 {
        return nil, nil
    }

    var v any = a

    for _, part := range path {
        if arr, ok := v.([]any); ok {
            n, err := strconv.Atoi(part)
            if err != nil {
                return nil, err
            }

            v = arr[n]
            continue
        }

        m, ok := v.(map[string]any)
        if !ok {
            // Path into a non-map
            return nil, nil
        }

        v, ok = m[part]
        if !ok {
            // Path does not exist
            return nil, nil
        }
    }

    return v, nil
}

Alright, let's give it a go module and build and run it!

$ go mod init control
$ go mod tidy
$ go build
# Grab a test file
$ curl https://raw.githubusercontent.com/json-iterator/test-data/master/large-file.json | jq -c '.[]' > large-file.json
$ cat large-file.json | head -n2 | ./control '.repo.url'
"https://api.github.com/repos/petroav/6.828"
"https://api.github.com/repos/rspt/rspt-theme"

Sweet. Now let's make sure it produces the same thing as jq.

$ cat large-file.json | ./control '.repo.url' > control.test
$ cat large-file.json | jq '.repo.url' > jq.test
$ diff jq.test control.test
$ echo $?
0

Great! It's working for a basic query. Let's see how it performs.

$ hyperfine \
  "cat large-file.json | ./control '.repo.url' > control.test" \
  "cat large-file.json | jq '.repo.url' > jq.test"
Benchmark 1: cat large-file.json | ./control '.repo.url' > control.test
  Time (mean ± σ):     310.0 ms ±  14.4 ms    [User: 296.2 ms, System: 49.3 ms]
  Range (min  max):   296.1 ms  344.9 ms    10 runs

Benchmark 2: cat large-file.json | jq '.repo.url' > jq.test
  Time (mean ± σ):     355.8 ms ±   1.1 ms    [User: 348.8 ms, System: 27.7 ms]
  Range (min  max):   354.8 ms  358.5 ms    10 runs

Summary
  'cat large-file.json | ./control '.repo.url' > control.test' ran
    1.15 ± 0.05 times faster than 'cat large-file.json | jq '.repo.url' > jq.test'

Now that's surprising! This naive implementation in Go is a bit faster than standard jq. But our implementation supports a heck of a lot less than jq. So this benchmark on its own isn't incredibly meaningful.

However, it's a good base for comparing to our next implementation.

Astute readers may notice that this version doesn't use a buffered reader from stdin, while the next version will. I tried this version with and without wrapping stdin in a buffered reader but it didn't make a meaningful difference. It might be because Go's JSON decoder does its own buffering. I'm not sure.

Let's do the fun implementation.

Partial parsing

Unlike a typical handwritten parser this partial parser is going to contain almost two parsers. One parser will care exactly about the structure of JSON. The other parser will only care about reading past the current value (whether it be a number or string or array or object, etc.) The path we pass to the parser will be used to decide whether each value should be fully parsed or partially parsed.

I'll reiterate: this partial parser is more complex than a typical handwritten parser. If you are unfamiliar with handwritten JSON parsers, you may want to take a look at previous articles I've written about parsing JSON.

The shell of this partial parser is going to look similar to the shell of the first parser.

package main

import (
    "bufio"
    "encoding/json"
    "fmt"
    "io"
    "log"
    "os"
    "strconv"
    "strings"
)

type jsonReader struct {
    read []byte
}

... TO IMPLEMENT ...

func main() {
    path := strings.Split(os.Args[1], ".")
    if path[0] == "" {
        path = path[1:]
    }

    b := bufio.NewReader(os.Stdin)
    enc := json.NewEncoder(os.Stdout)

    var jr jsonReader
    var val any
    var err error

    for {
        jr.reset()

        val, err = jr.extractDataFromJsonPath(b, path)
        if err == io.EOF {
            break
        }
        if err != nil {
            log.Println("Read", string(jr.read))
            log.Fatalln(err)
        }

        err = enc.Encode(val)
        if err != nil {
            log.Fatalln(err)
        }
    }
}

Except instead of using the builtin JSON parser we'll call our own extractDataFromJsonPath function that handles parsing and extraction all at once.

Before doing that we'll add a few helper functions. The first one grabs a byte from a reader and stores the read byte locally (so we can print out all read bytes if the program fails).

type jsonReader struct {
    read []byte
}

func (jr *jsonReader) readByte(r *bufio.Reader) (byte, error) {
    c, err := r.ReadByte()
    if err != nil {
        return byte(0), err
    }

    jr.read = append(jr.read, c)

    return c, nil
}

The reset member zeroes out the read bytes and gets called before each object is parsed in the main main loop.

func (jr *jsonReader) reset() {
    jr.read = nil
}

Now let's get into extractDataFromJsonPath.

extractDataFromJsonPath

This is the real parser. It expects a JSON object and fully parses the object, almost.

func (jr *jsonReader) extractDataFromJsonPath(r *bufio.Reader, path []string) (any, error) {
    if len(path) == 0 {
        return nil, nil
    }

    err := jr.eatWhitespace(r)
    if err != nil {
        return nil, err
    }

    b, err := jr.readByte(r)
    if err != nil {
        return nil, err
    }

    // Make sure we're actually going into an object
    if b != '{' {
        return nil, fmt.Errorf("Expected opening curly brace, got: '%s'", string(b))
    }

    i := -1

    var result any
    for {
        i++

        err = jr.eatWhitespace(r)
        if err != nil {
            return nil, err
        }

        bs, err := r.Peek(1)
        if err != nil {
            return nil, err
        }
        b := bs[0]

        // We found the end of the object
        if b == '}' {
            r.Discard(1)
            break
        }

        // Key-value pairs must be separated by commas
        if i > 0 {
            if b == ',' {
                r.Discard(1)
            } else {
                return nil, fmt.Errorf("Expected comma between key-value pairs, got: '%s'", string(b))
            }
        }

        err = jr.eatWhitespace(r)
        if err != nil {
            return nil, err
        }

        // Grab the key
        s, err := jr.expectString(r)
        if err != nil {
            return nil, err
        }

        err = jr.eatWhitespace(r)
        if err != nil {
            return nil, err
        }

        // Find a colon separating key from value
        b, err = jr.readByte(r)
        if err != nil {
            return nil, err
        }

        if b != ':' {
            return nil, fmt.Errorf("Expected colon, got: '%s'", string(b))
        }

        err = jr.eatWhitespace(r)
        if err != nil {
            return nil, err
        }

Up to this point it looks like any old handwritten parser. There are a few helpers in there (eatWhitespace, expectString) we'll implement shortly.

But once we see each key and are ready to look for a value we can decide if we need to fully parse the value (if the path goes into this key) or if we can partially parse the value (because the path does not go into this key).

        // If the key is not the start of this path, skip past this value
        if path[0] != s {
            err = jr.eatValue(r)
            if err != nil {
                return nil, err
            }

            continue
        }

        // Otherwise this is a path we want, grab the value
        result, err = jr.expectValue(r, path[1:])
        if err != nil {
            return nil, err
        }
    }

    return result, nil
}

And that's it! The core parsing loop is done. The meat now becomes 1) the eatValue function that partially parses JSON and 2) the expectValue function that either encounters a scalar value and returns it or recursively calls extractDataFromJsonPath to enter some new object.

Notes on helper naming

There are three main kinds of helpers you'll see. expectX helpers like expectString will return early with an error if they fail to find what they're looking for. eatX helpers like eatWhitespace will not return any value and will only move the read cursor forward. And tryX helpers like tryNumber will do the same thing as expectString but return an additional boolean argument. So the caller can decide whether or not to make other attempts at parsing.

But first let's fill in the two helpers we skipped. First off, eatWhitespace.

eatWhitespace

This function peeks and reads bytes while the bytes are whitespace.

func (jr *jsonReader) eatWhitespace(r *bufio.Reader) error {
    for {
        bs, err := r.Peek(1)
        if err != nil {
            return err
        }
        b := bs[0]

        isWhitespace := b == ' ' ||
            b == '\n' ||
            b == '\t' ||
            b == '\r'
        if !isWhitespace {
            return nil
        }

        r.Discard(1)
    }
}

That's it! Next we need to fill in expectString.

expectString

This is a standard handwritten parser helper that looks for a double quote and keeps collecting bytes until it finds an ending double quote that is not escaped.

func (jr *jsonReader) expectString(r *bufio.Reader) (string, error) {
    var s []byte

    err := jr.eatWhitespace(r)
    if err != nil {
        return "", err
    }

    // Look for opening quote
    b, err := jr.readByte(r)
    if err != nil {
        return "", err
    }

    if b != '"' {
        return "", fmt.Errorf("Expected double quote to start string, got: '%s'", string(b))
    }

    var prev byte
    for {
        b, err := jr.readByte(r)
        if err != nil {
            return "", err
        }

        if b == '\\' && prev == '\\' {
            // Just skip
            prev = byte(0)
            continue
        } else if b == '"' {
            // Overwrite the escaped double quote
            if prev == '\\' {
                s[len(s)-1] = '"'
            } else {
                // Otherwise it's the actual end
                break
            }
        }

        s = append(s, b)
        prev = b
    }

    return string(s), nil
}

Standard stuff! Now let's get back to those meaty functions we introduced before, starting with expectValue.

expectValue

This function is called by extractDataFromJsonPath when it wants to fully parse a value.

If we see a left curly brace, we call extractDataFromJsonPath with it.

func (jr *jsonReader) expectValue(r *bufio.Reader, path []string) (any, error) {
    bs, err := r.Peek(1)
    if err != nil {
        return nil, err
    }
    c := bs[0]

    if c == '{' {
        return jr.extractDataFromJsonPath(r, path)

Otherwise if we see a left bracket we call a new helper extractArrayDataFromJsonPath which will be almost identical to extractDataFromJsonPath but for parsing array syntax.

    } else if c == '[' {
        return jr.extractArrayDataFromJsonPath(r, path)
    }

If the value we're trying to parse isn't an array or object and there's more of a path then we have to return null because we can't enter into a scalar value.

    // Can't go any further into a path

    if len(path) != 0 {
        // Reached the end of this object but more of
        // the path remains. So this object doesn't
        // contain this path.
        return nil, nil
    }

Then we try to parse a scalar (numbers, strings, booleans, null) and ultimately return an error if nothing worked.

    ok, val, err := jr.tryScalar(r)
    if err != nil {
        return nil, err
    }
    if !ok {
        return nil, fmt.Errorf("Expected scalar, got: '%s'", string(c))
    }

    return val, err
}

Let's implement tryScalar and its dependencies now. And we'll come back to extractArrayDataFromJsonPath afterward.

tryScalar

The tryScalar is similar to expectValue. It's called tryScalar because it's allowed to fail.

We peek at the first byte and switch on a dedicated parsing helper based on it.

func (jr *jsonReader) tryScalar(r *bufio.Reader) (bool, any, error) {
    bs, err := r.Peek(1)
    if err != nil {
        return false, nil, err
    }
    c := bs[0]

    if c == '"' {
        val, err := jr.expectString(r)
        return true, string(val), err
    } else if c == 't' {
        val, err := jr.expectIdentifier(r, "true", true)
        return true, val, err
    } else if c == 'f' {
        val, err := jr.expectIdentifier(r, "false", false)
        return true, val, err
    } else if c == 'n' {
        val, err := jr.expectIdentifier(r, "null", nil)
        return true, val, err
    }

    return jr.tryNumber(r)
}

This passes control flow to two new functions, expectIdentifier and tryNumber. Let's do expectIdentifier next.

expectIdentifier

This function tries to match the reader on a string passed to it.

func (jr *jsonReader) expectIdentifier(r *bufio.Reader, ident string, value any) (any, error) {
    var s []byte

    for i := 0; i < len(ident); i++ {
        b, err := r.ReadByte()
        if err != nil {
            return nil, err
        }

        s = append(s, b)
    }

    if string(s) == ident {
        return value, nil
    }

    return nil, fmt.Errorf("Unknown value: '%s'", string(s))
}

Thanks Michael Lynch for pointing out in an earlier version that expectIdentifier does not need to Peek/Discard but can just ReadByte instead.

tryNumber

This function tries to parse a number. We'll do a very lazy number parser that will most likely allow all valid numbers. Internally we'll call json.Unmarshal on the bytes we build up to do the conversion itself.

func (jr *jsonReader) tryNumber(r *bufio.Reader) (bool, any, error) {
    var number []byte

    // Loop trying to find all number-like characters in a row
    for {
        bs, err := r.Peek(1)
        if err != nil {
            return false, nil, err
        }
        c := bs[0]

        isNumberCharacter := (c >= '0' && c <= '9') || c == 'e' || c == '-'
        if !isNumberCharacter {
            break
        }

        number = append(number, c)

        r.Discard(1)
    }

    if len(number) == 0 {
        return false, nil, nil
    }

    var n float64
    err := json.Unmarshal(number, &n)
    return true, n, err
}

If we can't find a number, that's ok. We'll just say so in the first argument by returning false.

Outstanding functions

Ok we've come a while building out helper functions. The last two remaining helpers are extractArrayDataFromJsonPath and eatValue. Let's finish up these real parser functions before getting to eatValue, the primary partial parsing function.

extractArrayDataFromJsonPath

This function is almost identical to extractDataFromJsonPath but rather than parsing key-value pairs inside curly braces it parses values inside brackets.

func (jr *jsonReader) extractArrayDataFromJsonPath(r *bufio.Reader, path []string) (any, error) {
    // Path inside an array must be an integer
    n, err := strconv.Atoi(string(path[0]))
    if err != nil {
        return nil, err
    }

    // Look for opening bracket. Make sure we're in an array
    b, err := jr.readByte(r)
    if err != nil {
        return nil, err
    }

    if b != '[' {
        return nil, fmt.Errorf("Expected opening bracket, got: '%s'", string(b))
    }

    var result any
    i := -1
    for {
        i++

        err = jr.eatWhitespace(r)
        if err != nil {
            return nil, err
        }

        bs, err := r.Peek(1)
        if err != nil {
            return nil, err
        }
        b := bs[0]

        // Found closing bracket, exit the array
        if b == ']' {
            r.Discard(1)
            break
        }

        // Array values must be separated by a comma
        if i > 0 {
            if b == ',' {
                r.Discard(1)
            } else {
                return nil, fmt.Errorf("Expected comma between key-value pairs, got: '%s'", string(b))
            }
        }

        err = jr.eatWhitespace(r)
        if err != nil {
            return nil, err
        }

Just like extractDataFromJsonPath it either calls eatValue or expectValue depending on whether the current index matches the requested path.

        // If the key is not the start of this path, skip past this value
        if i !=... (truncated)
                                    

July 08, 2022

July 07, 2022

July 06, 2022