database.news | database technology news aggregator

Phil Eaton

io_uring basics: Writing a file to disk

King and I wrote a blog post about building an event-driven cross-platform IO library that used io_uring on Linux. We sketched out how it works at a high level but I hadn't yet internalized how you actually code with io_uring. So I strapped myself down this week and wrote some benchmarks to build my intuition about io_uring and other IO models.

I started with implementations in Go and ported them to Zig to make sure I had done the Go versions decently. And I got some help from King and other internetters to find some inefficiencies in my code.

This post will walk through my process, getting increasingly efficient (and a little increasingly complex) ways to write an entire file to disk with io_uring, from Go and Zig.

Notably, we're not going to fsync() and we're not going to use O_DIRECT. So we won't be testing the entire IO pipeline from userland to disk hardware but just how fast IO gets to the kernel. The focus of this post is more on IO methods and using io_uring, not absolute numbers.

All code for this post is available on GitHub.

This code is going to indirectly show some differences in timing between Go and Zig. I could care less about benchmarketing. And I hope something about Zig vs Go is not what you take away from this post either.

The goal is to build an intuition and be generally correct. Observing the same relative behavior between implementations across two languages helps me gain confidence what I'm doing is correct.

io_uring

With normal blocking syscalls you just call read() or write() and wait for the results. io_uring is one of Linux's more powerful asynchronous IO offerings. Unlike epoll, you can use io_uring with both files and network connections. And unlike epoll you can even have the syscall run in the kernel.

To interact with io_uring, you register a submission queue for syscalls and their arguments. And you register a completion queue for syscall results.

You can batch many syscalls in one single call to io_uring, effectively turning up to N (4096 at most) syscalls into just one syscall. The kernel still does all the work of the N syscalls but you avoid some overhead.

As you check the completion queue and handle completed submissions, the submission queue is also freed all or somewhat, and you can now add more submissions.

For a more complete understanding, check out the kernel document Efficient IO with io_uring.

io_uring vs liburing

io_uring is a complex, low-level interface. Shuveb Hussain has an excellent series on programming io_uring. But that was too low-level for me as I was trying to figure out how to just get something working.

Instead, most people use liburing or a ported version of it like the Zig standard library's io_uring.zig or Iceber's iouring-go.

io_uring started clicking for me when I tried out the iouring-go library. So we'll start there.

Boilerplate

First off, let's set up some boilerplate for the Go and Zig code.

In main.go add:

package main

import (
  "bytes"
  "fmt"
  "os"
  "time"
)

func assert(b bool) {
  if !b {
    panic("assert")
  }
}

const BUFFER_SIZE = 4096

func readNBytes(fn string, n int) []byte {
  f, err := os.Open(fn)
  if err != nil {
    panic(err)
  }
  defer f.Close()

  data := make([]byte, 0, n)

  var buffer = make([]byte, BUFFER_SIZE)
  for len(data) < n {
    read, err := f.Read(buffer)
    if err != nil {
      panic(err)
    }

    data = append(data, buffer[:read]...)
  }

  assert(len(data) == n)

  return data
}

func benchmark(name string, data []byte, fn func(*os.File)) {
  fmt.Printf("%s", name)
  f, err := os.OpenFile("out.bin", os.O_RDWR | os.O_CREATE | os.O_TRUNC, 0755)
  if err != nil {
    panic(err)
  }

  t1 := time.Now()

  fn(f)

  s := time.Now().Sub(t1).Seconds()
  fmt.Printf(",%f,%f\n", s, float64(len(data))/s)

  if err := f.Close(); err != nil {
    panic(err)
  }

  assert(bytes.Equal(readNBytes("out.bin", len(data)), data))
}

And in main.zig add:

const std = @import("std");

const OUT_FILE = "out.bin";
const BUFFER_SIZE: u64 = 4096;

fn readNBytes(
  allocator: *const std.mem.Allocator,
  filename: []const u8,
  n: usize,
) ![]const u8 {
  const file = try std.fs.cwd().openFile(filename, .{});
  defer file.close();

  var data = try allocator.alloc(u8, n);
  var buf = try allocator.alloc(u8, BUFFER_SIZE);

  var written: usize = 0;
  while (data.len < n) {
    var nwritten = try file.read(buf);
    @memcpy(data[written..], buf[0..nwritten]);
    written += nwritten;
  }

  std.debug.assert(data.len == n);
  return data;
}

const Benchmark = struct {
  t: std.time.Timer,
  file: std.fs.File,
  data: []const u8,
  allocator: *const std.mem.Allocator,

  fn init(
    allocator: *const std.mem.Allocator,
    name: []const u8,
    data: []const u8,
  ) !Benchmark {
    try std.io.getStdOut().writer().print("{s}", .{name});

    var file = try std.fs.cwd().createFile(OUT_FILE, .{
      .truncate = true,
    });

    return Benchmark{
      .t = try std.time.Timer.start(),
      .file = file,
      .data = data,
      .allocator = allocator,
    };
  }

  fn stop(b: *Benchmark) void {
    const s = @as(f64, @floatFromInt(b.t.read())) / std.time.ns_per_s;
    std.io.getStdOut().writer().print(
      ",{d},{d}\n",
      .{ s, @as(f64, @floatFromInt(b.data.len)) / s },
    ) catch unreachable;

    b.file.close();

    var in = readNBytes(b.allocator, OUT_FILE, b.data.len) catch unreachable;
    std.debug.assert(std.mem.eql(u8, in, b.data));
    b.allocator.free(in);
  }
};

Keep it simple: write()

Now let's add the naive version of writing bytes to disk: calling write() repeatedly until all data has been written to disk.

In main.go:

func main() {
  size := 104857600 // 100MiB
  data := readNBytes("/dev/random", size)

  const RUNS = 10
  for i := 0; i < RUNS; i++ {
    benchmark("blocking", data, func(f *os.File) {
      for i := 0; i < len(data); i += BUFFER_SIZE {
        size := min(BUFFER_SIZE, len(data)-i)
        n, err := f.Write(data[i : i+size])
        if err != nil {
          panic(err)
        }

        assert(n == BUFFER_SIZE)
      }
    })
  }
}

And in main.zig:

pub fn main() !void {
  var allocator = &std.heap.page_allocator;

  const SIZE = 104857600; // 100MiB
  var data = try readNBytes(allocator, "/dev/random", SIZE);
  defer allocator.free(data);

  const RUNS = 10;
  var run: usize = 0;
  while (run < RUNS) : (run += 1) {
    {
      var b = try Benchmark.init(allocator, "blocking", data);
      defer b.stop();

      var i: usize = 0;
      while (i < data.len) : (i += BUFFER_SIZE) {
        const size = @min(BUFFER_SIZE, data.len - i);
        const n = try b.file.write(data[i .. i + size]);
        std.debug.assert(n == size);
      }
    }
  }
}

Let's build and run these programs and store the results to CSV we can analyze with DuckDB.

Go first:

$ go build main.go -o gomain
$ ./gomain > go.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'go.csv' group by column0 order by avg(cast(column1 as double)) asc"

method	avg_time	avg_throughput
blocking	0.07251540000000001s	1.4GB/s

And Zig:

$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'zig.csv' group by column0 order by avg(cast(column1 as double)) asc"

method	avg_time	avg_throughput
blocking	0.0656907669s	1.5GB/s

Alright, we've got a baseline now and both language implementations are in the same ballpark.

Let's add a simple io_uring version!

io_uring, 1 entry, Go

The iouring-go library has really excellent documentation for getting started.

To keep it simple, we'll use io_uring with only 1 entry. Add the following to func main() after the existing benchmark() call in main.go:

benchmark("io_uring", data, func(f * os.File) {
  iour, err := iouring.New(1)
  if err != nil {
    panic(err)
  }
  defer iour.Close()

  for i := 0; i < len(data); i += BUFFER_SIZE {
    size := min(BUFFER_SIZE, len(data)-i)
    prepRequest := iouring.Pwrite(int(f.Fd()), data[i : i+size], uint64(i))
    res, err := iour.SubmitRequest(prepRequest, nil)
    if err != nil {
      panic(err)
    }

    <-res.Done()
    i, err := res.ReturnInt()
    if err != nil {
      panic(err)
    }
    assert(size == i)
  }
})

Note that benchmark takes care of f.Seek(0) before each run. And it also validates that the file contents are equivalent to the input data. So it validates the benchmark for correctness.

Alright, let's run this new Go implementation with io_uring!

$ go mod init gomain
$ go mod tidy
$ go build main.go -o gomain
$ ./gomain > go.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'go.csv' group by column0 order by avg(cast(column1 as double)) asc"

method	avg_time	avg_throughput
blocking	0.0811486s	1.3GB/s
io_uring	0.5083049999999999s	213.2MB/s

Well that looks terrible.

Let's port it to Zig to see if we notice the same behavior there.

io_uring, 1 entry, Zig

There isn't an official Zig tutorial on io_uring I'm aware of. But io_uring.zig is easy enough to browse through. And there are tests in that file that also show how to use it.

And now that we've explored a bit in Go the basic gist should be similar:

initialize io_uring
submit an entry
wait for it to finish
move on

Add the following to fn main() after the existing benchmark block in main.zig:

{
  var b = try Benchmark.init(allocator, "iouring", data);
  defer b.stop();

  const entries = 1;
  var ring = try std.os.linux.IO_Uring.init(entries, 0);
  defer ring.deinit();

  var i: usize = 0;
  while (i < data.len) : (i += BUFFER_SIZE) {
    const size = @min(BUFFER_SIZE, data.len - i);
    _ = try ring.write(0, b.file.handle, data[i .. i + size], i);

    const submitted = try ring.submit_and_wait(1);
    std.debug.assert(submitted == 1);

    const cqe = try ring.copy_cqe();
    std.debug.assert(cqe.err() == .SUCCESS);
    std.debug.assert(cqe.res >= 0);
    const n = @as(usize, @intCast(cqe.res));
    std.debug.assert(n <= BUFFER_SIZE);
  }
}

Now build and run:

$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'zig.csv' group by column0 order by avg(cast(column1 as double)) asc"

method	avg_time	avg_throughput
blocking	0.06650093630000001s	1.5GB/s
io_uring	0.17542890139999998s	597.7MB/s

Well it's similarly pretty bad. But our implementation ignores one major aspect of io_uring: batching requests.

Let's do some refactoring!

io_uring, N entries, Go

To support submitting N entries, we're going to have an inner loop running up to N that fills up N entries to io_uring.

Then we'll wait for the N submissions to complete and check their results.

We'll keep going until we write the entire file.

All of this can stay inside the loop in main, I'm just dropping preceding whitespace for nicer formatting here:

benchmarkIOUringNEntries := func (nEntries int) {
  benchmark(fmt.Sprintf("io_uring_%d_entries", nEntries), data, func(f * os.File) {
    iour, err := iouring.New(uint(nEntries))
    if err != nil {
      panic(err)
    }
    defer iour.Close()

    requests := make([]iouring.PrepRequest, nEntries)

    for i := 0; i < len(data); i += BUFFER_SIZE * nEntries {
      submittedEntries := 0
      for j := 0; j < nEntries; j++ {
        base := i + j * BUFFER_SIZE
        if base >= len(data) {
          break
        }
        submittedEntries++
        size := min(BUFFER_SIZE, len(data)-i)
        requests[j] = iouring.Pwrite(int(f.Fd()), data[base : base+size], uint64(base))
      }

      if submittedEntries == 0 {
        break
      }

      res, err := iour.SubmitRequests(requests[:submittedEntries], nil)
      if err != nil {
        panic(err)
      }

      <-res.Done()

      for _, result := range res.ErrResults() {
        _, err := result.ReturnInt()
        if err != nil {
          panic(err)
        }
      }
    }
  })
}
benchmarkIOUringNEntries(1)
benchmarkIOUringNEntries(128)

There are some specific things in there to notice.

First, toward the end of the file we may not have N entries to submit. We may have 1 or even 0.

If we have 0 to submit, we need to not even submit anything otherwise the Go library hangs. Similarly, if we don't slice requests to requests[:submittedEntries], the Go library will segfault if submittedEntries < N.

Other than that, let's build and run this!

$ go build -o gomain
$ ./gomain > go.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'go.csv' group by column0 order by avg(cast(column1 as double)) asc"

method	avg_time	avg_throughput
blocking	0.0740368s	1.4GB/s
io_uring_128_entries	0.127519s	836.6MB/s
io_uring_1_entries	0.46831579999999995s	226.9MB/s

Now we're getting somewhere! Still half the throughput but a 4x improvement from using only a single entry.

Let's port the N entry code to Zig.

io_uring, N entries, Zig

Unlike Go we can't do closures, so we'll have to make benchmarkIOUringNEntries a top-level function and keep the calls to it in the loop in main:

pub fn main() !void {
    var allocator = &std.heap.page_allocator;

    const SIZE = 104857600; // 100MiB
    var data = try readNBytes(allocator, "/dev/random", SIZE);
    defer allocator.free(data);

    const RUNS = 10;
    var run: usize = 0;
    while (run < RUNS) : (run += 1) {
        {
            var b = try Benchmark.init(allocator, "blocking", data);
            defer b.stop();

            var i: usize = 0;
            while (i < data.len) : (i += BUFFER_SIZE) {
                const size = @min(BUFFER_SIZE, data.len - i);
                const n = try b.file.write(data[i .. i + size]);
                std.debug.assert(n == size);
            }
        }

        try benchmarkIOUringNEntries(allocator, data, 1);
        try benchmarkIOUringNEntries(allocator, data, 128);
    }
}

And for the implementation itself, the only two big differences from the first version are that we'll bulk-read completion events (cqes) and that we'll create and wait for many submissions at once.

fn benchmarkIOUringNEntries(
  allocator: *const std.mem.Allocator,
  data: []const u8,
  nEntries: u13,
) !void {
  const name = try std.fmt.allocPrint(allocator.*, "iouring_{}_entries", .{nEntries});
  defer allocator.free(name);

  var b = try Benchmark.init(allocator, name, data);
  defer b.stop();

  var ring = try std.os.linux.IO_Uring.init(nEntries, 0);
  defer ring.deinit();

  var cqes = try allocator.alloc(std.os.linux.io_uring_cqe, nEntries);
  defer allocator.free(cqes);

  var i: usize = 0;
  while (i < data.len) : (i += BUFFER_SIZE * nEntries) {
    var submittedEntries: u32 = 0;
    var j: usize = 0;
    while (j < nEntries) : (j += 1) {
      const base = i + j * BUFFER_SIZE;
      if (base >= data.len) {
        break;
      }
      submittedEntries += 1;
      const size = @min(BUFFER_SIZE, data.len - base);
      _ = try ring.write(0, b.file.handle, data[base .. base + size], base);
    }

    const submitted = try ring.submit_and_wait(submittedEntries);
    std.debug.assert(submitted == submittedEntries);

    const waited = try ring.copy_cqes(cqes[0..submitted], submitted);
    std.debug.assert(waited == submitted);

    for (cqes[0..submitted]) |*cqe| {
      std.debug.assert(cqe.err() == .SUCCESS);
      std.debug.assert(cqe.res >= 0);
      const n = @as(usize, @intCast(cqe.res));
      std.debug.assert(n <= BUFFER_SIZE);
    }
  }
}

Let's build and run:

$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'zig.csv' group by column0 order by avg(cast(column1 as double)) asc"

method	avg_time	avg_throughput
blocking	0.0674331114s	1.5GB/s
iouring_128_entries	0.06773539590000001s	1.5GB/s
iouring_1_entries	0.1855542556s	569.9MB/s

Huh, that's surprising! We caught up to blocking writes with io_uring in Zig, but not in Go, even though we made good progress in Go.

Ring buffers

But we can do a bit better. We're doing batching, but the API is called "io_uring" not "io_batch". We're not even making use of the ring buffer behavior io_uring gives us!

We are waiting for all submitted results complete. But there's no reason to do that. Instead we should submit as much as we can. But we should not block waiting for completions. We should handle completions when they happen. And we should retry submissions until we're done reading. Retrying if there's no space for the moment.

Unfortunately the Go library doesn't seem to expose this ring behavior of io_uring. Or I've missed it.

But we can do it in Zig. Let's go.

io_uring, ring buffer, Zig

We need to change the way we track which offsets we need to submit so far. We also need to keep the loop going until we are sure we have written all data. And we need to stop blocking on the number we submitted; never blocking at all.

fn benchmarkIOUringNEntries(
  allocator: *const std.mem.Allocator,
  data: []const u8,
  nEntries: u13,
) !void {
  const name = try std.fmt.allocPrint(allocator.*, "iouring_{}_entries", .{nEntries});
  defer allocator.free(name);

  var b = try Benchmark.init(allocator, name, data);
  defer b.stop();

  var ring = try std.os.linux.IO_Uring.init(nEntries, 0);
  defer ring.deinit();

  var cqes = try allocator.alloc(std.os.linux.io_uring_cqe, nEntries);
  defer allocator.free(cqes);

  var written: usize = 0;
  var i: usize = 0;
  while (i < data.len or written < data.len) {
    var submittedEntries: u32 = 0;
    var j: usize = 0;
    while (true) {
      const base = i + j * BUFFER_SIZE;
      if (base >= data.len) {
        break;
      }
      const size = @min(BUFFER_SIZE, data.len - base);
      _ = ring

October 19, 2023

io_uring basics: Writing a file to disk

io_uring

io_uring vs liburing

Boilerplate

Keep it simple: write()

io_uring, 1 entry, Go

io_uring, 1 entry, Zig

io_uring, N entries, Go

io_uring, N entries, Zig

Ring buffers

io_uring, ring buffer, Zig

October 16, 2023

Introducing database reports

October 13, 2023

Real-time dashboards: Are they worth it?

October 11, 2023

Fullstack Notion Clone: Next.js 13, React, Convex, Tailwind | Full Course 2023

Distributed caching systems and MySQL

How to query Google Sheets with SQL in real time

October 10, 2023

What is MySQL partitioning?

MySQL High Availability: Connection handling and concurrency

pgvector vs Pinecone: cost and performance

October 08, 2023

Offline-first React Native Apps with Expo, WatermelonDB, and Supabase

October 06, 2023

Personalizing your onboarding with Markdoc