database.news | database technology news aggregator

Phil Eaton

Writing a storage engine for Postgres: an in-memory Table Access Method

With Postgres 12, released in 2019, it became possible to swap out Postgres's storage engine.

This is a feature MySQL has supported for a long time. There are at least 8 different built-in engines you can pick from. MyRocks, MySQL on RocksDB, is another popular third-party distribution.

I assume there will be a renaissance of Postgres storage engines. To date, the efforts are nascent. OrioleDB and Citus Columnar are two promising third-party table access methods being actively developed.

Why alternative storage engines?

The ability to swap storage engines is useful because different workloads sometimes benefit from different storage approaches. Analytics workloads and columnar storage layouts go well together. Write-heavy workloads and LSM trees go well together. And some people like in-memory storage for running integration tests.

By swapping out only the storage engine, you get the benefit of the rest of the Postgres or MySQL infrastructure. The query language, the wire protocol, the ecosystem, etc.

Why not foreign data wrappers?

Very little has been written about the difference between foreign data wrappers (FDWs) and table access methods. Table access methods seems to be the lower-level layer where presumably you get better performance and cleaner integration. But there is clearly overlap between these two extension options.

For example there is a FDW for ClickHouse so when you create tables and rows and query the tables you are really creating and querying rows in a ClickHouse server. Similarly there's a FDW for RocksDB. And Citus's columnar engine works either as a foreign data wrapper or a table access method.

The Citus page draws the clearest distinction between FDWs and table access methods, but even that page is vague. Performance doesn't seem to be the main difference. Closer integration, and thus the ability to look more like vanilla Postgres from the outside, seems to be the gist.

In any case, I wanted to explore the table access method API.

Digging in

I haven't written Postgres extensions before and I've never written C professionally. If you're familiar with Postgres internals or C and notice something funky, please let me know!

It turns out that almost no one has written how to implement the minimal table access methods for various storage engine operations. So after quite a bit of stumbling to get the basics of an in-memory storage engine working, I'm going to walk you through my approach.

This is prototype-quality code which hopefully will be a useful base for further exploration.

All code for this post is available on GitHub.

A debug Postgres build

First off, let's make a debug build of Postgres.

$ git clone https://github.com/postgres/postgres
$ # An arbitrary commit from `master` after Postgres 16 I am on
$ git checkout 849172ff4883d44168f96f39d3fde96d0aa34c99
$ cd postgres
$ ./configure --enable-cassert --enable-debug CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer"
$ make -j8
$ sudo make install

This will install Postgres binaries (e.g. psql, pg_ctl, initdb, pg_config) into /usr/local/pgsql/bin.

I'm going to reference those absolute paths throughout this post because you might have a system (package manager) install of Postgres already.

Let's create a database and start up this debug build:

$ /usr/local/pgsql/bin/initdb test-db
$ /usr/local/pgsql/bin/pg_ctl -D test-db -l logfile start

Extension infrastructure

Since we installed Postgres from scratch, /usr/local/pgsql/bin/pg_config will supply all of the infrastructure we need.

The "infrastructure" is basically just PGXS: Postgres Makefile utilities.

It's convention-heavy. So in a new Makefile for this project we'll specify:

MODULES: Any C sources to build, without the .c file extension
EXTENSION: Extension metadata file, without the .control file extension
DATA: A SQL file that is executed when the extension is loaded, this time with the .sql extension

MODULES = pgtam
EXTENSION = pgtam
DATA = pgtam--0.0.1.sql

PG_CONFIG = /usr/local/pgsql/bin/pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

The final three lines set up the PGXS Makefile library based on the particular installed Postgres build we want to build the extension against and install the extension to.

PGXS gives us a few important targets like make distclean, make, and make install we'll use later on.

`pgtam.c`

A minimal C file that registers a function capable of serving as a table access method is:

#include "postgres.h"
#include "fmgr.h"
#include "access/tableam.h"

PG_MODULE_MAGIC;

const TableAmRoutine memam_methods = {
  .type = T_TableAmRoutine,
};

PG_FUNCTION_INFO_V1(mem_tableam_handler);
Datum mem_tableam_handler(PG_FUNCTION_ARGS) {
  PG_RETURN_POINTER(&memam_methods);
}

If you want to read about extension basics without the complexity of table access methods, you can find a complete, minimal Postgres extension I wrote to validate the infrastructure here. Or you can follow a larger tutorial.

The workflow for registering a table access method is to first run CREATE EXTENSION pgtam. This assumes pgtam is an extension that has a function that returns a TableAmRoutine struct instance, a table of table access methods.

Then you must run CREATE ACCESS METHOD mem TYPE TABLE HANDLER mem_tableam_handler. And finally you can use the access method when creating a table with USING mem: CREATE TABLE x(a INT) USING mem.

`pgtam.control`

This file contains extension metadata. At a minimum, the version of the extension and the filename for the extension where it should be installed.

default_version = '0.0.1'
module_pathname = '$libdir/pgtam'

`pgtam--0.0.1.sql`

Finally, in pgtam--0.0.1.sql (which is executed when we call CREATE EXTENSION pgtam), we register the handler function as a foreign function, and then we register the function as an access method.

CREATE OR REPLACE FUNCTION mem_tableam_handler(internal)
RETURNS table_am_handler AS 'pgtam', 'mem_tableam_handler'
LANGUAGE C STRICT;

CREATE ACCESS METHOD mem TYPE TABLE HANDLER mem_tableam_handler;

Build

Now that we've got all the pieces in place, we can build and install the extension.

$ make
$ sudo make install

Let's add a test.sql script to exercise the extension:

DROP EXTENSION IF EXISTS pgtam CASCADE;
CREATE EXTENSION pgtam;
CREATE TABLE x(a INT) USING mem;

And run it:

$ /usr/local/pgsql/bin/psql postgres -f test.sql
DROP EXTENSION
CREATE EXTENSION
psql:test.sql:3: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
psql:test.sql:3: error: connection to server was lost

Ok, so psql crashed! Let's look at the server logs. When we started Postgres with pg_ctl we specified the log file as logfile in the directory where we ran pg_ctl.

If we look through it we'll spot an assertion failure:

$ grep Assert logfile
TRAP: failed Assert("routine->scan_begin != NULL"), File: "tableamapi.c", Line: 52, PID: 2906922

That's a great sign! This is Postgres's debug infrastructure helping to make sure the table access method is correctly implemented.

Table access method stubs

The next step is to add function stubs for all the non-optional methods of the TableAmRoutine struct.

I've done all the work for you already so you can just copy this over the existing pgtam.c. It's a big file, but don't worry. There's nothing to explain. Just a bunch of blank functions returning default values when required.

#include "postgres.h"
#include "fmgr.h"
#include "access/tableam.h"
#include "access/heapam.h"
#include "nodes/execnodes.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
#include "utils/builtins.h"
#include "executor/tuptable.h"

PG_MODULE_MAGIC;

const TableAmRoutine memam_methods;

static const TupleTableSlotOps* memam_slot_callbacks(
  Relation relation
) {
  return NULL;
}

static TableScanDesc memam_beginscan(
  Relation relation,
  Snapshot snapshot,
  int nkeys,
  struct ScanKeyData *key,
  ParallelTableScanDesc parallel_scan,
  uint32 flags
) {
  return NULL;
}

static void memam_rescan(
  TableScanDesc sscan,
  struct ScanKeyData *key,
  bool set_params,
  bool allow_strat,
  bool allow_sync,
  bool allow_pagemode
) {
}

static void memam_endscan(TableScanDesc sscan) {
}

static bool memam_getnextslot(
  TableScanDesc sscan,
  ScanDirection direction,
  TupleTableSlot *slot
) {
  return false;
}

static IndexFetchTableData* memam_index_fetch_begin(Relation rel) {
  return NULL;
}

static void memam_index_fetch_reset(IndexFetchTableData *scan) {
}

static void memam_index_fetch_end(IndexFetchTableData *scan) {
}

static bool memam_index_fetch_tuple(
  struct IndexFetchTableData *scan,
  ItemPointer tid,
  Snapshot snapshot,
  TupleTableSlot *slot,
  bool *call_again,
  bool *all_dead
) {
  return false;
}

static void memam_tuple_insert(
  Relation relation,
  TupleTableSlot *slot,
  CommandId cid,
  int options,
  BulkInsertState bistate
) {
}

static void memam_tuple_insert_speculative(
  Relation relation,
  TupleTableSlot *slot,
  CommandId cid,
  int options,
  BulkInsertState bistate,
  uint32 specToken) {
}

static void memam_tuple_complete_speculative(
  Relation relation,
  TupleTableSlot *slot,
  uint32 specToken,
  bool succeeded) {
}

static void memam_multi_insert(
  Relation relation,
  TupleTableSlot **slots,
  int ntuples,
  CommandId cid,
  int options,
  BulkInsertState bistate
) {
}

static TM_Result memam_tuple_delete(
  Relation relation,
  ItemPointer tid,
  CommandId cid,
  Snapshot snapshot,
  Snapshot crosscheck,
  bool wait,
  TM_FailureData *tmfd,
  bool changingPart
) {
  TM_Result result = {};
  return result;
}

static TM_Result memam_tuple_update(
  Relation relation,
  ItemPointer otid,
  TupleTableSlot *slot,
  CommandId cid,
  Snapshot snapshot,
  Snapshot crosscheck,
  bool wait,
  TM_FailureData *tmfd,
  LockTupleMode *lockmode,
  TU_UpdateIndexes *update_indexes
) {
  TM_Result result = {};
  return result;
}

static TM_Result memam_tuple_lock(
  Relation relation,
  ItemPointer tid,
  Snapshot snapshot,
  TupleTableSlot *slot,
  CommandId cid,
  LockTupleMode mode,
  LockWaitPolicy wait_policy,
  uint8 flags,
  TM_FailureData *tmfd)
{
  TM_Result result = {};
  return result;
}

static bool memam_fetch_row_version(
  Relation relation,
  ItemPointer tid,
  Snapshot snapshot,
  TupleTableSlot *slot
) {
  return false;
}

static void memam_get_latest_tid(
  TableScanDesc sscan,
  ItemPointer tid
) {
}

static bool memam_tuple_tid_valid(TableScanDesc scan, ItemPointer tid) {
  return false;
}

static bool memam_tuple_satisfies_snapshot(
  Relation rel,
  TupleTableSlot *slot,
  Snapshot snapshot
) {
  return false;
}

static TransactionId memam_index_delete_tuples(
  Relation rel,
  TM_IndexDeleteOp *delstate
) {
  TransactionId id = {};
  return id;
}

static void memam_relation_set_new_filelocator(
  Relation rel,
  const RelFileLocator *newrlocator,
  char persistence,
  TransactionId *freezeXid,
  MultiXactId *minmulti
) {
}

static void memam_relation_nontransactional_truncate(
  Relation rel
) {
}

static void memam_relation_copy_data(
  Relation rel,
  const RelFileLocator *newrlocator
) {
}

static void memam_relation_copy_for_cluster(
  Relation OldHeap,
  Relation NewHeap,
  Relation OldIndex,
  bool use_sort,
  TransactionId OldestXmin,
  TransactionId *xid_cutoff,
  MultiXactId *multi_cutoff,
  double *num_tuples,
  double *tups_vacuumed,
  double *tups_recently_dead
) {
}

static void memam_vacuum_rel(
  Relation rel,
  VacuumParams *params,
  BufferAccessStrategy bstrategy
) {
}

static bool memam_scan_analyze_next_block(
  TableScanDesc scan,
  BlockNumber blockno,
  BufferAccessStrategy bstrategy
) {
  return false;
}

static bool memam_scan_analyze_next_tuple(
  TableScanDesc scan,
  TransactionId OldestXmin,
  double *liverows,
  double *deadrows,
  TupleTableSlot *slot
) {
  return false;
}

static double memam_index_build_range_scan(
  Relation heapRelation,
  Relation indexRelation,
  IndexInfo *indexInfo,
  bool allow_sync,
  bool anyvisible,
  bool progress,
  BlockNumber start_blockno,
  BlockNumber numblocks,
  IndexBuildCallback callback,
  void *callback_state,
  TableScanDesc scan
) {
  return 0;
}

static void memam_index_validate_scan(
  Relation heapRelation,
  Relation indexRelation,
  IndexInfo *indexInfo,
  Snapshot snapshot,
  ValidateIndexState *state
) {
}

static bool memam_relation_needs_toast_table(Relation rel) {
  return false;
}

static Oid memam_relation_toast_am(Relation rel) {
  Oid oid = {};
  return oid;
}

static void memam_fetch_toast_slice(
  Relation toastrel,
  Oid valueid,
  int32 attrsize,
  int32 sliceoffset,
  int32 slicelength,
  struct varlena *result
) {
}

static void memam_estimate_rel_size(
  Relation rel,
  int32 *attr_widths,
  BlockNumber *pages,
  double *tuples,
  double *allvisfrac
) {
}

static bool memam_scan_sample_next_block(
  TableScanDesc scan, SampleScanState *scanstate
) {
  return false;
}

static bool memam_scan_sample_next_tuple(
  TableScanDesc scan,
  SampleScanState *scanstate,
  TupleTableSlot *slot
) {
  return false;
}

const TableAmRoutine memam_methods = {
  .type = T_TableAmRoutine,

  .slot_callbacks = memam_slot_callbacks,

  .scan_begin = memam_beginscan,
  .scan_end = memam_endscan,
  .scan_rescan = memam_rescan,
  .scan_getnextslot = memam_getnextslot,

  .parallelscan_estimate = table_block_parallelscan_estimate,
  .parallelscan_initialize = table_block_parallelscan_initialize,
  .parallelscan_reinitialize = table_block_parallelscan_reinitialize,

  .index_fetch_begin = memam_index_fetch_begin,
  .index_fetch_reset = memam_index_fetch_reset,
  .index_fetch_end = memam_index_fetch_end,
  .index_fetch_tuple = memam_index_fetch_tuple,

  .tuple_insert = memam_tuple_insert,
  .tuple_insert_speculative = memam_tuple_insert_speculative,
  .tuple_complete_speculative = memam_tuple_complete_speculative,
  .multi_insert = memam_multi_insert,
  .tuple_delete = memam_tuple_delete,
  .tuple_update = memam_tuple_update,
  .tuple_lock = memam_tuple_lock,

  .tuple_fetch_row_version = memam_fetch_row_version,
  .tuple_get_latest_tid = memam_get_latest_tid,
  .tuple_tid_valid = memam_tuple_tid_valid,
  .tuple_satisfies_snapshot = memam_tuple_satisfies_snapshot,
  .index_delete_tuples = memam_index_delete_tuples,

  .relation_set_new_filelocator = memam_relation_set_new_filelocator,
  .relation_nontransactional_truncate = memam_relation_nontransactional_truncate,
  .relation_copy_data = memam_relation_copy_data,
  .relation_copy_for_cluster = memam_relation_copy_for_cluster,
  .relation_vacuum = memam_vacuum_rel,
  .scan_analyze_next_block = memam_scan_analyze_next_block,
  .scan_analyze_next_tuple = memam_scan_analyze_next_tuple,
  .index_build_range_scan = memam_index_build_range_scan,
  .index_validate_scan = memam_index_validate_scan,

  .relation_size = table_block_relation_size,
  .relation_needs_toast_table = memam_relation_needs_toast_table,
  .relation_toast_am = memam_relation_toast_am,
  .relation_fetch_toast_slice = memam_fetch_toast_slice,

  .relation_estimate_size = memam_estimate_rel_size,

  .scan_sample_next_block = memam_scan_sample_next_block,
  .scan_sample_next_tuple = memam_scan_sample_next_tuple
};

PG_FUNCTION_INFO_V1(mem_tableam_handler);

Datum mem_tableam_handler(PG_FUNCTION_ARGS) {
  PG_RETURN_POINTER(&memam_methods);
}

Let's build and test it!

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE

Hey we're getting somewhere! It successfully created the table with our custom table access method.

Querying rows

Next, let's try querying the table by adding a SELECT a FROM x to test.sql and running it:

$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
psql:test.sql:6: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
psql:test.sql:6: error: connection to server was lost

This time there's nothing in logfile that helps:

$ tail -n15 logfile
2023-11-01 18:43:32.449 UTC [2906199] LOG:  database system is ready to accept connections
2023-11-01 18:58:32.572 UTC [2907997] LOG:  checkpoint starting: time
2023-11-01 18:58:35.305 UTC [2907997] LOG:  checkpoint complete: wrote 28 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=2.712 s, sync=0.015 s, total=2.733 s; sync files=23, longest=0.004 s, average=0.001 s; distance=128 kB, estimate=150 kB; lsn=0/15F88E0, redo lsn=0/15F8888
2023-11-01 19:08:14.485 UTC [2906199] LOG:  server process (PID 2908242) was terminated by signal 11: Segmentation fault
2023-11-01 19:08:14.485 UTC [2906199] DETAIL:  Failed process was running: SELECT a FROM x;
2023-11-01 19:08:14.485 UTC [2906199] LOG:  terminating any other active server processes
2023-11-01 19:08:14.486 UTC [2906199] LOG:  all server processes terminated; reinitializing
2023-11-01 19:08:14.508 UTC [2908253] LOG:  database system was interrupted; last known up at 2023-11-01 18:58:35 UTC
2023-11-01 19:08:14.518 UTC [2908253] LOG:  database system was not properly shut down; automatic recovery in progress
2023-11-01 19:08:14.519 UTC [2908253] LOG:  redo starts at 0/15F8888
2023-11-01 19:08:14.520 UTC [2908253] LOG:  invalid record length at 0/161DE70: expected at least 24, got 0
2023-11-01 19:08:14.520 UTC [2908253] LOG:  redo done at 0/161DE38 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-11-01 19:08:14.521 UTC [2908254] LOG:  checkpoint starting: end-of-recovery immediate wait
2023-11-01 19:08:14.532 UTC [2908254] LOG:  checkpoint complete: wrote 35 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.010 s, total=0.012 s; sync files=27, longest=0.003 s, average=0.001 s; distance=149 kB, estimate=149 kB; lsn=0/161DE70, redo lsn=0/161DE70
2023-11-01 19:08:14.533 UTC [2906199] LOG:  database system is ready to accept connections

This was the first place I got stuck. How on earth do I figure out what methods to implement? I mean, it's clearly one or more of these methods from the struct. But there are so many methods.

I tried setting a breakpoint in gdb on the process returned by SELECT pg_backend_pid() for a psql session, but the breakpoint never seemed to be hit for any of my methods.

So I did the low-tech solution and opened a file, /tmp/pgtam.log, turned off buffering on it, and added a log to every method on the TableAmRoutine struct:

@@ -12,9 +12,13 @@

 const TableAmRoutine memam_methods;

+FILE* fd;
+#define DEBUG_FUNC() fprintf(fd, "in %s\n", __func__);
+
 static const TupleTableSlotOps* memam_slot_callbacks(
   Relation relation
 ) {
+  DEBUG_FUNC();
   return NULL;
 }

@@ -26,6 +30,7 @@
   ParallelTableScanDesc parallel_scan,
   uint32 flags
 ) {
+  DEBUG_FUNC();
   return NULL;
 }

@@ -37,9 +42,11 @@
   bool allow_sync,
   bool allow_pagemode
 ) {
+  DEBUG_FUNC();
 }

 static void memam_endscan(TableScanDesc sscan) {
+  DEBUG_FUNC();
 }

 static bool memam_getnextslot(
@@ -47,17 +54,21 @@
   ScanDirection direction,
   TupleTableSlot *slot
 ) {
+  DEBUG_FUNC();
   return false;
 }

 static IndexFetchTableData* memam_index_fetch_begin(Relation rel) {
+  DEBUG_FUNC();
   return NULL;
 }

 static void memam_index_fetch_reset(IndexFetchTableData *scan) {
+  DEBUG_FUNC();
 }

 static void memam_index_fetch_end(IndexFetchTableData *scan) {
+  DEBUG_FUNC();
 }

 static bool memam_index_fetch_tuple(
@@ -68,6 +79,7 @@
   bool *call_again,
   bool *all_dead
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -78,6 +90,7 @@
   int options,
   BulkInsertState bistate
 ) {
+  DEBUG_FUNC();
 }

 static void memam_tuple_insert_speculative(
@@ -87,6 +100,7 @@
   int options,
   BulkInsertState bistate,
   uint32 specToken) {
+  DEBUG_FUNC();
 }

 static void memam_tuple_complete_speculative(
@@ -94,6 +108,7 @@
   TupleTableSlot *slot,
   uint32 specToken,
   bool succeeded) {
+  DEBUG_FUNC();
 }

 static void memam_multi_insert(
@@ -104,6 +119,7 @@
   int options,
   BulkInsertState bistate
 ) {
+  DEBUG_FUNC();
 }

 static TM_Result memam_tuple_delete(
@@ -117,6 +133,7 @@
   bool changingPart
 ) {
   TM_Result result = {};
+  DEBUG_FUNC();
   return result;
 }

@@ -133,6 +150,7 @@
   TU_UpdateIndexes *update_indexes
 ) {
   TM_Result result = {};
+  DEBUG_FUNC();
   return result;
 }

@@ -148,6 +166,7 @@
   TM_FailureData *tmfd)
 {
   TM_Result result = {};
+  DEBUG_FUNC();
   return result;
 }

@@ -157,6 +176,7 @@
   Snapshot snapshot,
   TupleTableSlot *slot
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -164,9 +184,11 @@
   TableScanDesc sscan,
   ItemPointer tid
 ) {
+  DEBUG_FUNC();
 }

 static bool memam_tuple_tid_valid(TableScanDesc scan, ItemPointer tid) {
+  DEBUG_FUNC();
   return false;
 }

@@ -175,6 +197,7 @@
   TupleTableSlot *slot,
   Snapshot snapshot
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -183,6 +206,7 @@
   TM_IndexDeleteOp *delstate
 ) {
   TransactionId id = {};
+  DEBUG_FUNC();
   return id;
 }

@@ -193,17 +217,20 @@
   TransactionId *freezeXid,
   MultiXactId *minmulti
 ) {
+  DEBUG_FUNC();
 }

 static void memam_relation_nontransactional_truncate(
   Relation rel
 ) {
+  DEBUG_FUNC();
 }

 static void memam_relation_copy_data(
   Relation rel,
   const RelFileLocator *newrlocator
 ) {
+  DEBUG_FUNC();
 }

 static void memam_relation_copy_for_cluster(
@@ -218,6 +245,7 @@
   double *tups_vacuumed,
   double *tups_recently_dead
 ) {
+  DEBUG_FUNC();
 }

 static void memam_vacuum_rel(
@@ -225,6 +253,7 @@
   VacuumParams *params,
   BufferAccessStrategy bstrategy
 ) {
+  DEBUG_FUNC();
 }

 static bool memam_scan_analyze_next_block(
@@ -232,6 +261,7 @@
   BlockNumber blockno,
   BufferAccessStrategy bstrategy
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -242,6 +272,7 @@
   double *deadrows,
   TupleTableSlot *slot
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -258,6 +289,7 @@
   void *callback_state,
   TableScanDesc scan
 ) {
+  DEBUG_FUNC();
   return 0;
 }

@@ -268,14 +300,17 @@
   Snapshot snapshot,
   ValidateIndexState *state
 ) {
+  DEBUG_FUNC();
 }

 static bool memam_relation_needs_toast_table(Relation rel) {
+  DEBUG_FUNC();
   return false;
 }

 static Oid memam_relation_toast_am(Relation rel) {
   Oid oid = {};
+  DEBUG_FUNC();
   return oid;
 }

@@ -287,6 +322,7 @@
   int32 slicelength,
   struct varlena *result
 ) {
+  DEBUG_FUNC();
 }

 static void memam_estimate_rel_size(
@@ -296,11 +332,13 @@
   double *tuples,
   double *allvisfrac
 ) {
+  DEBUG_FUNC();
 }

 static bool memam_scan_sample_next_block(
   TableScanDesc scan, SampleScanState *scanstate
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -309,6 +347,7 @@
   SampleScanState *scanstate,
   TupleTableSlot *slot
 ) {
+  DEBUG_FUNC();
   return false;
 }

And then in the entrypoint, initialize the file for logging.

@@ -369,5 +408,9 @@
 PG_FUNCTION_INFO_V1(mem_tableam_handler);

 Datum mem_tableam_handler(PG_FUNCTION_ARGS) {
+  fd = fopen("/tmp/pgtam.log", "a");
+  setvbuf(fd, NULL, _IONBF, 0); // Prevent buffering
+  fprintf(fd, "\n\nmem_tableam handler loaded\n");
+
   PG_RETURN_POINTER(&memam_methods);
 }

Let's give it a shot!

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
psql:test.sql:6: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
psql:test.sql:6: error: connection to server was lost

And let's check our log file:

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks

Now we're getting somewhere!

I later realized elog() is the way most people log within Postgres/within extensions. I didn't know that when I was getting started though. This separate logging was a simple way to get the info out.

`slot_callbacks`

Since the request crashes and the last logged function is memam_slot_callbacks, it seems like that is where we should concentrate. The table access method docs suggest looking at the default heap access method for inspiration.

Its version of slot_callbacks returns &TTSOpsBufferHeapTuple:

static const TupleTableSlotOps *
heapam_slot_callbacks(Relation relation)
{
    return &TTSOpsBufferHeapTuple;
}

I have no idea what that means, but since it is defined in src/backend/executor/execTuples.c it doesn't seem to be tied to the heap access method implementation. Let's try it.

While it works initially, I noticed later on that TTSOpsBufferHeapTuple turns out not to be the right choice here. TTSOpsVirtual seems to be the right implementation.

@@ -19,7 +19,7 @@
   Relation relation
 ) {
   DEBUG_FUNC();
-  return NULL;
+  return &TTSOpsVirtual;
 }

 static TableScanDesc memam_beginscan(

Build and run:

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
psql:test.sql:6: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
psql:test.sql:6: error: connection to server was lost

It still crashes. But this time in /tmp/pgtam.log we made it into a new method!

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan

`scan_begin`

The function signature is:

TableScanDesc heap_beginscan(
  Relation relation,
  Snapshot snapshot,
  int nkeys,
  ScanKey key,
  ParallelTableScanDesc parallel_scan,
  uint32 flags
);

Since we just implemented stub versions of all the methods, we've been returning NULL. Since we're failing in this function, maybe we should try returning something that isn't NULL.

By looking at the definition of TableScanDesc, we can see it is a pointer to the TableScanDescData struct defined in src/include/access/relscan.h.

Let's malloc a TableScanDescData, free it in endscan, and return the TableScanDescData instance in beginscan:

@@ -30,8 +30,12 @@
   ParallelTableScanDesc parallel_scan,
   uint32 flags
 ) {
+  TableScanDescData* scan = {};
   DEBUG_FUNC();
-  return NULL;
+
+  scan = (TableScanDescData*)malloc(sizeof(TableScanDescData));
+
+  return (TableScanDesc)scan;
 }

 static void memam_rescan(
@@ -87,6 +87,7 @@

 static void memam_endscan(TableScanDesc sscan) {
   DEBUG_FUNC();
+  free(sscan);
 }

Build and run (you can do it on your own). No difference.

I got stuck for a while here too. Clearly something must be filled out in this struct but it could be anything. Through trial and error I realized the one field that must be filled out is scan->rs_rd.

@@ -34,6 +34,7 @@
   DEBUG_FUNC();

   scan = (TableScanDescData*)malloc(sizeof(TableScanDescData));
+  scan->rs_rd = relation;

   return (TableScanDesc)scan;
 }

We build and run:

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
 a
---
(0 rows)

And it works! It doesn't return anything but that's correct. There's nothing to return.

So what if we actually want to return something? Let's check our logs in /tmp/pgtam.log.

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan
in memam_getnextslot
in memam_endscan

Ok, I'm getting the gist of the API. A full table scan (which this is, because there are no indexes at play) starts with an initialization for a slot, then the scan begins, then getnextslot is called for each row, and then endscan is called to allow for cleanup.

So let's try returning a row in getnextslot.

`getnextslot`

The getnextslot signature is:

bool memam_getnextslot(
  TableScanDesc sscan,
  ScanDirection direction,
  TupleTableSlot *slot
);

So the sscan should be what we returned from beginscan and the interface docs say the current row gets stored in slot.

The return value seems to indicate whether or not we've reached the end of the scan. However, the scan will still end even if you return true if the slot is not filled out correctly. If the slot is filled out correctly and you unconditionally return true, you will crash the process.

Let's take a look at the definition of TupleTableSlot:

typedef struct TupleTableSlot
{
    NodeTag     type;
#define FIELDNO_TUPLETABLESLOT_FLAGS 1
    uint16      tts_flags;      /* Boolean states */
#define FIELDNO_TUPLETABLESLOT_NVALID 2
    AttrNumber  tts_nvalid;     /* # of valid values in tts_values */
    const TupleTableSlotOps *const tts_ops; /* implementation of slot */
#define FIELDNO_TUPLETABLESLOT_TUPLEDESCRIPTOR 4
    TupleDesc   tts_tupleDescriptor;    /* slot's tuple descriptor */
#define FIELDNO_TUPLETABLESLOT_VALUES 5
    Datum      *tts_values;     /* current per-attribute values */
#define FIELDNO_TUPLETABLESLOT_ISNULL 6
    bool       *tts_isnull;     /* current per-attribute isnull flags */
    MemoryContext tts_mcxt;     /* slot itself is in this context */
    ItemPointerData tts_tid;    /* stored tuple's tid */
    Oid         tts_tableOid;   /* table oid of tuple */
} TupleTableSlot;

tts_values is an array of Datum (which is a Postgres value). So that sounds like the actual values of the row. The tts_isnull field also looks important since that seems to be whether each value in the row is null or not. And tts_nvalid sounds important too since presumably it's the length of the tts_isnull and tts_values arrays.

The rest of it may or may not be important. Let's try filling out these three fields though and see what happens.

Datum

Back in the Postgres C extension documentation, we can see some simple examples of converting between C types and Postgres's Datum type.

For example:

Datum
add_one(PG_FUNCTION_ARGS)
{
    int32   arg = PG_GETARG_INT32(0);

    PG_RETURN_INT32(arg + 1);
}

If we look at the definition of PG_RETURN_INT32 in src/include/fmgr.h, we see:

#define PG_RETURN_INT32(x)   return Int32GetDatum(x)

So Int32GetDatum() is the function we'll use to set a Datum for a cell in a row.

@@ -54,13 +54,26 @@
   DEBUG_FUNC();
 }

+static bool done = false;
 static bool memam_getnextslot(
   TableScanDesc sscan,
   ScanDirection direction,
   TupleTableSlot *slot
 ) {
   DEBUG_FUNC();
-  return false;
+
+  if (done) {
+    return false;
+  }
+
+  slot->tts_nvalid = 1;
+  slot->tts_values = (Datum*)malloc(sizeof(Datum) * slot->tts_nvalid);
+  slot->tts_values[0] = Int32GetDatum(314 /* Some unique-looking value */);
+  slot->tts_isnull = (bool*)malloc(sizeof(bool) * slot->tts_nvalid);
+  slot->tts_isnull[0] = false;
+  done = true;
+
+  return true;
 }

 static IndexFetchTableData* memam_index_fetch_begin(Relation rel) {

The goal is that we return a single row and then exit the scan. It will have one 32-bit integer cell (remember we created the table CREATE TABLE x (a INT); INT is shorthand for INT4 which is a 32-bit integer) that will have the value 314.

But if we build and run this, we get no rows.

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
 a
---
(0 rows)

I got stuck for a while here. Plugging my getnextslot code into ChatGPT helped. One thing it gave me to try was calling ExecStoreVirtualTuple on the slot. I noticed that the built-in heap access method

October 31, 2023

Tinybird Engineering Blog

7 tips to make your dashboards faster

Want to speed up your data visualizations? Here are seven tried and true tips to improve dashboard performance.

October 30, 2023

Tinybird Engineering Blog

Resolving a year-long ClickHouse®️ lock contention

Here's how we fixed a problem that had stumped us for over a year and saw a 5x performance improvement in the process.

Tinybird Engineering Blog

Resolving a year-long ClickHouse®️ lock contention

Here's how we fixed a problem that had stumped us for over a year and saw a 5x performance improvement in the process.

by Jordi Villar

October 26, 2023

PlanetScale Blog

Introducing webhooks

You can now automatically trigger HTTP callbacks on events in PlanetScale to build custom integrations, notifications, and workflows.

← 1 ... 53 54 55 56 57 58 59 60 61 62 ... 92 →

7 tips to make your dashboards faster

November 06, 2023

Build a real-time dashboard in Python with Tinybird and Dash

The Future of MySQL Schema Change: Spirit

Supabase Beta October 2023

November 03, 2023

What are columnar databases? Here are 35 examples.

What are columnar databases? Here are 35 examples.

November 02, 2023

Announcing the Fivetran integration

November 01, 2023

Supabase is now compatible with Next.js 14

Writing a storage engine for Postgres: an in-memory Table Access Method

Why alternative storage engines?

Why not foreign data wrappers?

Digging in

A debug Postgres build

Extension infrastructure

`pgtam.c`

`pgtam.control`

`pgtam--0.0.1.sql`

Build

Table access method stubs

Querying rows

`slot_callbacks`

`scan_begin`

`getnextslot`

Datum

October 31, 2023

October 30, 2023

Resolving a year-long ClickHouse®️ lock contention

Resolving a year-long ClickHouse®️ lock contention

October 26, 2023

Introducing webhooks