database.news | database technology news aggregator

June 28, 2022

PlanetScale Blog

Announcing Vitess 14

Learn about what was just released in Vitess 14

Supabase Blog

Partial data dumps using Postgres Row Level Security

Using RLS to create seed files for local PostgreSQL testing.

Tinybird Engineering Blog

DataMarket replaced 4 data stack tools with Tinybird

We originally built the WhiteBox spinoff with on-premise infrastructure, but consolidated everything on Tinybird to scale.

Vitess Blog

We are pleased to announce the general availability of Vitess 14. Major Themes # In this new release, major improvements have been made in several areas of Vitess, including usability and reliability. Online DDL is now GA. Gen4 planner is the new default planner. VTAdmin and VTOrc are officially in beta with Vitess 14. Usability # Command-Line Syntax Deprecation # This release marks the beginning of Vitess standardizing its command-line and flags syntax.

June 26, 2022

Database Architects

Making unwinding through JIT-ed code scalable - b-tree operations

This article is part of the series about scalable unwinding that starts here.

Now that we have all infrastructure in place, we look at the high-level algorithms. For inserts, we walk down the tree until we hit the leaf-node that should contain the new value. If that node is full, we split the leaf node, and insert a new separator into the parent node to distinguish the two nodes. To avoid propagating that split further up (as the inner node might be full, too, requiring an inner split), we eagerly split full inner nodes when walking down. This guarantees that the parent of a node is never full, which allows us to look at nodes purely from top-to-bottom, which greatly simplifies locking.

The splits themselves are relatively simple, we just copy the right half of each node into a new node, reduce the size of the original node, and insert a separator into the parent. However two problems require some care 1) we might have to split the root, which does not have a parent itself, and 2) the node split could mean that the value we try to insert could be either in the left or the right node. The split functions always update the node iterator to the correct node, and release the lock on the node that is not needed after the split.

// Insert a new separator after splitting
static void
btree_node_update_separator_after_split (struct btree_node *n,
					 uintptr_t old_separator,
					 uintptr_t new_separator,
					 struct btree_node *new_right)
{
  unsigned slot = btree_node_find_inner_slot (n, old_separator);
  for (unsigned index = n->entry_count; index > slot; --index)
    n->content.children[index] = n->content.children[index - 1];
  n->content.children[slot].separator = new_separator;
  n->content.children[slot + 1].child = new_right;
  n->entry_count++;
}

// Check if we are splitting the root
static void
btree_handle_root_split (struct btree *t, struct btree_node **node,
			 struct btree_node **parent)
{
  // We want to keep the root pointer stable to allow for contention
  // free reads. Thus, we split the root by first moving the content
  // of the root node to a new node, and then split that new node
  if (!*parent)
    {
      // Allocate a new node, this guarantees us that we will have a parent
      // afterwards
      struct btree_node *new_node
	= btree_allocate_node (t, btree_node_is_inner (*node));
      struct btree_node *old_node = *node;
      new_node->entry_count = old_node->entry_count;
      new_node->content = old_node->content;
      old_node->content.children[0].separator = max_separator;
      old_node->content.children[0].child = new_node;
      old_node->entry_count = 1;
      old_node->type = btree_node_inner;

      *parent = old_node;
      *node = new_node;
    }
}

// Split an inner node
static void
btree_split_inner (struct btree *t, struct btree_node **inner,
		   struct btree_node **parent, uintptr_t target)
{
  // Check for the root
  btree_handle_root_split (t, inner, parent);

  // Create two inner node
  uintptr_t right_fence = btree_node_get_fence_key (*inner);
  struct btree_node *left_inner = *inner;
  struct btree_node *right_inner = btree_allocate_node (t, true);
  unsigned split = left_inner->entry_count / 2;
  right_inner->entry_count = left_inner->entry_count - split;
  for (unsigned index = 0; index < right_inner->entry_count; ++index)
    right_inner->content.children[index]
      = left_inner->content.children[split + index];
  left_inner->entry_count = split;
  uintptr_t left_fence = btree_node_get_fence_key (left_inner);
  btree_node_update_separator_after_split (*parent, right_fence, left_fence,
					   right_inner);
  if (target <= left_fence)
    {
      *inner = left_inner;
      btree_node_unlock_exclusive (right_inner);
    }
  else
    {
      *inner = right_inner;
      btree_node_unlock_exclusive (left_inner);
    }
}

// Split a leaf node
static void
btree_split_leaf (struct btree *t, struct btree_node **leaf,
		  struct btree_node **parent, uintptr_t fence, uintptr_t target)
{
  // Check for the root
  btree_handle_root_split (t, leaf, parent);

  // Create two leaf node
  uintptr_t right_fence = fence;
  struct btree_node *left_leaf = *leaf;
  struct btree_node *right_leaf = btree_allocate_node (t, false);
  unsigned split = left_leaf->entry_count / 2;
  right_leaf->entry_count = left_leaf->entry_count - split;
  for (unsigned index = 0; index != right_leaf->entry_count; ++index)
    right_leaf->content.entries[index]
      = left_leaf->content.entries[split + index];
  left_leaf->entry_count = split;
  uintptr_t left_fence = right_leaf->content.entries[0].base - 1;
  btree_node_update_separator_after_split (*parent, right_fence, left_fence,
					   right_leaf);
  if (target <= left_fence)
    {
      *leaf = left_leaf;
      btree_node_unlock_exclusive (right_leaf);
    }
  else
    {
      *leaf = right_leaf;
      btree_node_unlock_exclusive (left_leaf);
    }
}

// Insert an entry
static bool
btree_insert (struct btree *t, uintptr_t base, uintptr_t size,
	      struct object *ob)
{
  // Sanity check
  if (!size)
    return false;

  // Access the root
  struct btree_node *iter, *parent = NULL;
  {
    version_lock_lock_exclusive (&(t->root_lock));
    iter = t->root;
    if (iter)
      {
	btree_node_lock_exclusive (iter);
      }
    else
      {
	t->root = iter = btree_allocate_node (t, false);
      }
    version_lock_unlock_exclusive (&(t->root_lock));
  }

  // Walk down the btree with classic lock coupling and eager splits.
  // Strictly speaking this is not performance optimal, we could use
  // optimistic lock coupling until we hit a node that has to be modified.
  // But that is more difficult to implement and frame registration is
  // rare anyway, we use simple locking for now

  uintptr_t fence = max_separator;
  while (btree_node_is_inner (iter))
    {
      // Use eager splits to avoid lock coupling up
      if (iter->entry_count == max_fanout_inner)
	btree_split_inner (t, &iter, &parent, base);

      unsigned slot = btree_node_find_inner_slot (iter, base);
      if (parent)
	btree_node_unlock_exclusive (parent);
      parent = iter;
      fence = iter->content.children[slot].separator;
      iter = iter->content.children[slot].child;
      btree_node_lock_exclusive (iter);
    }

  // Make sure we have space
  if (iter->entry_count == max_fanout_leaf)
    btree_split_leaf (t, &iter, &parent, fence, base);
  if (parent)
    btree_node_unlock_exclusive (parent);

  // Insert in node
  unsigned slot = btree_node_find_leaf_slot (iter, base);
  if ((slot < iter->entry_count) && (iter->content.entries[slot].base == base))
    {
      // duplicate entry, this should never happen
      btree_node_unlock_exclusive (iter);
      return false;
    }
  for (unsigned index = iter->entry_count; index > slot; --index)
    iter->content.entries[index] = iter->content.entries[index - 1];
  struct leaf_entry *e = &(iter->content.entries[slot]);
  e->base = base;
  e->size = size;
  e->ob = ob;
  iter->entry_count++;
  btree_node_unlock_exclusive (iter);
  return true;
}

Deletion is more complex, as there are more cases. We have to maintain the invariant that each node is at least half full. Just like insertion we have the problem that operations can trickle up, e.g., deleting in element in a node might make it less than half-full, merging that node with a half-full neighbor deletes an entry from the parent, which can make that node less than half-full, etc. We solve that problem by merging while going down: When traversing the tree during element-removal, we check if the current node is less than half full. If yes, we merge/balance it with a neighbor node. If the parent becomes less than half-full that will be fixed at the next traversal. Strictly speaking this means nodes can, at least temporarily, be less than half full, but that is fine for asymptotic complexity, as we are never more than one element below the threshold.

The merge logic examines that least-full neighbor of the current code. If both nodes together would fit in one node, they are merged and the separator for the left node is removed from the parent. Otherwise, elements are shifted from the less-full node to the other node, which makes both nodes at least half full. The separator of the left node is updated after the shift:

// Merge (or balance) child nodes
static struct btree_node *
btree_merge_node (struct btree *t, unsigned child_slot,
		  struct btree_node *parent, uintptr_t target)
{
  // Choose the emptiest neighbor and lock both. The target child is already
  // locked
  unsigned left_slot;
  struct btree_node *left_node, *right_node;
  if ((child_slot == 0)
      || (((child_slot + 1) < parent->entry_count)
	  && (parent->content.children[child_slot + 1].child->entry_count
	      < parent->content.children[child_slot - 1].child->entry_count)))
    {
      left_slot = child_slot;
      left_node = parent->content.children[left_slot].child;
      right_node = parent->content.children[left_slot + 1].child;
      btree_node_lock_exclusive (right_node);
    }
  else
    {
      left_slot = child_slot - 1;
      left_node = parent->content.children[left_slot].child;
      right_node = parent->content.children[left_slot + 1].child;
      btree_node_lock_exclusive (left_node);
    }

  // Can we merge both nodes into one node?
  unsigned total_count = left_node->entry_count + right_node->entry_count;
  unsigned max_count
    = btree_node_is_inner (left_node) ? max_fanout_inner : max_fanout_leaf;
  if (total_count <= max_count)
    {
      // Merge into the parent?
      if (parent->entry_count == 2)
	{
	  // Merge children into parent. This can only happen at the root
	  if (btree_node_is_inner (left_node))
	    {
	      for (unsigned index = 0; index != left_node->entry_count; ++index)
		parent->content.children[index]
		  = left_node->content.children[index];
	      for (unsigned index = 0; index != right_node->entry_count;
		   ++index)
		parent->content.children[index + left_node->entry_count]
		  = right_node->content.children[index];
	    }
	  else
	    {
	      parent->type = btree_node_leaf;
	      for (unsigned index = 0; index != left_node->entry_count; ++index)
		parent->content.entries[index]
		  = left_node->content.entries[index];
	      for (unsigned index = 0; index != right_node->entry_count;
		   ++index)
		parent->content.entries[index + left_node->entry_count]
		  = right_node->content.entries[index];
	    }
	  parent->entry_count = total_count;
	  btree_release_node (t, left_node);
	  btree_release_node (t, right_node);
	  return parent;
	}
      else
	{
	  // Regular merge
	  if (btree_node_is_inner (left_node))
	    {
	      for (unsigned index = 0; index != right_node->entry_count;
		   ++index)
		left_node->content.children[left_node->entry_count++]
		  = right_node->content.children[index];
	    }
	  else
	    {
	      for (unsigned index = 0; index != right_node->entry_count;
		   ++index)
		left_node->content.entries[left_node->entry_count++]
		  = right_node->content.entries[index];
	    }
	  parent->content.children[left_slot].separator
	    = parent->content.children[left_slot + 1].separator;
	  for (unsigned index = left_slot + 1; index + 1 < parent->entry_count;
	       ++index)
	    parent->content.children[index]
	      = parent->content.children[index + 1];
	  parent->entry_count--;
	  btree_release_node (t, right_node);
	  btree_node_unlock_exclusive (parent);
	  return left_node;
	}
    }

  // No merge possible, rebalance instead
  if (left_node->entry_count > right_node->entry_count)
    {
      // Shift from left to right
      unsigned to_shift
	= (left_node->entry_count - right_node->entry_count) / 2;
      if (btree_node_is_inner (left_node))
	{
	  for (unsigned index = 0; index != right_node->entry_count; ++index)
	    {
	      unsigned pos = right_node->entry_count - 1 - index;
	      right_node->content.children[pos + to_shift]
		= right_node->content.children[pos];
	    }
	  for (unsigned index = 0; index != to_shift; ++index)
	    right_node->content.children[index]
	      = left_node->content
		  .children[left_node->entry_count - to_shift + index];
	}
      else
	{
	  for (unsigned index = 0; index != right_node->entry_count; ++index)
	    {
	      unsigned pos = right_node->entry_count - 1 - index;
	      right_node->content.entries[pos + to_shift]
		= right_node->content.entries[pos];
	    }
	  for (unsigned index = 0; index != to_shift; ++index)
	    right_node->content.entries[index]
	      = left_node->content
		  .entries[left_node->entry_count - to_shift + index];
	}
      left_node->entry_count -= to_shift;
      right_node->entry_count += to_shift;
    }
  else
    {
      // Shift from right to left
      unsigned to_shift
	= (right_node->entry_count - left_node->entry_count) / 2;
      if (btree_node_is_inner (left_node))
	{
	  for (unsigned index = 0; index != to_shift; ++index)
	    left_node->content.children[left_node->entry_count + index]
	      = right_node->content.children[index];
	  for (unsigned index = 0; index != right_node->entry_count - to_shift;
	       ++index)
	    right_node->content.children[index]
	      = right_node->content.children[index + to_shift];
	}
      else
	{
	  for (unsigned index = 0; index != to_shift; ++index)
	    left_node->content.entries[left_node->entry_count + index]
	      = right_node->content.entries[index];
	  for (unsigned index = 0; index != right_node->entry_count - to_shift;
	       ++index)
	    right_node->content.entries[index]
	      = right_node->content.entries[index + to_shift];
	}
      left_node->entry_count += to_shift;
      right_node->entry_count -= to_shift;
    }
  uintptr_t left_fence;
  if (btree_node_is_leaf (left_node))
    {
      left_fence = right_node->content.entries[0].base - 1;
    }
  else
    {
      left_fence = btree_node_get_fence_key (left_node);
    }
  parent->content.children[left_slot].separator = left_fence;
  btree_node_unlock_exclusive (parent);
  if (target <= left_fence)
    {
      btree_node_unlock_exclusive (right_node);
      return left_node;
    }
  else
    {
      btree_node_unlock_exclusive (left_node);
      return right_node;
    }
}

// Remove an entry
static struct object *
btree_remove (struct btree *t, uintptr_t base)
{
  // Access the root
  version_lock_lock_exclusive (&(t->root_lock));
  struct btree_node *iter = t->root;
  if (iter)
    btree_node_lock_exclusive (iter);
  version_lock_unlock_exclusive (&(t->root_lock));
  if (!iter)
    return NULL;

  // Same strategy as with insert, walk down with lock coupling and
  // merge eagerly
  while (btree_node_is_inner (iter))
    {
      unsigned slot = btree_node_find_inner_slot (iter, base);
      struct btree_node *next = iter->content.children[slot].child;
      btree_node_lock_exclusive (next);
      if (btree_node_needs_merge (next))
	{
	  // Use eager merges to avoid lock coupling up
	  iter = btree_merge_node (t, slot, iter, base);
	}
      else
	{
	  btree_node_unlock_exclusive (iter);
	  iter = next;
	}
    }

  // Remove existing entry
  unsigned slot = btree_node_find_leaf_slot (iter, base);
  if ((slot >= iter->entry_count) || (iter->content.entries[slot].base != base))
    {
      // not found, this should never happen
      btree_node_unlock_exclusive (iter);
      return NULL;
    }
  struct object *ob = iter->content.entries[slot].ob;
  for (unsigned index = slot; index + 1 < iter->entry_count; ++index)
    iter->content.entries[index] = iter->content.entries[index + 1];
  iter->entry_count--;
  btree_node_unlock_exclusive (iter);
  return ob;
}

Lookups are conceptually simple, we just walk down the b-tree. However we do the traversal using optimistic lock coupling, which means the data could change behind our back at any time. As a consequence, all reads have to be (relaxed) atomic reads, and we have to validate the current lock before acting upon a value that we have read. In case of failures (e.g., concurrent writes during reading), we simply restart the traversal.

// Find the corresponding entry for the given address
static struct object *
btree_lookup (const struct btree *t, uintptr_t target_addr)
{
  // Within this function many loads are relaxed atomic loads.
  // Use a macro to keep the code reasonable
#define RLOAD(x) __atomic_load_n (&(x), __ATOMIC_RELAXED)

  // For targets where unwind info is usually not registered through these
  // APIs anymore, avoid any sequential consistent atomics.
  // Use relaxed MO here, it is up to the app to ensure that the library
  // loading/initialization happens-before using that library in other
  // threads (in particular unwinding with that library's functions
  // appearing in the backtraces).  Calling that library's functions
  // without waiting for the library to initialize would be racy.
  if (__builtin_expect (!RLOAD (t->root), 1))
    return NULL;

  // The unwinding tables are mostly static, they only change when
  // frames are added or removed. This makes it extremely unlikely that they
  // change during a given unwinding sequence. Thus, we optimize for the
  // contention free case and use optimistic lock coupling. This does not
  // require any writes to shared state, instead we validate every read. It is
  // important that we do not trust any value that we have read until we call
  // validate again. Data can change at arbitrary points in time, thus we always
  // copy something into a local variable and validate again before acting on
  // the read. In the unlikely event that we encounter a concurrent change we
  // simply restart and try again.

restart:
  struct btree_node *iter;
  uintptr_t lock;
  {
    // Accessing the root node requires defending against concurrent pointer
    // changes Thus we couple rootLock -> lock on root node -> validate rootLock
    if (!version_lock_lock_optimistic (&(t->root_lock), &lock))
      goto restart;
    iter = RLOAD (t->root);
    if (!version_lock_validate (&(t->root_lock), lock))
      goto restart;
    if (!iter)
      return NULL;
    uintptr_t child_lock;
    if ((!btree_node_lock_optimistic (iter, &child_lock))
	|| (!version_lock_validate (&(t->root_lock), lock)))
      goto restart;
    lock = child_lock;
  }

  // Now we can walk down towards the right leaf node
  while (true)
    {
      enum node_type type = RLOAD (iter->type);
      unsigned entry_count = RLOAD (iter->entry_count);
      if (!btree_node_validate (iter, lock))
	goto restart;
      if (!entry_count)
	return NULL;

      if (type == btree_node_inner)
	{
	  // We cannot call find_inner_slot here because we need (relaxed)
	  // atomic reads here
	  unsigned slot = 0;
	  while (
	    ((slot + 1) < entry_count)
	    && (RLOAD (iter->content.children[slot].separator) < target_addr))
	    ++slot;
	  struct btree_node *child = RLOAD (iter->content.children[slot].child);
	  if (!btree_node_validate (iter, lock))
	    goto restart;

	  // The node content can change at any point in time, thus we must
	  // interleave parent and child checks
	  uintptr_t child_lock;
	  if (!btree_node_lock_optimistic (child, &child_lock))
	    goto restart;
	  if (!btree_node_validate (iter, lock))
	    goto restart; // make sure we still point to the correct node after
			  // acquiring the optimistic lock

	  // Go down
	  iter = child;
	  lock = child_lock;
	}
      else
	{
	  // We cannot call find_leaf_slot here because we need (relaxed)
	  // atomic reads here
	  unsigned slot = 0;
	  while (((slot + 1) < entry_count)
		 && (RLOAD (iter->content.entries[slot].base)
		       + RLOAD (iter->content.entries[slot].size)
		     <= target_addr))
	    ++slot;
	  struct leaf_entry entry;
	  entry.base = RLOAD (iter->content.entries[slot].base);
	  entry.size = RLOAD (iter->content.entries[slot].size);
	  entry.ob = RLOAD (iter->content.entries[slot].ob);
	  if (!btree_node_validate (iter, lock))
	    goto restart;

	  // Check if we have a hit
	  if ((entry.base <= target_addr)
	      && (target_addr < entry.base + entry.size))
	    {
	      return entry.ob;
	    }
	  return NULL;
	}
    }
#undef RLOAD
}

This is the end of the article series discussing the gcc patch for lock-free unwinding. With that patch, we get scalable unwinding even on a machine with 256 hardware contexts. I hope the series helps with understanding the patch, and eventually allows it to be integrated into gcc.

by Thomas Neumann (noreply@blogger.com)

Database Architects

Making unwinding through JIT-ed code scalable - The b-tree

This article is part of the series about scalable unwinding that starts here.

We use a b-tree because it offers fast lookup, good data locality, and a scalable implementation is reasonable easy when using optimistic lock coupling. Nevertheless a b-tree is a non-trivial data structure. To avoid having one huge article that includes all details of the b-tree, we just discuss the data structure themselves and some helper functions here, the insert/remove/lookup operations will be discussed in the next article.

A b-tree partitions its elements by value. An inner node contains a sorted list of separator/child pairs, with the guarantee that the elements in the sub-tree rooted at the child pointer will be <= the separator. The leaf nodes contains sorted lists of (base, size, object) entries, where the object is responsible for unwinding entries between base and base+size. An b-tree maintains the invariants that 1) all nodes except the root are at least half full, and 2) a leaf nodes have the same distance to the root. This guarantees us logarithmic lookup costs. Note that we use fence-keys, i.e., the inner nodes have a separator for the right-most entries, too, which is not the case in all b-tree implementations:

// The largest possible separator value
static const uintptr_t max_separator = ~((uintptr_t) (0));

// Inner entry. The child tree contains all entries <= separator
struct inner_entry
{
  uintptr_t separator;
  struct btree_node *child;
};

// Leaf entry. Stores an object entry
struct leaf_entry
{
  uintptr_t base, size;
  struct object *ob;
};

// node types
enum node_type
{
  btree_node_inner,
  btree_node_leaf,
  btree_node_free
};

// Node sizes. Chosen such that the result size is roughly 256 bytes
#define max_fanout_inner 15
#define max_fanout_leaf 10

// A btree node
struct btree_node
{
  // The version lock used for optimistic lock coupling
  struct version_lock version_lock;
  // The number of entries
  unsigned entry_count;
  // The type
  enum node_type type;
  // The payload
  union
  {
    // The inner nodes have fence keys, i.e., the right-most entry includes a
    // separator
    struct inner_entry children[max_fanout_inner];
    struct leaf_entry entries[max_fanout_leaf];
  } content;
};

To simplify the subsequent code we define a number of helper functions that are largely straight-forward, and that allow to distinguish leaf and inner node and provide searching within a node. The lock operations directly map to operations on the version lock:

// Is an inner node?
static inline bool
btree_node_is_inner (const struct btree_node *n)
{
  return n->type == btree_node_inner;
}

// Is a leaf node?
static inline bool
btree_node_is_leaf (const struct btree_node *n)
{
  return n->type == btree_node_leaf;
}

// Should the node be merged?
static inline bool
btree_node_needs_merge (const struct btree_node *n)
{
  return n->entry_count < (btree_node_is_inner (n) ? (max_fanout_inner / 2)
						   : (max_fanout_leaf / 2));
}

// Get the fence key for inner nodes
static inline uintptr_t
btree_node_get_fence_key (const struct btree_node *n)
{
  // For inner nodes we just return our right-most entry
  return n->content.children[n->entry_count - 1].separator;
}

// Find the position for a slot in an inner node
static unsigned
btree_node_find_inner_slot (const struct btree_node *n, uintptr_t value)
{
  for (unsigned index = 0, ec = n->entry_count; index != ec; ++index)
    if (n->content.children[index].separator >= value)
      return index;
  return n->entry_count;
}

// Find the position for a slot in a leaf node
static unsigned
btree_node_find_leaf_slot (const struct btree_node *n, uintptr_t value)
{
  for (unsigned index = 0, ec = n->entry_count; index != ec; ++index)
    if (n->content.entries[index].base + n->content.entries[index].size > value)
      return index;
  return n->entry_count;
}

// Try to lock the node exclusive
static inline bool
btree_node_try_lock_exclusive (struct btree_node *n)
{
  return version_lock_try_lock_exclusive (&(n->version_lock));
}

// Lock the node exclusive, blocking as needed
static inline void
btree_node_lock_exclusive (struct btree_node *n)
{
  version_lock_lock_exclusive (&(n->version_lock));
}

// Release a locked node and increase the version lock
static inline void
btree_node_unlock_exclusive (struct btree_node *n)
{
  version_lock_unlock_exclusive (&(n->version_lock));
}

// Acquire an optimistic "lock". Note that this does not lock at all, it
// only allows for validation later
static inline bool
btree_node_lock_optimistic (const struct btree_node *n, uintptr_t *lock)
{
  return version_lock_lock_optimistic (&(n->version_lock), lock);
}

// Validate a previously acquire lock
static inline bool
btree_node_validate (const struct btree_node *n, uintptr_t lock)
{
  return version_lock_validate (&(n->version_lock), lock);
}

With that we come to the b-tree itself, which consists of a pointer to the root node, a version lock to protect the root, and a free list:

// A btree. Suitable for static initialization, all members are zero at the
// beginning
struct btree
{
  // The root of the btree
  struct btree_node *root;
  // The free list of released node
  struct btree_node *free_list;
  // The version lock used to protect the root
  struct version_lock root_lock;
};

// Initialize a btree. Not actually used, just for exposition
static inline void
btree_init (struct btree *t)
{
  t->root = NULL;
  t->free_list = NULL;
  t->root_lock.version_lock = 0;
};

We need that free list because readers operate without visible synchronization. If we would simply free() a node, we would risk that a concurrent reader is still looking at that node, even though no relevant data exist on that node. (But the reader does not know this until it did the read). To prevent that, we put freed nodes in the free list, which ensures that the memory location remains valid, and prefer using nodes from the free list when allocating new nodes:

// Allocate a node. This node will be returned in locked exclusive state
static struct btree_node *
btree_allocate_node (struct btree *t, bool inner)
{
  while (true)
    {
      // Try the free list first
      struct btree_node *next_free
	= __atomic_load_n (&(t->free_list), __ATOMIC_SEQ_CST);
      if (next_free)
	{
	  if (!btree_node_try_lock_exclusive (next_free))
	    continue;
	  // The node might no longer be free, check that again after acquiring
	  // the exclusive lock
	  if (next_free->type == btree_node_free)
	    {
	      struct btree_node *ex = next_free;
	      if (__atomic_compare_exchange_n (
		    &(t->free_list), &ex, next_free->content.children[0].child,
		    false, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))
		{
		  next_free->entry_count = 0;
		  next_free->type = inner ? btree_node_inner : btree_node_leaf;
		  return next_free;
		}
	    }
	  btree_node_unlock_exclusive (next_free);
	  continue;
	}

      // No free node available, allocate a new one
      struct btree_node *new_node
	= (struct btree_node *) (malloc (sizeof (struct btree_node)));
      version_lock_initialize_locked_exclusive (
	&(new_node->version_lock)); // initialize the node in locked state
      new_node->entry_count = 0;
      new_node->type = inner ? btree_node_inner : btree_node_leaf;
      return new_node;
    }
}

// Release a node. This node must be currently locked exclusively and will
// be placed in the free list
static void
btree_release_node (struct btree *t, struct btree_node *node)
{
  // We cannot release the memory immediately because there might still be
  // concurrent readers on that node. Put it in the free list instead
  node->type = btree_node_free;
  struct btree_node *next_free
    = __atomic_load_n (&(t->free_list), __ATOMIC_SEQ_CST);
  do
    {
      node->content.children[0].child = next_free;
  } while (!__atomic_compare_exchange_n (&(t->free_list), &next_free, node,
					 false, __ATOMIC_SEQ_CST,
					 __ATOMIC_SEQ_CST));
  btree_node_unlock_exclusive (node);
}

The last remaining infrastructure code is destroying the b-tree. Here, we simply walk the tree recursively and release all nodes. The recursion is safe because the depth is bound logarithmic:

// Recursively release a tree. The btree is by design very shallow, thus
// we can risk recursion here
static void
btree_release_tree_recursively (struct btree *t, struct btree_node *node)
{
  btree_node_lock_exclusive (node);
  if (btree_node_is_inner (node))
    {
      for (unsigned index = 0; index < node->entry_count; ++index)
	btree_release_tree_recursively (t, node->content.children[index].child);
    }
  btree_release_node (t, node);
}

// Destroy a tree and release all nodes
static void
btree_destroy (struct btree *t)
{
  // Disable the mechanism before cleaning up
  struct btree_node *old_root
    = __atomic_exchange_n (&(t->root), NULL, __ATOMIC_SEQ_CST);
  if (old_root)
    btree_release_tree_recursively (t, old_root);

  // Release all free nodes
  while (t->free_list)
    {
      struct btree_node *next = t->free_list->content.children[0].child;
      free (t->free_list);
      t->free_list = next;
    }
}

This finished the infrastructure part of the b-tree, the high-level insert/remove/lookup functions are covered in the next article .

by Thomas Neumann (noreply@blogger.com)

Database Architects

Making unwinding through JIT-ed code scalable - Optimistic Lock Coupling

This article is part of the series about scalable unwinding that starts here.

When thinking about exception handling it is reasonable to assume that we will have far more unwinding requests than changes to the unwinding tables. In our setup, the tables only change when JITed code is added to or removed from the program. That is always expensive to begin with due the mprotect calls, TLB shootdowns, etc. Thus we can safely assume that we will have at most a few hundred updates per second even in extreme cases, probably far less. Lookups however can easily reach thousands or even millions per second, as we do one lookup per frame.

This motivates us to use a read-optimized data structure, a b-tree with optimistic lock coupling: Writers use traditional lock coupling (lock parent node exclusive, lock child node exclusive, release parent node, lock child of child, etc.), which works fine as long as there is not too much contention. Readers however have to do something else, as we expect thousands of them. One might be tempted to use a rw-lock for readers, but that does not help. Locking an rw-lock in shared mode causes an atomic write, which makes the threads fight over the cache line of the lock even if there is no (logical) contention.

Instead, we use version locks, where readers do no write at all:

// Common logic for version locks
struct version_lock
{
  // The lock itself. The lowest bit indicates an exclusive lock,
  // the second bit indicates waiting threads. All other bits are
  // used as counter to recognize changes.
  // Overflows are okay here, we must only prevent overflow to the
  // same value within one lock_optimistic/validate
  // range. Even on 32 bit platforms that would require 1 billion
  // frame registrations within the time span of a few assembler
  // instructions.
  uintptr_t version_lock;
};

#ifdef __GTHREAD_HAS_COND
// We should never get contention within the tree as it rarely changes.
// But if we ever do get contention we use these for waiting
static __gthread_mutex_t version_lock_mutex = __GTHREAD_MUTEX_INIT;
static __gthread_cond_t version_lock_cond = __GTHREAD_COND_INIT;
#endif

The version lock consists of a single number, where the lower two bits indicate the lock status. As we will see below, for exclusive locks we will use them just like a regular mutex, with the addition that the higher bits are incremented on every unlock. If we do get contention we use version_lock_mutex and version_lock_cond for sleeping, but that should be very rare. For readers we do not modify the lock at all but just remember its state. After the read is finished we check the state again. If it changed, we did a racy read, and try again. Note that such a locking mechanism is sometimes call a sequence lock in the literature. The great advantage is that readers can run fully parallel, and the performance is excellent as long as writes are uncommon.

Initialiizing the lock and trying to acquire the lock in exclusive more are straight forward:

// Initialize in locked state
static inline void
version_lock_initialize_locked_exclusive (struct version_lock *vl)
{
  vl->version_lock = 1;
}

// Try to lock the node exclusive
static inline bool
version_lock_try_lock_exclusive (struct version_lock *vl)
{
  uintptr_t state = __atomic_load_n (&(vl->version_lock), __ATOMIC_SEQ_CST);
  if (state & 1)
    return false;
  return __atomic_compare_exchange_n (&(vl->version_lock), &state, state | 1,
				      false, __ATOMIC_SEQ_CST,
				      __ATOMIC_SEQ_CST);
}

We simply set the lock to 1 to initialize a new lock in locked state. The try_lock tries to change the lowest bit, and fails if that is not possible.

For blocking lock_exclusive calls we first try the same as try_lock. If that fails, we acquire the mutex, try to lock again, and sleep if we did not get the lock:

// Lock the node exclusive, blocking as needed
static void
version_lock_lock_exclusive (struct version_lock *vl)
{
#ifndef __GTHREAD_HAS_COND
restart:
#endif

  // We should virtually never get contention here, as frame
  // changes are rare
  uintptr_t state = __atomic_load_n (&(vl->version_lock), __ATOMIC_SEQ_CST);
  if (!(state & 1))
    {
      if (__atomic_compare_exchange_n (&(vl->version_lock), &state, state | 1,
				       false, __ATOMIC_SEQ_CST,
				       __ATOMIC_SEQ_CST))
	return;
    }

    // We did get contention, wait properly
#ifdef __GTHREAD_HAS_COND
  __gthread_mutex_lock (&version_lock_mutex);
  state = __atomic_load_n (&(vl->version_lock), __ATOMIC_SEQ_CST);
  while (true)
    {
      // Check if the lock is still held
      if (!(state & 1))
	{
	  if (__atomic_compare_exchange_n (&(vl->version_lock), &state,
					   state | 1, false, __ATOMIC_SEQ_CST,
					   __ATOMIC_SEQ_CST))
	    {
	      __gthread_mutex_unlock (&version_lock_mutex);
	      return;
	    }
	  else
	    {
	      continue;
	    }
	}

      // Register waiting thread
      if (!(state & 2))
	{
	  if (!__atomic_compare_exchange_n (&(vl->version_lock), &state,
					    state | 2, false, __ATOMIC_SEQ_CST,
					    __ATOMIC_SEQ_CST))
	    continue;
	}

      // And sleep
      __gthread_cond_wait (&version_lock_cond, &version_lock_mutex);
      state = __atomic_load_n (&(vl->version_lock), __ATOMIC_SEQ_CST);
    }
#else
  // Spin if we do not have condition variables available
  // We expect no contention here, spinning should be okay
  goto restart;
#endif
}

When sleeping we set the second-lowest bit, too, to indicate waiting threads. The unlock function checks that bit, and wakes up the threads if needed:

// Release a locked node and increase the version lock
static void
version_lock_unlock_exclusive (struct version_lock *vl)
{
  // increase version, reset exclusive lock bits
  uintptr_t state = __atomic_load_n (&(vl->version_lock), __ATOMIC_SEQ_CST);
  uintptr_t ns = (state + 4) & (~((uintptr_t) 3));
  state = __atomic_exchange_n (&(vl->version_lock), ns, __ATOMIC_SEQ_CST);

#ifdef __GTHREAD_HAS_COND
  if (state & 2)
    {
      // Wake up waiting threads. This should be extremely rare.
      __gthread_mutex_lock (&version_lock_mutex);
      __gthread_cond_broadcast (&version_lock_cond);
      __gthread_mutex_unlock (&version_lock_mutex);
    }
#endif
}

Readers do not modify the lock at all. When they "lock" in shared mode they store the current state, and check if the state is still the same in the validate function:

// Acquire an optimistic "lock". Note that this does not lock at all, it
// only allows for validation later
static inline bool
version_lock_lock_optimistic (const struct version_lock *vl, uintptr_t *lock)
{
  uintptr_t state = __atomic_load_n (&(vl->version_lock), __ATOMIC_SEQ_CST);
  *lock = state;

  // Acquiring the lock fails when there is currently an exclusive lock
  return !(state & 1);
}

// Validate a previously acquire lock
static inline bool
version_lock_validate (const struct version_lock *vl, uintptr_t lock)
{
  // Prevent the reordering of non-atomic loads behind the atomic load.
  // Hans Boehm, Can Seqlocks Get Along with Programming Language Memory
  // Models?, Section 4.
  __atomic_thread_fence (__ATOMIC_ACQUIRE);

  // Check that the node is still in the same state
  uintptr_t state = __atomic_load_n (&(vl->version_lock), __ATOMIC_SEQ_CST);
  return (state == lock);
}

We fail early if the lock is currently locked exclusive.

Note that optimistic lock coupling conceptually does the same as classical lock coupling. The main difference is that we have to valid a lock before we can act upon a value that we have read. This means: 1) lock parent, 2) fetch child pointer, 3) validate parent, restart if validation fails, 4) lock child, etc.

In the next article we look at the b-tree data structure that we use to store the frames.

by Thomas Neumann (noreply@blogger.com)

Database Architects

Making unwinding through JIT-ed code scalable - Replacing the gcc hooks

This article is part of the series about scalable unwinding that starts here.

As discussed in the previous article, the gcc mechanism does not scale because it uses a global lock to protect its list of unwinding frames. To solve that problem, we replace that list with a read-optimized b-tree that allows for concurrent reads and writes. In this article we just discuss the patches to gcc necessary to enable that mechanism, the b-tree itself is discussed in subsequent articles.

We start by replacing the old fast path mechanism with a b-tree root:

index 8ee55be5675..d546b9e4c43 100644
--- a/libgcc/unwind-dw2-fde.c
+++ b/libgcc/unwind-dw2-fde.c
@@ -42,15 +42,34 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 #endif
 #endif
 
+#ifdef ATOMIC_FDE_FAST_PATH
+#include "unwind-dw2-btree.h"
+
+static struct btree registered_frames;
+
+static void
+release_registered_frames (void) __attribute__ ((destructor (110)));
+static void
+release_registered_frames (void)
+{
+  /* Release the b-tree and all frames. Frame releases that happen later are
+   * silently ignored */
+  btree_destroy (&registered_frames);
+}
+
+static void
+get_pc_range (const struct object *ob, uintptr_t *range);
+static void
+init_object (struct object *ob);
+
+#else
+
 /* The unseen_objects list contains objects that have been registered
    but not yet categorized in any way.  The seen_objects list has had
    its pc_begin and count fields initialized at minimum, and is sorted
    by decreasing value of pc_begin.  */
 static struct object *unseen_objects;
 static struct object *seen_objects;
-#ifdef ATOMIC_FDE_FAST_PATH
-static int any_objects_registered;
-#endif
 
 #ifdef __GTHREAD_MUTEX_INIT
 static __gthread_mutex_t object_mutex = __GTHREAD_MUTEX_INIT;
@@ -78,6 +97,7 @@ init_object_mutex_once (void)
 static __gthread_mutex_t object_mutex;
 #endif
 #endif
+#endif

When the platform supports atomics (ATOMIC_FDE_FAST_PATH), we replace the whole mechanism with one b-tree, whose root is registered_frames. Neither the objects lists not the mutex exist on such platforms. To avoid leaking the memory for the b-tree we register release_registered_frames as destructor with a very late priority on shutdown. Note that it is still safe to throw exceptions even when the b-tree is gone as long as unwinding does not have to go through JITed code that was registered here. The destroyed b-tree behaves like an empty b-tree. get_pc_range and init_object are forward declared because we need them already when inserting into the b-tree.

When registering frames we no longer grab a mutex, we immediately store the frames in the b-tree:

@@ -99,23 +119,23 @@ __register_frame_info_bases (const void *begin, struct object *ob,
   ob->fde_end = NULL;
 #endif
 
+#ifdef ATOMIC_FDE_FAST_PATH
+  // Initialize  eagerly to avoid locking later
+  init_object (ob);
+
+  // And register the frame
+  uintptr_t range[2];
+  get_pc_range (ob, range);
+  btree_insert (&registered_frames, range[0], range[1] - range[0], ob);
+#else
   init_object_mutex_once ();
   __gthread_mutex_lock (&object_mutex);
 
   ob->next = unseen_objects;
   unseen_objects = ob;
-#ifdef ATOMIC_FDE_FAST_PATH
-  /* Set flag that at least one library has registered FDEs.
-     Use relaxed MO here, it is up to the app to ensure that the library
-     loading/initialization happens-before using that library in other
-     threads (in particular unwinding with that library's functions
-     appearing in the backtraces).  Calling that library's functions
-     without waiting for the library to initialize would be racy.  */
-  if (!any_objects_registered)
-    __atomic_store_n (&any_objects_registered, 1, __ATOMIC_RELAXED);
-#endif
 
   __gthread_mutex_unlock (&object_mutex);
+#endif
 }
 
 void
@@ -153,23 +173,23 @@ __register_frame_info_table_bases (void *begin, struct object *ob,
   ob->s.b.from_array = 1;
   ob->s.b.encoding = DW_EH_PE_omit;
 
+#ifdef ATOMIC_FDE_FAST_PATH
+  // Initialize  eagerly to avoid locking later
+  init_object (ob);
+
+  // And register the frame
+  uintptr_t range[2];
+  get_pc_range (ob, range);
+  btree_insert (&registered_frames, range[0], range[1] - range[0], ob);
+#else
   init_object_mutex_once ();
   __gthread_mutex_lock (&object_mutex);
 
   ob->next = unseen_objects;
   unseen_objects = ob;
-#ifdef ATOMIC_FDE_FAST_PATH
-  /* Set flag that at least one library has registered FDEs.
-     Use relaxed MO here, it is up to the app to ensure that the library
-     loading/initialization happens-before using that library in other
-     threads (in particular unwinding with that library's functions
-     appearing in the backtraces).  Calling that library's functions
-     without waiting for the library to initialize would be racy.  */
-  if (!any_objects_registered)
-    __atomic_store_n (&any_objects_registered, 1, __ATOMIC_RELAXED);
-#endif
 
   __gthread_mutex_unlock (&object_mutex);
+#endif
 }

Note that there is a subtle change of logic here: The old mechanism stores all incoming frames as they are in the unseen_objects chain without even looking at them. Then, during unwinding, it inspects the frames to find out the range of program counter values (pc_range), and sorts them in the seen_objects list. This fundamentally requires a lock, as the objects are modified during unwinding. To avoid that, the new code immediately computes the PC range and lets the b-tree do the sorting. Which keeps the frame immutable during unwinding.

When searching a frame during unwinding we delegate everything to the b-tree, which provides lock-free lookups. No mutex is acquired on platforms that support atomics:

@@ -1033,17 +1161,12 @@ _Unwind_Find_FDE (void *pc, struct dwarf_eh_bases *bases)
   const fde *f = NULL;
 
 #ifdef ATOMIC_FDE_FAST_PATH
-  /* For targets where unwind info is usually not registered through these
-     APIs anymore, avoid taking a global lock.
-     Use relaxed MO here, it is up to the app to ensure that the library
-     loading/initialization happens-before using that library in other
-     threads (in particular unwinding with that library's functions
-     appearing in the backtraces).  Calling that library's functions
-     without waiting for the library to initialize would be racy.  */
-  if (__builtin_expect (!__atomic_load_n (&any_objects_registered,
-					  __ATOMIC_RELAXED), 1))
+  ob = btree_lookup (&registered_frames, (uintptr_t) pc);
+  if (!ob)
     return NULL;
-#endif
+
+  f = search_object (ob, pc);
+#else
 
   init_object_mutex_once ();
   __gthread_mutex_lock (&object_mutex);
@@ -1081,6 +1204,7 @@ _Unwind_Find_FDE (void *pc, struct dwarf_eh_bases *bases)
 
  fini:
   __gthread_mutex_unlock (&object_mutex);
+#endif
 
   if (f)
     {

De-registering a frame works just the same as registering, we use get_pc_range to get the range and then remove it from the b-tree:

@@ -200,16 +220,33 @@ __register_frame_table (void *begin)
 void *
 __deregister_frame_info_bases (const void *begin)
 {
-  struct object **p;
   struct object *ob = 0;
 
   /* If .eh_frame is empty, we haven't registered.  */
   if ((const uword *) begin == 0 || *(const uword *) begin == 0)
     return ob;
 
+#ifdef ATOMIC_FDE_FAST_PATH
+  // Find the corresponding PC range
+  struct object lookupob;
+  lookupob.tbase = 0;
+  lookupob.dbase = 0;
+  lookupob.u.single = begin;
+  lookupob.s.i = 0;
+  lookupob.s.b.encoding = DW_EH_PE_omit;
+#ifdef DWARF2_OBJECT_END_PTR_EXTENSION
+  lookupob.fde_end = NULL;
+#endif
+  uintptr_t range[2];
+  get_pc_range (&lookupob, range);
+
+  // And remove
+  ob = btree_remove (&registered_frames, range[0]);
+#else
   init_object_mutex_once ();
   __gthread_mutex_lock (&object_mutex);
 
+  struct object **p;
   for (p = &unseen_objects; *p ; p = &(*p)->next)
     if ((*p)->u.single == begin)
       {
@@ -241,6 +278,8 @@ __deregister_frame_info_bases (const void *begin)
 
  out:
   __gthread_mutex_unlock (&object_mutex);
+#endif
+
   gcc_assert (ob);
   return (void *) ob;
 }

We are nearly done with our changes to existing gcc code, we just need a few more helper functions for get_pc_range:

@@ -264,7 +303,7 @@ __deregister_frame (void *begin)
    instead of an _Unwind_Context.  */
 
 static _Unwind_Ptr
-base_from_object (unsigned char encoding, struct object *ob)
+base_from_object (unsigned char encoding, const struct object *ob)
 {
   if (encoding == DW_EH_PE_omit)
     return 0;
@@ -821,6 +860,91 @@ init_object (struct object* ob)
   ob->s.b.sorted = 1;
 }
 
+#ifdef ATOMIC_FDE_FAST_PATH
+/* Get the PC range from FDEs. The code is very similar to
+   classify_object_over_fdes and should be kept in sync with
+   that. The main difference is that classify_object_over_fdes
+   modifies the object, which we cannot do here */
+static void
+get_pc_range_from_fdes (const struct object *ob, const fde *this_fde,
+			uintptr_t *range)
+{
+  const struct dwarf_cie *last_cie = 0;
+  int encoding = DW_EH_PE_absptr;
+  _Unwind_Ptr base = 0;
+
+  for (; !last_fde (ob, this_fde); this_fde = next_fde (this_fde))
+    {
+      const struct dwarf_cie *this_cie;
+      _Unwind_Ptr mask, pc_begin, pc_range;
+
+      /* Skip CIEs.  */
+      if (this_fde->CIE_delta == 0)
+	continue;
+
+      this_cie = get_cie (this_fde);
+      if (this_cie != last_cie)
+	{
+	  last_cie = this_cie;
+	  encoding = get_cie_encoding (this_cie);
+	  base = base_from_object (encoding, ob);
+	}
+
+      const unsigned char *p;
+      p = read_encoded_value_with_base (encoding, base, this_fde->pc_begin,
+					&pc_begin);
+      read_encoded_value_with_base (encoding & 0x0F, 0, p, &pc_range);
+
+      /* Take care to ignore link-once functions that were removed.
+	 In these cases, the function address will be NULL, but if
+	 the encoding is smaller than a pointer a true NULL may not
+	 be representable.  Assume 0 in the representable bits is NULL.  */
+      mask = size_of_encoded_value (encoding);
+      if (mask < sizeof (void *))
+	mask = (((_Unwind_Ptr) 1) << (mask << 3)) - 1;
+      else
+	mask = -1;
+      if ((pc_begin & mask) == 0)
+	continue;
+
+      _Unwind_Ptr pc_end = pc_begin + pc_range;
+      if ((!range[0]) && (!range[1]))
+	{
+	  range[0] = pc_begin;
+	  range[1] = pc_end;
+	}
+      else
+	{
+	  if (pc_begin < range[0])
+	    range[0] = pc_begin;
+	  if (pc_end > range[1])
+	    range[1] = pc_end;
+	}
+    }
+}
+
+/* Get the PC range for lookup */
+static void
+get_pc_range (const struct object *ob, uintptr_t *range)
+{
+  range[0] = range[1] = 0;
+  if (ob->s.b.sorted)
+    {
+      get_pc_range_from_fdes (ob, ob->u.sort->orig_data, range);
+    }
+  else if (ob->s.b.from_array)
+    {
+      fde **p = ob->u.array;
+      for (; *p; ++p)
+	get_pc_range_from_fdes (ob, *p, range);
+    }
+  else
+    {
+      get_pc_range_from_fdes (ob, ob->u.single, range);
+    }
+}
+#endif
+
 /* A linear search through a set of FDEs for the given PC.  This is
    used when there was insufficient memory to allocate and sort an
    array.  */

The code finds out the range of possible pc values for the FDE. Conceptually it does nearly the same as the already existing function classify_object_over_fdes. Unfortunately we cannot reuse that code, as classify_object_over_fdes modifies the object in order to speed up subsequent searches, and we require our data to be immutable.

In the next article we look at optimistic lock coupling, which is the mechanism that we use to allow for lock-free readers.

by Thomas Neumann (noreply@blogger.com)

Database Architects

Making unwinding through JIT-ed code scalable

Exceptions are a very handy mechanism to propagate errors in C++ programs, but unfortunately they do not scale very well. In all common C++ implementations the unwinding mechanism takes global lock during unwinding, which has disastrous consequences when the number of threads is high. On a machine with 256 hardware context we see worse-than-single-threaded behavior even for relatively modest failure rates.

Fortunately the Florian Weimer fixed one contention point in gcc 12 on systems with glibc 2.35 or newer, which gives us scalable exceptions as long as no JIT-ed code has been registered. Unfortunately our system does register JIT-ed code... Which means exception unwinding in our code base is still single-threaded in practice.

But we can fix that by teaching gcc to store the unwinding information in a read-optimized b-tree, which allows for fully parallel unwinding without any atomic writes. There is a gcc patch that does just that, but unfortunately it is quite involved and difficult to review. This article series thus explains all parts of the patch and shows how a read-optimized b-tree can be implemented lock-free.

In order to keep the article length somewhat reasonable, the discusses is broken into parts:

When unwinding exceptions, the compiler has to find the corresponding unwinding information for every call frame on the stack between the throw and the catch. gcc uses two different mechanisms for that: For ahead-of-time compiled code it asks glibc to find the unwinding information using either dl_iterate_phdr (on older systems) or _dl_find_object (on systems with glibc 2.35 or newer). Note that this mapping is not static, as shared libraries could be added or removed at any time, potentially during a concurrent unwind. For that reason dl_iterate_phdr was protected by a global mutex, which clearly does not scale. _dl_find_object avoids that mutex by using a lock-free data structure. But that only affects ahead-of-time compiled code, for JITed code libgcc uses a different mechanism, which we now discuss:

For JITed code the emitter has to explicitly register the unwinding information using __register_frame_info_bases (and the very similar __register_frame_info_table_bases):

void
__register_frame_info_bases (const void *begin, struct object *ob,
			     void *tbase, void *dbase)
{
  /* If .eh_frame is empty, don't register at all.  */
  if ((const uword *) begin == 0 || *(const uword *) begin == 0)
    return;

  ob->pc_begin = (void *)-1;
  ob->tbase = tbase;
  ob->dbase = dbase;
  ob->u.single = begin;
  ob->s.i = 0;
  ob->s.b.encoding = DW_EH_PE_omit;
#ifdef DWARF2_OBJECT_END_PTR_EXTENSION
  ob->fde_end = NULL;
#endif

  init_object_mutex_once ();
  __gthread_mutex_lock (&object_mutex);

  ob->next = unseen_objects;
  unseen_objects = ob;
#ifdef ATOMIC_FDE_FAST_PATH
  /* Set flag that at least one library has registered FDEs.
     Use relaxed MO here, it is up to the app to ensure that the library
     loading/initialization happens-before using that library in other
     threads (in particular unwinding with that library's functions
     appearing in the backtraces).  Calling that library's functions
     without waiting for the library to initialize would be racy.  */
  if (!any_objects_registered)
    __atomic_store_n (&any_objects_registered, 1, __ATOMIC_RELAXED);
#endif

  __gthread_mutex_unlock (&object_mutex);
}

Which conceptually is a simple thing: It grabs a mutex, puts the object information into a list, and released the mutex. Note the code within the ATOMIC_FDE_FAST_PATH block: As we will see below, that unwinding mechanism is very slow. Thus, libgcc tries to avoid taking that path by remembering if any JITed code was registered at all. If not, it stops unwinding immediately. But that mechanism does not help if we do have JITed code.

During unwinding, the unwinder calls _Unwind_Find_FDE, which traverses the list in order to find the corresponding unwind table for the given address:

const fde *
_Unwind_Find_FDE (void *pc, struct dwarf_eh_bases *bases)
{
  struct object *ob;
  const fde *f = NULL;

#ifdef ATOMIC_FDE_FAST_PATH
  /* For targets where unwind info is usually not registered through these
     APIs anymore, avoid taking a global lock.
     Use relaxed MO here, it is up to the app to ensure that the library
     loading/initialization happens-before using that library in other
     threads (in particular unwinding with that library's functions
     appearing in the backtraces).  Calling that library's functions
     without waiting for the library to initialize would be racy.  */
  if (__builtin_expect (!__atomic_load_n (&any_objects_registered,
					  __ATOMIC_RELAXED), 1))
    return NULL;
#endif

  init_object_mutex_once ();
  __gthread_mutex_lock (&object_mutex);

  /* Linear search through the classified objects, to find the one
     containing the pc.  Note that pc_begin is sorted descending, and
     we expect objects to be non-overlapping.  */
  for (ob = seen_objects; ob; ob = ob->next)
    if (pc >= ob->pc_begin)
      {
	f = search_object (ob, pc);
	if (f)
	  goto fini;
	break;
      }

  /* Classify and search the objects we've not yet processed.  */
  while ((ob = unseen_objects))
    {
      struct object **p;

      unseen_objects = ob->next;
      f = search_object (ob, pc);

      /* Insert the object into the classified list.  */
      for (p = &seen_objects; *p ; p = &(*p)->next)
	if ((*p)->pc_begin < ob->pc_begin)
	  break;
      ob->next = *p;
      *p = ob;

      if (f)
	goto fini;
    }

 fini:
  __gthread_mutex_unlock (&object_mutex);

  if (f)
    {
      int encoding;
      _Unwind_Ptr func;

      bases->tbase = ob->tbase;
      bases->dbase = ob->dbase;

      encoding = ob->s.b.encoding;
      if (ob->s.b.mixed_encoding)
	encoding = get_fde_encoding (f);
      read_encoded_value_with_base (encoding, base_from_object (encoding, ob),
				    f->pc_begin, &func);
      bases->func = (void *) func;
    }

  return f;
}

In the fast path it tries to stop early if no unwinding information had been registered. But if any unwinding frame has been registered by JITed code, it grabs the global object_mutex and does everything effectively single-threaded. This does not scale. Thus, we discuss how to switch gcc to a lock-free lookup structure in the next article.

by Thomas Neumann (noreply@blogger.com)