The limitations of LLMs, or why are we doing RAG?
This is an external post of mine. Click here if you are not redirected.
This is an external post of mine. Click here if you are not redirected.
Some of the most interesting technical blog posts I read come from, and a common reason for posts I write is, confusion. You're at work and you start asking questions that are difficult to answer. You spend a few hours or a day trying to get to the bottom of things.
If you ask a question to very experienced and successful developers at work, they have a tendency not to give context and to simplify things down to a single answer. This may be a good way to make business decisions. (One can't afford to waste an eternity considering everything indefinitely.) But accepting an answer you don't understand is actively harmful for building intuition.
Certainly, sometimes not accepting an answer can be irritating. You'll have to figure that out.
But beyond "go along to get along", another reason we don't pursue what we're confused about is because we're embarrassed that we're confused in the first place. What's worse, the embarrassment we feel naturally grows the more experienced we get. "I've got this job title, I don't want to seem like I don't know what you mean."
But if you fight the embarrassment and pursue your confusion regardless, you'll likely figure something very interesting out. Moreover, you will probably not have been the only person who was confused. At least personally it is quite rare that I am confused about something and no one else is.
So pay attention when you get confused, and consider why it happened. What did you expect to be the case, and how did reality differ? Explore the angles and the options. When you finally understand, think about what led you to that understanding.
Write it down. Put it into an internal Markdown doc, an internal Atlassian doc, an internal Google Slides page, whatever. The medium doesn't matter.
This entire process doesn't come easily. We feel embarrassed. We aren't used to lingering on something we're confused by. We aren't used to writing things down.
But if you can make yourself pause every once in a while and think about what you (or someone around you) got confused by, and if you can force yourself to stop getting embarrassed by what you got confused by, and if you can write down the background and the reasoning that led to your ultimate understanding, you're going to have something pretty interesting to talk about.
You'll contribute to the growth and intuition of your colleagues. And you'll never run out of things to write about.
Confusion is embarrassing. But fight that feeling, and dig into why you're confused. And write it down.
— Phil Eaton (@eatonphil) June 14, 2024
You won't be the only one who was confused. And you'll tend to have something pretty interesting to talk about.https://t.co/IdX1nGBheR pic.twitter.com/KzTjqMxw6u
Database system developers have a complicated relationship with storage devices: They can store terabytes of data cheaply, and everything is still there after a system crash. On the other hand, storage can be a spoilsport by being slow when it matters most.
This blog post shows
When we colloquially talk about speed, we usually think in terms of throughput, i.e., how much data we can store or retrieve per second. Let’s use the fantastic bench-fio tool to measure how a consumer-grade Crucial T700 SSD used in one of our build servers performs for random reads:
Due to better access locality, B-trees are faster than binary search trees in practice -- but are they also better in theory? To answer this question, let's look at the number of comparisons required for a search operation. Assuming we store n elements in a binary search tree, the lower bound for the number of comparisons is log2 n in the worst case. However, this is only achievable for a perfectly balanced tree. Maintaining such a tree's perfect balance during insert/delete operations requires O(n) time in the worst case.
Balanced binary search trees, therefore, leave some slack in terms of how balanced they are and have slightly worse bounds. For example, it is well known that an AVL tree guarantees at most 1.44 log2 n comparisons, and a Red-Black tree guarantees 2 log2 n comparisons. In other words, AVL trees require at most 1.44 times the minimum number of comparisons, and Red-Black trees require up to twice the minimum.
How many comparisons does a B-tree need? In B-trees with degree k, each node (except the root) has between k and 2k children. For k=2, a B-tree is essentially the same data structure as a Red-Black tree and therefore provides the same guarantee of 2 log2 n comparisons. So how about larger, more realistic values of k?
To analyze the general case, we start with a B-tree that has the highest possible height for n elements. The height is maximal when each node has only k children (for simplicity, this analysis ignores the special case of underfull root nodes). This implies that the worst-case height of a B-tree is logk n. During a lookup, one has to perform a binary search that takes log2 k comparisons in each of the logk n nodes. So in total, we have log2 k * logk n = log2 n comparisons.
This actually matches the best case, and to construct the worst case, we have to modify the tree somewhat. On one (and only one) arbitrary path from the root to a single leaf node, we increase the number of children from k to 2k. In this situation, the tree height is still less than or equal to logk n, but we now have one worst-case path where we need log2 2k (instead of log2 k) comparisons. On this worst-case path, we have log2 2k * logk n = (log2 2k) / (log2 k) * log2 n comparisons.
Using this formula, we get the following bounds:
k=2: 2 log2 n
k=4: 1.5 log2 n
k=8: 1.33 log2 n
k=16: 1.25 log2 n
...
k=512: 1.11 log2 n
We see that as k grows, B-trees get closer to the lower bound. For k>=8, B-trees are guaranteed to perform fewer comparisons than AVL trees in the worst case. As k increases, B-trees become more balanced. One intuition for this result is that for larger k values, B-trees become increasingly similar to sorted arrays which achieve the log2 n lower bound. Practical B-trees often use fairly large values of k (e.g., 100) and therefore offer tight bounds -- in addition to being more cache-friendly than binary search trees.
(Caveat: For simplicity, the analysis assumes that log2 n and log2 2k are integers, and that the root has either k or 2k entries. Nevertheless, the observation that larger k values lead to tighter bounds should hold in general.)