keep an in-memory hash map where every key is mapped to a byte offset in the data file - the location at which the value can be found.
Whenever append a new key-value pair to the file, also update the hash map to reflect the offset of the data you just wrote (this works both for inserting new keys and for updating existing keys).
When want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value (used by Bitcask - the default storage engine in Riak)
NOTE:
A storage engine like Bitcask is well suited to situations where the value for each key is updated frequently. For example, the key might be the URL of a cat video, and the value might be the number of times it has been played (incremented every time someone hits the play button). In this kind of workload, there are a lot of writes, but there are not too many distinct keys - you have a large number of writes per key, but it’s feasible to keep all keys in memory.
Append-only and break the log into segments of a certain size by closing a segment file when it reaches a certain size, and making subsequent writes to a new segment file.
Perform compaction on these segments - Compaction means throwing away duplicate keys in the log, and keeping only the most recent update for each key. The merging and compaction of frozen segments can be done in a background thread, and while it is going on, we can still continue to serve read and write requests as normal, using the old segment files. After the merging process is complete, we switch read requests to using the new merged segment instead of the old segments - and then the old segment files can simply be deleted.
Lots of detail goes into making this simple idea work in practice. Briefly, some of the issues that are important in a real implementation are:
File format: CSV is not the best format for a log. It’s faster and simpler to use a binary format that first encodes the length of a string in bytes, followed by the raw string (without need for escaping).
Deleting records: If you want to delete a key and its associated value, you have to append a special deletion record to the data file (sometimes called a tombstone). When log segments are merged, the tombstone tells the merging process to discard any previous values for the deleted key.
Crash recovery: If the database is restarted, the in-memory hash maps are lost. In principle, you can restore each segment’s hash map by reading the entire segment file from beginning to end and noting the offset of the most recent value for every key as you go along. However, that might take a long time if the segment files are large, which would make server restarts painful. Bitcask speeds up recovery by storing a snapshot of each segment’s hash map on disk, which can be loaded into memory more quickly.
Partially written records: The database may crash at any time, including halfway through appending a record to the log. Bitcask files include checksums, allowing such corrupted parts of the log to be detected and ignored.
Concurrency control: As writes are appended to the log in a strictly sequential order, a common implementation choice is to have only one writer thread. Data file segments are append-only and otherwise immutable, so they can be read concurrently by multiple threads.
Limitations:
The hash table must fit in memory.
Range queries are not efficient.
SSTables (Sorted String Table) - The log-structured indexes break the database down into variable-size segments, typically several megabytes or more in size, and always write a segment sequentially.
change to the format of our segment files: require that the sequence of key-value pairs is sorted by key
Pros: can use sparse in-memory index (no need to keep all keys in memory) and range queries are more efficient
Constructing and maintaining SSTables:
how do you get your data to be sorted by key in the first place?
When a write comes in, add it to an in-memory balanced tree data structure (for example, a red-black tree). This in-memory tree is sometimes called a memtable.
When the memtable gets bigger than some threshold - typically a few megabytes - write it out to disk as an SSTable file. This can be done efficiently because the tree already maintains the key-value pairs sorted by key. The new SSTable file becomes the most recent segment of the database. While the SSTable is being written out to disk, writes can continue to a new memtable instance.
In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.
From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.
NOTE: it only suffers from one problem: if the database crashes, the most recent writes (which are in the memtable but not yet written out to disk) are lost. In order to avoid that problem, we can keep a separate log on disk to which every write is immediately appended - only for recovery purpose.
LSM (Log-Structured Merge-Tree) -Trees - Making an LSM-tree out of SSTables
he basic idea of LSM-trees: keeping a cascade of SSTables that are merged in the background - is simple and effective. Even when the dataset is much bigger than the available memory it continues to work well. Since data is stored in sorted order, you can efficiently perform range queries (scanning all keys above some minimum and up to some maximum), and because the disk writes are sequential the LSM-tree can support remarkably high write throughput.
Performance optimizations:
the LSM-tree algorithm can be slow when looking up keys that do not exist in the database: you have to check the memtable, then the segments all the way
back to the oldest (possibly having to read from disk for each one) before you can be sure that the key does not exist.
In order to optimize this kind of access, storage engines often use additional Bloom filters. (A Bloom filter is a memory-efficient data structure for approximating the contents of a set. It can tell you if a key does not appear in the database, and thus saves many unnecessary disk reads for nonexistent keys.)
谢谢分享,
Perform compaction on these segments - Compaction means throwing away duplicate keys in the log, and keeping only the most recent update for each key. The merging and compaction of frozen segments can be done in a background thread, and while it is going on, we can still continue to serve read and write requests as normal, using the old segment files. After the merging process is complete, we switch read requests to using the new merged segment instead of the old segments - and then the old segment files can simply be deleted.