Database storage engines

Ajay Yadav
6 min readJan 4, 2022


Book Reference: DDIA

There are two categories of the storage engine

  • LSM storage engine
  • Page oriented storage engine (B-Tree based)

World’s simplest database


set(key, value) {    echo "${key},${value}" > test.db}get(key) {    grep "^${key}," test.db | sed -e "s/^${key},//" | tail -n 1}

What kind of values can be stored in the database defined by the above script?

It’s a key-value store where value can be of any type.

Write Performance

DB set is appending the record at the end of the file, Complexity O(1). It’s a very good performance.

Read Performance

In the worst case, to search for a record have to go through all the records in the database. Time complexity is O(n)

How to improve read performance?


The index is built on the primary content, doesn’t affect any content. But it impacts the performance of the query.

  • A well-chosen index improves the read performance
  • Every index reduces the write performance.

Hash Index

You want to meet your friend, you know his society name but you forgot the apartment number.

One way is to go to every home and ask if that house belongs to your friend. Worst-case O(N)

Another way is to go to the society’s office and ask for the house number of your friend based on his name. That’s how a hash-based index works.

Hash-based index maps the key to the location where the value is stored.

Real-life Example:

  • Bit cask (default storage engine of Riak DB) uses hash-based indexing.
  • Bit cask offers high-performance reads and writes subjected to all keys fits in available RAM.

Segmentation, Merge, and Compaction

Disk out of space

Write operation appending at the end of the file, it keeps getting larger and larger and it will run out of space after a time.


When file size increases a threshold, close the file for further writes and create a new segment file for write.


Throwing away the records corresponding to the duplicate one and keeping only the latest one.


Segmentation and compaction will keep on increasing the number of segments. So periodically merge them.

Compaction and Merging happen in the background thread.

Problems in real database Implementation

File Format: CSV v/s binary format

Deleting records: Tombstone is added to mark a record as deleted.

Crash recovery: Memtable crash can be restored from the segment logs on the crash, it won’t be sorted.

Partially written records: Partially written records can be avoided with the help of the checksum.

Concurrency control: Keep a single writer thread.

Advantage of the hash-based index

  • Writes are faster
  • Concurrency and Crash recovery is better

Hash-based index drawback

  • Range-based queries are inefficient
  • The hash-based index should fit in memory

SSTable and LSM

SSTable stands for Sorted String Table.

In SSTable, simple change to our segments file, keys are stored in the sorted order.

Construction of SSTable

Questions: Explain the following flows in SSTable

  • Write to memtable
  • Write to segment
  • Read
  • Crash recovery

Advantages of SS table

  • Merging and compaction are easier: Based on merge sort.
  • Index is sparse
  • Range queries and locality of reference

DB based on SSTable

LSM (Log structured merge tree)

Storage engines based on the concept of compaction and merging the sorted files are known as LSM (log-structured merge tree)

Performance optimization in LSM tree

Bloom filter

If a record doesn’t exist in the database, DB has to scan through all segment files. To improve on this bloom filters are used.

Compaction strategies: Leveled vs Sized-Tired compaction

Size-tired compaction

Leveled compaction


B Tree is a data structure used for disk access


Branching Factor

The numbers of child nodes can be referred from a node of the B-tree.

Height of B-tree

Log(N), with k (branching factor) as base

B-Tree node size

4KB (can be larger as well)


Given a binary tree of size 4, branching factor 500, and each node stores 4KB of data. What is the maximum of the data being stored by the binary tree?

Reliability in B- Tree

  • In-place update: Create a new data node and update the pointer.
  • Crash recovery: WAL (write-ahead log)
  • Concurrency: Latch

Working of WAL log

Optimization B-Tree

  • Copy on write
  • Abbreviated key
  • Sequential access
  • Sibling links

B-Tree vs LSM Tree

Thumb Rule

  • LSM tree faster for writes
  • B-Tree faster for the read

LSM Tree Advantages

  • Faster for the write as append-only.

LSM tree downside

  • The disk has a limited write capacity.
  • Compaction and merging utilize some writes and hence impact external writes.
  • LSM tree stores multiple copies of the same data.

Analytical processing

Commercial transactions

  • Sale
  • Banking
  • Processing an order

OLTP ( Online Transaction Processing)

  • Look for a few records and updates based on the user input.

OLAP (Online Analytical Processing)

Data analytics

  • Scan over a large number of records read a few fields and calculate aggregates.

OLTP databases are highly available, hence the database administrator doesn’t allow the analytics to run long-running queries on this database. As analytical queries are expensive, scanning a large dataset which impacts the performance.

Data warehouse

It is a separate database where analysts can run queries to their content.


The process of getting data into the data warehouse is known as ETL (Extract Transform and Load)

Star Schema

  • Fact table: very large
  • Dimension table

Column-oriented database

Column DB compression



Ajay Yadav

Believer of Distributed Systems