A table engine in ClickHouse controls how your data is stored, indexed, merged, and queried. Choosing the wrong one affects both performance and correctness.
When you create a table in ClickHouse, you have to declare an engine. This is different from most databases, where the storage mechanism is invisible. In ClickHouse, it is an explicit decision you make, and it shapes everything about how that table behaves.
This post covers the main ClickHouse engine types, what each one is designed for, and how to make the right choice for your workload. You do not need to memorise all of them. In practice, one engine covers the vast majority of real-world use cases.
What does a table engine actually do?
A table engine controls four things for every table you create:
- How data is stored on disk: The physical layout, file format, and compression strategy all come from the engine.
- How data is indexed: Which columns form the primary key, and how ClickHouse uses that key to skip data during queries.
- How data is merged: ClickHouse writes data in parts and merges them in the background. The engine controls the rules for how that merge happens.
- What operations are supported: Some engines support replication. Some support deduplication. Some are read-only. The engine defines the boundaries.
Think of a table engine as the contract between you and ClickHouse for how a specific table will behave.
How many table engines does ClickHouse have?
ClickHouse has a large number of table engines, which can feel overwhelming at first. But they fall into a small number of families, and most of them are for specialised situations you will rarely encounter when starting out.
The main ClickHouse engine types are:
- MergeTree family: the primary engines for production data storage
- Log family: lightweight engines for small, temporary datasets
- Integration engines: engines that connect ClickHouse to external systems like Kafka, S3, or PostgreSQL
- Special-purpose engines: engines for specific patterns like in-memory storage or distributed query routing
For most workloads, you will spend almost all of your time in the MergeTree family. The others exist for specific situations.
What is the MergeTree engine and why does almost everyone use it?
MergeTree is the default engine for production tables in ClickHouse. If you are storing real data that you will query repeatedly, you will almost always use MergeTree or one of its variants.
Here is what makes it the right choice for most situations:
- It handles high-volume inserts efficiently by writing data in sorted parts and merging them in the background
- It supports a primary key for fast data skipping during queries
- It supports partitioning so you can drop old data by time period cheaply
- It supports replication for high availability
- It supports TTL (time to live) so data expires automatically without manual cleanup
Every MergeTree table requires an ORDER BY clause. This is the sorting key that determines how data is physically arranged on disk. Your most frequent query filters should guide this choice.
A basic MergeTree table for an online store’s order events looks like this:
CREATE TABLE order_events (
order_id UInt64,
customer_id UInt64,
country String,
status String,
revenue Float64,
created_at DateTime
) ENGINE = MergeTree()
ORDER BY (country, created_at);With this setup, ClickHouse sorts all data by country first, then by date. Queries that filter by country and date range will skip large portions of data without reading them, making them dramatically faster.
What are the MergeTrees, and when do you use them?
The MergeTree family includes several specialised variants. Each one inherits everything from the base MergeTree engine and adds one specific behaviour during the background merge process.
| Engine | What it adds | When to use it |
|---|---|---|
MergeTree |
Nothing extra. Base engine. | General-purpose storage |
ReplacingMergeTree |
Deduplicates rows with the same primary key during merges | When you need to update records by inserting new versions and letting old ones be removed |
SummingMergeTree |
Sums numeric columns with the same primary key during merges | Pre-aggregating counters or totals at write time |
AggregatingMergeTree |
Stores partial aggregation states during merges | Building materialised views with complex aggregations |
CollapsingMergeTree |
Cancels out rows using a sign column | Handling updates and deletes in an append-only system |
The variants are covered in depth in a separate post in this series. For now, the important thing to know is that MergeTree is your starting point, and you move to a variant only when you have a specific need like deduplication or pre-aggregation.
What are the Log engines and when do you need them?
The Log family is a set of lightweight engines designed for small tables that do not need the full capabilities of MergeTree. They write data quickly with minimal overhead and support only basic reads and appends.
The three main Log engines are TinyLog, StripeLog, and Log. They differ mainly in how they handle concurrent reads and whether they use a separate index file.
When do you actually use them? Almost never for production data. The Log family is useful for:
- Temporary tables during testing or development
- Small lookup tables with fewer than one million rows
- Intermediate staging tables in a data pipeline where data passes through briefly
If you are storing data you intend to query repeatedly or at any meaningful volume, use MergeTree instead.
What are the integration engines?
Integration engines let ClickHouse read from and write to external systems as if they were ClickHouse tables. You do not store data in ClickHouse with these engines. You query the external system through a ClickHouse interface.
Common integration engines include:
- Kafka: Read directly from a Kafka topic and insert rows into a ClickHouse table as messages arrive.
- S3: Query files stored in Amazon S3 or compatible object storage without loading them into ClickHouse first.
- PostgreSQL: Read from a live PostgreSQL table inside a ClickHouse query, useful for joining operational data with analytical data.
- MySQL: Same as PostgreSQL, for MySQL sources.
These are particularly useful for building data pipelines where data flows from an operational system into ClickHouse for analytics.
What is the Distributed engine?
The Distributed engine is not a storage engine. It does not store data. It is a query routing layer that sits in front of a cluster of ClickHouse nodes and distributes queries across them.
When you query a Distributed table, ClickHouse sends parts of that query to each node in the cluster, collects the results, and merges them into a single response. This is how ClickHouse scales horizontally across multiple machines.
You will encounter the Distributed engine when you move from a single ClickHouse instance to a cluster. For single-node deployments, you do not need it.
Which engine should you start with?
The decision is simpler than the number of engines suggests. Here is a straightforward way to think about it:
- Storing production data you will query? Use MergeTree.
- Need deduplication on insert? Use ReplacingMergeTree.
- Building pre-aggregated counters? Use SummingMergeTree.
- Small temporary table for testing? Use TinyLog or Log.
- Reading from Kafka, S3, or an external database? Use the relevant integration engine.
- Running a multi-node cluster? Add a Distributed table on top of your MergeTree tables.
If you are just getting started, pick MergeTree, define your ORDER BY based on your most common query filters, and move forward. You can always change the engine type later by migrating the table, but getting MergeTree right from the start covers the majority of what most teams need.
Looking for ClickHouse consulting or implementation support? Book a free ClickHouse consultation
