ClickHouse Primary Keys, Sorting Keys, and Partition Keys

ClickHouse primary keys, sorting keys, and partition keys

These three concepts control how ClickHouse organises data on disk, how fast it can skip data during queries, and how cheaply you can drop old data. Getting them right is the difference between a schema that scales and one that does not.

Most ClickHouse performance problems trace back to one of three schema decisions: the wrong sorting key, a misunderstood primary key, or a partition key that creates more problems than it solves. This post explains all three, shows you how they interact, and gives you a practical framework for making the right choice the first time.

What are the three concepts, and what does each one do?

Before going into detail, here is the clearest way to think about all three at once:

Concept Defined by What it controls
Sorting key ORDER BY The physical order of rows on disk within each part
Primary key PRIMARY KEY Which columns are included in the sparse index for fast data skipping
Partition key PARTITION BY How data is grouped into separate chunks that can be dropped or queried independently

They work together but they are not the same thing. Confusing them is the most common source of slow queries and poor schema design in ClickHouse.

What scenario will we use throughout this post?

To keep things concrete, we will use one scenario throughout: a content delivery platform called Streamline. Streamline serves video content to users across multiple regions. Every time a user watches, pauses, or finishes a video, Streamline logs a playback event.

A simplified version of that events table looks like this:

CREATE TABLE playback_events (
user_id       UInt64,
content_id    UInt64,
region        String,
event_type    String,
watch_seconds UInt32,
event_ts      DateTime
) ENGINE = MergeTree()
ORDER BY (region, event_ts)
PARTITION BY toYYYYMM(event_ts);

We will refer back to this table as we unpack each concept.

What does the sorting key actually do?

The sorting key, defined by ORDER BY, determines the physical order of rows within each data part on disk. Every time ClickHouse writes a part, it sorts the rows by your ORDER BY expression before writing them.

In the Streamline table, ORDER BY (region, event_ts) means all rows are sorted by region first, then by timestamp within each region. On disk, all events from Europe sit together, sorted by time. All events from North America sit together, sorted by time. And so on.

Why does this matter for query speed? When you run a query like this:

SELECT count(), sum(watch_seconds)
FROM playback_events
WHERE region = 'Europe'
AND event_ts >= '2024-03-01'
AND event_ts < '2024-04-01';

ClickHouse knows that all European events are physically grouped together on disk. It skips every row that belongs to a different region without reading them. Within the European rows, it skips everything outside March 2024. The result is that ClickHouse reads a tiny fraction of the total data on disk to answer this query.

If instead you had ORDER BY user_id but always filtered by region, ClickHouse would have to scan far more data because events from Europe are scattered across the entire dataset, interleaved with events from every other region.

Following ClickHouse ORDER BY best practices means one thing above everything else: your ORDER BY columns should reflect your most common query filters, in order from most selective to least selective.

What is the primary key in ClickHouse and how is it different from ORDER BY?

This is the concept that confuses most people coming from PostgreSQL or MySQL, where the primary key enforces uniqueness and drives the main index.

In ClickHouse, the primary key does not enforce uniqueness. It defines which columns are included in the sparse index that ClickHouse uses to skip data during reads.

Here is what sparse index means: instead of indexing every row, ClickHouse indexes one entry per granule. A granule is a block of 8,192 rows by default. So if your table has 100 million rows, the primary key index has roughly 12,000 entries, not 100 million. This makes the index small enough to keep in memory at all times.

When you run a query, ClickHouse scans the primary key index to find which granules might contain matching rows. It then reads only those granules from disk. Everything else is skipped.

Now here is the key distinction from ORDER BY:

  • The ORDER BY expression is the full sorting key. It controls the physical order of all rows on disk.
  • The PRIMARY KEY must be a prefix of the ORDER BY expression. It controls which columns appear in the sparse index.
  • If you do not specify a PRIMARY KEY separately, ClickHouse uses the full ORDER BY as the primary key. This is the default and works well for most tables.

Here is a case where you would separate them. Streamline wants to sort data by region and timestamp for query performance, but only needs the index to cover region:

CREATE TABLE playback_events (
user_id       UInt64,
content_id    UInt64,
region        String,
event_type    String,
watch_seconds UInt32,
event_ts      DateTime
) ENGINE = MergeTree()
PRIMARY KEY (region)
ORDER BY (region, event_ts)
PARTITION BY toYYYYMM(event_ts);

With this setup, the sparse index covers only region. The data is still physically sorted by (region, event_ts) on disk. Queries that filter by region use the index to skip granules. Queries that filter by both region and timestamp benefit from the physical sort order even though event_ts is not in the index directly.

For most tables, you do not need to separate PRIMARY KEY and ORDER BY. Use the default behaviour where they are the same. Separate them only when you have a specific reason, such as needing a longer sort key for compression benefits while keeping the index lean.

What does the partition key do and when do you actually need it?

The partition key, defined by PARTITION BY, splits your table data into separate physical chunks called partitions. Each partition is stored independently.

In the Streamline table, PARTITION BY toYYYYMM(event_ts) creates one partition per month. All March 2024 events live in one partition. All April 2024 events live in another.

This serves two purposes:

  • Cheap data deletion. If Streamline needs to delete all events older than 12 months for compliance reasons, dropping a partition takes milliseconds. Deleting individual rows across a large table takes much longer and requires a mutation.
  • Partition pruning. When a query includes a filter that matches the partition expression, ClickHouse skips entire partitions that cannot contain matching rows. A query filtering for March events skips every partition except March.
Partition pruning only happens when your WHERE clause includes a filter on the partition column itself. If you partition by toYYYYMM(event_ts) but your query filters only on region, ClickHouse cannot prune any partitions. It reads all of them. Do not rely on partitioning to speed up queries that do not filter on the partition column.

What are the most common partitioning mistakes?

Partitioning is the area where most ClickHouse beginners make their most costly mistakes. These are the ones that appear most often:

Why does partitioning by day cause problems?

Partitioning by day creates a new partition every 24 hours. If you ingest data continuously, you end up with hundreds of small partitions. ClickHouse has to manage more parts, more files, and more background merges. Query performance drops because ClickHouse opens and reads many small files instead of fewer large ones. Partition by month unless you have a very strong reason to go finer.

Why is partitioning by a high-cardinality column a bad idea?

If you partition by user_id or content_id in the Streamline example, you create one partition per user or per video. With millions of users, that is millions of partitions. ClickHouse was not designed for this. It breaks background merging, slows inserts, and makes queries slower, not faster. High-cardinality columns belong in ORDER BY, not PARTITION BY.

Why does partitioning not replace a good ORDER BY?

Partitioning prunes at the partition level, which is coarse. The ORDER BY and primary key index prune at the granule level, which is fine-grained. A good ORDER BY skips far more data than partitioning alone. If you are relying on partitioning to make queries fast, your ORDER BY probably needs work.

How do you choose all three correctly for a new table?

Work through these questions in order when designing a new ClickHouse table:

  1. What columns do your most common queries filter on? These columns, in order of selectivity, form your ORDER BY. For Streamline, queries almost always filter on region and event_ts, so ORDER BY (region, event_ts).
  2. Do you need a separate PRIMARY KEY? Only if you want your sparse index to cover fewer columns than your full sort key. For most tables, skip this and let PRIMARY KEY default to ORDER BY.
  3. Do you need to drop old data by time period? If yes, add a PARTITION BY on your time column at monthly granularity. If not, you may not need partitioning at all.

Applied to the Streamline table, the final schema looks like this:

CREATE TABLE playback_events (
user_id       UInt64,
content_id    UInt64,
region        String,
event_type    String,
watch_seconds UInt32,
event_ts      DateTime
) ENGINE = MergeTree()
ORDER BY (region, event_ts)
PARTITION BY toYYYYMM(event_ts);

This schema gives Streamline fast queries on region and time filters, monthly partitions for cheap historical data deletion, and a primary key index that covers both sort columns by default.

Looking for ClickHouse consulting or implementation support? Book a free ClickHouse consultation

End

Related Posts