ReplacingMergeTree, SummingMergeTree, and AggregatingMergeTree: When to Use Each

ReplacingMergeTree, SummingMergeTree, and AggregatingMergeTree: When to Use Each

The MergeTree variants are specialised versions of the base MergeTree engine that add specific behaviour during background merges, each solving a different data problem.

If you have used the base MergeTree engine and started hitting situations where you need deduplication, pre-aggregation, or update-like behaviour, you are ready for the ClickHouse MergeTree variants. This post explains what each one does, shows a concrete example of the problem it solves, and tells you exactly when to reach for it.

What do all MergeTree variants have in common?

Before going into each variant, it helps to understand what they share. Every MergeTree variant:

  • Inherits all the core MergeTree features: sorting key, partitioning, TTL, replication support
  • Requires an ORDER BY clause when creating a table
  • Writes data in parts on insert and merges those parts in the background
  • Applies its special behaviour during the merge process, not at insert time

That last point is critical and easy to get wrong. The special behaviour of each variant happens when ClickHouse merges parts in the background. It does not happen the moment you insert a row. This affects how you query the data and what you can rely on at any given moment.

Quick reference: which variant does what?

Engine What it does during merge Primary use case
ReplacingMergeTree Keeps only the latest version of rows with the same ORDER BY key Deduplication, storing latest state of a record
SummingMergeTree Sums numeric columns for rows with the same ORDER BY key Pre-aggregating counters and totals at write time
AggregatingMergeTree Stores partial aggregate states and combines them during merge Materialised views with complex aggregations like uniq counts
CollapsingMergeTree Cancels out row pairs using a sign column (+1 and -1) Handling updates and deletes in an append-only system

What is ReplacingMergeTree and when do you use it?

The problem it solves: You need to store the latest version of a record, but ClickHouse does not support traditional row updates efficiently.

In ClickHouse, updating a row the way you would in PostgreSQL is expensive. The recommended pattern is to insert a new version of the row and let the old one be removed. ReplacingMergeTree automates the removal part.

Imagine you run a delivery platform. Each order has a status that changes over time: pending, in transit, delivered. You want to always be able to query the current status of any order.

Here is the table:

CREATE TABLE order_status (
order_id      UInt64,
customer_id   UInt64,
status        String,
updated_at    DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY order_id;

The updated_at column passed into the engine is the version column. When two rows have the same order_id, ClickHouse keeps the one with the highest updated_at value and discards the older one during merges.

You insert the order as it progresses:

INSERT INTO order_status VALUES
(5001, 201, 'pending',   '2024-04-01 09:00:00'),
(5001, 201, 'in_transit', '2024-04-01 11:30:00'),
(5001, 201, 'delivered', '2024-04-01 14:15:00');

After ClickHouse merges these parts in the background, only the delivered row survives.

Important: deduplication happens during background merges, not immediately after insert. If you query the table right after inserting, you may still see all three rows. To force deduplication at query time, add FINAL to your SELECT:
SELECT * FROM order_status FINAL WHERE order_id = 5001;

Use ReplacingMergeTree when:

  • You need to store the current state of records that change over time
  • You are comfortable with eventual deduplication rather than immediate consistency
  • Your update volume is high enough that traditional mutations would be too slow

What is SummingMergeTree and when do you use it?

The problem it solves: You are storing counters or totals that you always query as aggregates, and you want ClickHouse to pre-sum them at write time instead of summing at query time.

Imagine you run a marketing agency and you ingest ad spend data every hour from multiple platforms. You store spend per campaign per day. Every hour, new rows arrive with that hour’s spend. By end of day, you have 24 rows per campaign, and you always query total daily spend.

Instead of summing at query time across 24 rows per campaign per day, SummingMergeTree does the summing for you during background merges:

CREATE TABLE campaign_spend (
campaign_id  UInt64,
platform     String,
spend_date   Date,
spend        Float64,
impressions  UInt64
) ENGINE = SummingMergeTree()
ORDER BY (campaign_id, platform, spend_date);

You insert hourly rows throughout the day:

INSERT INTO campaign_spend VALUES
(101, 'Google', '2024-04-01', 120.50, 45000),
(101, 'Google', '2024-04-01', 98.00, 38000),
(101, 'Google', '2024-04-01', 145.75, 52000);

After merging, ClickHouse collapses these into a single row:

campaign_id platform spend_date spend impressions
101 Google 2024-04-01 364.25 135000

All numeric columns get summed automatically. Non-numeric columns like platform and spend_date keep the value from one of the merged rows, so they should always be part of your ORDER BY key.

Use SummingMergeTree when:

  • Your queries always aggregate numeric columns and never need individual row values
  • You want to reduce the number of rows stored over time without running manual aggregations
  • Your ORDER BY columns represent the dimensions you always group by in queries

What is AggregatingMergeTree, and when do you use it?

The problem it solves: You need to pre-compute complex aggregations like unique user counts or averages across large datasets, and SummingMergeTree is not flexible enough for your aggregation logic.

AggregatingMergeTree stores partial aggregation states instead of raw values. During merges, it combines those partial states into a single result. You almost always use it together with a materialised view.

Imagine you want to track the number of unique users who triggered each event type each day, across hundreds of millions of events. Running COUNT(DISTINCT user_id) at query time across that volume is slow. AggregatingMergeTree lets you maintain a running count that updates as new data arrives.

CREATE TABLE event_unique_users (
event_type   String,
event_date   Date,
unique_users AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree()
ORDER BY (event_type, event_date);

The AggregateFunction(uniq, UInt64) column stores a compact data structure that tracks unique values. When ClickHouse merges parts, it merges these structures together rather than summing raw numbers. The result is an accurate, unique user count without scanning the full events table at query time.

AggregatingMergeTree has a steeper learning curve than the other variants. If you are just getting started, use it only when you have a clear need for complex pre-aggregations that SummingMergeTree cannot handle. It is most powerful when paired with materialised views.

Use AggregatingMergeTree when:

  • You need to pre-compute aggregations like uniq counts, averages, or quantiles
  • Your data volume makes running these aggregations at query time too slow
  • You are building a materialised view that needs to stay up to date as data arrives

What is CollapsingMergeTree, and when do you use it?

The problem it solves: You need to handle updates and deletes in an append-only system, where inserting a cancellation row should undo a previous insert.

CollapsingMergeTree uses a sign column to mark rows as either active (1) or cancelled (-1). During merges, ClickHouse cancels out pairs of rows with opposite signs and the same ORDER BY key, keeping only the net result.

Imagine you track active subscriptions. When a user subscribes, you insert a row with sign 1. When they cancel, you insert the same row with sign -1. After a merge, both rows disappear, correctly reflecting that the subscription no longer exists.

CREATE TABLE active_subscriptions (
user_id   UInt64,
plan      String,
sign      Int8
) ENGINE = CollapsingMergeTree(sign)
ORDER BY (user_id, plan);

Use CollapsingMergeTree when:

  • You are modelling data where records are created and then cancelled or replaced
  • You need a delete-like behaviour without running expensive mutations
  • Your data naturally fits a sign-based model where state changes come in pairs

How do you choose the right variant for your situation?

The decision comes down to one question at a time:

  • Do you need to store the latest version of a changing record? Use ReplacingMergeTree.
  • Do you always query numeric columns as totals and never need individual rows? Use SummingMergeTree.
  • Do you need pre-computed unique counts, averages, or other complex aggregations? Use AggregatingMergeTree with a materialised view.
  • Do you need to cancel out rows to handle updates or deletes? Use CollapsingMergeTree.
  • None of the above? Use the base MergeTree. It covers most cases and is the easiest to reason about.

Looking for ClickHouse consulting or implementation support? Book a free ClickHouse consultation

End

Related Posts