With millions of API connections every month, we cannot analyze trends and product performance using direct, slow queries on our huge analytics database. Instead, we need a simple means of:
- aggregate huge amounts of data quickly.
- produce periodical summaries.
Here comes TimescaleDB’s continuous aggregates. At Mindee, we track and store every API call with Timescaledb, a time-series SQL database, so we can understand how our APIs are performing and find ways to improve them.
TimescaleDB’s continuous aggregates collect data in PostgreSQL’s materialized views, which allows you to run faster queries. They can be automatically refreshed, either continuously or by following a policy, either on the whole database or over a specific range. The advantage is that they are very fast since they only take into account the data that has changed since the last refresh.
Prerequisites
- TimescaleDB 2+
- Postgres12+
Create a continuous aggregate
Let’s create a trivial continuous aggregate, which is based on an api_responses table.
We will aggregate per hour the sum of API success responses, the sum of errors, the average file size, and the maximum computation time for each customer.
From the above example, the option timescaledb.materialized_only=true disables the real-time aggregation, and the option WITH NO DATA creates the continuous aggregate instantly with no data refresh.
Note: join, filter, order by, and distinct cannot be used in the creation of the aggregate.
In case you want to delete what we just did, the DROP MATERIALIZED VIEW command will delete the aggregate of your choice.
Add a retention policy
You can add a policy to automatically drop the data of your aggregate after a certain amount of time. For instance, the following command will create a policy that will drop the data of the aggregate after 6 months.
And the following will remove the retention policy.
Refresh the continuous aggregate
There are three ways to refresh the data for a continuous aggregate:
- manually over a selected time window.
- periodically following an automatic refresh policy.
- continuously with real-time aggregation.
Manually refresh a continuous aggregate
Right now, our continuous aggregate does not have any data. Let’s fill it with a manual refresh over the last 6 months.
The two dates in the parameters are the start and end of the time window over which the aggregate will be computed. Also, keep in mind that even for this time window, the aggregate will be calculated only for the data that really requires computation.
Since you probably have a lot of data, I suggest starting your first refresh only over a small window of time instead of over everything. In testing, doing multiple little refreshes over little windows was way faster than doing a single one over a large window.
Create a periodic refresh policy
If we want the aggregate to update its data periodically, we can add a refresh policy. The policy has three arguments:
- start_offset: the beginning of the refresh window, relative to when the refresh runs.
- end_offset: the end of the refresh window, relative to when the refresh runs.
- schedule_interval: the interval between the run of two refreshes
The following policy will allow the data to be refreshed every 10 minutes over the last two hours.
It can then be deleted with the following :
Opt for real-time continuous aggregation
TimescaleDB offers a real-time aggregation option, which, unlike the refresh policy, will let your aggregate always be up-to-date. The following command will enable the real-time aggregation:
Query the aggregate
Since TimescaleDB is based on PostgreSQL, continuous aggregates can be queried with any standard SQL commands.
Conclusion
When working with data already aggregated over buckets, some metrics like mean, median, or percentiles can be hard to compute. Fortunately, the official TimescaleDB toolkit provides many analytics tools to compute approximates of multiple metrics.