# Working with Time Series Data

A time series is a series of data points indexed in time order. In database terms, time series tables include a timestamp column plus additional data columns and each record represents an observation at a particular time.

# Equally-Spaced Time Series

At its simplest, consider a time series database table that records readings from an IoT sensor. Sensor readings are made every 10 minutes and stored in the table:

timestamp	sensorValue
...	...
2021-01-07T09:00:00Z	123.2
2021-01-07T09:10:00Z	111.6
2021-01-07T09:20:00Z	111.6
2021-01-07T09:30:00Z	102.5
2021-01-07T09:40:00Z	105.2
...	...

This table represents an equal-interval, single-dimension time series. That is, observations are recorded at equal intervals (10 minutes) over one dimension: time.

Consider an extended table that stores readings for multiple sensors:

timestamp	sensorID	sensorValue
...	...
2021-01-07T09:00:00Z	1	123.2
2021-01-07T09:00:00Z	2	80.1
2021-01-07T09:10:00Z	1	111.6
2021-01-07T09:10:00Z	2	80.1
2021-01-07T09:20:00Z	1	111.6
2021-01-07T09:20:00Z	2	90.2
2021-01-07T09:30:00Z	1	102.5
2021-01-07T09:30:00Z	2	100.2
2021-01-07T09:40:00Z	1	105.2
2021-01-07T09:40:00Z	2	94.3
...	...

This table represents an equal-interval, two-dimension time series. That is, observations are recorded at equal intervals (10 minutes) over two dimensions: time and sensorID.

# Variably-Spaced Time Series

In the tables above, observations are recorded every 10 minutes whether the sensor reading has changed or not. This can lead to redundancy when recorded values are slowly changing. An alternative strategy is to only store values when they change. For example, compare the following variably-spaced table with the first table above:

timestamp	sensorValue
...	...
2021-01-07T09:00:00Z	123.2
2021-01-07T09:10:00Z	111.6
2021-01-07T09:30:00Z	102.5
2021-01-07T09:40:00Z	105.2
...	...

The records in a variably-Spaced time series table represent a 'change-stream', recording when values change. Storing data in this way results in smaller database table at the expense of more complex querying.

# Query Strategies

# Equally-Spaced Time Series

Querying an equally-spaced time series is simple and can be achieved using the following Flo.w Dataset query definition:

{
  "where": {"timestamp": "<floored timestamp>"}
}

<floored timestamp> is an ISO8601 timestamp aligned with the clock interval used in the data.

WARNING

For performant querying, timestamp should be the table primary key (or be indexed separately).

For multi-dimensional time series you should specify a composite primary key (or other index) with timestamp as the first term - for example, (timestamp, sensorID).

# Variably-Spaced Time Series

With a variably-spaced time series, there is no guarantee that a record exists at a particular timestamp t. To retrieve the latest record at or before t the query must specify a time window: t_start to t and return the latest record within that window:

Time Window Query

The width of the time window is dependent on the data and your use case. It should be wide enough to ensure that a record will be found, which will depend on the expected frequency of change records. You should anticipate that the query may return no data.

Because of the increased complexity of the query, we recommend using a parameterized SQL-type Flo.w dataset to query the time series. For example, to query a single-dimension time series with a 15 minute window, create a Flo.w dataset with ID 'timeseries' and the following SQL as the dataset source:

SELECT * FROM "timeseries_table"
WHERE "timestamp" BETWEEN {{timestamp}}::timestamptz - INTERVAL '15 minutes'
  AND {{timestamp}}::timestamptz
ORDER BY "timestamp" DESC
LIMIT 1

To default to querying the current time when timestamp is null you can use the coalesce function:

SELECT * FROM "timeseries_table"
WHERE "timestamp" BETWEEN coalesce({{timestamp}}, now()) - INTERVAL '15 minutes'
  AND coalesce({{timestamp}}, now())
ORDER BY "timestamp" DESC
LIMIT 1

WARNING

For performant querying, timestamp should be the table primary key (or be indexed separately).

For multi-dimensional time series you should specify a composite primary key (or other index) with timestamp as the first term - for example, (timestamp, sensorID).

The query parameter timestamp is supplied when querying the dataset. For example (using the Flo.w Client Library):

// Query time series at 9:25 with a 15 minute time window
// Returns latest record in window
const results = await flowClient.datasets.query('timeseries', {}, {
    timestamp: '2021-01-07T09:25:00Z'
  });

Querying a two-dimensional time series requires an expanded SQL expression that returns a set of data representing the latest record for each unique entity. For example, to query a two-dimensional version of the sensor table above use the following SQL as the dataset source:

SELECT DISTINCT ON ("sensorID") * FROM "2d_timeseries"
WHERE "timestamp" BETWEEN {{timestamp}}::timestamptz - INTERVAL '15 minutes'
  AND {{timestamp}}::timestamptz
ORDER BY "sensorID", "timestamp" DESC

TIP

On balance, we recommend using variably-spaced rather than equally-spaced time series data.

Although querying is more complex, the complexity can be moved to a SQL-type dataset and is hidden from clients such as front-end web applications.

With the correct indexing the queries shown above are very performant and have the benefit that the query time t does not need to be aligned with the data. In addition, the time series data can also be used to retrieve latest values in 'live' visualizations by setting t to 'now'.

# Table Partitioning

Long-running time series ingests can lead to very large database tables. Consider using PostgreSQL table partitioning (opens new window) if a table is likely to exceed approximately 1 million rows.

We recommend that table partitioning requirements are determined during the planning phase of data ingest and partitioning is set up after designing the database table schema but before creating the ingest dataset or ingesting any data.

Partitioned master tables should be created with the PARTITION BY RANGE ("timestamp") option to partition data by timestamp. Partitions should be created using an appropriate timestamp range (for example, 1 day, 1 week, 1 month) to ensure that partitions do not exceed the recommended total record count. Determine the appropriate range by calculation or experiment during the planning phase of data ingest.

WARNING

Ensure that partition timestamp range is not too small. Many small partitions reduces query efficiency. Aim for partition sizes in the range of 100,000s of rows or around 1GB.

WARNING

Be aware of the total index size of a table or partition. Using indexes that are larger than available memory will cause increased paging to/from disk and will decrease query performance.

WARNING

Do not use calculated time ranges with partitioned tables when querying PostgreSQL v10 and below. The query planner cannot determine which partitions are within range and all partitions will be scanned.

Replace the calculated time range with an explicit time range using two parameters, timestampStart and timestampEnd. For example, replace:

WHERE "timestamp" BETWEEN {{timestamp}}::timestamptz - INTERVAL '15 minutes'
   AND {{timestamp}}::timestamptz

with:

WHERE "timestamp" BETWEEN {{timestampStart}}::timestamptz
  AND {{timestampEnd}}::timestamptz

← Datasets for Ingest Pipelines Library →