Benchmark

At EDF, data generated by automatons – sensors and actuators – are used with critical real-time constraints to operate power stations. Beside this operational usage, these data streams – mainly measurements from sensors – may be mined to extract useful information for failures anticipation, plant optimization, etc. To manage this data, EDF relies on dedicated data management systems called data historians.

An article published in TPCTC 2012 gives an overview of data historians (article also available on the LIRIS webpage). In this article, a benchmark is introduced to establish an objective basis for performance comparison between data historians and other data management systems. This benchmark focuses on EDF use cases.

You can refer to the original article for a more detailed description, along with experimental results with the data historian InfoPlus.21, the RDBMS MySQL and the key-value store Berkeley DB.

Scenario

To compare data historians and RDBMS performances, this benchmark is inspired by the scenario of nuclear power plants data historization. In this context, data generated by sensors distributed on the plant site are aggregated by a daemon communicating with the data historian. For insertions, the benchmark simulates this daemon and pseudo-randomly generate data to be inserted. This data is then accessible for remote users, who can send queries to update, retrieve or analyze this data. After the insertion phase, this benchmark proposes a simple yet representative set of such queries.

Database schema

This benchmark deals with data according to a minimal database schema, centered upon times series data and simplified from EDF nuclear power plants schema. For each variable type (analog or boolean), a description table is defined (ana_desc and bool_desc). Measurements are stored in separate tables (ana_val and bool_val). Each time series is associated with an identifier (id); a short textual description, or name, (label); a creation date (creation_date) and a destruction date (destruction_date). For analog values, the description table ana_desc also contains the unit of measurement (unit), which is usually described in a separate table discarded for this benchmark; a theoretical sampling interval (interval) and two thresholds indicating if the measured value is critically low (threshold_low) or critically high (threshold_high). For boolean values, the description table bool contains two short descriptions associated with values 0 (label_0) and 1 (label_1).

Times series are stored in tables ana_val and bool_val, which contains the time series identifier (id); the timestamp with millisecond precision (date); the value (value) and a small array of eight bits for meta-data, or data quality, (quality).

Logical relational schema:

ana_desc
id INT NOT NULL
label CHAR(40) NOT NULL
creation_date TIMESTAMP NOT NULL
destruction_date TIMESTAMP
unit INT NOT NULL
interval INT NOT NULL
threshold_low FLOAT NOT NULL
threshold_high FLOAT NOT NULL
ana_val
id INT NOT NULL
date TIMESTAMP NOT NULL
value FLOAT NOT NULL
quality TINYINT NOT NULL
bool_desc
id INT NOT NULL
label CHAR(40) NOT NULL
creation_date TIMESTAMP NOT NULL
destruction_date TIMESTAMP
label_0 CHAR(60) NOT NULL
label_1 CHAR(60) NOT NULL
bool_val
id INT NOT NULL
date TIMESTAMP NOT NULL
value BOOLEAN NOT NULL
quality TINYINT NOT NULL

For this benchmark to be compatible with hierarchical data models used by data historians, the relational model defined previously can not be mandatory. For instance, a hierarchical schema can represent the same data and allow functionally equivalent queries to be executed:

Hierarchical schema:

  • ana_desc
    • id
    • label
    • creation_date
    • destruction_date
    • unit
    • interval
    • threshold_low
    • threshold_high
    • ana_val
      • date
      • value
      • quality
  • bool_desc
    • id
    • label
    • creation_date
    • destruction_date
    • label_0
    • label_1
    • bool_val
      • date
      • value
      • quality

Query workload

This benchmark defines twelve queries, representative of EDF practices, to aim at giving an overview of data historians or RDBMS prevalence. Parameters generated at run time are written in brackets. These parameters are exactly the same between each benchmark execution, to obtain identical data and queries. Queries are executed one by one in a fixed order; interactions are currently not evaluated with this benchmark to keep its definition simple and alleviate performances analysis.

Insertion

Data insertion is a fundamental operation for data historians. To optimize these queries, the interface and language are not imposed (i.e. these queries can be translated from SQL to any language or API call, whichever maximizes performances).

Q0.1 Analog values insertions

INSERT INTO ana_val VALUES ([ID], [DATE], [VAL], [QUALITY])

Q0.2 Boolean values insertions

INSERT INTO bool_val VALUES ([ID], [DATE], [VAL], [QUALITY])

Update

Data updates, retrieval and analysis are usually performed by end-users; performance constraints are more flexible compared with insertions.

Q1.1 Update an analog value. The Quality attribute is updated to reflect a manual modification of the data.

UPDATE ana_val
SET value = [VAL], quality = (quality | 128)
WHERE id = [ID] AND date = [DATE]

Q1.2 Update a boolean value. The Quality attribute is updated to reflect a manual modification of the data.

UPDATE bool_value
SET value = [VAL], quality = (quality | 128)
WHERE id = [ID] AND date = [DATE]

Data retrieval and analysis

This benchmark defines nine such queries to evaluate the performances of each system, and identify specific optimizations for some types of queries. Queries without parameters (Q11.1 and Q11.2) are executed only once to refrain from using query caches (storing results in order not to re-evaluate the query).

Raw data extraction

Q2.1 Extract raw data for an analog time series between two Dates, sorted with increasing date values.

SELECT * FROM ana_val
WHERE id = [ID] AND date BETWEEN [START] AND [END]
ORDER BY date ASC

Q2.2 Extract raw data for a boolean time series between two Dates, sorted with increasing Date values.

SELECT * FROM bool_val
WHERE id = [ID] AND date BETWEEN [START] AND [END]
ORDER BY date ASC

Aggregation

Q3.1 Extract data quantity for an analog time series between two dates.

SELECT COUNT(*) FROM ana_val
WHERE id = [ID]
  AND date BETWEEN [START] AND [END]

Q3.2 Extract data quantity for a boolean time series between two dates.

SELECT COUNT(*) FROM bool_val
WHERE id = [ID]
  AND date BETWEEN [START] AND [END]

Q4 Extract the sum of an analog time series between two dates.

SELECT SUM(value) FROM ana_val
WHERE id = [ID]
  AND date BETWEEN [START] AND [END]

Q5 Extract the average of an analog time series between two dates.

SELECT AVG(value) FROM ana_val
WHERE id = [ID]
  AND date BETWEEN [START] AND [END]

Q6 Extract the minimum and maximum values of an analog time series between two dates.

SELECT MIN(value), MAX(value) FROM ana_val
WHERE id = [ID]
  AND date BETWEEN [START] AND [END]

Filtering

Q7 Extract analog values above the threshold indicated in its description (ana_desc.threshold_high).

SELECT date, value FROM ana_desc, ana_val
WHERE ana_desc.id = ana_val.id
  AND ana_desc.id = [ID]
  AND date BETWEEN [START] AND [END]
  AND value > ana_desc.threshold_high

Q8 Extract analog values above a given threshold.

SELECT date, value FROM ana_val
WHERE id = [ID]
  AND date BETWEEN [START] AND [END]
  AND value > [THRESHOLD]

Aggregation of filtered values on multiple time series

Q9 Identify the time series whose values most often do not fall between its high and low thresholds.

SELECT label , COUNT(*) AS count FROM ana_desc, ana_val
WHERE ana_desc.id = ana_val.id
  AND date BETWEEN [START] AND [END]
  AND (value < threshold_low OR value > threshold_high)
GROUP BY ana_desc.id, label ORDER BY count DESC LIMIT 1

Sampling period verification on multiple time series

Q10 Identify the time series whose sampling period do not, by the greatest margin, comply with its description.

SELECT values.id , COUNT(*) AS count FROM ana_desc,
(
  SELECT D1.id, D1.date, MIN(D2.date - D1.date ) AS interval
  FROM ana_val D1, ana_val D2
  WHERE D2.date > D1.date
    AND D1.id = D2.id
    AND D1.date BETWEEN [START] AND [END]
  GROUP BY D1.id, D1.date
) AS values
WHERE values.id = ana_desc.id
  AND values.interval > ana_desc.interval
GROUP BY values.id ORDER BY count DESC LIMIT 1

Current values extraction

Q11.1 Extract most recent values for each analog time series.

SELECT id, value FROM ana_val
WHERE (id, date) IN
(
  SELECT id, MAX(date) FROM ana_val
  GROUP BY id
)
ORDER BY id

Q11.2 Extract most recent values for each boolean time series.

SELECT id, value FROM bool_val
WHERE (id, date) IN
(
  SELECT id, MAX(date) FROM bool_val
  GROUP BY id
)
ORDER BY id