Aggregations
seq-db support various types of aggregations: functional aggregations, histograms and timeseries. Each of the types relies on the usage of the inverted-index, therefore to calculate aggregations for the fields, the field must be indexed.
Function aggregations
Aggregations allow the computation of statistical values over document fields that match the query. E.g. calculating number of logs written by each service in the given interval, or all unique values of the field.
seq-db supports various aggregation functions:
AGG_FUNC_SUM
— sum of field valuesAGG_FUNC_AVG
— average value of the fieldAGG_FUNC_MIN
— minimum value of the fieldAGG_FUNC_MAX
— maximum value of the fieldAGG_FUNC_QUANTILE
— quantile value for the fieldAGG_FUNC_UNIQUE
— computation of unique field values (not supported in timeseries)AGG_FUNC_COUNT
— number of documents for each unique value of the field
For the API of the functions, please refer to public API
To better understand how aggregations work, let's illustrate examples with identical SQL queries.
Sum, average, minimum, maximum, quantile
Calculation of the aforementioned aggregations requires:
AGG_FUNC
which is one ofAGG_FUNC_SUM
,AGG_FUNC_AVG
,AGG_FUNC_MIN
,AGG_FUNC_MAX
,AGG_FUNC_QUANTILE
,aggregate_by_field
- the field on which aggregation will be appliedgroup_by_field
- the field by which values will be grouped (used in not all aggregations)filtering_query
- query to filter only relevant logs for the aggregationquantile
- only for theAGG_FUNC_QUANTILE
In general, this translates to the following SQL query:
SELECT <group_by_field>, AGG_FUNC(<aggregate_by_field>)
FROM db
WHERE <filtering_query>
GROUP BY <group_by_field>
Translating to our API:
grpcurl -plaintext -d '
{
"query": {
"from": "2000-01-01T00:00:00Z",
"to": "2077-01-01T00:00:00Z",
"query": "<filtering_query>"
},
"aggs": [
{
"field": "<aggregate_by_field>",
"func": "AGG_FUNC",
"group_by": "<group_by_field>"
}
]
}' localhost:9004 seqproxyapi.v1.SeqProxyApi/GetAggregation
Considering real-world example, we may want to calculate average response time for services having response_time
field, then we will write the following query:
SELECT service, AVG(response_time)
FROM db
WHERE response_time:* -- meaning that `response_time` field exists in logs
GROUP BY service
Using our API:
grpcurl -plaintext -d '
{
"query": {
"from": "2000-01-01T00:00:00Z",
"to": "2077-01-01T00:00:00Z",
"query": "response_time:*"
},
"aggs": [
{
"field": "response_time",
"func": "AGG_FUNC_AVG",
"group_by": "service"
}
]
}' localhost:9004 seqproxyapi.v1.SeqProxyApi/GetAggregation
Count, unique
Count and unique aggregations are very similar to the above examples, except for those aggregation there is no need to
have an additional group_by_field
, since we are already grouping by aggregate_by_field
.
SQL query for the AGG_FUNC_COUNT
aggregation:
SELECT <aggregate_by_field>, COUNT (*)
FROM db
WHERE <filtering_query>
GROUP BY <aggregate_by_field>
Translating to our API:
grpcurl -plaintext -d '
{
"query": {
"from": "2000-01-01T00:00:00Z",
"to": "2077-01-01T00:00:00Z",
"query": "<filtering_query>"
},
"aggs": [
{
"field": "<aggregate_by_field>",
"func": "AGG_FUNC_COUNT",
}
]
}' localhost:9004 seqproxyapi.v1.SeqProxyApi/GetAggregation
Considering real-world example, we may want to calculate number of logs for each logging level (debug
, info
, etc.)
for
the particular service, e.g. seq-db
, then we can write the following query:
SELECT level, COUNT(*)
FROM db
WHERE service:seq-db
GROUP BY level
Using our API:
grpcurl -plaintext -d '
{
"query": {
"from": "2000-01-01T00:00:00Z",
"to": "2077-01-01T00:00:00Z",
"query": "service:seq-db"
},
"aggs": [
{
"field": "level",
"func": "AGG_FUNC_COUNT",
}
]
}' localhost:9004 seqproxyapi.v1.SeqProxyApi/GetAggregation
Histograms
Histograms allow users to visually interpret the distribution of logs satisfying given query. E.g. number of logs of the particular service for the given interval of time.
Histograms can be queried separately, using GetHistogram or with documents and functional aggregations using ComplexSearch.
For the more detailed API and examples, please refer to public API
Timeseries
Timeseries allow to calculate aggregations for intervals and visualize them. They are something in between histograms and functional aggregations: they allow to simultaneously calculate multiple histograms for the given aggregate functions.
Consider the previous example of histograms, where we visualized number of logs over time only for one service at a
time. Using the power of timeseries, we can calculate number of logs for each service simultaneously, using the
AGG_FUNC_COUNT
over service
field.
Another example of using timeseries is visualizing number of logs for each log-level over time. It may be exceptionally useful, when there is a need to debug real-time problems. We can simply visualize number of logs for each level and find unusual spikes and logs associated with them.
Because timeseries are basically aggregations, they have the same API as aggregations, except a new interval
field is
present to calculate number of buckets to calculate aggregation on. For the details, please refer
to public API