Parquet
Implementation Details
The Parquet backend is implemented as a PlainFileHandler within the PlainStoragePlugin. It uses the same partitioning logic as the PB backend (e.g., hourly, daily, or yearly partitions).
Schema
The schema used in Parquet files is derived from the Protocol Buffers definitions in EPICSEvent.proto. This ensures consistency between the PB and Parquet backends.
Configuration and Management
For details on how to configure the Parquet backend, including compression settings and data conversion, please see the Storage Plugins page in the Sysadmin guide.
Tools for Analyzing Parquet Files
One of the main advantages of the Parquet backend is the ability to use standard industry tools to analyze the data.
Apache Parquet CLI
The parquet-cli (or parquet-tools) is a command-line utility for inspecting Parquet files. It allows you to view the schema, metadata, and the actual data stored in the files.
Viewing Metadata
The meta command displays detailed information about row groups, column statistics (min/max values), and compression.
$ parquet meta your_file.parquet
File path: your_file.parquet
Created by: parquet-mr version 1.17.0
Properties:
parquet.proto.descriptor: name: "VectorDouble"
...
Schema:
message EPICS.VectorDouble {
required int32 secondsintoyear = 1;
required int32 nano = 2;
repeated double val = 3;
optional int32 severity = 4;
optional int32 status = 5;
optional int32 repeatcount = 6;
repeated group fieldvalues = 7 {
required binary name (STRING) = 1;
required binary val (STRING) = 2;
}
optional boolean fieldactualchange = 8;
}
Row group 0: count: 2667650 25,09 B records start: 4 total(compressed): 63,833 MB total(uncompressed):1,694 GB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
secondsintoyear INT32 Z _ 2667650 2,68 B 0 "10368000" / "13046399"
nano INT32 Z _ 2667650 3,33 B 0 "30013" / "999946202"
val DOUBLE Z _ R_ F 266765000 0,19 B 0 "-0.0" / "16.818929577945916"
severity INT32 Z _ 2667650 0,00 B 2667650
status INT32 Z _ 2667650 0,00 B 2667650
repeatcount INT32 Z _ 2667650 0,00 B 2667650
fieldvalues.name BINARY Z _ 2667692 0,00 B 2667609 "DESC" / "startup"
fieldvalues.val BINARY Z _ 2667692 0,00 B 2667609 "" / "true"
fieldactualchange BOOLEAN Z _ 2667650 0,00 B 2667613 "false" / "true"
Viewing the Schema
The schema command shows the logical schema of the file.
parquet schema your_file.parquet
Example output:
{
"type": "record",
"name": "VectorDouble",
"namespace": "EPICS",
"fields": [
{
"name": "secondsintoyear",
"type": "int"
},
{
"name": "nano",
"type": "int"
},
{
"name": "val",
"type": {
"type": "array",
"items": "double"
},
"default": []
},
{
"name": "severity",
"type": ["null", "int"],
"default": null
},
{
"name": "status",
"type": ["null", "int"],
"default": null
},
{
"name": "repeatcount",
"type": ["null", "int"],
"default": null
},
{
"name": "fieldvalues",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "fieldvalues",
"namespace": "",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "val",
"type": "string"
}
]
}
},
"default": []
},
{
"name": "fieldactualchange",
"type": ["null", "boolean"],
"default": null
}
]
}
Viewing Data
The head command displays the first few records in JSON format.
parquet head -n 2 your_file.parquet
Example output:
secondsintoyear: 123456
nano: 1000000
val: 10.5
secondsintoyear: 123457
nano: 2000000
val: 10.6
Other Tools
Because Parquet is a standard format, you can also use tools like:
Python: Using pandas with pyarrow or fastparquet.
DuckDB: Directly querying Parquet files using SQL with DuckDB.
Polars: Fast multi-threaded DataFrame library for Rust and Python Polars.
Apache Arrow: For high-performance in-memory processing with Apache Arrow.