Parquet

Implementation Details

The Parquet backend is implemented as a PlainFileHandler within the PlainStoragePlugin. It uses the same partitioning logic as the PB backend (e.g., hourly, daily, or yearly partitions).

Schema

The schema used in Parquet files is derived from the Protocol Buffers definitions in EPICSEvent.proto. This ensures consistency between the PB and Parquet backends.

Configuration and Management

For details on how to configure the Parquet backend, including compression settings and data conversion, please see the Storage Plugins page in the Sysadmin guide.

Tools for Analyzing Parquet Files

One of the main advantages of the Parquet backend is the ability to use standard industry tools to analyze the data.

Apache Parquet CLI

The parquet-cli (or parquet-tools) is a command-line utility for inspecting Parquet files. It allows you to view the schema, metadata, and the actual data stored in the files.

Viewing Metadata

The meta command displays detailed information about row groups, column statistics (min/max values), and compression.

$ parquet meta your_file.parquet

File path:  your_file.parquet
Created by: parquet-mr version 1.17.0
Properties:
           parquet.proto.descriptor: name: "VectorDouble"
...
Schema:
message EPICS.VectorDouble {
  required int32 secondsintoyear = 1;
  required int32 nano = 2;
  repeated double val = 3;
  optional int32 severity = 4;
  optional int32 status = 5;
  optional int32 repeatcount = 6;
  repeated group fieldvalues = 7 {
    required binary name (STRING) = 1;
    required binary val (STRING) = 2;
  }
  optional boolean fieldactualchange = 8;
}

Row group 0:  count: 2667650  25,09 B records  start: 4  total(compressed): 63,833 MB total(uncompressed):1,694 GB
--------------------------------------------------------------------------------
                   type      encodings count     avg size   nulls   min / max
secondsintoyear    INT32     Z   _     2667650   2,68 B     0       "10368000" / "13046399"
nano               INT32     Z   _     2667650   3,33 B     0       "30013" / "999946202"
val                DOUBLE    Z _ R_ F  266765000 0,19 B     0       "-0.0" / "16.818929577945916"
severity           INT32     Z   _     2667650   0,00 B     2667650
status             INT32     Z   _     2667650   0,00 B     2667650
repeatcount        INT32     Z   _     2667650   0,00 B     2667650
fieldvalues.name   BINARY    Z   _     2667692   0,00 B     2667609 "DESC" / "startup"
fieldvalues.val    BINARY    Z   _     2667692   0,00 B     2667609 "" / "true"
fieldactualchange  BOOLEAN   Z   _     2667650   0,00 B     2667613 "false" / "true"

Viewing the Schema

The schema command shows the logical schema of the file.

parquet schema your_file.parquet

Example output:

{
  "type": "record",
  "name": "VectorDouble",
  "namespace": "EPICS",
  "fields": [
    {
      "name": "secondsintoyear",
      "type": "int"
    },
    {
      "name": "nano",
      "type": "int"
    },
    {
      "name": "val",
      "type": {
        "type": "array",
        "items": "double"
      },
      "default": []
    },
    {
      "name": "severity",
      "type": ["null", "int"],
      "default": null
    },
    {
      "name": "status",
      "type": ["null", "int"],
      "default": null
    },
    {
      "name": "repeatcount",
      "type": ["null", "int"],
      "default": null
    },
    {
      "name": "fieldvalues",
      "type": {
        "type": "array",
        "items": {
          "type": "record",
          "name": "fieldvalues",
          "namespace": "",
          "fields": [
            {
              "name": "name",
              "type": "string"
            },
            {
              "name": "val",
              "type": "string"
            }
          ]
        }
      },
      "default": []
    },
    {
      "name": "fieldactualchange",
      "type": ["null", "boolean"],
      "default": null
    }
  ]
}

Viewing Data

The head command displays the first few records in JSON format.

parquet head -n 2 your_file.parquet

Example output:

secondsintoyear: 123456
nano: 1000000
val: 10.5
secondsintoyear: 123457
nano: 2000000
val: 10.6

Other Tools

Because Parquet is a standard format, you can also use tools like:

  • Python: Using pandas with pyarrow or fastparquet.

  • DuckDB: Directly querying Parquet files using SQL with DuckDB.

  • Polars: Fast multi-threaded DataFrame library for Rust and Python Polars.

  • Apache Arrow: For high-performance in-memory processing with Apache Arrow.