# Parquet ## Implementation Details The Parquet backend is implemented as a [PlainFileHandler](../_static/javadoc/edu/stanford/slac/archiverappliance/plain/PlainFileHandler.html) within the [PlainStoragePlugin](../_static/javadoc/edu/stanford/slac/archiverappliance/plain/PlainStoragePlugin.html). It uses the same partitioning logic as the PB backend (e.g., hourly, daily, or yearly partitions). ### Schema The schema used in Parquet files is derived from the Protocol Buffers definitions in `EPICSEvent.proto`. This ensures consistency between the PB and Parquet backends. ## Configuration and Management For details on how to configure the Parquet backend, including compression settings and data conversion, please see the [Storage Plugins](../../sysadmin/references/storage_plugins.md#apache-parquet-backend) page in the Sysadmin guide. ## Tools for Analyzing Parquet Files One of the main advantages of the Parquet backend is the ability to use standard industry tools to analyze the data. ### Apache Parquet CLI The [parquet-cli](https://github.com/apache/parquet-java/tree/master/parquet-cli) (or `parquet-tools`) is a command-line utility for inspecting Parquet files. It allows you to view the schema, metadata, and the actual data stored in the files. #### Viewing Metadata The `meta` command displays detailed information about row groups, column statistics (min/max values), and compression. ```text $ parquet meta your_file.parquet File path: your_file.parquet Created by: parquet-mr version 1.17.0 Properties: parquet.proto.descriptor: name: "VectorDouble" ... Schema: message EPICS.VectorDouble { required int32 secondsintoyear = 1; required int32 nano = 2; repeated double val = 3; optional int32 severity = 4; optional int32 status = 5; optional int32 repeatcount = 6; repeated group fieldvalues = 7 { required binary name (STRING) = 1; required binary val (STRING) = 2; } optional boolean fieldactualchange = 8; } Row group 0: count: 2667650 25,09 B records start: 4 total(compressed): 63,833 MB total(uncompressed):1,694 GB -------------------------------------------------------------------------------- type encodings count avg size nulls min / max secondsintoyear INT32 Z _ 2667650 2,68 B 0 "10368000" / "13046399" nano INT32 Z _ 2667650 3,33 B 0 "30013" / "999946202" val DOUBLE Z _ R_ F 266765000 0,19 B 0 "-0.0" / "16.818929577945916" severity INT32 Z _ 2667650 0,00 B 2667650 status INT32 Z _ 2667650 0,00 B 2667650 repeatcount INT32 Z _ 2667650 0,00 B 2667650 fieldvalues.name BINARY Z _ 2667692 0,00 B 2667609 "DESC" / "startup" fieldvalues.val BINARY Z _ 2667692 0,00 B 2667609 "" / "true" fieldactualchange BOOLEAN Z _ 2667650 0,00 B 2667613 "false" / "true" ``` #### Viewing the Schema The `schema` command shows the logical schema of the file. ```bash parquet schema your_file.parquet ``` Example output: ```json { "type": "record", "name": "VectorDouble", "namespace": "EPICS", "fields": [ { "name": "secondsintoyear", "type": "int" }, { "name": "nano", "type": "int" }, { "name": "val", "type": { "type": "array", "items": "double" }, "default": [] }, { "name": "severity", "type": ["null", "int"], "default": null }, { "name": "status", "type": ["null", "int"], "default": null }, { "name": "repeatcount", "type": ["null", "int"], "default": null }, { "name": "fieldvalues", "type": { "type": "array", "items": { "type": "record", "name": "fieldvalues", "namespace": "", "fields": [ { "name": "name", "type": "string" }, { "name": "val", "type": "string" } ] } }, "default": [] }, { "name": "fieldactualchange", "type": ["null", "boolean"], "default": null } ] } ``` #### Viewing Data The `head` command displays the first few records in JSON format. ```bash parquet head -n 2 your_file.parquet ``` Example output: ```text secondsintoyear: 123456 nano: 1000000 val: 10.5 secondsintoyear: 123457 nano: 2000000 val: 10.6 ``` ### Other Tools Because Parquet is a standard format, you can also use tools like: - **Python**: Using [pandas](https://pandas.pydata.org/) with [pyarrow](https://arrow.apache.org/docs/python/) or [fastparquet](https://fastparquet.readthedocs.io/). - **DuckDB**: Directly querying Parquet files using SQL with [DuckDB](https://duckdb.org/). - **Polars**: Fast multi-threaded DataFrame library for Rust and Python [Polars](https://pola.rs/). - **Apache Arrow**: For high-performance in-memory processing with [Apache Arrow](https://arrow.apache.org/).