# The .pb file format and the .pbraw binary protocol

## The `.pb` file format

The
[PlainStoragePlugin](../_static/javadoc/edu/stanford/slac/archiverappliance/plain/PlainStoragePlugin.html)
in the EPICS archiver appliance uses Google\'s
[ProtocolBuffers](https://developers.google.com/protocol-buffers) as the
serialization mechanism. The PB definitions mapping EPICS DBR types to
PB messages can be found in [EPICSEvent.proto](../../../EPICSEvent.proto). PB
files contain serialized PB messages; one per sample; a sample per line.
The first line in a PB file is a header (`PayloadInfo` PB message) that
contains some basic information like the PV name, its DBR type and so
on.

![image](../../images/pbfile.png)

As serialized PB messages are binary data; after serialization, newline
characters are escaped to maintain a \"sample per line\" constraint.

1. The ASCII escape character `0x1B` is escaped to the following two
   characters `0x1B 0x01`
2. The ASCII newline character `\n` or `0x0A` is escaped to the
   following two characters `0x1B 0x02`
3. The ASCII carriage return character `0x0D` is escaped to the
   following two characters `0x1B 0x03`

Because of the sample per line constraint, one can use `wc -l` to
determine the number of events in a PB file. The \"sample per line\"
constraint also lets us determine where a sample begins and ends at any
arbitrary location in the file.

## Configuration

For details on how to configure the Protocol Buffers backend, including time partitioning and storage stage setup, please see the [Storage Plugins](../../sysadmin/storage_plugins#protocol-buffers-pb-backend) page in the Sysadmin guide.

PB files try to optimize on storage consumption. On an average, an
`EPICS DBR_DOUBLE/PB ScalarDouble` consumes about 21 bytes per sample.
To save space, the record processing timestamps in the samples are split
into three parts

1. **year** - This is stored once in the PB file in the header.
2. **secondsintoyear** - This is stored with each sample.
3. **nano** - This is stored with each sample.

This leads to the side-effect that each PB file \"belongs\" to a year.
In addition, the record processing timestamps are guaranteed to be and
expected to be monotonically increasing. The \"monotonically increasing
timestamps\" constraint lets us use various search algorithms on PB
files without the need for an index. The
[PlainStoragePlugin](../_static/javadoc/edu/stanford/slac/archiverappliance/plain/PlainStoragePlugin.html)
handles the translation back and forth between DBR types and raw PB
messages and also enforces a strict partitioning.

The installation bundle also includes some utilities that manipulate PB
files. These can be found in the `install/pbutils` folder of the `mgmt`
webapp. These include

1. **printTimes.sh** - This utility prints the record processing
   timestamps of all the samples in the set of specified PB files.
2. **pb2json.sh** - This utility prints all the data in all the samples
   in the set of specified PB files as JSON that can potentially be
   loaded into Python or other languages.
3. **validate.sh** - This utility performs some simple validation of
   the set of specified PB files or PB files in the specified folders.
4. **repair.sh** - This utility performs some simple validation of the
   set of specified PB files or PB files in the specified folders. If
   errors are found in a PB file, the PB file is repaired by copying
   the valid samples into a new file and then renaming it to the old
   file name. It also support an option to make a backup of the
   original file before attempting to fix it.

## The PB/HTTP protocol

The `PB/HTTP` is a binary protocol over HTTP that is an extension of the
`.pb` file format. The main difference is that PB files contain only one
chunk (header+samples) while the PB/HTTP can contain many chunks. Chunks
are separated by empty lines (`\n` characters) similar to how HTTP
separates its headers from the body. While streaming data over the
PB/HTTP protocol, the server also uses [HTTP chunks](http://en.wikipedia.org/wiki/Chunked_transfer_encoding) to
transfer the data across. There is no strict formula on how many chunks
there will per data retrieval request; the server chunks data based on
data source/partition/other parameters. Both
[pbrawclient](https://github.com/slacmshankar/epicsarchiverap_pbrawclient/)
and [carchivetools](https://github.com/epicsdeb/carchivetools) handle
the multiple chunks in a seamless fashion and present the data to the
caller as a single event stream.

![image](../../images/pbhttp.png)