Skip to content

Commit

Permalink
PARQUET-2310: implementation status (#34)
Browse files Browse the repository at this point in the history
Add outline of implementation status tables.

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
  • Loading branch information
alippai and alamb authored Jul 4, 2024
1 parent 2d38aa5 commit 19eb00f
Showing 1 changed file with 124 additions and 0 deletions.
124 changes: 124 additions & 0 deletions content/en/docs/File Format/implementationstatus.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
---
title: "Implementation status"
linkTitle: "Implementation status"
weight: 8
---

This page summarizes the features supported by different Parquet
implementations.

*Note*: This is a work in progress and we would welcome help expanding its scope.

### Legend
The value in each box means:
* ✅: supported
* ❌: not supported
* (blank) no data

Implementations:
* `C++`: [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet)
* `Java`: [parquet-java](https://github.com/apache/parquet-java)
* `Go`: [parquet-go](https://github.com/apache/arrow/tree/main/go/parquet)
* `Rust`: [parquet-rs](https://github.com/apache/arrow-rs/blob/master/parquet/README.md)



### Physical types

| Data type | C++ | Java | Go | Rust |
| ----------------------------------------- | ----- | ------ | ----- | ----- |
| BOOLEAN | | | | |
| INT32 | | | | |
| INT64 | | | | |
| INT96 (1) | | | | |
| FLOAT | | | | |
| DOUBLE | | | | |
| BYTE_ARRAY | | | | |
| FIXED_LEN_BYTE_ARRAY | | | | |

* \(1) This type is deprecated, but as of 2024 it's common in currently produced parquet files


### Logical types

| Data type | C++ | Java | Go | Rust |
| ----------------------------------------- | ----- | ------ | ----- | ----- |
| STRING | | | | |
| ENUM | | | | |
| UUID | | | | |
| 8, 16, 32, 64 bit signed and unsigned INT | | | | |
| DECIMAL (INT32) | | | | |
| DECIMAL (INT64) | | | | |
| DECIMAL (BYTE_ARRAY) | | | | |
| DECIMAL (FIXED_LEN_BYTE_ARRAY) | | | | |
| DATE | | | | |
| TIME (INT32) | | | | |
| TIME (INT64) | | | | |
| TIMESTAMP (INT64) | | | | |
| INTERVAL | | | | |
| JSON | | | | |
| BSON | | | | |
| LIST | | | | |
| MAP | | | | |
| UNKNOWN (always null) | | | | |
| FLOAT16 | | | | |

### Encodings

| Encoding | C++ | Java | Go | Rust |
| ----------------------------------------- | ----- | ------ | ----- | ----- |
| PLAIN | | | | |
| PLAIN_DICTIONARY | | | | |
| RLE_DICTIONARY | | | | |
| RLE | | | | |
| BIT_PACKED (deprecated) | | | | |
| DELTA_BINARY_PACKED | | | | |
| DELTA_LENGTH_BYTE_ARRAY | | | | |
| DELTA_BYTE_ARRAY | | | | |
| BYTE_STREAM_SPLIT | | | | |

### Compressions

| Compression | C++ | Java | Go | Rust |
| ----------------------------------------- | ----- | ------ | ----- | ----- |
| UNCOMPRESSED | | | | |
| BROTLI | | | | |
| GZIP | | | | |
| LZ4 (deprecated) | | | | |
| LZ4_RAW | | | | |
| LZO | | | | |
| SNAPPY | | | | |
| ZSTD | | | | |

### Other format level features

| | C++ | Java | Go | Rust |
| ----------------------------------------- | ----- | ------ | ----- | ----- |
| xxxHash-based bloom filters | | | | |
| Bloom filter length (1) | | | | |
| Statistics min_value, max_value | | | | |
| Page index | | | | |
| Page CRC32 checksum | | | | |
| Modular encryption | | | | |
| Size statistics (2) | | | | |


* \(1) In parquet.thrift: ColumnMetaData->bloom_filter_length

* \(2) In parquet.thrift: ColumnMetaData->size_statistics

### High level data APIs for Parquet feature usage

| Format | C++ | Java | Go | Rust |
| -------------------------------------------- | ----- | ------ | ----- | ----- |
| External column data (1) | | | | |
| Row group "Sorting column" metadata (2) | | | | |
| Row group pruning using statistics | | | | |
| Reading select columns only | | | | |
| Page pruning using statistics | | | | |
| Page pruning using bloom filter | | | | |


* \(1) In parquet.thrift: ColumnChunk->file_path

* \(2) In parquet.thrift: RowGroup->sorting_columns

0 comments on commit 19eb00f

Please sign in to comment.