Merge pull request apache#37 from wgtmac/sync_format_2.10.0

Sync site with format release v2.10.0
vinooganesh · Jan 15, 2024 · 7bc0b28 · 7bc0b28
2 parents 8f9954c + 7cf58a9
commit 7bc0b28
Show file tree

Hide file tree

Showing 23 changed files with 1,250 additions and 92 deletions.
diff --git a/content/en/docs/Concepts/_index.md b/content/en/docs/Concepts/_index.md
@@ -5,19 +5,29 @@ weight: 4
 description: >
   Glossary of relevant terminology.
 ---
-*Block (hdfs block)*: This means a block in hdfs and the meaning is unchanged for describing this file format. The file format is designed to work well on top of hdfs.
+  - *Block (HDFS block)*: This means a block in HDFS and the meaning is
+    unchanged for describing this file format.  The file format is
+    designed to work well on top of HDFS.
 
-*File*: A hdfs file that must include the metadata for the file. It does not need to actually contain the data.
+  - *File*: A HDFS file that must include the metadata for the file.
+    It does not need to actually contain the data.
 
-*Row group*: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset.
+  - *Row group*: A logical horizontal partitioning of the data into rows.
+    There is no physical structure that is guaranteed for a row group.
+    A row group consists of a column chunk for each column in the dataset.
 
-*Column chunk*: A chunk of the data for a particular column. These live in a particular row group and is guaranteed to be contiguous in the file.
+  - *Column chunk*: A chunk of the data for a particular column.  They live
+    in a particular row group and are guaranteed to be contiguous in the file.
 
-*Page*: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which is interleaved in a column chunk.
+  - *Page*: Column chunks are divided up into pages.  A page is conceptually
+    an indivisible unit (in terms of compression and encoding).  There can
+    be multiple page types which are interleaved in a column chunk.
 
-Hierarchically, a file consists of one or more row groups. A row group contains exactly one column chunk per column. Column chunks contain one or more pages.
+Hierarchically, a file consists of one or more row groups.  A row group
+contains exactly one column chunk per column.  Column chunks contain one or
+more pages.
 
 ## Unit of parallelization
-* MapReduce - File/Row Group
-* IO - Column chunk
-* Encoding/Compression - Page
+  - MapReduce - File/Row Group
+  - IO - Column chunk
+  - Encoding/Compression - Page
diff --git a/content/en/docs/File Format/Data Pages/_index.md b/content/en/docs/File Format/Data Pages/_index.md
@@ -3,14 +3,24 @@ title: "Data Pages"
 linkTitle: "Data Pages"
 weight: 7
 ---
-For data pages, the 3 pieces of information are encoded back to back, after the page header. We have the
+For data pages, the 3 pieces of information are encoded back to back, after the page
+header. No padding is allowed in the data page.
+In order we have:
+ 1. repetition levels data
+ 1. definition levels data
+ 1. encoded values
 
-* definition levels data,
-* repetition levels data,
-* encoded values. The size of specified in the header is for all 3 pieces combined.
+The value of `uncompressed_page_size` specified in the header is for all the 3 pieces combined.
 
-The data for the data page is always required. The definition and repetition levels are optional, based on the schema definition. If the column is not nested (i.e. the path to the column has length 1), we do not encode the repetition levels (it would always have the value 1). For data that is required, the definition levels are skipped (if encoded, it will always have the value of the max definition level).
+The encoded values for the data page is always required.  The definition and repetition levels
+are optional, based on the schema definition.  If the column is not nested (i.e.
+the path to the column has length 1), we do not encode the repetition levels (it would
+always have the value 1).  For data that is required, the definition levels are
+skipped (if encoded, it will always have the value of the max definition level).
 
-For example, in the case where the column is non-nested and required, the data in the page is only the encoded values.
+For example, in the case where the column is non-nested and required, the data in the
+page is only the encoded values.
 
 The supported encodings are described in Encodings.md
+
+The supported compression codecs are described in Compression.md
diff --git a/content/en/docs/File Format/Data Pages/checksumming.md b/content/en/docs/File Format/Data Pages/checksumming.md
@@ -3,4 +3,7 @@ title: "Checksumming"
 linkTitle: "Checksumming"
 weight: 7
 ---
-Pages of all kinds can be individually checksummed. This allows disabling of checksums at the HDFS file level, to better support single row lookups. Checksums are calculated using the standard CRC32 algorithm - as used in e.g. GZip - on the serialized binary representation of a page (not including the page header itself).
+Pages of all kinds can be individually checksummed. This allows disabling of checksums
+at the HDFS file level, to better support single row lookups. Checksums are calculated
+using the standard CRC32 algorithm - as used in e.g. GZip - on the serialized binary
+representation of a page (not including the page header itself).
diff --git a/content/en/docs/File Format/Data Pages/columnchunks.md b/content/en/docs/File Format/Data Pages/columnchunks.md
@@ -3,4 +3,17 @@ title: "Column Chunks"
 linkTitle: "Column Chunks"
 weight: 7
 ---
-Column chunks are composed of pages written back to back. The pages share a common header and readers can skip over page they are not interested in. The data for the page follows the header and can be compressed and/or encoded. The compression and encoding is specified in the page metadata.
+Column chunks are composed of pages written back to back.  The pages share a common
+header and readers can skip over pages they are not interested in.  The data for the
+page follows the header and can be compressed and/or encoded.  The compression and
+encoding is specified in the page metadata.
+
+A column chunk might be partly or completely dictionary encoded. It means that
+dictionary indexes are saved in the data pages instead of the actual values. The
+actual values are stored in the dictionary page. See details in Encodings.md.
+The dictionary page must be placed at the first position of the column chunk. At
+most one dictionary page can be placed in a column chunk.
+
+Additionally, files can contain an optional column index to allow readers to
+skip pages more efficiently. See PageIndex.md for details and
+the reasoning behind adding these to the format.
diff --git a/content/en/docs/File Format/Data Pages/compression.md b/content/en/docs/File Format/Data Pages/compression.md
@@ -0,0 +1,84 @@
+---
+title: "Compression"
+linkTitle: "Compression"
+weight: 1
+---
+
+## Overview
+
+Parquet allows the data block inside dictionary pages and data pages to
+be compressed for better space efficiency. The Parquet format supports
+several compression covering different areas in the compression ratio /
+processing cost spectrum.
+
+The detailed specifications of compression codecs are maintained externally
+by their respective authors or maintainers, which we reference hereafter.
+
+For all compression codecs except the deprecated `LZ4` codec, the raw data
+of a (data or dictionary) page is fed *as-is* to the underlying compression
+library, without any additional framing or padding.  The information required
+for precise allocation of compressed and decompressed buffers is written
+in the `PageHeader` struct.
+
+## Codecs
+
+### UNCOMPRESSED
+
+No-op codec.  Data is left uncompressed.
+
+### SNAPPY
+
+A codec based on the
+[Snappy compression format](https://github.com/google/snappy/blob/master/format_description.txt).
+If any ambiguity arises when implementing this format, the implementation
+provided by Google Snappy [library](https://github.com/google/snappy/)
+is authoritative.
+
+### GZIP
+
+A codec based on the GZIP format (not the closely-related "zlib" or "deflate"
+formats) defined by [RFC 1952](https://tools.ietf.org/html/rfc1952).
+If any ambiguity arises when implementing this format, the implementation
+provided by the [zlib compression library](https://zlib.net/) is authoritative.
+
+Readers should support reading pages containing multiple GZIP members, however,
+as this has historically not been supported by all implementations, it is recommended
+that writers refrain from creating such pages by default for better interoperability.
+
+### LZO
+
+A codec based on or interoperable with the
+[LZO compression library](http://www.oberhumer.com/opensource/lzo/).
+
+### BROTLI
+
+A codec based on the Brotli format defined by
+[RFC 7932](https://tools.ietf.org/html/rfc7932).
+If any ambiguity arises when implementing this format, the implementation
+provided by the  [Brotli compression library](https://github.com/google/brotli)
+is authoritative.
+
+### LZ4
+
+A **deprecated** codec loosely based on the LZ4 compression algorithm,
+but with an additional undocumented framing scheme.  The framing is part
+of the original Hadoop compression library and was historically copied
+first in parquet-mr, then emulated with mixed results by parquet-cpp.
+
+It is strongly suggested that implementors of Parquet writers deprecate
+this compression codec in their user-facing APIs, and advise users to
+switch to the newer, interoperable `LZ4_RAW` codec.
+
+### ZSTD
+
+A codec based on the Zstandard format defined by
+[RFC 8478](https://tools.ietf.org/html/rfc8478).  If any ambiguity arises
+when implementing this format, the implementation provided by the
+[ZStandard compression library](https://facebook.github.io/zstd/)
+is authoritative.
+
+### LZ4_RAW
+
+A codec based on the [LZ4 block format](https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md).
+If any ambiguity arises when implementing this format, the implementation
+provided by the [LZ4 compression library](http://www.lz4.org/) is authoritative.