Skip to content

Commit

Permalink
Merge pull request apache#37 from wgtmac/sync_format_2.10.0
Browse files Browse the repository at this point in the history
Sync site with format release v2.10.0
  • Loading branch information
shangxinli authored Jan 15, 2024
2 parents 8f9954c + 7cf58a9 commit 7bc0b28
Show file tree
Hide file tree
Showing 23 changed files with 1,250 additions and 92 deletions.
28 changes: 19 additions & 9 deletions content/en/docs/Concepts/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,29 @@ weight: 4
description: >
Glossary of relevant terminology.
---
*Block (hdfs block)*: This means a block in hdfs and the meaning is unchanged for describing this file format. The file format is designed to work well on top of hdfs.
- *Block (HDFS block)*: This means a block in HDFS and the meaning is
unchanged for describing this file format. The file format is
designed to work well on top of HDFS.

*File*: A hdfs file that must include the metadata for the file. It does not need to actually contain the data.
- *File*: A HDFS file that must include the metadata for the file.
It does not need to actually contain the data.

*Row group*: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset.
- *Row group*: A logical horizontal partitioning of the data into rows.
There is no physical structure that is guaranteed for a row group.
A row group consists of a column chunk for each column in the dataset.

*Column chunk*: A chunk of the data for a particular column. These live in a particular row group and is guaranteed to be contiguous in the file.
- *Column chunk*: A chunk of the data for a particular column. They live
in a particular row group and are guaranteed to be contiguous in the file.

*Page*: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which is interleaved in a column chunk.
- *Page*: Column chunks are divided up into pages. A page is conceptually
an indivisible unit (in terms of compression and encoding). There can
be multiple page types which are interleaved in a column chunk.

Hierarchically, a file consists of one or more row groups. A row group contains exactly one column chunk per column. Column chunks contain one or more pages.
Hierarchically, a file consists of one or more row groups. A row group
contains exactly one column chunk per column. Column chunks contain one or
more pages.

## Unit of parallelization
* MapReduce - File/Row Group
* IO - Column chunk
* Encoding/Compression - Page
- MapReduce - File/Row Group
- IO - Column chunk
- Encoding/Compression - Page
22 changes: 16 additions & 6 deletions content/en/docs/File Format/Data Pages/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,24 @@ title: "Data Pages"
linkTitle: "Data Pages"
weight: 7
---
For data pages, the 3 pieces of information are encoded back to back, after the page header. We have the
For data pages, the 3 pieces of information are encoded back to back, after the page
header. No padding is allowed in the data page.
In order we have:
1. repetition levels data
1. definition levels data
1. encoded values

* definition levels data,
* repetition levels data,
* encoded values. The size of specified in the header is for all 3 pieces combined.
The value of `uncompressed_page_size` specified in the header is for all the 3 pieces combined.

The data for the data page is always required. The definition and repetition levels are optional, based on the schema definition. If the column is not nested (i.e. the path to the column has length 1), we do not encode the repetition levels (it would always have the value 1). For data that is required, the definition levels are skipped (if encoded, it will always have the value of the max definition level).
The encoded values for the data page is always required. The definition and repetition levels
are optional, based on the schema definition. If the column is not nested (i.e.
the path to the column has length 1), we do not encode the repetition levels (it would
always have the value 1). For data that is required, the definition levels are
skipped (if encoded, it will always have the value of the max definition level).

For example, in the case where the column is non-nested and required, the data in the page is only the encoded values.
For example, in the case where the column is non-nested and required, the data in the
page is only the encoded values.

The supported encodings are described in Encodings.md

The supported compression codecs are described in Compression.md
5 changes: 4 additions & 1 deletion content/en/docs/File Format/Data Pages/checksumming.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,7 @@ title: "Checksumming"
linkTitle: "Checksumming"
weight: 7
---
Pages of all kinds can be individually checksummed. This allows disabling of checksums at the HDFS file level, to better support single row lookups. Checksums are calculated using the standard CRC32 algorithm - as used in e.g. GZip - on the serialized binary representation of a page (not including the page header itself).
Pages of all kinds can be individually checksummed. This allows disabling of checksums
at the HDFS file level, to better support single row lookups. Checksums are calculated
using the standard CRC32 algorithm - as used in e.g. GZip - on the serialized binary
representation of a page (not including the page header itself).
15 changes: 14 additions & 1 deletion content/en/docs/File Format/Data Pages/columnchunks.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,17 @@ title: "Column Chunks"
linkTitle: "Column Chunks"
weight: 7
---
Column chunks are composed of pages written back to back. The pages share a common header and readers can skip over page they are not interested in. The data for the page follows the header and can be compressed and/or encoded. The compression and encoding is specified in the page metadata.
Column chunks are composed of pages written back to back. The pages share a common
header and readers can skip over pages they are not interested in. The data for the
page follows the header and can be compressed and/or encoded. The compression and
encoding is specified in the page metadata.

A column chunk might be partly or completely dictionary encoded. It means that
dictionary indexes are saved in the data pages instead of the actual values. The
actual values are stored in the dictionary page. See details in Encodings.md.
The dictionary page must be placed at the first position of the column chunk. At
most one dictionary page can be placed in a column chunk.

Additionally, files can contain an optional column index to allow readers to
skip pages more efficiently. See PageIndex.md for details and
the reasoning behind adding these to the format.
84 changes: 84 additions & 0 deletions content/en/docs/File Format/Data Pages/compression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
title: "Compression"
linkTitle: "Compression"
weight: 1
---

## Overview

Parquet allows the data block inside dictionary pages and data pages to
be compressed for better space efficiency. The Parquet format supports
several compression covering different areas in the compression ratio /
processing cost spectrum.

The detailed specifications of compression codecs are maintained externally
by their respective authors or maintainers, which we reference hereafter.

For all compression codecs except the deprecated `LZ4` codec, the raw data
of a (data or dictionary) page is fed *as-is* to the underlying compression
library, without any additional framing or padding. The information required
for precise allocation of compressed and decompressed buffers is written
in the `PageHeader` struct.

## Codecs

### UNCOMPRESSED

No-op codec. Data is left uncompressed.

### SNAPPY

A codec based on the
[Snappy compression format](https://github.com/google/snappy/blob/master/format_description.txt).
If any ambiguity arises when implementing this format, the implementation
provided by Google Snappy [library](https://github.com/google/snappy/)
is authoritative.

### GZIP

A codec based on the GZIP format (not the closely-related "zlib" or "deflate"
formats) defined by [RFC 1952](https://tools.ietf.org/html/rfc1952).
If any ambiguity arises when implementing this format, the implementation
provided by the [zlib compression library](https://zlib.net/) is authoritative.

Readers should support reading pages containing multiple GZIP members, however,
as this has historically not been supported by all implementations, it is recommended
that writers refrain from creating such pages by default for better interoperability.

### LZO

A codec based on or interoperable with the
[LZO compression library](http://www.oberhumer.com/opensource/lzo/).

### BROTLI

A codec based on the Brotli format defined by
[RFC 7932](https://tools.ietf.org/html/rfc7932).
If any ambiguity arises when implementing this format, the implementation
provided by the [Brotli compression library](https://github.com/google/brotli)
is authoritative.

### LZ4

A **deprecated** codec loosely based on the LZ4 compression algorithm,
but with an additional undocumented framing scheme. The framing is part
of the original Hadoop compression library and was historically copied
first in parquet-mr, then emulated with mixed results by parquet-cpp.

It is strongly suggested that implementors of Parquet writers deprecate
this compression codec in their user-facing APIs, and advise users to
switch to the newer, interoperable `LZ4_RAW` codec.

### ZSTD

A codec based on the Zstandard format defined by
[RFC 8478](https://tools.ietf.org/html/rfc8478). If any ambiguity arises
when implementing this format, the implementation provided by the
[ZStandard compression library](https://facebook.github.io/zstd/)
is authoritative.

### LZ4_RAW

A codec based on the [LZ4 block format](https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md).
If any ambiguity arises when implementing this format, the implementation
provided by the [LZ4 compression library](http://www.lz4.org/) is authoritative.
Loading

0 comments on commit 7bc0b28

Please sign in to comment.