-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[optimize] readRleBitPackedHybrid
cost too much time, and typedarray
for column result?
#7
Comments
I have plans to make I like the idea of returning TypedArrays, but will need to think more about how to actually implement that. Suggestions and PRs welcome! I've found the parsing to be quite fast in general. Can you share an example parquet file which is particularly slow? |
A more than 20x faster example: TPCH lineitem table (sf1 almost 250MB). Without reading columnMetadata.codec {
type: 'INT64',
encodings: [ 'PLAIN' ],
path_in_schema: [ 'l_extendedprice' ],
codec: 'SNAPPY',
num_values: 122880n,
total_uncompressed_size: 983071n,
total_compressed_size: 634625n,
key_value_metadata: undefined,
data_page_offset: 1580630n,
index_page_offset: undefined,
dictionary_page_offset: undefined,
statistics: {
max: '^\x10�\x00\x00\x00\x00\x00',
min: '�i\x01\x00\x00\x00\x00\x00',
null_count: 0n,
distinct_count: undefined
},
encoding_stats: undefined
} {
num_values: 122880,
encoding: 'PLAIN',
definition_level_encoding: 'RLE',
repetition_level_encoding: 'RLE',
statistics: undefined
} But the codec is And I can do more, but not good at parquet format yet, |
with more: |
Just pushed commit 488bfa7f9 which increases performance by 11.7% when parsing tpch on node.js! There is still much room for improvement. Nice job finding optimizations! |
Yeah, I see the commit , great job. And here is some of small test might help function concatArrayBuffers(buffer1, buffer2) {
const newBuffer = new ArrayBuffer(buffer1.byteLength + buffer2.byteLength)
const view = new Uint8Array(newBuffer)
view.set(new Uint8Array(buffer1), 0)
view.set(new Uint8Array(buffer2), buffer1.byteLength)
console.log(buffer1.byteLength + buffer2.byteLength)
return newBuffer
}
const buffer1 = new ArrayBuffer(40000000)
const buffer2 = new ArrayBuffer(40000000)
console.time('concatArrayBuffers')
var r = new Int32Array(concatArrayBuffers(buffer1, buffer2))
console.timeEnd('concatArrayBuffers')
function concatTypedArrays(array1, array2) {
const newTypedArray = new array1.constructor(array1.length + array2.length)
newTypedArray.set(array1, 0)
newTypedArray.set(array2, array1.length)
return newTypedArray
}
const typedArray1 = new Int32Array(10000000)
const typedArray2 = new Int32Array(10000000)
console.time('concatTypedArrays')
var r = concatTypedArrays(typedArray1, typedArray2)
console.timeEnd('concatTypedArrays')
r
var src1 = Array(10000000).fill(0)
var src2 = Array(10000000).fill(0)
console.time('array concat')
var dst = src1.concat(src2)
console.timeEnd('array concat')
// dst.push(...src) might throw an error
var dst = []
for (let i=0; i++!=10000000;) {
dst.push(i)
}
console.time('array push')
for (let i=0; i++!=10000000;) {
dst.push(i)
}
console.timeEnd('array push') Nodejs v22 on my machine:
if create a function ArrayBuffer cast to TypedArray almost zero cost, and When should use
NOT:
|
v0.7.8, another performance issue is the But I didn't test anything for |
I'll go to sleep, here is some answer for export function readData(reader, encoding, count, bitWidth) {
/** @type {any[]} */
const values = new Array()
if (encoding === 'RLE') {
let seen = 0
while (seen < count) {
seen = readRleBitPackedHybrid(reader, bitWidth, 0, count, values, seen)
}
} else {
throw new Error(`parquet encoding not supported ${encoding}`)
}
return values
}
export function readRleBitPackedHybrid(reader, width, length, numValues, values, seen=0) {
if(!values)throw error('')
if (!length) {
length = reader.view.getInt32(reader.offset, true)
reader.offset += 4
if (length < 0) throw new Error(`parquet invalid rle/bitpack length ${length}`)
}
const startOffset = reader.offset
while (reader.offset - startOffset < length && seen < numValues) {
const [header, newOffset] = readVarInt(reader.view, reader.offset)
reader.offset = newOffset
if ((header & 1) === 0) {
let [val, count] = readRle(reader, header, width)
for (let i=0; i++ < count;) { values[seen++] = val}
} else {
// bit-packed
let res = readBitPacked(reader, header, width, numValues - seen)
for (let i=0; i++ < res.length;) { values[seen++] = res[i]}
}
}
return seen
}
function readRle(reader, header, bitWidth) {
const count = header >>> 1
const width = (bitWidth + 7) >> 3
let value
if (width === 1) {value = reader.view.getUint8(reader.offset) }
else if (width === 2) {value = reader.view.getUint16(reader.offset, true) }
else if (width === 4) {value = reader.view.getUint32(reader.offset, true) }
else { throw new Error(`parquet invalid rle width ${width}`)}
reader.offset += width
return [value, count]
} |
I just pushed some more optimizations in v0.7.10. It is about 37% faster than v0.7.8 on the TPCH dataset! I tried switching to TypedArrays but things got slower for me. Thanks for the investigations @simline! Let me know if you have more suggestions! |
Yes I have, with /**
* Read `count` int32 values.
*
* @param {DataReader} reader - buffer to read data from
* @param {number} count - number of values to read
* @returns {Int32Array} array of int32 values
*/
function readPlainInt32(reader, count) {
reader.offset += count * 4
return new Int32Array(reader.view.buffer, 8)
}
/**
* Read `count` int64 values.
*
* @param {DataReader} reader - buffer to read data from
* @param {number} count - number of values to read
* @returns {BigInt64Array} array of int64 values
*/
function readPlainInt64(reader, count) {
reader.offset += count * 8
return new BigInt64Array(reader.view.buffer, 8)
} |
For data analistics, another thing is User filter column data with expr (
|
I made those changes to I tried on node, chrome, and firefox and saw the same slowdown with TypedArrays. What environment are you testing in? |
@platypii sorry, I get the result from If must return the whole result as once, example above like |
Got TypedArrays working! Published v0.7.11 with a tpch parse time of 8,951 ms, compared to 19,358 ms for version v0.7.8. A speedup of 53.7%! |
fix for my test: function concatTypedArrays(array1, array2) {
const newTypedArray = new array1.constructor(array1.length + array2.length)
newTypedArray.set(array1, 0)
newTypedArray.set(array2, array1.length)
return newTypedArray
}
for (let i = 0; i < 49; i++) {
var newA = new BigInt64Array(122880).fill(BigInt(i+1))
console.time('concatTypedArrays')
dst = concatTypedArrays(dst, typedArray2)
console.timeEnd('concatTypedArrays')
} Time increased every times in loops, from 0.7ms to 19.6ms, cause var dst = new BigInt64Array(122880 * 49)
for (let i = 0; i < 49; i++) {
var newA = new BigInt64Array(122880).fill(BigInt(i+1))
console.time('concatTypedArrays')
dst.set(newA, 122880*(i))
console.timeEnd('concatTypedArrays')
} Cost <1ms every time! |
readDefinitionLevels bitWidth condition When |
and if i can do something for this will be great
The text was updated successfully, but these errors were encountered: