Investigate support for large(r) files #15

herrfugbaum · 2018-11-28T16:53:59Z

Streams should be helpful to handle larger files.
In Papa Parse, the csv parser used in this project, there is already support for streams built in.

Files could be handled row by row instead of "all or nothing".

herrfugbaum · 2018-11-29T11:26:29Z

First test with the results of the 2018 StackOverflow survey (~186MB, 98856 rows) took 383.422 seconds
😅

const Papa = require('papaparse')

const fs = require('fs')

const file = fs.createReadStream('./huge.csv')

const start = Date.now()

Papa.parse(file, {
  header: true,
  skipEmptyLines: true,
  step: function (row) {
    console.log("Row:", row.data)
  },
  complete: function () {
    const duration = (Date.now() - start) / 1000
    console.log('Reading the file took ' + duration + ' seconds')
  }
})

herrfugbaum · 2018-11-29T13:46:02Z

Using chunk instead of step took 360.858 seconds.

const Papa = require('papaparse')

const streamFile = require('./streamFile')

const start = Date.now()

const file = streamFile('./huge.csv')

Papa.parse(file, {
  header: true,
  skipEmptyLines: true,
  /* step: function (row) {
    console.log("Row:", row.data)
  }, */
  chunk: function (chunk) {
    console.log(chunk)
  },
  complete: function () {
    const duration = (Date.now() - start) / 1000
    console.log('Reading the file took ' + duration + ' seconds')
  }
})

ankush981 · 2018-12-04T15:28:28Z

Ah, I just created #23. Maybe remove that one and copy over the ideas here?

ankush981 · 2018-12-11T10:42:36Z

Files could be handled row by row instead of "all or nothing".

Maybe we can try, say, 200 lines at a time . . . row by row is still going to kill, espcially on non-SSd drisks. 😓

herrfugbaum added the help wanted Extra attention is needed label Nov 28, 2018

herrfugbaum self-assigned this Nov 29, 2018

This was referenced Dec 11, 2018

Breaks on medium-sized files #22

Open

Takes too long to parse medium-sized files #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate support for large(r) files #15

Investigate support for large(r) files #15

herrfugbaum commented Nov 28, 2018

herrfugbaum commented Nov 29, 2018

herrfugbaum commented Nov 29, 2018

ankush981 commented Dec 4, 2018

ankush981 commented Dec 11, 2018

Investigate support for large(r) files #15

Investigate support for large(r) files #15

Comments

herrfugbaum commented Nov 28, 2018

herrfugbaum commented Nov 29, 2018

herrfugbaum commented Nov 29, 2018

ankush981 commented Dec 4, 2018

ankush981 commented Dec 11, 2018