Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate support for large(r) files #15

Open
herrfugbaum opened this issue Nov 28, 2018 · 4 comments
Open

Investigate support for large(r) files #15

herrfugbaum opened this issue Nov 28, 2018 · 4 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@herrfugbaum
Copy link
Owner

Streams should be helpful to handle larger files.
In Papa Parse, the csv parser used in this project, there is already support for streams built in.

Files could be handled row by row instead of "all or nothing".

@herrfugbaum herrfugbaum added the help wanted Extra attention is needed label Nov 28, 2018
@herrfugbaum
Copy link
Owner Author

First test with the results of the 2018 StackOverflow survey (~186MB, 98856 rows) took 383.422 seconds
😅

const Papa = require('papaparse')

const fs = require('fs')

const file = fs.createReadStream('./huge.csv')

const start = Date.now()

Papa.parse(file, {
  header: true,
  skipEmptyLines: true,
  step: function (row) {
    console.log("Row:", row.data)
  },
  complete: function () {
    const duration = (Date.now() - start) / 1000
    console.log('Reading the file took ' + duration + ' seconds')
  }
})

@herrfugbaum herrfugbaum self-assigned this Nov 29, 2018
@herrfugbaum
Copy link
Owner Author

Using chunk instead of step took 360.858 seconds.

const Papa = require('papaparse')

const streamFile = require('./streamFile')

const start = Date.now()

const file = streamFile('./huge.csv')

Papa.parse(file, {
  header: true,
  skipEmptyLines: true,
  /* step: function (row) {
    console.log("Row:", row.data)
  }, */
  chunk: function (chunk) {
    console.log(chunk)
  },
  complete: function () {
    const duration = (Date.now() - start) / 1000
    console.log('Reading the file took ' + duration + ' seconds')
  }
})

@ankush981
Copy link

Ah, I just created #23. Maybe remove that one and copy over the ideas here?

@ankush981
Copy link

Files could be handled row by row instead of "all or nothing".

Maybe we can try, say, 200 lines at a time . . . row by row is still going to kill, espcially on non-SSd drisks. 😓

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants