Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] Thoughts on DDI Data Loader #22

Open
TheCedarPrince opened this issue May 6, 2024 · 2 comments
Open

[IMPROVEMENT] Thoughts on DDI Data Loader #22

TheCedarPrince opened this issue May 6, 2024 · 2 comments
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@TheCedarPrince
Copy link
Member

Hey @00krishna,

These are my thoughts on your recent implementation for #21! As I mentioned over Slack, the functionality is sufficient enough that I went ahead and merged the PR, but as we discussed, there does seem to be some places for improvements. I am going to copy some of our discussion to here first so as to not lose thoughts as well as add in my own thoughts and comments along the way.

Basic Benchmarking

For a dataset with about 70 million elements, I see this time and allocation count:

julia> @btime load_ipums_extract(ddi, "usa_00001.dat");
  130.577 s (475777044 allocations: 15.20 GiB) 

For a dataset with only 60,000 elements, I see this time and allocation count:

julia> @btime load_ipums_extract(ddi, "cps_00157.dat");
  118.328 ms (518521 allocations: 17.71 MiB)

Interestingly, when I tried using ipumsr to do this, the loading for both files was nearly instantaneous. I did some digging and found that hipread seems to be doing a lot of the heavy-lifting within ipumsr to do the parsing of this file.

Thoughts on Slowdowns

Per Krishna:

Most of the allocations must be happening in this part of the code, since this is where the data is getting loaded.

  for line in eachline(extract_filepath)
       data = map((x,p,d) -> _parse_data(x, p, d), 
                       [strip(line[r]) for r in range_vec], 
                       [eltype(a) for a in dtype_vec],
                       [d for d in dcml_vec])
       push!(df, data)
   end

After doing some further analysis using Profile.jl and PProf.jl, I was able to find the following flamegraph profiling on allocations in the code:

image

Although the image is not so great, one can see that a majority of the allocations occurs in the map function call as you were imagining, Krishna. Here's the code I ran to generate this flamegraph:

using IPUMS
using Profile
using PProf

ddi = parse_ddi("cps_00157.xml");

Profile.Allocs.clear()
Profile.Allocs.@profile load_ipums_extract(ddi, "cps_00157.dat");
PProf.Allocs.pprof()

I attached the allocation profile file here as well.

alloc-profile.pb.gz

I also did a quick analysis of the computation. It would appear that much of the time the function is running, it is trying its best at type inference. This flame graph is a bit inscrutable but I figured I would include it here as well:

image

Concluding Thoughts

In short, it seems like there is a huge opportunity to optimize the loading functionality to make it more straightforwardly take advantage of the fixed width files. Although, I am not entirely sure how to do that yet. I'd be curious your thoughts @00krishna.

Otherwise, great stuff!

@TheCedarPrince TheCedarPrince added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels May 6, 2024
@00krishna
Copy link
Collaborator

So I have created PR #30 to help improve on the performance. @TheCedarPrince is reviewing this. We will work on further improvements as well.

@TheCedarPrince
Copy link
Member Author

That was merged @00krishna -- thanks for adding this in. Are you still able to post that minimal working example to Discourse? I'd be curious to see how we could keep hacking at it someday to make it as fast as tidyr's reader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants