Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support local paths for InputDataset.source #30

Merged
merged 5 commits into from
Aug 20, 2024

Conversation

dafyddstephenson
Copy link
Contributor

@dafyddstephenson dafyddstephenson commented Aug 17, 2024

closes #4 .
Summary of changes:

Core:

  • add _get_source_type method in utils (returns 'url' or 'path')
  • add call to _get_source_type in InputDataset.__init__() and set local_path and exists_locally attributes if source attribute is a local path
  • add logic to InputDataset.get() which creates a local symbolic link to the dataset path if it is elsewhere on the system
  • update InputDataset.check_exists_locally() to reflect above changes

CI:

  • add a second section to tests/test_roms_marbl_example.py. The first section creates and runs the case roms_marbl_remote_case where all input datasets are URLs, creating a blueprint test_blueprint.yaml as before. The second section first moves all the fetched datasets to an unrelated directory and then modifies the test_blueprint.yaml file to replace URLs with the path to this dir, before creating and running another case roms_marbl_local_case from this modified yaml file.

Other/bugfixes:

Comment on lines +65 to +67
if _get_source_type(source) == "path":
self.exists_locally = True
self.local_path = source
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I wonder if this could all be abstracted behind a class. e.g.

self.source = DataSource(source)

then later use self.source.exists_locally and self.source.path.

Copy link
Contributor Author

@dafyddstephenson dafyddstephenson Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I began implementing this for this PR but it got hairier than expected. I think it requires a lot of design considerations and should be returned to once the core feature set is complete. AdditionalCode and BaseModel also have a source_repo attribute (that I think is about to be renamed source to accommodate local files) that could be replaced with an instance of whatever this class will be. I think this could be part of the tidy-up that brings in pathlib. I'll raise an issue for now.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I have one main comment about the idea of abstracting out the "source" of the data. It's related to #9 and drivendataorg/cloudpathlib#455.

Comment on lines +105 to +108
If InputDataset.source is...
- ...a local path: create a symbolic link to the file in `local_dir/input_datasets`.
- ...a URL: fetch the file to `local_dir/input_datasets` using Pooch
(updating the `local_path` attribute of the calling InputDataset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is another argument for abstracting out the concept of a "source" into a standalone class / type.

This is a similar idea to using cloudpathlib, but maybe we need to attach more info, in which case we would need our own Source class with a .path attribute.

to_fetch.fetch(os.path.basename(self.source), downloader=downloader)
self.exists_locally = True
self.local_path = tgt_dir + "/" + os.path.basename(self.source)
# If the file is somewhere else on the system, make a symbolic link where we want it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could all go behind a Source interface.

“Dafydd added 5 commits August 20, 2024 14:16
- add _get_source_type method in utils (returns 'url' or 'path')
- add call to _get_source_type in InputDataset.__init__() which updates local_path and exists_locally if 'path'
- add logic to InputDataset.get() which creates a local symbolic link to the dataset path if it is elsewhere on the system
- update InputDataset.check_exists_locally() to reflect above changes
- Modifies the blueprint created by the first case to use local paths to input datasets where available
- Creates and runs a second case `roms_marbl_local_case` from this blueprint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants