-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support local paths for InputDataset.source
#30
Support local paths for InputDataset.source
#30
Conversation
if _get_source_type(source) == "path": | ||
self.exists_locally = True | ||
self.local_path = source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I wonder if this could all be abstracted behind a class. e.g.
self.source = DataSource(source)
then later use self.source.exists_locally
and self.source.path
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I began implementing this for this PR but it got hairier than expected. I think it requires a lot of design considerations and should be returned to once the core feature set is complete. AdditionalCode
and BaseModel
also have a source_repo
attribute (that I think is about to be renamed source
to accommodate local files) that could be replaced with an instance of whatever this class will be. I think this could be part of the tidy-up that brings in pathlib
. I'll raise an issue for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I have one main comment about the idea of abstracting out the "source" of the data. It's related to #9 and drivendataorg/cloudpathlib#455.
If InputDataset.source is... | ||
- ...a local path: create a symbolic link to the file in `local_dir/input_datasets`. | ||
- ...a URL: fetch the file to `local_dir/input_datasets` using Pooch | ||
(updating the `local_path` attribute of the calling InputDataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is another argument for abstracting out the concept of a "source" into a standalone class / type.
This is a similar idea to using cloudpathlib
, but maybe we need to attach more info, in which case we would need our own Source
class with a .path
attribute.
to_fetch.fetch(os.path.basename(self.source), downloader=downloader) | ||
self.exists_locally = True | ||
self.local_path = tgt_dir + "/" + os.path.basename(self.source) | ||
# If the file is somewhere else on the system, make a symbolic link where we want it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could all go behind a Source
interface.
- add _get_source_type method in utils (returns 'url' or 'path') - add call to _get_source_type in InputDataset.__init__() which updates local_path and exists_locally if 'path' - add logic to InputDataset.get() which creates a local symbolic link to the dataset path if it is elsewhere on the system - update InputDataset.check_exists_locally() to reflect above changes
… yaml created with Case.persist()
- Modifies the blueprint created by the first case to use local paths to input datasets where available - Creates and runs a second case `roms_marbl_local_case` from this blueprint
18fd968
to
b032dcc
Compare
closes #4 .
Summary of changes:
Core:
_get_source_type
method inutils
(returns 'url' or 'path')_get_source_type
inInputDataset.__init__()
and setlocal_path
andexists_locally
attributes ifsource
attribute is a local pathInputDataset.get()
which creates a local symbolic link to the dataset path if it is elsewhere on the systemInputDataset.check_exists_locally()
to reflect above changesCI:
tests/test_roms_marbl_example.py
. The first section creates and runs the caseroms_marbl_remote_case
where all input datasets are URLs, creating a blueprinttest_blueprint.yaml
as before. The second section first moves all the fetched datasets to an unrelated directory and then modifies thetest_blueprint.yaml
file to replace URLs with the path to this dir, before creating and running another caseroms_marbl_local_case
from this modified yaml file.Other/bugfixes:
Case.persist()
andCase.from_blueprint()
when modifying the test routine above:valid_end_date
was overwritingvalid_start_date
entry in blueprint outputstart_date
andend_date
associated with input datasets were not being written to blueprint outputCase.check_is_setup()
was not correctly checking for the local presence of additional codetree
on system before calling it (closes bash tree: command not found #18)Case.end_date
if they want quick results (closes simplest example should run really fast #21)ci/environment.yml
to includenetCDF4
andxarray
(closes Missing dependencies for analysis #23) and removenco
andncview
(issue not yet raised but was mentioned in Tom's notes)