Pull data from DaRUS, the Data Repository of the University of Stuttgart, to a local folder (./data
) with Python3.
Use it to organize the data of multiple datasets locally on your computer and to integrate your open data in git repositories.
-
Install Python3 + pip (Python package manager)
-
Clone this repository to the place you need it. If it is a git repository add it as submodule via
git submodule add https://github.com/iswunistuttgart/darus_data_download.git
-
Install required package requests (to make HTTP requests to the REST API):
# cd to directory of this repository, then: pip install --user -r requirements.txt
-
Create the configuration file (
scripts/darus_config.json
) template by runningpython scripts/get_data.py
-
If the dataset(s) you want to use are not (yet) public, then get your API Token on https://darus.uni-stuttgart.de/dataverseuser.xhtml?selectTab=apiTokenTab and fill it in a file named
.darus_apikey
. Warning: never check in your api_key via git! Within this repository it is added to .gitignore -
Configure the data to download in
scripts/darus_config.json
. The doi of each dataset is in the formatdoi:10.18419/darus-????
(find your own data on https://darus.uni-stuttgart.de/dataverseuser.xhtml?selectTab=dataRelatedToMe) -
If you are using this module as submodule: move the
darus_config.json
file to the directory above this repository and check it in with your parent git project to keep data configuration reproducible -
Download/update all data by running
python scripts/get_data.py
The metadata is also downloaded as as
info.json
in each folder
- in
./scripts/
(directory ofget_data.py
) - in
./
(the parent directory, whereReadme.md
is located) - in
../
(one directory above this project)
For downloading two datasets
{
"dataverse_url": "https://darus.uni-stuttgart.de/",
"datasets": [
{"id": "doi:10.18419/darus-1234", "version": "latest"},
{"id": "doi:10.18419/darus-1235", "version": "2.0"}
]
}
To have reproducible results you should refer to a fixed version after the repository was published. Otherwise the data may change on repository updates.
Latest ("version": "latest"
) is the default setting, will also use unpublished versions. You can also use a version number or any string of the version descriptions given here
- Handle ENV variables(especially for API key) to use it in Docker, etc.
- Make it more robust against failure/misconfiguration
- Allow upload of files (maybe use pyDaRUS)
- Allow specifying version
- You are welcome to contribute bugfixes directly as pull requests
- For new features or changed functionality please open an issue first, or feel free to discuss it directly.