A model for working with Data Packages.
pip install datapackage
import datapackage
dp = datapackage.DataPackage('http://data.okfn.org/data/core/gdp/datapackage.json')
brazil_gdp = [{'Year': int(row['Year']), 'Value': float(row['Value'])}
for row in dp.resources[0].data if row['Country Code'] == 'BRA']
max_gdp = max(brazil_gdp, key=lambda x: x['Value'])
min_gdp = min(brazil_gdp, key=lambda x: x['Value'])
percentual_increase = max_gdp['Value'] / min_gdp['Value']
msg = (
'The highest Brazilian GDP occured in {max_gdp_year}, when it peaked at US$ '
'{max_gdp:1,.0f}. This was {percentual_increase:1,.2f}% more than its '
'minimum GDP in {min_gdp_year}.'
).format(max_gdp_year=max_gdp['Year'],
max_gdp=max_gdp['Value'],
percentual_increase=percentual_increase,
min_gdp_year=min_gdp['Year'])
print(msg)
# The highest Brazilian GDP occured in 2011, when it peaked at US$ 2,615,189,973,181. This was 172.44% more than its minimum GDP in 1960.
import datapackage
dp = datapackage.DataPackage('http://data.okfn.org/data/core/gdp/datapackage.json')
try:
dp.validate()
except datapackage.exceptions.ValidationError as e:
# Handle the ValidationError
pass
import datapackage
# This descriptor has two errors:
# * It has no "name", which is required;
# * Its resource has no "data", "path" or "url".
descriptor = {
'resources': [
{},
]
}
dp = datapackage.DataPackage(descriptor)
for error in dp.iter_errors():
# Handle error
import datapackage
dp = datapackage.DataPackage()
dp.descriptor['name'] = 'my_sleep_duration'
dp.descriptor['resources'] = [
{'name': 'data'}
]
resource = dp.resources[0]
resource.descriptor['data'] = [
7, 8, 5, 6, 9, 7, 8
]
with open('datapackage.json', 'w') as f:
f.write(dp.to_json())
# {"name": "my_sleep_duration", "resources": [{"data": [7, 8, 5, 6, 9, 7, 8], "name": "data"}]}
import datapackage
import datapackage.registry
# This constant points to the official registry URL
# You can use any URL or path that points to a registry CSV
registry_url = datapackage.registry.Registry.DEFAULT_REGISTRY_URL
registry = datapackage.registry.Registry(registry_url)
descriptor = {} # The datapackage.json file
schema = registry.get('tabular') # Change to your schema ID
dp = datapackage.DataPackage(descriptor, schema)
Package provides push_datapackage
and pull_datapackage
utilities to
push and pull to/from storage.
This functionality requires jsontableschema
storage plugin installed. See
plugins
section of jsontableschema
docs for more information. Let's imagine
we have installed jsontableschema-mystorage
(not a real name) plugin.
Then we could push and pull datapackage to/from the storage:
All parameters should be used as keyword arguments.
from datapackage import push_datapackage, pull_datapackage
# Push
push_datapackage(
descriptor='descriptor_path',
backend='mystorage', **<mystorage_options>)
# Import
pull_datapackage(
descriptor='descriptor_path', name='datapackage_name',
backend='mystorage', **<mystorage_options>)
Options could be a SQLAlchemy engine or a BigQuery project and dataset name etc. Detailed description you could find in a concrete plugin documentation.
See concrete examples in
plugins
section of jsontableschema
docs.
These notes are intended to help people that want to contribute to this package itself. If you just want to use it, you can safely ignore them.
We cache the schemas from https://github.com/dataprotocols/schemas using git-subtree. To update it, use:
git subtree pull --prefix datapackage/schemas https://github.com/dataprotocols/schemas.git master --squash