Skip to content

Generate Pandas frames, load and extract data, based on JSON Table Schema descriptors.

License

Notifications You must be signed in to change notification settings

frictionlessdata/tableschema-pandas-py

Repository files navigation

tableschema-pandas-py

Travis Coveralls PyPi Github Gitter

Generate and load Pandas data frames Table Schema descriptors.

Features

  • implements tableschema.Storage interface

Contents

Getting Started

Installation

The package use semantic versioning. It means that major versions could include breaking changes. It's highly recommended to specify package version range in your setup/requirements file e.g. package>=1.0,<2.0.

$ pip install tableschema-pandas

Documentation

# pip install datapackage tableschema-pandas
from datapackage import Package

# Save to Pandas

package = Package('http://data.okfn.org/data/core/country-list/datapackage.json')
storage = package.save(storage='pandas')

print(type(storage['data']))
#  <class 'pandas.core.frame.DataFrame'>

print(storage['data'].head())
#               Name   Code
#  0     Afghanistan   AF
#  1   Åland Islands   AX
#  2         Albania   AL
#  3         Algeria   DZ
#  4  American Samoa   AS

# Load from Pandas

package = Package(storage=storage)
print(package.descriptor)
print(package.resources[0].read())

Storage works as a container for Pandas data frames. You can define new data frame inside storage using storage.create method:

>>> from tableschema_pandas import Storage

>>> storage = Storage()
>>> storage.create('data', {
...     'primaryKey': 'id',
...     'fields': [
...         {'name': 'id', 'type': 'integer'},
...         {'name': 'comment', 'type': 'string'},
...     ]
... })

>>> storage.buckets
['data']

>>> storage['data'].shape
(0, 0)

Use storage.write to populate data frame with data:

>>> storage.write('data', [(1, 'a'), (2, 'b')])

>>> storage['data']
id comment
1        a
2        b

Also you can use tabulator to populate data frame from external data file. As you see, subsequent writes simply appends new data on top of existing ones:

>>> import tabulator

>>> with tabulator.Stream('data/comments.csv', headers=1) as stream:
...     storage.write('data', stream)

>>> storage['data']
id comment
1        a
2        b
1     good

API Reference

Storage

Storage(self, dataframes=None)

Pandas storage

Package implements Tabular Storage interface (see full documentation on the link):

Storage

Only additional API is documented

Arguments

  • dataframes (object[]): list of storage dataframes

Contributing

The project follows the Open Knowledge International coding standards.

Recommended way to get started is to create and activate a project virtual environment. To install package and development dependencies into active environment:

$ make install

To run tests with linting and coverage:

$ make test

Changelog

Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted commit history.

v1.1

  • Added support for composite primary keys (loading to pandas)

v1.0

  • Initial driver implementation

About

Generate Pandas frames, load and extract data, based on JSON Table Schema descriptors.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published