Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rough draft implementation #1

Closed
wants to merge 3 commits into from
Closed

Rough draft implementation #1

wants to merge 3 commits into from

Conversation

geowurster
Copy link
Member

@sgillies et al.

This PR and associated draft branch is intended to start a discussion but not necessarily be merged.

For those not aware of the background, this was spurred by Toblerity/fio-buffer#4. Fiona has a fio cat CLI command that reads an OGR supported vector datasource and prints each feature on its own line. Being able to stream vector features and geometries around is very unixy and makes it easier to develop standalone tools that are good at doing one thing, but there is no standard method for reading and writing these streams.

This PR aims to make reading GeoJSON feature sequences just like reading a text file with open().

import geojseq

with geojseq.open('coutwildrnp.geojson') as src, geojson.open('-', 'w', use_rs=True) as dst:
    for feat in src:
        dst.write(feat)

So now doing something like supporting OGR datasources via something like Fiona AND feature streams would look something like:

import fiona
import geojseq

infile = sys.argv[1]
input_is_sequence = sys.argv[2]

with geojseq.open(infile) if input_is_sequence else fiona.open(infile) as src:
    for feat in src:
        pass

@sgillies
Copy link
Contributor

@geowurster i'm back in my home office after a work trip, have shipped Rasterio 0.26, and am 👀 on this at last.

@sgillies
Copy link
Contributor

@geowurster I suggest we put geojseq.core.open() aside for now and just consider the stream class.

I feel like FeatureStream in read mode is right on. Using fio cat --rs, I wrote Fiona's test dataset out as an RS-delimited sequence of GeoJSON features and FeatureStream handles these exactly as I'd like. The analogy to csv.reader is going to be super useful for Python programmers.

>>> from geojseq.core import FeatureStream
>>> with open('/tmp/coutwldrnp.jseq', 'r') as f, FeatureStream(f) as src:
...     for ftr in src:
...         print((ftr['id'], ftr['properties']['NAME']))
...
('0', 'Mount Naomi Wilderness')
('1', 'Wellsville Mountain Wilderness')
('2', 'Mount Zirkel Wilderness')
('3', 'High Uintas Wilderness')
('4', 'Rawah Wilderness')
('5', 'Mount Olympus Wilderness')
('6', 'Comanche Peak Wilderness')
('7', 'Cache La Poudre Wilderness')
...

Let's consider separate classes for the reading and writing duties.

@geowurster
Copy link
Member Author

@sgillies While the csv module is certainly very similar to what we're trying to accomplish, I'm not convinced its what we should use as a model. I have been using a bunch of different I/O libraries lately and have found that modules lacking something similar open() feel antiquated. I can get behind focusing on the core class or classes first but I think the module.open() pattern is important to a modern I/O library.

I didn't use it for this project because our core file-like object will need to have GeoJSON specific properties etc., but I already have a NewlineJSON project (that needs a bit of work) that started out with Reader() and Writer() classes that were intended to be drop-in replacements for csv.DictReader/Writer() but I found that a central file-like object and newlinejson.open() worked lot better.

Are there more compelling reasons that I'm missing for the csv model? My guess is that csv doesn't have a open() because of the additional complication headers and DictReader/Writer() introduce but we don't have that problem.

Some examples of this change in the stdlib:

Library Python 2.7 Python 3
gzip gzip.open() for file paths but gzip.GzipFile(filename=None, fileobj=None) does both. gzip.open() transparently handles file paths and file-like objects.
bz2 bz2.BZFile() only opens file paths. To decompress a file-like object BZ2Decompressor().decompress() must be used. Same exists for compressing. Introduced bz2.open() for transparently reading file paths and file-like objects.
lzma External library lzma.open() that behaves like gzip and bz2.
tarfile tarfile.open(filename=None, fileobj=None) Same as Python 2

In contrast, libraries like csv and MsgPack just aren't as streamlined:

import csv

with open('data.csv') as f:
    for line in csv.DictReader(f):
        # Do something

import msgpack

with open('data.msg') as f:
   for msg in msgpack.Unpacker(f):
        # Do something

with open('data.msg', 'w') as f:
    packer = msgpack.Packer()
    for item in something_else:
        f.write(packer.pack(item))

Objects like BZ2Compressor() are still useful so you can do stuff like compress data before sending across the network, but they seem antiquated when reading and writing files.

@sgillies
Copy link
Contributor

Sold. I'm game to start with one class and add an open().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants