Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirement: Include both a format and a data/publication version number #13

Open
krischer opened this issue Jan 3, 2018 · 13 comments
Open

Comments

@krischer
Copy link
Contributor

krischer commented Jan 3, 2018

Include both a format and a data/publication version number.

@chad-earthscope
Copy link
Member

I support the addition of both a data format and data publication version.

The motivation for the format version is to make the format self describing and for identification, i.e. as a signature to match. It would also allow future evolution of the fundamental portions of the format.

The data publication version was discussed during the previous evaluation. With miniSEED 2.x there is no versioning built into the format. Some data centers used the "data quality" identifier as a crude form of versioning, but this is extremely limited with only 4 "levels" and a vague implication of "quality". Tracking data versions, which are a reality in modern data management and use, is especially important for scientific data. Including the capability to identify versions directly in the format allows for basic versioning and can be used by systems external to the format for extended, version-specific metadata.

@andres-h
Copy link

andres-h commented Jan 6, 2018

Data/publication version number should be an optional (IRIS) extension. Linear version numbers do not support "forks" where data has been modified in multiple datacentres. I would not hardcode this feature into the standard, because something more clever might be needed in future.

@crotwell
Copy link

crotwell commented Jan 8, 2018

Format version is critical for sure.

Andres has a good point about linear version numbers not working well with forks. If two data centers both receive version 7 of the data, each does something and then has a different version 8.

The alternatives are to either name-space the data version (perhaps within the additional headers) or to declare that the data version has no meaning beyond the context of the datacenter where it was created.

@chad-earthscope
Copy link
Member

chad-earthscope commented Jan 8, 2018

... or to declare that the data version has no meaning beyond the context of the datacenter where it was created.

That is exactly the conclusion we got to in the previous conversation last July on this topic. In the case of the IRIS DMC, I think we would work with those that contribute data so that the version is done by the owner whenever possible.

A system that identifies relative relationships between versions across forks and data centers would require some sort of central registry or much more complexity.

I suspect a data publication version in a record would be useful for many data centers, justifying a fixed 1 byte it would use, but it would be OK to use an optional header for this if that's where the consensus lands.

@krischer
Copy link
Contributor Author

Summary

(Please let me know if I missed a point or misunderstood something)

Please vote on:

  1. Do we want to include the actual data format version to enable self-identification and versioning of the data format? (Yes/No)
  2. Do we want a single byte "data publication version" somewhere in each record? This would be a linear version number without a lot of additional semantics largely useful internally for data centers. (Yes/No)
  3. Do we want a more complex "data publication version" which must include things like namespaces. (Yes/No)

@crotwell
Copy link

1 yes
2 yes
3 no, or at least not as a required header field. No objection to a standardized key that could be used in the optional part of the header as in #14

@chad-earthscope
Copy link
Member

  1. Yes
  2. Yes
  3. Not as a requirement.

@kaestli
Copy link

kaestli commented Jan 30, 2018

  1. Yes
  2. No (a data stream which was modified should get a different streamID, not a different version number, but a streamID pretending it to be the same. What if "version" tag varies between records of the "same" stream?. Using the streamID for to point to metadata allows to further describe the version/modification there)
  3. (this should be answered in the streamID discussion)

@ozym
Copy link

ozym commented Jan 30, 2018

  1. Yes
  2. Yes, although I could see this used as a mechanism to determine data providence within the collection systems (e.g. daisy chained data feeds) rather than version per se
  3. No

@claudiodsf
Copy link

  1. Do we want to include the actual data format version to enable self-identification and versioning of the data format? (Yes/No)

Yes

  1. Do we want a single byte "data publication version" somewhere in each record? This would be a linear version number without a lot of additional semantics largely useful internally for data centers. (Yes/No)

Yes

  1. Do we want a more complex "data publication version" which must include things like namespaces. (Yes/No)

No

@ihenson-bsl
Copy link

  1. Yes
  2. Yes
  3. No

@ValleeMartin
Copy link

  1. Yes
  2. Yes
  3. No

@JoseAntonioJara
Copy link

  1. Yes
  2. Yes, adding an identifier (namespace or other) meaning the datacenter where it was created.
  3. Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants