History Improvement #111
Replies: 2 comments
-
The main purpose of RSS, in general, is to provide a way to tell when and how a dataset (or any other published entity) was last changed. The intention (in general and in ERDDAP) was not to provide a complete list of changes to the dataset. So, if you concatenate all of the current RSS-noted changes for a dataset in ERDDAP, you won't necessarily get a list of all the changes to the dataset. Even if it did, it wouldn't be as precise or succinct as running diff repeatedly against a dataset's metadata (e.g., .das or /info/). (That is related to your comment about parsing the RSS description -- the description was meant to be human readable, not necessarily parse-able via software. The diff output would be less human readable but more parse-able/reliable way to detect specific changes.) Note that ERDDAP doesn't/can't detect or report many important changes. For example, if a data value between the first and last time point is changed/edited, that will in most cases not be detected or reported (via RSS or any other metadata) unless the dataset owner or ERDDAP administrator explicitly changes some metadata value. An exception is EDDTableFromHttpGet datasets which maintain a log of changes to the data values and which a user can request (including with constraints, e.g., all changes made for a given range of time). My point here is: if you want a true log of all changes to the dataset, concatenating RSS messages won't give you what you want. And if users think this concatenated RSS log will give them that, they will be mistaken. I see some value in this, e.g., it might be interesting for a human to peruse a list of changes. But given that it is incomplete, it seems of limited use. I'm also not sure I see a practical use for this that isn't solved better in some other way.
A problem/complication with your system: there are datasets that change very often, e.g., every minute. This would quickly lead to a very large RSS history file. Over a few years, this would be impractically large. Another, related problem: if you put all the histories of all the datasets in a given ERDDAP into 1 file (or generate it on-the-fly) (as in your example above), it will immediately be a very large file on many ERDDAPs (e.g., the CoastWatch WCN ERDDAP has 3000+ datasets). Moreover, most people wouldn't want that information for all datasets or even a large number of datasets. Normally people are focused on one dataset at a time. I pushed back on much of what you suggested, but I suspect I'm not seeing what you see. Please provide specific examples of how you see this being used and how this solves problems that can't already be solved by the current features, and how to avoid the large response problems I identified above. Best wishes. |
Beta Was this translation helpful? Give feedback.
-
I should have said: your email mostly talks about RSS, but RSS is an inefficient way to detect changes to a given ERDDAP dataset. The most efficient way to detect changes to a dataset, by far, is ERDDAP's subscription system because your system is notified the instant a given dataset changes (not, e.g., when you poll every x minutes). When your system is notified of change to a dataset via the subscription system, you can request follow up information (e.g., the new .das, or the new .ttl if you add that, etc). So your system can stay perfectly up-to-data with minimal requests from your system to ERDDAP and minimal data transfer between systems. I hope that helps. |
Beta Was this translation helpful? Give feedback.
-
Hey there,
In ERDDAP we only have access to the latest modification date and description, via the RSS file
Every time a dataset is updated, it completely overrides the old RSS file, so :
Right now I'm building a
{bigParentDirectory}/history.json
which contains the history like so :When ERDDAP goes up, the RSS file is build based on the information available in the history file. If there's no information in this dataset in the history file, it'll initialize and create the first record for it.
With this new way to handle history, we could also think about exposing the history file of each dataset as a new file-format.
All changes can be found in this ERDDAP fork : https://github.com/vliz-be-opsci/FAIR-EASE-erddap
There's also a complete docker build & runnable environment inside, but this is not the main topic.
I'm not sure we can see this as "the final" or optimal solution, but I really think this can be useful for the ERDDAP community.
Beta Was this translation helpful? Give feedback.
All reactions