History Improvement #111

couppeym · 2023-09-20T14:51:37Z

couppeym
Sep 20, 2023

Hey there,

In ERDDAP we only have access to the latest modification date and description, via the RSS file
Every time a dataset is updated, it completely overrides the old RSS file, so :

from a user point of view, you can't see the evolution of the dataset as the RSS file only contains the latest update date and description
the RSS file needs to be parsed to extract useful content, which can lead to error if there's an ERDDAP update that breaks things
the RSS file is re-created every time the ERDDAP server goes up, so you'll lose the last "real" description of the update and the date of it

Right now I'm building a {bigParentDirectory}/history.json which contains the history like so :

{
  "dataset_1": [
    {
      "date": "2023-07-27T07:52:09Z",
      "description": "A combinedGlobalAttribute changed:",
      "old": "  old line #12=\"    infoUrl=???\",",
      "new": "  new line #12=\"    infoUrl=https://coastwatch.pfeg.noaa.gov\"."
    },
    {
      "date": "2023-07-27T07:50:39Z",
      "description": "The number of variables changed:",
      "old": "  old=2,",
      "new": "  new=1."
    },
    {
      "date": "2023-09-12T09:53:03Z",
      "description": "The dataset was reloaded."
    },
    {
      "date": "2023-07-27T07:47:39Z",
      "description": "The destinationName for dataVariable #0=temperature changed:",
      "old": "  old=temp,",
      "new": "  new=temperature.",
      "combined": "The combinedAttribute for dataVariable #0=temperature changed:",
      "oldCombined": "  old line #1=\"    _FillValue=NaNd\",",
      "newCombined": "  new line #1=\"    ioos_category=Temperature\"."
    },
    {
      "date": "2023-07-27T06:59:26Z",
      "description": "Initial init dataset_1"
    }
  ],
  "dataset2": [
    {
      "date": "..",
      "description": ".."
    }
  ]
}

When ERDDAP goes up, the RSS file is build based on the information available in the history file. If there's no information in this dataset in the history file, it'll initialize and create the first record for it.

With this new way to handle history, we could also think about exposing the history file of each dataset as a new file-format.

All changes can be found in this ERDDAP fork : https://github.com/vliz-be-opsci/FAIR-EASE-erddap
There's also a complete docker build & runnable environment inside, but this is not the main topic.

I'm not sure we can see this as "the final" or optimal solution, but I really think this can be useful for the ERDDAP community.

BobSimons · 2023-09-20T19:07:23Z

BobSimons
Sep 20, 2023
Maintainer

The main purpose of RSS, in general, is to provide a way to tell when and how a dataset (or any other published entity) was last changed. The intention (in general and in ERDDAP) was not to provide a complete list of changes to the dataset. So, if you concatenate all of the current RSS-noted changes for a dataset in ERDDAP, you won't necessarily get a list of all the changes to the dataset. Even if it did, it wouldn't be as precise or succinct as running diff repeatedly against a dataset's metadata (e.g., .das or /info/). (That is related to your comment about parsing the RSS description -- the description was meant to be human readable, not necessarily parse-able via software. The diff output would be less human readable but more parse-able/reliable way to detect specific changes.)

Note that ERDDAP doesn't/can't detect or report many important changes. For example, if a data value between the first and last time point is changed/edited, that will in most cases not be detected or reported (via RSS or any other metadata) unless the dataset owner or ERDDAP administrator explicitly changes some metadata value. An exception is EDDTableFromHttpGet datasets which maintain a log of changes to the data values and which a user can request (including with constraints, e.g., all changes made for a given range of time).

My point here is: if you want a true log of all changes to the dataset, concatenating RSS messages won't give you what you want. And if users think this concatenated RSS log will give them that, they will be mistaken.

I see some value in this, e.g., it might be interesting for a human to peruse a list of changes. But given that it is incomplete, it seems of limited use. I'm also not sure I see a practical use for this that isn't solved better in some other way.

If you just want to know if the dataset's metadata has changed in any way from the last time you checked, you can use RSS and see if it has a new timestamp.
If you want to know if there is more recent data, you should look for changes to the maximum of the time value (time[last] for gridded datasets or max(time) for tabular datasets).
Can you please give examples of how you envision people using this system?

A problem/complication with your system: there are datasets that change very often, e.g., every minute. This would quickly lead to a very large RSS history file. Over a few years, this would be impractically large.

Another, related problem: if you put all the histories of all the datasets in a given ERDDAP into 1 file (or generate it on-the-fly) (as in your example above), it will immediately be a very large file on many ERDDAPs (e.g., the CoastWatch WCN ERDDAP has 3000+ datasets). Moreover, most people wouldn't want that information for all datasets or even a large number of datasets. Normally people are focused on one dataset at a time.

I pushed back on much of what you suggested, but I suspect I'm not seeing what you see. Please provide specific examples of how you see this being used and how this solves problems that can't already be solved by the current features, and how to avoid the large response problems I identified above.

Best wishes.

0 replies

BobSimons · 2023-09-20T19:40:03Z

BobSimons
Sep 20, 2023
Maintainer

I should have said: your email mostly talks about RSS, but RSS is an inefficient way to detect changes to a given ERDDAP dataset. The most efficient way to detect changes to a dataset, by far, is ERDDAP's subscription system because your system is notified the instant a given dataset changes (not, e.g., when you poll every x minutes). When your system is notified of change to a dataset via the subscription system, you can request follow up information (e.g., the new .das, or the new .ttl if you add that, etc). So your system can stay perfectly up-to-data with minimal requests from your system to ERDDAP and minimal data transfer between systems.

I hope that helps.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

History Improvement #111

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

History Improvement #111

couppeym Sep 20, 2023

Replies: 2 comments

BobSimons Sep 20, 2023 Maintainer

BobSimons Sep 20, 2023 Maintainer

couppeym
Sep 20, 2023

BobSimons
Sep 20, 2023
Maintainer

BobSimons
Sep 20, 2023
Maintainer