Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection crawling seems to be broken #79

Closed
christophfriedrich opened this issue Sep 15, 2020 · 9 comments
Closed

Collection crawling seems to be broken #79

christophfriedrich opened this issue Sep 15, 2020 · 9 comments
Assignees
Labels
Milestone

Comments

@christophfriedrich
Copy link
Collaborator

@m-mohr informed me that there's something wrong with how the Earth Engine driver is listed in the Hub:

grafik

The backend is reported as being unavailable, but it doubtlessly is available.

This bug seems to be due to the collection crawling: More than 900 collections are being reported, but GEE actually has "only" around 480, of which quite a few are rather new, so probably both the old and the new ones (~440 + ~480 = ~920) are floating around the database and causing errors...

@christophfriedrich christophfriedrich added bug Something isn't working high priority labels Sep 15, 2020
@christophfriedrich christophfriedrich added this to the v0.7 milestone Sep 15, 2020
@christophfriedrich christophfriedrich self-assigned this Sep 16, 2020
@christophfriedrich
Copy link
Collaborator Author

I'm relatively sure that I did not expand the "All collections" section before taking the screenshot yesterday, so it's interesting to note that meanwhile the behaviour is that initially, 434 collections are listed and only after expanding the section that number is replaced with the 921. That's another sign that this is probably due to inconsistent database content (see also #78 (comment))

@christophfriedrich
Copy link
Collaborator Author

Okay, the problem is the primary key of the collections table: It's set on service + api_version + id. So a collection is identified e.g. by https://earthengine.openeo.org + 1.0.0-rc.2 + COPERNICUS/S2.

This was done to minimise changes (see also #56), because the service URL most likely never changes, and I thought the same of the api_version field. But now it happened, the GEE driver's api_version was changed from 1.0.0-rc.2 to 1.0.0, causing duplicate entries to occur.

But the grouping of all raw documents into the individual backend entries is done on the backend field, causing the old 1.0.0-rc.2 documents that were previously crawled to end up in the same aggregation as the fresh 1.0.0 ones, because they both belong to the https://earthengine.openeo.org/v1.0 backend.

And because one unsuccessful endpoint* is enough to deem the whole backend unsuccessful, GEE was flagged as such.

unsuccessfulCrawls: { $max: '$unsuccessfulCrawls' }, // use `max` to get the largest (-> "worst") number

* of any of the endpoints /, /collections, /processes, /service_types, /output_formats, /file_formats, /udf_runtimes

Questions arising from this:

  1. Was a changing api_version field a one-time issue in the current dev phase or should that be treated as a use case that could happen regularly?
  2. Should the grouping be changed to service+api_version?
  3. Should it need more than just a single failed endpoint to cause flagging?

For 2. I'd say yes, it kinda would've prevented this bug (when the service+api_version change was introduced it should've been changed anyway, I probably just oversaw it).

For 3. I'd say no, but how crawling errors are communicated to the user should be discussed anyway, which is why #23 exists.

@m-mohr
Copy link
Member

m-mohr commented Sep 16, 2020

Thanks for investigating.

  1. Was a changing api_version field a one-time issue in the current dev phase or should that be treated as a use case that could happen regularly?

That can happen regularly (like every x months or so)

  1. Should the grouping be changed to service+api_version?

I don't fully understand that yet. Can you use the https://earthengine.openeo.org/v1.0 URL?

  1. Should it need more than just a single failed endpoint to cause flagging?

Fine with "no".

@christophfriedrich
Copy link
Collaborator Author

I don't fully understand that yet.

Assume a backend changes its api_version. After crawling there will be two documents for the / endpoint in the database's raw table: one with 1.0.0-rc.2 (old, now has unsuccessfulCrawls=1) and one with 1.0.0 (new).

Now the difference is:

Grouping on backend (i.e. https://earthengine.openeo.org/v1.0):

grafik

-> 1 backend

Grouping on service+api_version:

grafik

-> 2 backends

@m-mohr
Copy link
Member

m-mohr commented Sep 16, 2020

Grouping on backend seems correct, but I guess the question is why crawling it doesn't drop old collections? It sounds like that's the original issue that on crawling the old data doesn't get removed or correctly updated, right?

@christophfriedrich
Copy link
Collaborator Author

That's right, and the cause was the same: Old data was removed based on the backend field -- and because both old and new data had the same backend value, nothing was deleted. I now changed the deletion step to service+api_version too, so this bug is fixed.

I tested the crawling several times and it worked both for the GEE case and also for EODC -- they changed to 1.0.0 today (at least I believe so as the deployed Hub still lists 1.0.0-rc.2 but the live backend reports 1.0.0). So I guess after tonight's crawl the deployed Hub will list EODC incorrectly too. But as soon as this fix is deployed and the next crawling done, it will go away :)

@christophfriedrich
Copy link
Collaborator Author

I'm confident this works fine, so I merged it onto master; feel free to deploy it whenever you've got the time (it's not super urgent IMO).

@m-mohr
Copy link
Member

m-mohr commented Sep 21, 2020

It seems fixed. I restarted the server this morning and couldn't reproduce any longer (although the server is on the dev branch, I think).

@christophfriedrich
Copy link
Collaborator Author

Right now, dev and master are identical. As long as you only pull when I tell you to do so you can leave it on dev :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants