Skip to content
nthieberger edited this page Jul 27, 2023 · 3 revisions

Catalog Proxy Thoughts

Background

The PARADISEC catalog is stored on disk. It has a fairly simple structure of collections which contain items which contain files.

For example

    /srv
        /catalog
            /NT1
                /001
                    NT1-001-001A-checksum-PDSC_ADMIN.txt
                    NT1-001-001A.mp3
                    NT1-001-001A.wav
                    NT1-001-001B-checksum-PDSC_ADMIN.txt
                    NT1-001-001B.mp3
                    NT1-001-001B.wav
                    NT1-001-CAT-PDSC_ADMIN.xml
                    NT1-001-df-PDSC_ADMIN.pdf

Currently nabu accesses the catalog through direct file access over an NFS mount.

The catalog also needs to be accessed by

  • Ingestion process - Currently NFS
  • Mod PARADISEC - HTTPS via NABU

There are a number of projects currently being worked on or proposed:

  • Migrate the catalog to AWS S3
  • Convert the catalog to support OCFL

These two items make access to the catalog more complicated. In particular OCFL given the complex file layout and the introduction of versioning.

Given the existing projects which need 'disk level' access to the catalog and potential future projects, it would make sense to abstract the internal workings of S3 and OCFL away and expose it as a simple REST interface.

This approach will also assist with the migration process by allowing us to change all projects that need catalog access once and then updating the proxy for the new access patterns.

It would be also advantageous to generalise the project so that it can be utilised by other projects with similar requirements.

Proposal

Write a simple REST based nodejs express application. It will be capable of simple get and put operations to access the catalog.

Take the approach of a plugin interface for backend access. It should support

  • Standard layout on disk
  • Standard layout on S3
  • OCFL layout on disk
  • OCFL layout on S3

We assume a paradigm which takes two keys. An object identifier, and a filename. Files are stored inside of objects.

We will use configuration (possibly via provision of short javascript functions) which when given an identifier, determine disk or OCFL layout.

For example for PARADISEC. We would pass NT1 for collections and NT1-001 for items. The config would translate as follows:

The proxy is designed to be as simple as possible, it will never require a database, it is simply translating S3, Disk and OCFL access to a simple REST endpoint. Configuration either via regex or simple javascript function will aide in the translation.

Possible interface spec

GET /object/:identifier

  • Lists all the filenames in the identifier
  • e.g. GET /object/NT1-001, should include metadata like checksums

GET /object

  • Lists all the identifiers in the catalog
  • e.g. GET /object, should include metadata like checksums

HEAD /object/:identifier/:filename

  • Checks if the file exists
  • e.g. HEAD /object/NT1-001/NT1-001-001A.mp3 would return 200 OK if the mp3 exists

GET /object/:identifier/:filename

  • Retrieves an object from the catalog
  • This could be a 200 or a 301 redirect to a signed URL
  • e.g. GET /object/NT1-001/NT1-001-001A.mp3 would download the mp3

POST /object/:identifier/:filename

  • Creates a new file in the catalog
  • Should fail if the mp3 already exists in the current version
  • e.g. POST /object/NT1-001/NT1-001-001A.mp3 would create the mp3

PUT /object/:identifier/:filename

  • Replaces the file in the catalog
  • Should fail if the mp3 does not already exists in the current version
  • If it exists creates a new version where versioning is supported in the backend
  • e.g. PUT /object/NT1-001/NT1-001-001A.mp3 would overwrite the mp3

DELETE /object/:identifier/:filename

  • deletes the file from the catalog
  • Should fail if the mp3 does not exist
  • Should create a new version without the file where versioning is supported
  • e.g. PUT /object/NT1-001/NT1-001-001A.mp3 would overwrite the mp3

GET /object_url/:identifier/:filename

  • Returns a signed pre-authenticated URL that the file can be fetched from
  • e.g. GET /object/NT1-001/NT1-001-001A.mp3 would return one of
  • Disk based /object/NT1-001/NT1-001-001A.mp3
  • S3 a signed S3 URL
  • Note: Defined some keys for expiry and disposition for the url

Questions

  • Best way to support direct vs redirect to signed URLs
  • Do we need POST /object/:identifier

Future features

  • Support for listing versions of an object
  • Support for retrieving a specific version of an object
  • Authentication (something that can be configured and provided by the source syste,
  • Pagination on lists
Clone this wiki locally