Skip to content

Grab data and objects from CONTENTdm API, with an eye toward preparing it for migration into Islandora

License

Notifications You must be signed in to change notification settings

lyrasis/cdmtools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cdmtools

Set of commands/tools to grab metadata and objects from a CONTENTdm instance via the server and web APIs.

These files are organized in such a way that the source-system-agnostic MDMM be used to further prepare them for migration.

Installation

Dependencies

  • You will need Ruby

  • You will need a recent-ish version of bundler (gem install bundler will install if you don’t have it, update it if you do)

  • You will need cdminspect installed somewhere this application can find it

Steps

Clone this repo.

cd into the resulting directory

bundle install

Configuration

If you will only work on one project and/or don’t plan on contributing code back to this repo…​ You can edit config/config.yaml in place to set up your project. When you run commands, this default config location will be used.

If you will be working on multiple projects, need to keep your config(s) in a place where they can be backed up, or you want to avoid contributing your configs back to this repo…​

Copy config/config.yaml to your desired location and edit the copy. Specify the path to the desired config when you run a command, like this:

exe/cdm --settings -c path/to/your/cdm_config.yaml

The example config.yaml included with the repo is heavily commented and intends to be self-documenting.

Usage

For the available commands:

exe/cdm help

For details on exactly what each command does:

exe/cdm help [COMMAND]

This command currently is the best documentation for each step. Later, I will add some more info to the repo wiki.

Conceptual outline

CDM has collections.

Collections contain objects.

Objects are simple or compound.

Simple objects will have metadata record plus one associated object file.

Compound objects have child objects. The top-level compound object will have a metadata record. Child objects each have a metadata record (usually sparser than the parent record) and one associated object file.

Document-PDF is treated as a compound object in CDM, but if `cdmprintpdf`=1 in the parent record, we treat this as a simple object, with the print pdf file as the object file.

_cdmobjectinfo contains the original CDM JSON object info downloaded for each compound object.

_cdmrecords contains the original CDM JSON metadata downloaded for each object (top level and child mixed together).

_migrecords contains the "migration records", or migrecords, generated for each item. Migrecords are the original record, augmented with information in new fields added to support the processing and migration of the data. Do exe/cdm help get_top_records to see the details on how migrecords are generated.

_objects contains the object files downloaded for each collection.

MDMM expects collection directories containing _migrecords and _objects directories.

Data/metadata profiling

  • get_coll_data

  • get_dc_mappings

  • get_field_data

At this point, you can see the number of collections you are working with and the way the metadata fields have been defined for them.

  • get_pointers

  • get_top_records

Check your logfile for any errors or warnings at this point and resolve them before continuing, or your problems will just be compounded.

The tool creates a local JSON file for each successful API call, and tries to be pretty good at not making additional API calls to replace info you already have.

This means you can just re-run the above command to try to re-grab any records that could not be retrieved before. It will work fast and only grab what is missing. It also currently means that, if you want to refresh the records from the source, you will need to delete the local copies in collalias/_cdmrecords

Use --force=true to re-download records and compoundobject info you already retrieved.

  • get_child_records

Use --force=true to re-download records you already retrieved.

Again, check your logfile for warnings or errors after this step

  • report_error_records

This will write a CSV report of pointers for which a valid/usable record could not be retrieved from the CDM API.

  • delete_error_records

Optionally, get rid of these records so they don’t cause trouble in the rest of the migration. If restrictions are removed or other problems taken care of, you can re-download just the missing records using get_top_records and get_child_records with --force=true. If new objects have been added you’ll need to re-run get_pointers before re-grabbing records.

  • finalize_migration_records

Again, check your logfile for warnings or errors.

report_mig_error_records and delete_mig_error_records work similarly to the aforementioned CDM-record specific commands, but flag objects with issues like unknown islandora content model, or orphaned pdfpage with no associated file.

Run exe/cdm to see some other report options

At this point, you can use MDMM to handle metadata reporting, cleanup, and remapping.

Object data

  • harvest_objects

Check logfile for errors/warnings after this step.

Simple objects: harvested file size is compared against cdmfilesize and warning is logged if the values do not match

Document-PDF objects: the single PDF is harvested. We don’t have that filesize in the CDM record, so best practice will be to validate these objects outside this process

Use exe/cdm help and exe/cdm help [COMMAND] to get more details on helper functions for harvesting and working with objects

Contributing

Bug reports and pull requests are welcome in the GitHub repo.

License

The gem is available as open source under the terms of the MIT License.

About

Grab data and objects from CONTENTdm API, with an eye toward preparing it for migration into Islandora

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published