Set of commands/tools to grab metadata and objects from a CONTENTdm instance via the server and web APIs.
These files are organized in such a way that the source-system-agnostic MDMM be used to further prepare them for migration.
-
You will need Ruby
-
You will need a recent-ish version of bundler (
gem install bundler
will install if you don’t have it, update it if you do) -
You will need cdminspect installed somewhere this application can find it
If you will only work on one project and/or don’t plan on contributing code back to this repo…
You can edit config/config.yaml
in place to set up your project. When you run commands, this default config location will be used.
If you will be working on multiple projects, need to keep your config(s) in a place where they can be backed up, or you want to avoid contributing your configs back to this repo…
Copy config/config.yaml
to your desired location and edit the copy. Specify the path to the desired config when you run a command, like this:
exe/cdm --settings -c path/to/your/cdm_config.yaml
The example config.yaml
included with the repo is heavily commented and intends to be self-documenting.
For the available commands:
exe/cdm help
For details on exactly what each command does:
exe/cdm help [COMMAND]
This command currently is the best documentation for each step. Later, I will add some more info to the repo wiki.
CDM has collections.
Collections contain objects.
Objects are simple or compound.
Simple objects will have metadata record plus one associated object file.
Compound objects have child objects. The top-level compound object will have a metadata record. Child objects each have a metadata record (usually sparser than the parent record) and one associated object file.
Document-PDF is treated as a compound object in CDM, but if `cdmprintpdf`=1 in the parent record, we treat this as a simple object, with the print pdf file as the object file.
_cdmobjectinfo
contains the original CDM JSON object info downloaded for each compound object.
_cdmrecords
contains the original CDM JSON metadata downloaded for each object (top level and child mixed together).
_migrecords
contains the "migration records", or migrecords, generated for each item. Migrecords are the original record, augmented with information in new fields added to support the processing and migration of the data. Do exe/cdm help get_top_records
to see the details on how migrecords are generated.
_objects
contains the object files downloaded for each collection.
MDMM expects collection directories containing _migrecords
and _objects
directories.
-
get_coll_data
-
get_dc_mappings
-
get_field_data
At this point, you can see the number of collections you are working with and the way the metadata fields have been defined for them.
-
get_pointers
-
get_top_records
Check your logfile for any errors or warnings at this point and resolve them before continuing, or your problems will just be compounded.
The tool creates a local JSON file for each successful API call, and tries to be pretty good at not making additional API calls to replace info you already have.
This means you can just re-run the above command to try to re-grab any records that could not be retrieved before. It will work fast and only grab what is missing. It also currently means that, if you want to refresh the records from the source, you will need to delete the local copies in collalias/_cdmrecords
Use --force=true
to re-download records and compoundobject info you already retrieved.
-
get_child_records
Use --force=true
to re-download records you already retrieved.
Again, check your logfile for warnings or errors after this step
-
report_error_records
This will write a CSV report of pointers for which a valid/usable record could not be retrieved from the CDM API.
-
delete_error_records
Optionally, get rid of these records so they don’t cause trouble in the rest of the migration. If restrictions are removed or other problems taken care of, you can re-download just the missing records using get_top_records
and get_child_records
with --force=true
. If new objects have been added you’ll need to re-run get_pointers
before re-grabbing records.
-
finalize_migration_records
Again, check your logfile for warnings or errors.
report_mig_error_records
and delete_mig_error_records
work similarly to the aforementioned CDM-record specific commands, but flag objects with issues like unknown islandora content model, or orphaned pdfpage with no associated file.
Run exe/cdm
to see some other report options
At this point, you can use MDMM to handle metadata reporting, cleanup, and remapping.
-
harvest_objects
Check logfile for errors/warnings after this step.
Simple objects: harvested file size is compared against cdmfilesize and warning is logged if the values do not match
Document-PDF objects: the single PDF is harvested. We don’t have that filesize in the CDM record, so best practice will be to validate these objects outside this process
Use exe/cdm help
and exe/cdm help [COMMAND]
to get more details on helper functions for harvesting and working with objects
Bug reports and pull requests are welcome in the GitHub repo.
The gem is available as open source under the terms of the MIT License.