-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seed-data generation script #1
Comments
working on harvesting DC metadata using fetch-metadata script and tag analysis right now. |
Possible issue with ruby on the server (aka am I missing something?). As I try to work with the ruby interpreter, it gives me the following message: The program 'ruby' can be found in the following packages:
Currently, I am working on my machine and then sftp-ing to the server. |
RVM (https://rvm.beginrescueend.com/) is now installed and the problem is fixed: maphub@samos:~$ ruby -v In general, I recommend to develop on your local machine (using the same ruby version) and use the server only for running longer jobs Best, On Mar 7, 2012, at 10:07 PM, Shion wrote: Possible issue with ruby on the server (aka am I missing something?). As I try to work with the ruby interpreter, it gives me the following message: The program 'ruby' can be found in the following packages:
Currently, I am working on my machine and then sftp-ing to the server. Reply to this email directly or view it on GitHub: Bernhard Haslhofer |
Basic script has been uploaded along with a screenshot showing terminal run and the sample yaml output file. Waiting for bug fixing + server running... |
The script is run as follows: ruby generate-loc-seeddata.rb -i mapdir -m metadatadir -n numofsamples i.e. ruby generate-loc-seeddata.rb -i maps -m mods_metadata_8800 -n 170 |
Cool, thank you! I will have a look at it later today and get back to you. Bernhard Haslhofer On Monday, April 9, 2012 at 12:49 AM, Shion wrote:
|
Hi Shion, the code looks good. There is quite a bit of IO overhead, but on Friday I will give you some tips how you can fix this and speed up the script. When I tried to execute the script on the server I had the following problem: maphub@samos:~/maphub-seeddata/scripts$ ruby generate-loc-seeddata.rb -i ../../data/maps/ -m ../../data/metadata/ -n 5 I will add this to the issues list, just to track progress. It would also be great if you could summarize the usage of the three scripts in the README file. Please keep it short, just explain what the scripts do and how to use them. Ad Github: avoid checking in non-source files (output, screenshots, etc.). Github is really just for code… Thanks again and talk to you on Friday, Bernhard Bernhard Haslhofer On Monday, April 9, 2012 at 12:49 AM, Shion wrote:
|
Thanks. I just updated the README. Is this fine or were you thinking about something else? |
I made some changes to the README file and added some comments to the script code... Bernhard Haslhofer On Wednesday, April 11, 2012 at 4:29 PM, Shion wrote:
|
We need a script that generates a seed data file for the Library of Congress Maphub instance.
Each "map record" in the seed data file includes:
We already have the maps in place and scripts to download metadata from the LoC's GMD collection (see scripts directory). The script has to read these map identifiers, iterate over the harvested metadata records, identify matching records (based on the map identifier), and output a maphub map record for each match.
The challenging part of this script is to select the appropriate metadata fields from the OAI-PMH records. We want only those that carry "relevant" semantics about the map. Also some data cleansing (whitespace, special chars, etc.) steps might be necessary. At the end the metadata need to be indexed by Apache Solr / Lucene.
The results should be a script
generate-loc-seeddata
which takes the directory of map image files and a directory of XML files (= the metadata records) and a set of identifiers (probably a TXT file) as input and generates an outputfileloc-seeddata.yaml
Possible execution:
generate-loc-seeddata maps/ metadata/*.xml
generate-loc-seeddata -n 10 maps/ metadata/*.xml
for only 10 mapsThe text was updated successfully, but these errors were encountered: