Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seed-data generation script #1

Open
behas opened this issue Mar 2, 2012 · 9 comments
Open

seed-data generation script #1

behas opened this issue Mar 2, 2012 · 9 comments
Assignees

Comments

@behas
Copy link
Member

behas commented Mar 2, 2012

We need a script that generates a seed data file for the Library of Congress Maphub instance.

Each "map record" in the seed data file includes:

  • pointers to the map image file URIs
  • selected metadata fields (title, description, subject, creator)

We already have the maps in place and scripts to download metadata from the LoC's GMD collection (see scripts directory). The script has to read these map identifiers, iterate over the harvested metadata records, identify matching records (based on the map identifier), and output a maphub map record for each match.

The challenging part of this script is to select the appropriate metadata fields from the OAI-PMH records. We want only those that carry "relevant" semantics about the map. Also some data cleansing (whitespace, special chars, etc.) steps might be necessary. At the end the metadata need to be indexed by Apache Solr / Lucene.

The results should be a script generate-loc-seeddata which takes the directory of map image files and a directory of XML files (= the metadata records) and a set of identifiers (probably a TXT file) as input and generates an outputfile loc-seeddata.yaml

Possible execution:

generate-loc-seeddata maps/ metadata/*.xml

generate-loc-seeddata -n 10 maps/ metadata/*.xml for only 10 maps

@ghost ghost assigned shionguha Mar 2, 2012
@shionguha
Copy link
Member

working on harvesting DC metadata using fetch-metadata script and tag analysis right now.

@shionguha
Copy link
Member

Possible issue with ruby on the server (aka am I missing something?). As I try to work with the ruby interpreter, it gives me the following message:

The program 'ruby' can be found in the following packages:

  • ruby1.8
  • ruby1.9.1
    Ask your administrator to install one of them

Currently, I am working on my machine and then sftp-ing to the server.

@behas
Copy link
Member Author

behas commented Mar 8, 2012

RVM (https://rvm.beginrescueend.com/) is now installed and the problem is fixed:

maphub@samos:~$ ruby -v
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-linux]

In general, I recommend to develop on your local machine (using the same ruby version) and use the server only for running longer jobs

Best,
Bernhard

On Mar 7, 2012, at 10:07 PM, Shion wrote:

Possible issue with ruby on the server (aka am I missing something?). As I try to work with the ruby interpreter, it gives me the following message:

The program 'ruby' can be found in the following packages:

  • ruby1.8
  • ruby1.9.1
    Ask your administrator to install one of them

Currently, I am working on my machine and then sftp-ing to the server.


Reply to this email directly or view it on GitHub:
https://github.com/maphub/maphub-seeddata/issues/1#issuecomment-4383943


Bernhard Haslhofer
Postdoc Associate
Cornell Information Science
301 College Ave.
Ithaca, NY 14850
WWW: http://www.cs.cornell.edu/~bh392/
Skype: bernhard.haslhofer

@shionguha
Copy link
Member

Basic script has been uploaded along with a screenshot showing terminal run and the sample yaml output file. Waiting for bug fixing + server running...

@shionguha
Copy link
Member

The script is run as follows:

ruby generate-loc-seeddata.rb -i mapdir -m metadatadir -n numofsamples

i.e. ruby generate-loc-seeddata.rb -i maps -m mods_metadata_8800 -n 170

@behas
Copy link
Member Author

behas commented Apr 9, 2012

Cool, thank you! I will have a look at it later today and get back to you.


Bernhard Haslhofer
Postdoc Associate
Cornell Information Science
301 College Ave.
Ithaca, NY 14850
WWW: http://www.cs.cornell.edu/~bh392/
Skype: bernhard.haslhofer

On Monday, April 9, 2012 at 12:49 AM, Shion wrote:

The script is run as follows:

---

Reply to this email directly or view it on GitHub:
https://github.com/maphub/maphub-seeddata/issues/1#issuecomment-5021010

@behas
Copy link
Member Author

behas commented Apr 11, 2012

Hi Shion,

the code looks good. There is quite a bit of IO overhead, but on Friday I will give you some tips how you can fix this and speed up the script.

When I tried to execute the script on the server I had the following problem:

maphub@samos:~/maphub-seeddata/scripts$ ruby generate-loc-seeddata.rb -i ../../data/maps/ -m ../../data/metadata/ -n 5
Finding all the map files...
Finding all the metadata files ...
Creating output YAML File
generate-loc-seeddata.rb:177:in <main>': undefined methodparent' for nil:NilClass (NoMethodError)

I will add this to the issues list, just to track progress.

It would also be great if you could summarize the usage of the three scripts in the README file. Please keep it short, just explain what the scripts do and how to use them.

Ad Github: avoid checking in non-source files (output, screenshots, etc.). Github is really just for code…

Thanks again and talk to you on Friday,

Bernhard


Bernhard Haslhofer
Postdoc Associate
Cornell Information Science
301 College Ave.
Ithaca, NY 14850
WWW: http://www.cs.cornell.edu/~bh392/
Skype: bernhard.haslhofer

On Monday, April 9, 2012 at 12:49 AM, Shion wrote:

The script is run as follows:

---

Reply to this email directly or view it on GitHub:
https://github.com/maphub/maphub-seeddata/issues/1#issuecomment-5021010

@shionguha
Copy link
Member

Thanks. I just updated the README. Is this fine or were you thinking about something else?

@behas
Copy link
Member Author

behas commented Apr 13, 2012

I made some changes to the README file and added some comments to the script code...


Bernhard Haslhofer
Postdoc Associate
Cornell Information Science
301 College Ave.
Ithaca, NY 14850
WWW: http://www.cs.cornell.edu/~bh392/
Skype: bernhard.haslhofer

On Wednesday, April 11, 2012 at 4:29 PM, Shion wrote:

Thanks. I just updated the README. Is this fine or were you thinking about something else?


Reply to this email directly or view it on GitHub:
https://github.com/maphub/maphub-seeddata/issues/1#issuecomment-5078183

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants