-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with analysis specific command line parameters #500
Comments
I think configuration files (JSON or YAML...) should be used instead of options. |
Yes, it would be elegant, but it requires to rewrite a number of things - both in the scripts and in Java side (the input for Java would then be a file name, then we should parse the file marshal it to Java objects, manage the irrelevant parameters etc. What about the documentation of the parameters on script levels? Validation on script levels? Some of the questions I think of now. But either we use options or config file, I think the first step would be the separation of general and specific parameters. So for K10plus it would look like this general:
- schemaType: PICA
- marcFormat: PICA_NORMALIZED
- emptyLargeCollectors: true
- groupBy: 001@\$0
- groupListFile: src/main/resources/k10plus-libraries-by-unique-iln.txt
- ignorableFields 001@,001E,001L,001U,001U,001X,001X,002V,003C,003G,003Z,008G,017N,020F,027D,031B,037I,039V,042@,046G,046T,101@,101E,101U,102D,201E,201U,202D,1...,2...
- allowableRecords: '002@.0 !~ "^L" && 002@.0 !~ "^..[iktN]" && (002@.0 !~ "^.v" || 021A.a?)'
- solrForScoresUrl: http://localhost:8983/solr/k10plus_pica_grouped_validation
index:
- indexWithTokenizedField: true
- indexFieldCounts: true I need to think over this option. |
The current configuration format is a set of parameters passed to script
As we are free to choose, I'd prefer TOML. Here is an example: # general configuration
# can be overridden by same name arguments to qa-catalogue script
# can also be read from .env file (this should then be removed when proper configuration files exist)
name = "MyCat"
mask = "*.dat"
schema = "PICA"
# some general configuration is currently set by --params
ignorableFields = ["001@","001E","001L","001U","001U","001X","001X","002V","003C","003G","003Z",
"008G","017N","020F","027D","031B","037I","039V","042@","046G","046T",
"101@","101E","101U","102D","201E","201U","202D","1...","2..."]
# additional configuration for individual analysis tasks
[validate]
indexWithTokenizedField = true
indexFieldCounts = true In any case it's important to keep the configuration names used by |
@nichtich I like this kind of configuration. The only open question I see now is the following: there is a list of TOML parsers here: https://github.com/toml-lang/toml/wiki, and it includes bash parser as well: stoml. So we should include it as a dependency. I do not see linux package that you can easily install, we should fetch it from the Github project page's release section. |
@nichtich I found an alternative and still reliable option that does not involve new technology into the existing stack. It requires three steps:
{
"common": [
{"short": "n", "long": "name", "hasArg": true},
{"short": "x", "long": "marcxml", "hasArg": false},
... // other parameters
],
"validate": [
{"short": "R", "long": "format", "hasArg": true},
{"short": "W", "long": "emptyLargeCollectors", "hasArg": false}
... // other parameters
],
"completeness": [
{"short": "R", "long": "format", "hasArg": true},
{"short": "V", "long": "advanced", "hasArg": false}
... // other parameters
],
... // other analyses
}
php parameter-filter.php validate --name KBR --emptyLargeCollectors --advanced returns --name KBR --emptyLargeCollectors because
PARAMS=$(php parameter-filter.php validate ${TYPE_PARAMS})
./validate ${GENERAL_PARAMS} ${OUTPUT_PARAMS} ${PARAMS} ${MARC_DIR}/$MASK 2> ${LOG_DIR}/validate.log |
I don't understand why you did not directly use Java instead of Java => JSON => PHP. I'd prefer to entirely replace PHP in the backend (the existing scripts use the language for simple tasks only) as we already have three more languages (95% Java, 2% Bash, 2% R). But in the end it works, so thanks! |
@nichtich You are right, I should have created it in Java. But I was something else in mind that I did not wrote down: the JSON is also intended to reuse in the web UI to interpret the parameters at the bottom of the analysis pages. |
With the current approach of CLI scripts there is a problem. In the specific analysis shell scripts (such as
./validate
) we list all the allowed parameters that are list as short and long options. But when we run commands such as "all" we put every parameters into a TYPE_PARAMS variable. When validate parses the parameters withgetopt
it removes the unknown parameter names, but keeps their parameters.For example:
the parameters (@$) are
and after getops parsing they become
--unknown_param3
is removed, but its parametervalue3
remains there!We have two types of parameters: i) those that have parameter values ii) those that haven't. It is not a problem for tpye ii, but it moves the parameter values of the unknown parameters into the file list, and it cases errors.
How to solve this issue?
I could imagine the following solutions:
./common-script
does it in a limited fashion../common-script
and in./qa-catalogue
we handle those. It is the most elegant solution, but not backward compatible, the current scripts should be revised to check the usage ofTYPE_PARAMS
(those calls./common-script
directly) and--params
(those calls./qa-catalogue
).getops
, allow every parameters, but handle the unknown parameters on Java side (I am not sure if it is possible).@nichtich What do you think?
Based on comment at #issuecomment-2475769005 the (first) implementation follows the steps:
The text was updated successfully, but these errors were encountered: