-
Notifications
You must be signed in to change notification settings - Fork 56
Files organization
The organization of template files for Chatette is very permissive.
First, as we said, template files don't need to have a specific extension, as long as they contain text. This being said, all the examples present on this repo use the extension .chatette
.
After that, we need to define which files should be created and how they should be organized in order for the program to work.
Once again, this is very permissive: the simplest solution is to put all your unit declarations in whatever order in just one file and execute Chatette on that file. If some templates or unit declarations are incorrect, errors will be printed on the error output of the terminal (stderr
); warnings will be printed on stdout
.
We will call the file on which Chatette is executed the master file (or root file).
Obviously, as you keep adding intents and templates, this master file will grow to a point it gets very hard to work with and find things inside it. For this reason, it is possible to break the whole master file into multiple files. To be exact, you can take any number of unit declarations out of the master file and put them in another file. To have the same behavior as before, you would then need to include that file back into the master file (at any point you want inside it). When parsing a file and encountering a file inclusion, the parser will stop reading the current file, parse the included file and come back to finish off the parsing of the first file. It is worth noting that it is even possible to include files into included files (and so on), making any "file inclusion tree" possible, with the root always being the master file.
As explained earlier, a file is included by using an unindented line that starts with a pipe symbol |
and is followed by the path to the included file relative to the file currently being parsed.
For example, a file that is at the same place in the file system than the directory included
could contain the line |included/included-file.chatette
in order to include the file included-file.chatette
that is located in that directory.
A usual way of organizing files is to have the intents inside the master file, and other files for aliases and slots (or even several files for aliases that belong together and several files for slots that belong together).
Starting with v1.5.0, when using the Rasa adapter, you can provide a base file that Chatette will extend. This is convenient when you want to define regexes and lookup tables for the subsequently trained Rasa NLU model.
Providing a base file is completely optional; if no base file is given, no regexes or lookup tables will be defined.
Such a base file thus contains a JSON object with the following format:
{
"rasa_nlu_data": {
"common_examples": null,
"entity_synonyms": null,
"regex_features": [],
"lookup_tables": []
}
}
with regex_features
and lookup_tables
containing anything you want. On the other hand, the contents of common_examples
and entity_synonyms
will be overwritten by the data generated by Chatette.
This JSON object must be the content of a file whose path can be provided to the script using the command line option --base-file
. This path can be absolute or relative to the current working directory.
After the program has been executed on correct template files, output files should have been generated. The outputs will be placed by default in a directory called output
(unless you specified otherwise). The training examples will be in a sub-directory called train
, while the testing examples (if any) will be found in a sub-directory called test
.
The exact organization of the files in those directories and even which files will be generated depends on the adapter that was chosen: rasa
(by default), rasa-md
or jsonl
.
For the adapter that is meant to generate JSON files that can be used as inputs for Rasa NLU. The JSON-formatted examples will simply be put in files in the output/train
and output/test
directories.
If there are more than 10'000 training examples (resp. test examples), several output files will be created, each containing at most 10'000 examples. Those files will be named output.X.json
, where X
is a number.
Each of those files will contain the synonym information, inside a JSON object, as shown in the documentation of Rasa NLU.
For the adapter meant to generate a Markdown file that can be used as input for Rasa NLU. As defined in the documentation of this project, a section will be associated to each intent, and each subsequent line will be a generated example for that intent, with entities (slots) being denoted as links. Moreover, information related to each set of synonyms will be associated to a section in that file.
As Rasa NLU only expects one such file called nlu.md
, only one file will be generated in the output/train
directory (and at most one file can be generated in the output/test
directory). Expectedly, this file will be named in both cases nlu.md
.
For the adapter meant to generate JSONL files (that cannot be used with Rasa NLU), each line of the output file will be a JSON object directly mapping to the internal representation of an example in Chatette.
As for the rasa
adapter, files named output.X.jsonl
(remark the extension .jsonl
) where X
is a number, will be generated in the output/train
directory (and in the output/test
directory if asked). Such file will contain at most 10'000 examples.
Moreover, a file named synonyms.json
will be generated in both those directories and contain a JSON object that corresponds to the internal representation of this information inside Chatette.
If you read this wiki in order, you should now be able to make your own template files.
You can find some illustrative examples on this page or in the repository, and learn about the command line interface of the program here.