In the following, we suppose that the LabNbook
database and the versionning
(of the form id_report.gzip
) files are available in your local machine. Note that, the project can be run without the Prefect
orchestrator. It can be done by removing the Prefect
tags (@task, @flow and @logger
) in all python flows (as flow_0.py
) files and ignoring steps 1 and 2 below. Here is a Prefect
overview of all flows executed on a subset of missions.
View content
-
Create a Prefect account following this link.
-
Configure Prefect cloud following this link.
-
Run the following command line in your terminal in order to clone this git repository:
Git clone https://github.com/anismhaddouche/Indicators.git
-
Past the
versionning
folder into thedata
folder. -
Create a virtual env with conda. If conda is not installed, follow this link.
conda env create -f python_env.yml
-
Modifies these sections in the
pyproject.toml
file-
database
: in order to connect to the database (which is assumed to be installed in your machine):user = "your_user" password = "your_password" host = "localhost" database_name = "your_database_name"
-
missions
: choose if you want to run the project on all missions or only a subset.all = false # True to take all missions in the versioning folder subset = ["1376","453","1559","1694","556","534","1640","1694","451","1237","533","647"]
-
-
You have to option for running all flows:
-
Local run with
Prefect
, open a terminal, navigate to the repositoryIndicators
and run the following commands:conda activate ml python scripts/run_flows.py
-
Cloud or local run with
Prefect UI
:- Write these commands in your terminal prefect server start prefect deployment build scripts/run_flows.py:run_flows -n "labnbook" && prefect deployment apply run_flows-deployment.yaml && prefect agent start -q default
- Open
Prefect UI
(cloud or local) and click intoRUN
theDeployment
menu
-
-
In order to get some reports run this command:
streamlit run scripts/dashboard.py
We describe here python scripts (flows) in the scripts folder.
View content
The purpose of this flow is to connect to the previously installed LabNbook
database and prepare LabDocs
for the next flow which consists in calculating contribution matrices.
- Dependencies
- The dictionary
[database]
in theproject.toml
file.
- The dictionary
- Returns
- The file
data/tmp/0_labdocs_texts_init.json.gz
- The file
- Dependencies:
- The dictionary
[regex_text_patterns]
in theproject.toml
file.
- The dictionary
- Returns:
- The folder
data/tmp/0_missions_texts
- The folder
The purpose of this flow is to calculate contribution matrices and some variables that describes LabDocs
as the number of tokens, segments, ... etc.
- Dependencies
- The folder
data/tmp/0_missions_texts
- The nlp model the config section
[nlp][spacy_model]
of the fileproject.toml
- The folder
- Returns
- The folder
data/tmp/1_missions_contrib
- The folder
The purpose of this flow is to calculate all indicators.
- Dependencies
- The dictionary
[missions]
inproject.toml
file
- The dictionary
- Returns
- The file
data/tmp/2_collab.json.gz
- The file
- Dependencies
- The two dictionaries
[nlp][model]
and[missions]
in theproject.toml
file - The nlp model in the config section
[config_nlp]
of the fileproject.toml
- The two dictionaries
- Returns
- The file
data/tmp/reports/2_semantic.json
- The file
The purpose of this flow is to generate some reports.
- Dependencies
- The file
data/tmp/2_collab.json.gz
- The file
- Returns
- The file
data/tmp/reports/3_summary_nonsemantic_indicators.csv
and its corresponding Pandas DataFramedf_nonsemantic
- The file
- Dependencies
- The file
data/tmp/reports/2_semantic.json
- The file
- Returns
- The file
data/tmp/reports/3_summary_semantic_indicator.csv
and its corresponding Pandas DataFramedf_semantic
- The file
- Dependencies
- The Pandas DataFrames
df_nonsemantic
anddf_semantic
- The Pandas DataFrames
- Returns
- The file
data/tmp/reports/3_times.csv
- The file
Besides the improvements concerning the quality of the python code, I propose to improve the model all-MiniLM-L6-v2
used in the task semantic_indicator
of the flow_2.py. To this end, we give below some suggestions.
View content
Improve the all-MiniLM-L6-v2
nlp model
What this model does?
As mentioned before, this model is used in the task semantic_indicator
of the flow_2.py. In order to have an idea of how this model is used, let's suppose that we have a Labdoc that evolves from a version
It is worth noticing that this model is used sequentially between two Labdoc versions. For instance, givenv1
, v2
andv3
, results are of the form
$similarity(v_1,v_1) = s_1 =1$ $similarity(v_1,v_2) = s_2$ $similarity(v_2,v_3) = s_3$
where, for
As a concrete example, here is the output of the Labdoc 340270
which is a dictionary of the form {"id_labdoc:{id_trace}:["id_user",score]}
saved in the file data/tmp/2_semantic.json.
"340270": {"5866822": ["10893", 1], "5869856": ["10917", 0.57]}, "340978": {"5885737": ["10893", 1]}
Note that, the first score is always equals
How does it work?
To compare the similarity between two versions of the same LabDoc, the process is done in two steps (See Figure 2 below).
-
The first step involves computing a vector of numbers in
$R^p$ (a tensor) for each version, denoted as$v_1$ and$v_2$ , respectively. This is known as the embedding step in natural language processing (NLP). -
Then, we calculate the cosine similarity between these two vectors using the formula
$similarity(v_1, v_2)$ . You can refer to the Python script flow_2.py from line 104 to line 123 to understand how this calculation is performed.
How to improve this model ?
The objective is to improve the semantic interpretation, of Labdocs, of the used NLP model all-MiniLM-L6-v2
by improving its embedding. Note that, in this project I used this model for its implementation simplicity in order to have a first draft. It is not well adapted to our dataset since we have a lot of mathematical formulas. For future works, I suggest using a well-adapted model like MathBert since it is trained on scientific texts containing mathematical formulas.
In order to improve the embedding of our NLP model, we have to train (fine-tune) our pre-trained model to do a task using our set of LabDocs. A well-adapted task here is the Masker Language Modeling (MLM). It is an unsupervised learning technique that involves masking tokens in a text sequence and training a model to predict the missing tokens. This creates an improved embedding that better captures the semantics of the text (see this tutorial).