Medicines text mining tool
Repository owner: NHS Digital Analytical Services
Email: datascience@nhs.net
To contact us raise an issue on Github or via email and we will respond promptly.
NHS Digital's Interoperable Medicines Programme has been involved in establishing a flow of medicine data from secondary care to improve medication safety, gain insights into overprescribing, understand the overuse of antibiotics, and improve the treatments related to COVID-19. Initial investigatory work with trusts and suppliers concluded that the standard for describing and coding medicines (dictionary of medicines and devices - dm+d) is only partially adopted by secondary care organisations. To enable any medicines data collection from individual hospitals to be comparable across England the programme has developed text mining functionality to map a hospital medicine description to the closest match in the dm+d standard (this functionality is known as Medicines Text Mining Tool - MTMT). Where there are too many variations between the medicine description and the closest dm+d description the match will appear as unmapped (exceeds a threshold). The mapped outputs can only be used for the secondary uses of data and must not be used for direct care (e.g. must not be used to map a hospital drug dictionary for direct care use).
- The Medicines Text Mining Tool has been developed from data derived from CareFlow Medicines Management electronic Prescribing and Medicines Administration (ePMA) systems utilised in 25 trusts in England. The data set contained 50,841,362 prescribed items with 49,844 unique descriptions. From the list of prescribed items 89.3% were mapped to the dm+d standard.
- There were 24,081,702 prescribed items where it was possible to compare the Medicines Text Mining Tool against manual dm+d mapping performed by Trusts. The results matched exactly 60.8% of the time and identified the same active ingredient (that is had a common VTM or VMP code) 99.3% of the time.
- Assurances cannot be provided around the coverage of patient episodes nor bed days included in the data that was used to develop the "tool"
- All reasonable endeavours have been undertaken to clinically assure the Medicines Text Mining Tool and reasonable attempts were undertaken to rectify issues and errors that were identified.
- A list of issues that have been identified and could not be rectified are provided in appendix A, please note that list is not comprehensive - there may be further issues and incorrect mappings that have yet been identified.
- Therefore, for reasons outlined above, NHS Digital cannot accept clinical responsibility for use of the "tool" and any outputs from use of the mapped data. The use of both the mapping tool and mapped data is the clinical responsibility of the user.
No | Issue | Examples |
---|---|---|
1 | Differences in dm+d and ePMA naming conventions | |
2 | Similar names may yield the wrong match | |
3 | Multiple ingredient ePMA terms being matched inaccurately due to lexically similar dm+d entries | |
4 | ePMA typos | |
5 | Ambiguous ePMA term | |
6 | ePMA term contains additional pre/post-fixed information, abbreviations, additional spacing or brevity of the ePMA term | |
7 | Closest lexical match is not the correct match | |
8 | Mismatching due to flavours | |
9 | Non-medicinal product entries in ePMA |
This code runs on a Databricks cluster (Spark version 2.4.5) with the following packages:
- collections
- datetime
- enum
- functools
- operator
- os
- pandas
- pyspark.broadcast
- pyspark.sql
- random
- re
- time
- traceback
- typing
- unittest
- uuid
- warnings
The following dm+d tables are required as inputs to the pipeline:
- amp
- vmp
- vtm
- amp_parsed
- vmp_parsed
- form
- route
- unit_of_measure
Locate and run the init_schemas notebook. You will need to specify the following, which are the database and table names where the outputs will be written to:
- db
- match_lookup_final_name
- unmappable_table_name
- accuracy_table_name
Locate and run the run_notebooks notebook. You will need to specify:
- source_dataset: Whether the input data is of type source_A or source_B
- raw_input_table: If input is of type source_B, this must contain string fields named "medication_name_value" and "form_in_text". If input is of the type source_A, this must contain a string field named "Drug"
- db: The database
- batch_size: The number of rows you'd like to process (i.e. The number of rows in raw_input_table)
- notebook_root: the location of this notebook
Medicines Text Mining Tool codebase is released under the MIT License.
The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.