manuscript.qmd

---
title: Harnessing generative AI to annotate the severity of all phenotypic abnormalities within the Human Phenotype Ontology
author:
  - name: Kitty B Murphy
    orcid: 0000-0002-8669-3076
    corresponding: true
    email: kitty.murphy@ukdri.ac.uk
    roles:
      - Investigation
      - Project administration
      - Software
      - Visualization
    affiliations:
      - Department of Brain Sciences, Imperial College London, UK 
      - UK Dementia Research Institute at Imperial College London, UK
  - name: Brian M Schilder
    orcid: 0000-0001-5949-2191
    corresponding: false
    email: brian_schilder@alumni.brown.edu
    roles: 
      - Investigation
      - Project administration
      - Software
      - Visualization
    affiliations:
      - Department of Brain Sciences, Imperial College London, UK 
      - UK Dementia Research Institute at Imperial College London, UK
  - name: Nathan G Skene
    orcid: 0000-0002-6807-3180
    email: n.skene@imperial.ac.uk
    corresponding: true
    roles: 
      - Investigation
      - Project administration
      - Software
      - Visualization
    affiliations:
      - Department of Brain Sciences, Imperial College London, UK 
      - UK Dementia Research Institute at Imperial College London, UK
keywords:
  - Ontology
  - Human Phenotype Ontology
  - Large Language Model
  - GPT-4
  - Artificial intelligence
  - Generative AI
  - Rare disease 
  - Automation
  - Medicine
bibliography: references.bib
date: last-modified
citation: 
  container-title: medRxiv
---

```{r setup, echo=FALSE, message=FALSE, warning=FALSE, cache=FALSE}
library(dplyr)
library(ggplot2)
library(ggtext)
library(patchwork)
library(HPOExplorer)
library(KGExplorer)
library(data.table)
library(Hmisc)
library(corrplot)
library(ggstatsplot)
options("knitr.table.format"="pipe")
```

```{r import-data}
#### Load HPO data
tag <- "v2024-02-08"
save_dir <- "data"
keep_descendants <- "Phenotypic abnormality"
p2g <- HPOExplorer::load_phenotype_to_genes(tag = tag,
                                            save_dir=save_dir)
hpo <- HPOExplorer::get_hpo(method="github",
                            tag=tag, 
                            save_dir=save_dir)

hpo <- KGExplorer::filter_ontology(ont = hpo, 
                                   keep_descendants = keep_descendants)
#### Read HPO GPT annotations 
gpt_annot <- HPOExplorer::gpt_annot_read(hpo = hpo)
#### Compute IDs counts
ids_matched <- intersect(c(hpo@terms,names(hpo@alternative_terms)), 
                         gpt_annot$hpo_id)
names_matched <- intersect(hpo@elementMetadata$name,
                           trimws(gsub("obsolete","",gpt_annot$hpo_name, ignore.case = TRUE)))

n_ids_hpo <- hpo@n_terms
n_ids_gpt <- length(unique(data.table::fcoalesce(gpt_annot$hpo_name,gpt_annot$hpo_id)))
```

```{r gpt_annot_check}
query_hits <- HPOExplorer::search_hpo(search_cols = c("name"),
                                      hpo = hpo)
#### Run annotation checks
checks <- HPOExplorer::gpt_annot_check(annot = gpt_annot, 
                                       query_hits = query_hits)
# checks$true_pos_rate
```

```{r gpt_annot_codify}
res_coded <- HPOExplorer::gpt_annot_codify(annot = gpt_annot)
annot_melt <- HPOExplorer::gpt_annot_melt(res_coded = res_coded)
```

```{r gpt_annot_class, results='hide'}
res_class <- HPOExplorer::gpt_annot_class(res_coded = res_coded)
fig_severity_class <- ggstatsplot::ggbetweenstats(res_class,
                            x="severity_class", y="severity_score_gpt")
fig_severity_class_dt <- MSTExplorer:::get_ggstatsplot_stats(fig_severity_class)
```

```{r gpt_annot_plot}
gpt_annot_plot_out <- HPOExplorer::gpt_annot_plot(
  annot = gpt_annot[hpo_name!=keep_descendants],
  keep_descendants = keep_descendants, 
  width = 10)
```

```{r weights_dict}
weights_dict <- eval(formals(gpt_annot_codify)$weights_dict)
annotation_order <- gsub("_"," ",names(sort(unlist(weights_dict), decreasing = TRUE)))
```

## Abstract

There are thousands of human phenotypes which are linked to genetic variation. These range from the benign (white eyelashes) to the deadly (respiratory failure). The Human Phenotype Ontology has categorised all human phenotypic variation into a unified framework that defines the relationships between them (e.g. missing arms and missing legs are both abnormalities of the limb). This has made it possible to perform phenome-wide analyses, e.g. to prioritise which make the best candidates for gene therapies. However, there is currently limited metadata describing the clinical characteristics / severity of these phenotypes. With \>`r format(round(n_ids_hpo/100)*100, scientific = FALSE)` phenotypic abnormalities across \>`r format(round(length(unique(p2g$disease_id))/100)*100, scientific = FALSE)` rare diseases, manual curation of such phenotypic annotations by experts would be exceedingly labour-intensive and time-consuming. Leveraging advances in artificial intelligence, we employed the OpenAI GPT-4 large language model (LLM) to systematically annotate the severity of all phenotypic abnormalities in the HPO. Phenotypic severity was defined using a set of clinical characteristics and their frequency of occurrence. First, we benchmarked the generative LLM clinical characteristic annotations against ground-truth labels within the HPO (e.g. phenotypes in the 'Cancer' HPO branch were annotated as causing cancer by GPT-4). True positive recall rates across different clinical characteristics ranged from `r round(min(checks$true_pos_rate)*100)`-`r round(max(checks$true_pos_rate)*100)`% (mean=`r round(mean(checks$true_pos_rate)*100)`%), clearly demonstrating the ability of GPT-4 to automate the curation process with a high degree of fidelity. Using a novel approach, we developed a severity scoring system that incorporates both the nature of the clinical characteristic and the frequency of its occurrence. These clinical characteristic severity metrics will enable efforts to systematically prioritise which human phenotypes are most detrimental to human health, and best targets for therapeutic intervention.

## Introduction

Ontologies provide a common language with which to communicate concepts. In medicine, ontologies for phenotypic abnormalities are invaluable for defining, diagnosing, prognosing, and treating human disease. Since 2008, the Human Phenotype Ontology (HPO) has been instrumental in healthcare and biomedical research by providing a framework for comprehensively describing human phenotypes and the relationships between them [@kohlerHumanPhenotypeOntology2021; @garganoHumanPhenotypeOntology2024]. By expanding its depth and breadth over time, the HPO now contains \>`r format(round(n_ids_hpo/100)*100, scientific = FALSE)` phenotypic abnormalities that are associated with \>`r format(round(length(unique(p2g$disease_id))/100)*100, scientific = FALSE)` diseases as captured in HPOA [REF]. Some HPO phenotypes also contain information related to typical age of onset, frequency, triggers, time course, mortality rate and typical severity. Describing the severity-related attributes of a disease is crucial for both research and clinical care of individuals with rare diseases. When researchers or clinicians are presented with phenotypes that fall outside of their expertise, resources to quickly and reliably retrieve summaries with additional relevant information about these phenotypes are essential. In the clinic, this can help in reaching a differential diagnosis or prioritising the treatment of some phenotypes over others. In research, this information is useful for prioritising targets for causal disease mechanisms, performing large-scale analyses of phenotypic data, and guiding funding agencies when assessing the potential impact and need for research in a given disease area. To date, the HPO has largely relied on manual curation by domain experts. While this approach can improve annotation quality and accuracy, it is both time-consuming and labour-intensive. As a result, less than 1% of terms within the HPO contain metadata such as time course and severity.

Artificial intelligence (AI) capabilities have advanced considerably in recent years, presenting new opportunities to integrate natural language processing technologies into assisting in the curation process. Specifically, there have recently been considerable advances in large language model (LLM) and their application to biomedical problems, in some cases performing as well or better than human clinicians on standardised medical exams and patient diagnosis tasks [@vanveenAdaptedLargeLanguage2024; @boltonBioMedLM7BParameter2024; @zhangBiomedGPTUnifiedGeneralist2023; @labrakBioMistralCollectionOpenSource2024; @guDomainSpecificLanguageModel2021; @singhalLargeLanguageModels2023; @shinBioMegatronLargerBiomedical2020; @chengExploringPotentialGPT42023; @singhalLargeLanguageModels2023; @oneilPhenomicsAssistantInterface2024; @luoBioGPTGenerativePretrained2022; @mcduffAccurateDifferentialDiagnosis2023; @singhalExpertLevelMedicalQuestion2023]. Recent work has demonstrated that the Generative Pre-trained Transformer 4 (GPT-4) foundation model [@openaiGPT4TechnicalReport2024], when combined with strategic prompt engineering, can outperform even specialist LLMs that are explicitly fine-tuned for biomedical tasks [@noriCanGeneralistFoundation2023]. In a landmark achievement, GPT-4 was the first LLM to surpass a score of 90% in the United States Medical Licensing Examination (USML) [@noriCanGeneralistFoundation2023].

Here, we have used GPT-4 to systematically annotate the severity of `r n_ids_gpt` / `r n_ids_hpo` (`r round(n_ids_gpt/n_ids_hpo*100,1)`%) phenotypic abnormalities within the HPO. Our severity annotation framework was adapted from previously defined criteria developed through consultation with clinicians [@Lazarin2014-rz]. The authors consulted 192 healthcare professionals for their opinions on the relative severity of various clinical characteristics: they used this to create a system for categorising the severity of diseases. Briefly, each healthcare professional was sent a survey asking them to first rate how important a disease characteristic was for determining disease severity, and then to rate the severity of a set of given disease. Using the responses, the authors were able to categorise clinical characteristics into 4 'severity tiers'. While characteristics such as shortened lifespan in infancy and intellectual disability were identified as highly severe and placed into tier 1, sensory impairment and reduced lifespan were categorised as less severe and placed into tier 4. Standardised metrics of severity allow clinicians to quickly assess the urgency of treating a given phenotype, as well as prognosing what outcomes might be expected.

To evaluate the consistency of responses generated by GPT-4 `r checks$consistency_count[[1]]` phenotypes were annotated multiple times. For a subset of phenotypes with known expected clinical characteristics, true positive rates were calculated to assess recall. Additionally, based on the clinical characteristics and their occurrence, we have quantified the severity of each phenotype, providing an example of how these clinical characteristic annotations can be used to guide prioritisation of gene therapy trials. Ultimately, we hope that our resource will be of utility to those working in rare diseases, as well as the wider healthcare community.

## Results

### Annotating the HPO using GPT-4

```{r fig-occurrence}
#| label: fig-occurrence
#| fig-cap: GPT-4 was able to annotate all human phenotypes based on whether they are always/often/rarely/never associated with different clinical characteristics. **a** An example of the prompt input given to to GPT-4. The phenotypes listed in the second to last sentence (*italicised*) were changed to allow all HPO phenotypes to be annotated. **b** Stacked bar plot showing the proportion of the occurrence of each clinical characteristic across all annotated HPO phenotypes. The terms shown on the x-axis are the clinical characteristics for which GPT-4 was asked to determine whether each phenotype caused them. 
#| fig-height: 6
#| fig-width: 10

# subset to clinical characteristics   
occurr_df <- gpt_annot[,c("hpo_name",gsub(" ", "_", annotation_order)),with=FALSE]
cols_of_interest <- names(occurr_df)[-1]
# Create an empty dataframe to store the counts
count_df <- data.frame(matrix(0, nrow = length(cols_of_interest), ncol = 4))
colnames(count_df) <- c("always", "often", "rarely", "never")
rownames(count_df) <- cols_of_interest

## Loop through each column of interest and count the occurrences of:
## 'always', 'often', 'rarely', 'never'
for (col in cols_of_interest) {
  counts <- table(occurr_df[[col]])
  count_df[col, ] <- counts[match(colnames(count_df), names(counts))]
}

count_df_long <- tidyr::gather(count_df,
                               key = "condition",
                               value = "count", 
                               always:never)
count_df_long$category <- rownames(count_df)

count_df_long$category <- gsub("_", " ", count_df_long$category)
count_df_long$condition <- factor(count_df_long$condition, 
                                  levels=c("always", "often", "rarely", "never"))

count_df_long <- count_df_long |>
  dplyr::group_by(category) |>
  dplyr::mutate(percentage = count / sum(count) * 100)

# set annotation order for plots 
count_df_long$category <- factor(count_df_long$category, 
                                 levels = annotation_order)
 

# Load the prompt generation function 
source("code/prompt_gen.R")
prompts <- prompt_gen_table(hpo = hpo)
txt_prompt <- prompts$prompt[1688]
for(x in prompts$terms[[1688]]){
  txt_prompt <- sub(x,paste0("*",x,"*"),txt_prompt)
}
for(x in c(
  eval(formals(prompt_gen_table)$effects),
  eval(formals(prompt_gen_table)$responses),
  "justification")){
  txt_prompt <- sub(x,paste0("**",x,"**"),txt_prompt)
}
## Subplot a: Example prompt
f1a <- ggplot() + 
  labs(title="Example GPT-4 prompt") +
  ggtext::geom_textbox(aes(x=.5,y=.6),
                       box.padding = margin(20, 10, 20, 10),
                       box.margin = margin(0, 0, 0, 0),
                       box.r = grid::unit(30, "pt"),
                       width = unit(.8, "npc"),
                       size=3.5,
                       halign=.5,
                       fill="grey90",
                       label=txt_prompt) +
  ylim(c(0,1)) +
  theme_void() +
  theme(plot.title = element_text(hjust=0.5))

## Subplot b: Stacked bar plot of clinical characteristic occurrence.
f1b <- ggplot(count_df_long, aes(x = category, y = count, fill = condition)) +
  geom_bar(stat = "identity") +
  labs(title = "Clinical characteristic occurrence",
       x = NULL, y = "HPO phenotypes (n)", fill = NULL) +
  theme_minimal() +
  theme(legend.position = "right", 
        plot.title = element_text(hjust=0.5, vjust=-1),  
        axis.text.x = element_text(angle=45, hjust=1),  
        ) +
  scale_fill_brewer(palette = "GnBu", direction = -1)

(
  f1a| 
    (f1b/plot_spacer() ) + plot_layout(heights = c(1, .1))
) + 
  patchwork::plot_layout(widths = c(1,1)) +
  patchwork::plot_annotation(tag_levels = letters) 
```

We employed the OpenAI GPT-4 model with Python to annotate `r length(unique(gpt_annot$hpo_name))` terms within the HPO (`r tag`) [@kohlerHumanPhenotypeOntology2021; @garganoHumanPhenotypeOntology2024]. Our annotation framework was developed based on previously defined criteria for classifying disease severity [@Lazarin2014-rz]. We sought to evaluate the impact of phenotypes on factors including intellectual disability, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility, and congenital onset. Through prompt design we found that the performance of GPT-4 improved when we incorporated a scale associated with each clinical characteristic and required a justification for each response. For each clinical characteristic, we asked about the frequency of its occurrence - whether it never, rarely, often, or always occurred. Framing the queries in this way served two purposes. First, this helped to constrain the responses of GPT-4 to a specific range of values, making answers more consistent and amenable to downstream data analysis. Second, it served to overcome one of the main limitations noted by @Lazarin2014-rz as they did not collect information on how the frequency of each disease affected their decision making when generating severity annotations.

```{r always_death}
never_thresh <- 50
never_dt <- subset(count_df_long, condition=="never" & percentage>=never_thresh)
always_death <- gpt_annot[,list(always_death=all(death=="always"),
                                hpo_name=hpo_name[1]),
                          by="hpo_id"][always_death==TRUE,]
```

Clinical characteristic occurrence varied across annotation categories. \>`r never_thresh`% of phenotypes never caused `r paste(never_dt$category[-1], collapse=", ")` or `r never_dt$category[1]`. Only a minority of phenotypes (`r round(sum(subset(count_df_long, condition %in% c("never") & category=="congenital onset")$percentage),1)`%) never had a congenital onset, which is expected as rare disorders tend to be early onset genetic conditions ([Fig. @fig-occurrence]).

Less than `r ceiling(subset(count_df_long, condition=="always" & category=="death")$percentage)`% of phenotypes always directly resulted in death (n=`r uniqueN(gpt_annot[death=="always"]$hpo_name)`), such as 'Stillbirth', 'Anencephaly' and 'Bilateral lung agenesis'. Meanwhile, `r uniqueN(gpt_annot[death %in% c("often","rarely")]$hpo_name)` phenotypes were annotated as often or rarely causing death. `r uniqueN(gpt_annot[death %in% c("never")]$hpo_name)` phenotypes were annotated as never causing death. Examples of phenotypes that never cause death included `r sum(grepl("syndactyly",gpt_annot[death %in% c("never")]$hpo_name, ignore.case = TRUE))` unique forms of syndactyly, a non-lethal condition that causes fused or webbed fingers (occurring 1 in 1,200--15,000 live births). While not life-threatening itself, syndactyly is a symptom of genetic disorders that can cause life-threatening cardiovascular and neurodevelopmental defects, such as Apert Syndrome [@garagnaniSyndromesAssociatedSyndactyly2013]. This example highlights the ability of GPT-4 to successfully distinguish between phenotypes that directly cause lethality, and those that are often associated with diseases that cause lethality.

### Annotation consistency and recall

```{r ontLvl-vs-consistency}
consist <- checks$annot_stringent_mean
consist <- HPOExplorer::add_hpo_id(consist, hpo = hpo)
consist <- HPOExplorer::add_ont_lvl(consist, hpo = hpo)
# consist_vs_lvl <- Hmisc::rcorr(as.matrix(consist[!is.na(ontLvl)][,-c("hpo_name","hpo_id")]))
# corrplot::corrplot(consist_vs_lvl$r)

consist_vs_ontLvl <- consist|>
  data.table::melt.data.table(id.vars = c("hpo_name","hpo_id","ontLvl"), 
                              variable.name = "metric", 
                              value.name = "Consistent\n(stringent)")|>
  # ggstatsplot::ggscatterstats(x = "consistency",
  #                             y = "ontLvl")
  # ggstatsplot::ggbetweenstats(x = "consistency", 
  #                             y = "ontLvl", type = "parametric")
  ggstatsplot::ggbarstats(x = "Consistent\n(stringent)", 
                          y = "ontLvl") +
  ggplot2::labs(x="Human Phenotype Ontology level", title = "Consistency of GPT-4 annotations by HPO level")
consist_vs_ontLvl_dt <- MSTExplorer:::get_ggstatsplot_stats(consist_vs_ontLvl)
```

To assess annotation consistency, we queried GPT-4 with a subset of the HPO phenotypes multiple times (n=`r checks$consistency_count[[1]]` unique phenotypes). We employed two different metrics to determine the *consistency rate*. The first, less stringent metric, defined consistency as the duplicate annotations being either 'always' and 'often', or 'never' and 'rarely'. The second, more stringent metric, required exact agreement in annotation occurrences, e.g. 'always' and 'always'. For the less stringent metric, duplicated phenotypes were annotated consistently at a rate of at least `r floor(min(checks$consistency_rate)*100)`%, and for the more stringent metric, the lowest consistency rate was `r round(min(checks$consistency_stringent_rate)*100)`% for congenital onset. An example of where annotations were inconsistent was for the HPO term 'Acute leukaemia'. One time, GPT-4 annotated it as often causing impaired mobility, giving the justification that 'weakness and fatigue from leukaemia and its treatment can impair mobility'. The other time, GPT-4 annotated it as rarely causing impaired mobility, giving the justification that 'acute leukaemia rarely impairs mobility directly'. Despite specifying in the prompt for GPT-4 not to take into consideration indirect effects, this is an example of where it failed to do so.

We also reasoned that GPT-4 would be better able to give consistent answers for more specific phenotypes lower in the ontology, as they are more likely to have a single cause. We found that the stringent consistency rate did indeed significantly improve with greater HPO ontology depth ($X_{Pearson}^2$=`r round(consist_vs_ontLvl_dt$summary_data$statistic,2)`, $\hat{V}_{Cramer}$=`r round(consist_vs_ontLvl_dt$summary_data$estimate,2)`, p=`r format(consist_vs_ontLvl_dt$summary_data$p.value, digits = 1)`). See @fig-consist-vs-ontLvl for a visual representation of this relationship.

```{r annot_checks}
# Define items to be checked
items <- c("consistency_count", "consistency_rate",
           "consistency_stringent_count", "consistency_stringent_rate",
           "true_pos_count", "true_pos_rate")
metric_types <- c("Rate", "Count")[1]

# Get data frame with checks values
check_df <- lapply(checks[items], data.table::as.data.table, keep.rownames = TRUE) |>
  data.table::rbindlist(idcol = "metric") |>
  data.table::setnames(c("metric", "annotation", "value"))

check_df$metric <- factor(check_df$metric,
                          levels = items,
                          ordered = TRUE)

check_df$annotation <- factor(check_df$annotation,
                              levels = unique(check_df$annotation),
                              ordered = TRUE)

check_df[, metric_type := ifelse(grepl("count", metric),
                                 "Count", ifelse(grepl("rate", metric), "Rate", NA))]

check_df[, metric_category := gsub("_count|_rate", "", metric)]

check_df$metric_type <- factor(check_df$metric_type, ordered = TRUE)

check_df <- check_df[annotation != "pheno_count"]

check_df[, n := value[metric_type == "Count"],
         by = c("metric_category", "annotation")]

check_df$annotation <- gsub("_", " ", check_df$annotation)
annotation_order <- unique(check_df$annotation)
check_df$annotation <- factor(check_df$annotation, levels = annotation_order)
check_df$metric_category <- gsub("_", " ", check_df$metric_category)
check_df[metric_category=="consistency",metric_category:="consistency lenient"]
```

```{r fig-checks}
#| label: fig-checks
#| fig-cap: GPT-4 annotations are consistent and accurate across annotations. **a** Barplot showing the annotation consistency within phenotypes that were annotated more than once. In the lenient metric, annotations were collapsed into two groups ('always'/'often' and 'never'/'rarely'). For a given clinical characteristic within a given phenotype, if an annotation was always within the same group it was considered consistent. In the stringent metric, all four annotation categories were considered to be different from one another. Thus, annotations were only defined as consistent if they were all identical. The blue dashed line indicates the probability of two annotations being consistent by chance in the lenient metric (~1/2). The gold dashed line is the probability of two annotations being consistent by chance in the stringent metric (~1/4). **b** Bar plot of the true positive rate for each annotation. The labels above each bar indicate the number of phenotypes tested. 
#| fig-height: 5
#| fig-width: 6

lab_size <- 3
consistency_plot <- ggplot(check_df[metric_type %in% metric_types &
                                    grepl("consistency",metric_category)],
                           aes(x = annotation, y = value,
                               fill = stringr::str_to_sentence(
                                 gsub("consistency ", "", metric_category)
                                 ),
                               label = round(value, 2))) +
  geom_bar(stat = "identity",position = "dodge") +
  labs(title="Annotation consistency",
       x = NULL, y = "Mean consistency", 
       fill=NULL) +
  scale_fill_manual(values = c("#C3DAEAFF", "#ECE28BFF")) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "top", 
        strip.background = element_rect(fill="white"),
        plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(x = 1, y = 1, label = paste0("n=",n," phenotypes")),
            size=lab_size,
            check_overlap=TRUE, hjust=0) +
  ## Compute the empirical probability of annotations being consistent by chance 
  ## with 2 options and N duplicates
  geom_hline(yintercept = mean(1/2^checks$annot$pheno_count), 
             color="dodgerblue", linetype="dashed")+
  ## Compute the empirical probability of annotations being consistent by chance 
  ## with 4 options and N duplicates
  geom_hline(yintercept = mean(1/4^checks$annot$pheno_count), 
             color="gold3", linetype="dashed")

recall_plot <- ggplot(check_df[metric_type %in% metric_types &
                               metric_category %in% c("true pos"), ],
                        aes(x = annotation, y = value,
                            fill = metric_category,
                            label = n#paste0("n\n=\n",n)
                        )
                      ) +
  geom_bar(stat = "identity", show.legend = TRUE) +
  theme_bw() +
  labs(x = NULL, y = "Recall") +
  ggtitle("True positive rate") +
  scale_fill_manual(values = c("#B1D5BBFF")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none", 
        plot.title = element_text(hjust = 0.5)) +
  geom_label(fill = alpha("white",1),
            size = lab_size)

consistency_plot + recall_plot + 
  plot_layout(widths = c(1, 1)) +
  plot_annotation(tag_levels = 'a')
```

In order to evaluate the validity of the annotations, we calculated a true positive rate. This involved identifying specific branches within the HPO that would contain phenotypes that would reliably indicate the presence of certain conditions. For instance, the phenotypes 'Decreased fertility in females' and 'Decreased fertility in males' should often or always cause reduced fertility. We observed an encouraging true positive rate exceeding `r floor(min(checks$true_pos_rate)*100)`% across in every clinical characteristic and achieving perfect recall (100%) in `r sum(checks$true_pos_rate==1)`/`r length(checks$true_pos_rate)` characteristics.

```{r}
lowest_recall <- names(checks$true_pos_rate)[checks$true_pos_rate==min(checks$true_pos_rate)]
lowest_recall_dt <- checks$annot[get(lowest_recall)=="never" & hpo_id %in% query_hits[[lowest_recall]]]
```

The lowest true positive rate was observed for `r gsub("_"," ",lowest_recall)`, with `r round(checks$true_pos_rate[[lowest_recall]]*100,1)`% recall across `r checks$true_pos_count[[lowest_recall]]` HPO phenotypes. Some cases in which the GPT-4 annotations disagreed with the HPO ground truth included: `r paste(shQuote(head(lowest_recall_dt$hpo_name, 6)), collapse=", ")`. In the case of `r shQuote(lowest_recall_dt$hpo_name[1])` it provided the justification that `r shQuote(lowest_recall_dt$physical_malformations_justification[1])`. In another instance, GPT-4 noted that `r shQuote(lowest_recall_dt$hpo_name[2])` is `r shQuote(lowest_recall_dt$physical_malformations_justification[2])`. This indicates that while technically incorrect according to our predefined benchmarks, a case could in fact be made that mild skin conditions do not rise to the level of physical malformations.

This high level of recall underscores the robustness of our annotations and the reliability of the HPO framework in capturing clinically relevant phenotypic information. However, we acknowledge that the number of testable true positive phenotypes for some of these categories are low, especially `r shQuote(names(checks$checkable_count)[checks$checkable_count==min(checks$checkable_count)])` for which there is only `r min(checks$checkable_count)` phenotype in the HPO (after excluding terms pertaining to colour or night blindness). Furthermore, some of the true positive phenotypes are lexically similar to the name of the clinical characteristic itself. In these cases, annotating 'Severe intellectual disability' as always causing intellectual disability is a relatively trivial task. Nevertheless, even these scenarios provide a clear and interpretable benchmark for the model's performance. In addition, were numerous phenotypes with lexically non-obvious relationships to the clinical characteristic that were annotated correctly by GPT-4. For example, 'Molar tooth sign on MRI' (a neurodevelopmental pathology observed in radiological scans) was correctly annotated as causing intellectual disability.

### Quantifying phenotypic severity

```{r anenceph_dt}
anenceph_dt <- res_coded$annot_weighted[hpo_name=='Anencephaly',]
```

While individual annotations are informative, we wanted to be able to distil the severity of each phenotype into a single score. Quantifying the overall severity of phenotypes can have important implications for diagnosis, prognosis, and treatment. It may also guide the prioritisation of gene therapy trials for phenotypes with the most severe clinical characteristics and thus the most urgent need. Importantly, the values reflected the severity of each clinical characteristic based on both the type of characteristic itself and its frequency within a particular phenotype. For instance, a phenotype always causing death would have a higher multiplied value than a phenotype often causing reduced fertility (see @tbl-metric-weights). First, we created a dictionary to map each clinical characteristic (e.g. blindness) and its frequency (always, often, rarely, never) to numeric values from 0-3. Then, the clinical characteristic values were multiplied by weights. Next, we computed an average score for each phenotype by aggregating the multiplied values across all clinical characteristics and then calculating the mean. This was then normalised by the theoretical maximum severity score, so that all phenotypes were on a 0-100 severity scale (where 100 is the most severe phenotype possible). This average normalised score represents the overall severity of the phenotype based on the severity of its individual clinical characteristics.

Based on these scores we evaluated the top 50 severe phenotypes. One of the most severe phenotype was `r shQuote(anenceph_dt$hpo_name)` (`r anenceph_dt$hpo_id`) with a composite severity score of `r round(anenceph_dt$severity_score_gpt, 1)`. Anencephaly is a birth defect where the baby is born without a portion of its brain and skull, often these babies are stillborn. In fact, many of the most severe phenotypes were related to developmental brain and neural tube defects. Comparison of the severity scores for each response, across the clinical characteristics annotated, revealed consistent trends: as the response of the clinical characteristic increased (from never to always), the severity score also increased ([Supplementary Fig. @fig-severity-boxplot]). We also evaluated the severity score distribution by HPO branch and calculated the mean severity score using all phenotypes within each major HPO branch ([Fig. @fig-severity-histo]). The HPO branch with the greatest mean severity score was `r shQuote(unique(gpt_annot_plot_out$data$dat2$ancestor_name)[1])` (mean=`r round(unique(gpt_annot_plot_out$data$dat2$mean_severity_score_gpt)[1],1)`), followed by `r shQuote(unique(gpt_annot_plot_out$data$dat2$ancestor_name)[2])` (mean=`r round(unique(gpt_annot_plot_out$data$dat2$mean_severity_score_gpt)[2],1)`), which would include the highly ranked phenotypes seen in @fig-top-phenos.

```{r fig-top-phenos}
#| label: fig-top-phenos
#| fig-cap: Quantifying the severity of HPO phenotype annotations highlights the most impactful conditions. Heatmap of 10 represetantive phenotypes from each severity class (Profound, Severe, Moderate, Mild) stratified by whether the phenotypes are often/always congenital (**a**-**b**) or rarely/never congenital (**c**-**d**). Continuous severity scores are shown as bars (**b**,**d**) and were calculated by multiplying the numeric values assigned to each clinical characteristic according to @tbl-metric-weights. The average normalised score, representing overall phenotype severity on a 0-100 scale, was calculated by aggregating the multiplied values and normalising by the theoretical maximum severity score. The x-axes show each of the clinical characteristics. All data for this figure, as well as justifications for each annotation, can be found in @tbl-annotations.
#| fig-height: 10
#| fig-width: 13

plot_top_phenos_out <- HPOExplorer::plot_top_phenos(res_class = res_class, axis.text.x=c(TRUE,TRUE))
plot_top_phenos_out$plot
```

### Severity classes

While the continuous severity score is a helpful metric, there may be some use cases where a categorical classification of severity is more immediately useful. In work by @Lazarin2014-rz, the authors defined severity classed using a simple decision tree based on the individual severity annotations. We approximated this approach using our GPT-4 annotations. This categorical approach showed a strong degree of positive correspondence with the continuous severity score ($\hat{\omega_{p}^2}$=`r round(fig_severity_class_dt$summary_data$estimate,2)`, p\<`r format(.Machine$double.xmin, digits = 2)`). In other words, severity score increased with severity class level (mild \< moderate \< severe \< profound) as expected. The distribution of severity classes is shown in @fig-severity-class.

### Correlations between clinical characteristic severity metrics

```{r cor-metrics}
cor_metrics <- Hmisc::rcorr(as.matrix(res_coded$annot_weighted[,-c("hpo_id","hpo_name")]))
cor_composite <- sort(cor_metrics$r["severity_score_gpt",], decreasing = TRUE)[-1]
```

We found that some clinical characteristic severity metrics were correlated with one another, with a mean Pearson correlation of `r round(mean(cor_metrics$r[-nrow(cor_metrics$r),-ncol(cor_metrics$r)]),2)` across all individual metrics (see @fig-metric-corplot). In particular, blindness and sensory impairment were highly correlated with one another (r=`r round(cor_metrics$r["blindness","sensory_impairments"],2)`, p=`r round(cor_metrics$P["blindness","sensory_impairments"],4)`). Some metrics drove the composite severity score more than other, which is a reflection of both our per-metric weighting scheme, response type frequencies, and the correlation structure between metrics. Overall, `r gsub("_"," ",names(cor_composite[1]))` seemed to be the strongest driver of the composite severity score with a Pearson correlation of `r mean(cor_composite[1],2)`, followed by `r gsub("_"," ",names(cor_composite[2]))` (r=`r round(cor_composite[2],2)`) and `r gsub("_"," ",names(cor_composite[3]))` (r=`r round(cor_composite[3],2)`).

```{r, eval=FALSE}
### Correlation with QALY
ihme <- data.table::fread("~/Downloads/IHME-GBD_2019_DATA-e678e00e-1/IHME-GBD_2019_DATA-e678e00e-1.csv")
ihme[,nm:=tolower(cause_name)]

res_coded$annot_weighted[,nm:=tolower(hpo_name)]
ihme_merged <- merge(data.table::dcast.data.table(ihme, 
                             formula = "nm~measure_name+metric_name", 
                             value.var = "val",
                             fun.aggregate = mean, na.rm=TRUE), 
                     res_coded$annot_weighted)
metric_cols <- list(
  IHME=grep(paste(paste0("_",
                         "Percent",
                         # unique(ihme$metric_name)[1],
                         "$"),collapse = "|"),names(ihme_merged), value = TRUE),
  GPT=setdiff(names(res_coded$annot_weighted),c("hpo_id","hpo_name","nm"))
)

Xcor <- cor(ihme_merged[,unlist(metric_cols),with=FALSE], use="complete.obs")
heatmaply::heatmaply(Xcor[metric_cols$IHME,metric_cols$GPT])
```

### Congenital onset by HPO branch

```{r gpt_annot_plot_branches}
gpt_annot_plot_branches_out <- HPOExplorer::gpt_annot_plot_branches(
  gpt_annot = gpt_annot, 
  hpo = hpo, 
  metric = "congenital_onset", 
  fill_lab = "Congenital\nonset", 
  show_plot = FALSE)
top_congenital <- unique(gpt_annot_plot_branches_out$data[!is.na(congenital_onset),c("ancestor_name","proportion_always")])
```

```{r fig-congenital-branches}
#| label: fig-congenital-branches
#| fig-cap: Distribution of congenital onset across HPO branches. The y-axis shows the proportion of phenotypes that are always/often/rarely/never congenital. The x-axis shows the HPO branch, orderered from highest to lowest proportion of always congenital phenotypes.
#| fig-height: 6
#| fig-width: 6

gpt_annot_plot_branches_out$plot
```

Next, we assessed the distribution of congenital onset across HPO branches ([Fig. @fig-congenital-branches]). We found that the `r top_congenital$ancestor_name[1]` branch contained the greatest proportion of phenotypes that were always congenital (`r round(top_congenital$proportion_always[1]*100,2)`%), followed by `r top_congenital$ancestor_name[2]` (`r round(top_congenital$proportion_always[2]*100,2)`%) and `r top_congenital$ancestor_name[3]` (`r round(top_congenital$proportion_always[3]*100,2)`%). This is concordant with the expectation that these phenotypes should largely be congenital. The HPO branches with the least commonly congenital phenotypes were `r rev(top_congenital$ancestor_name)[1]` (`r round(rev(top_congenital$proportion_always)[1]*100,2)`%), `r rev(top_congenital$ancestor_name)[2]` (`r round(rev(top_congenital$proportion_always)[2]*100,2)`%), and `r rev(top_congenital$ancestor_name)[3]` (`r round(rev(top_congenital$proportion_always)[3]*100,2)`%). ['Constitutional symptom'](https://hpo.jax.org/app/browse/term/HP:0025142) is a fairly broad term defined as *'A symptom or manifestation indicating a systemic or general effect of a disease and that may affect the general well-being or status of an individual.'* Examples include 'Fatigue' 'Exercise intolerance', 'Hot flashes' and 'Sneeze'.

## Discussion

Phenotype severity annotations have utility across a wide variety of applications in both the clinic and research. In clinical settings, severity annotations can be used to prioritise the treatment of some phenotypes over others in patients with complex presentations, avoid administering contraindicated drugs, and prognosing potential health outcomes. In research settings, severity annotations can be used to identify phenotypes that have a large impact on patient outcomes and yet are currently understudied. They may also be used to help design new experiments and studies, or even provide new insights into the underlying aetiology of the disease by making expert-level summaries more immediately accessible to the wider research community.

The creation and annotation of biomedical knowledge has traditionally relied on manual or semi-manual curation by human experts [@putmanMonarchInitiative20242024; @ochoaOpenTargetsPlatform2021; @mungallMonarchInitiativeIntegrative2017a; @kohlerHumanPhenotypeOntology2021; @garganoHumanPhenotypeOntology2024]. Performing such manual curation and review tasks at scale is often infeasible for human biomedical experts given limited time and resources. LLMs have the capacity to effectively encode, retrieve, and synthesise vast amounts of diverse information in a highly scalable manner [@vanveenAdaptedLargeLanguage2024; @singhalLargeLanguageModels2023; @openaiGPT4TechnicalReport2024]. This makes them powerful tools that can be applied in a rapidly expanding variety of scenarios, including medical practice, research and data curation [@singhalLargeLanguageModels2023; @toroDynamicRetrievalAugmented2023; @panLargeLanguageModels2023; @oneilPhenomicsAssistantInterface2024; @caufieldStructuredPromptInterrogation2023].

Here, we introduce a novel framework to leverage the current best-in-class LLM, GPT-4 [@openaiGPT4TechnicalReport2024], to systematically annotate the severity of `r n_ids_gpt` phenotypic abnormalities within the HPO. By employing advanced AI capabilities, we have demonstrated the feasibility of automating this process, significantly enhancing efficiency without substantially compromising accuracy. Our validation approach yielded a high true positive rate exceeding `r floor(min(checks$true_pos_rate)*100)`% across the phenotypes tested. Furthermore, our approach can be readily adapted and scaled to accommodate the growing volume of phenotypic data. In total, the entire study cost \$296.27 in queries to the OpenAI API. While we do not have a direct comparison, this likely represents a extremely small fraction of the total costs of such a study if performed manually by human experts charging at an hourly rate. Even if all human annotations were provided on a volunteer basis, this would still require hundreds if not thousands of hours of cumulative manual human labour. Using our approach, severity annotations for the entire HPO can be generated automatically at a rate of \~100 phenotypes/hour. Further optimisation of the annotation process and increased API rate limits could potentially accelerate this even further.

Throughout this study, we observed that GPT-4 was capable of reliably recovering deep semantic relationships from the medical domain, far beyond making superficial inferences based on lexical similarities. An excellent example of this is the phenotype 'Molar tooth sign on MRI' (`r res_coded$annot_weighted[hpo_name=='Molar tooth sign on MRI']$hpo_id`; severity score=`r round(res_coded$annot_weighted[hpo_name=='Molar tooth sign on MRI']$severity_score_gpt,2)`), which GPT-4 annotated as causing intellectual disability. At first glance, we ourselves assumed this was a false positive as the term appeared to be related to dentition. However, upon further inspection we realised that molar tooth sign is in fact a pattern of abnormal brain morphology that happens to bear some resemblance to molar dentition when observed in radiological scans. This phenotype is a known sign of neurodevelopmental defects that can indeed cause severe intellectual disability [@gleesonMolarToothSign2004].

In addition to rapidly synthesising and summarising vast amounts of information, LLMs can also be steered to provide justifications for each particular response. This makes LLMs amenable to direct interrogation as a means of recovering explainability, especially when designed to retain information about previous requests and interactions as they use these to iteratively improve and update their predictions [@janikAspectsHumanMemory2024]. This represents a categorical advance over traditional natural language processing models based on more shallow forms of statistical or machine learning (e.g. Term Frequency-Inverse Document Frequency [@jonesStatisticalInterpretationTerm1972], Word2vec [@mikolovEfficientEstimationWord2013]) which lack the ability to provide chains of causal reasoning to justify their predictions. This highlights the fundamental trade-off between simpler models with high explainability (the ability humans to understand the inner workings of the model) but low interpretability (the ability of humans to trace the decision process of the model, analogous to human 'reasoning'), and deeper more complex models with low explainability but high interpretability [@marcinkevicsInterpretabilityExplainabilityMachine2023].

A key contribution of our study is the introduction of a quantitative severity scoring system that integrates both the nature of the clinical characteristic and the frequency of its occurrence. By encoding the concept of severity in this way, we are able to prioritise phenotypes based on their impact on patients. The methodology allowed us to transition from low-throughput qualitative assessments of severity (e.g. @Lazarin2014-rz) to high-throughput quantitative assessments of severity. One of the most severe phenotypes in the HPO is 'Fetal akinesia sequence' (FAS; `r annot_melt[hpo_name=='Fetal akinesia sequence']$hpo_id[1]`, severity score= `r round(annot_melt[hpo_name=='Fetal akinesia sequence']$severity_score_gpt[1],1)`), and extremely rare condition that is almost always lethal. FAS is a complex, multi-system phenotype that can be caused by at least `r uniqueN(p2g[hpo_name=='Fetal akinesia sequence']$disease_id)` different genetic disorders. Despite the complex and heterogeneous aetiology of this phenotype, GPT-4 was able to provide accurate annotations alongside explainable justifications for those annotations (see @tbl-fas). For example, this phenotype almost always results in death, either *in utero* or shortly after birth. Not only did GPT-4 correctly provide the annotation death as `r shQuote(gpt_annot[hpo_name=='Fetal akinesia sequence']$death)`, when asked whether FAS causes sensory impairments it provided the response `r shQuote(gpt_annot[hpo_name=='Fetal akinesia sequence']$sensory_impairments)` with the justification `r shQuote(gpt_annot[hpo_name=='Fetal akinesia sequence']$sensory_impairments_justification)` Neurodevelopmental disruption is indeed a hallmark component of FAS (e.g. hydrocephalus, cerebellar hypoplasia) that causes severe impairments across multiple sensory systems [@chenPrenatalDiagnosisGenetic2012]. This demonstrates that GPT-4 was able to recover the correct chain of causality from phenotype to clinical characteristic.

Our findings highlight the potential of this next generation of natural language processing technologies in significantly contributing to the automation and refinement of data curation in biomedical research. These results have a large number of useful real-world applications, such as prioritising gene therapy candidates [@murphyIdentificationCellTypespecific2023] and guiding clinical decision-making in rare diseases. It may also be used as tool to help inform policy decisions and funding allocation by healthcare or governmental institutions. This of course would need to be in consultation with subject matter medical experts, patients, advocates and biomedical ethicists before reaching a final decision. Nevertheless, access to succinct, interpretable, and semi-quantitative severity annotations may encourage key decision makers with limited time to review individual proposals to pay heed to phenotypes and diseases that would otherwise be overlooked. As the HPO and the broader literature continue to grow over time, our automated AI-based approach can easily be repeated to keep pace with the rapidly evolving biomedical landscape. Furthermore, it can be extended to produce different sets of annotations or be used with any other ontology. Additional use cases include gathering data on the prevalence of each phenotype to approximate their social and financial costs.

One key limitation of our study is the fact that we did not explicitly interrogate GPT-4 to assess how the availability of treatments affected the annotations it produced. For example, there are some very severe conditions for which highly effective treatments and early detection screens are widely available (e.g. syphilis, some forms of melanoma), thus rendering them fully treatable or even curable provided access to modern healthcare. It would therefore be useful to further interrogate GPT-4 to uncover how the availability of treatments influences its responses. Many of our findings here seem to indicate that GPT-4 does take into account quality of care to the extent that health services increase the likelihood of desired outcomes. For example, many of the cancer phenotypes are justified as always or often causing death unless detected and treated early in the disease course. On the other hand, some cancers are justified as rarely causing death if appropriate treatment is provided, which may not always be the case for individuals or populations with access to less access to quality healthcare services. Future efforts could more explicitly ask GPT-4 whether the phenotype would cause death with no or suboptimal treatment.

Another limitation with the present dataset is that phenotypes themselves can manifest with different degrees of severity, in the sense that they are more pronounced or intense. For example, sensitivity to light could range from a mild inconvenience to a severe disability that prevents the individual from leaving their home during the day. The effect of onset (beyond congenital vs. non-congenital) and time course (acute, slowly progression, relapse-remitting) were also not explicitly considered. Finally, we did not ask GPT-4 to consider phenotypes as they present within particular diseases. For example, while the phenotype 'Hypertension' may be mild to moderate in the general population and not present until middle-age, it can also present early in life as very severe in the context of a rare genetic disorder such as Liddle syndrome. Future work could explore these nuances in more detail.

In addition to these technical challenges, there are multiple factors that need to be considered when trying to prioritise phenotypes for their suitability for gene therapy development. First, while we have attempted to formalise severity here, this is an inherently subjective concept that may vary considerably across different individuals and contexts. For instance, one could ask whether a condition that always causes death is worse than a condition that causes a lifetime of severe disability (e.g. paralysis, blindness, intellectual disability). Metrics such as quality-adjusted life years (QALYs) have been proposed in the past to address these dilemmas by defining health as a function of both the length and quality of life [@prietoProblemsSolutionsCalculating2003]. With regards to the financial burden of diseases, in some situations phenotypes which require many years of expensive medical care may be prioritised over those that result in extremely early onset lethality and little opportunity for therapeutic intervention. Another factor that affects the viability of a therapeutic program is the speed, cost and other practical considerations of a clinical trial. For instance, measuring risk of ageing-related respiratory failure over a ten-year period may be impractical in some cases. However, testing for total reversal of an existing severe phenotype could potentially yield faster and more immediately impactful results. If performed in close collaboration with medical ethicists, governmental organisations, advocacy groups and patient families, such cost/benefit assessments could be aided by LLMs through the scalable gathering of relevant data. As AI capabilities continue to advance, the range of applications in which they can be used effectively will continue to grow.

While our study demonstrates the feasibility and utility of AI-driven phenotypic annotation, several limitations must be acknowledged. The reliance on computational algorithms may introduce biases or inaccuracies inherent to the training data, necessitating ongoing validation and refinement of our approach. Additionally, our severity scoring system, while comprehensive, may not capture the full spectrum of phenotypic variability or account for complex gene-environment interactions. Future research should focus on further optimising AI-driven annotation methodologies, incorporating additional data modalities such as genomic and clinical data to enhance accuracy.

In conclusion, our study represents a significant step towards harnessing the power of AI to advance phenotypic annotation and severity assessment across all rare diseases. This resource aims to provide researchers and clinicians with actionable insights that can inform rare disease research and improve the lives of individuals affected by rare diseases.

## Methods

### Annotating the HPO using OpenAI GPT-4

We wrote a Python script to iteratively query GPT-4 via the OpenAI application programming interface (API). The ultimately yielded consistently formatted annotations for `r n_ids_gpt` terms within the HPO. Our annotation framework was developed based on previously defined criteria for classifying disease severity [@Lazarin2014-rz]. We sought to evaluate whether each phenotype directly caused a given severity-related clinical characteristic, including: intellectual disability, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility, and/or had a congenital onset. Through prompt engineering we found that the performance of GPT-4 improved when we incorporated a scale associated with each clinical characteristic and required a justification for each response. We asked how frequently the given phenotype directly causes each clinical characteristic - whether it never, rarely, often, or always occurred. This design helps to constrain the potential responses of GPT-4 and thus make it more amenable to machine-readable post-processing. It also serves to address one of its key limitations from the @Lazarin2014-rz survey, namely the lack information on how clinical characteristic frequency affected the clinicians' severity annotations. Here, we can instead use the frequency values to generate more precise annotations and downstream severity ranking scores.

Furthermore, our prompt design revealed that the optimal trade-off between the number of phenotypes and performance (in terms of producing the desired annotations, and adhering to the formatting requirements) was achieved when inputting no more than two or three phenotypes per prompt. An example prompt can be seen in @fig-occurrence. Thus, only two phenotypes were included per prompt in order to 1) avoid exceeding per-query token limits, and 2) prevent the breakdown of GPT-4 performance due to long-form text input, which is presently a known limitation common to many LLMs including GPT-4 [@weiLongformFactualityLarge2024].

### Calculating the true positive rate

```{r tbl-true-positives}
#| label: tbl-true-positives
#| tbl-cap: The HPO branches and their descendants used as true positives for each clinical characteristic.   
#| tbl-colwidths: [40,40,20]

queries <- eval(formals(HPOExplorer::search_hpo)$queries)
query_hits_dt <- data.table(
  "Clinical characteristic"=stringr::str_to_sentence(gsub("_"," ",names(queries))),
  "HPO queries"=trimws(gsub("^","",
                            sapply(queries, function(x){paste(shQuote(x),collapse="; ")}), 
                            fixed = TRUE)),
  "True positive HPO IDs"=sapply(query_hits[names(queries)], data.table::uniqueN)
)
knitr::kable(as.data.frame(query_hits_dt))
```

A true positive rate was calculated as a measure of the recall of the GPT-4 annotations. This was achieved by identifying specific branches within the HPO that would contain phenotypes that would reliably indicate the occurrence of certain clinical characteristics, and using all descendants of this HPO branch as true positives. For example, all descendants of the terms 'Intellectual disability' (HP:0001249) or 'Mental deterioration' (HP:0001268) should be annotated as always or often causing intellectual disability (@tbl-true-positives).

### Quantifying phenotypic severity

The GPT-4 generated clinical characteristic occurrences were converted into a semi-quantitative scoring system, with 'always' corresponding to 3, 'often' to 2, 'rarely' to 1, and 'never' to 0. These scores were then weighted by a severity metric on a scale of 1-5, with 5 representing the highest severity, as determined by the provided clinical characteristics (@tbl-metric-weights). Subsequently, the weighted scores underwent normalisation to yield a final quantitative severity score ranging from 0-100, with 100 signifying the maximum severity score attainable.

Let us denote:

-   $p$ : a phenotype in the HPO.

-   $j$ : the identity of a given annotation metric (i.e. clinical characteristic, such as 'intellectual disability' or 'congenital onset').

-   $W_j$: the assigned weight of metric $j$.

-   $F_j$: the maximum possible value for metric $j$ (equivalent across all $j$).

-   $F_{pj}$ : the numerically encoded value of annotation metric $j$ for phenotype $p$.

-   $NSS_p$: the final composite severity score for phenotype $p$ after applying normalisation to align values to a 0-100 scale and ensure equivalent meaning regardless of which other phenotypes are being analysed in addition to $p$. This allows for direct comparability of severity scores across studies with different sets of phenotypes.

{{< pagebreak >}

::: {#eq-gpt .content-hidden unless-format="html"}
![](figures/eq5.png){height="300px"}

Computing normalised severity score from encoded GPT-4 annotations.
:::

::: {.content-visible unless-format="html"}
```{=tex}
\begin{equation*}
  \eqnmarkbox[Brown4]{nss}{NSS_p}
  =
  \frac{ 
    \eqnmarkbox[Goldenrod]{nss2}{\sum_{j=1}^{m}} 
    (
      \eqnmarkbox[Goldenrod4]{nss3}{F_{pj}}
      \times 
      \eqnmarkbox[IndianRed4]{nss4}{W_j}
    )
    }{
    \eqnmarkbox[Tan]{nss5}{\sum_{j=1}^{m}(\max\{F_j\} \times W_j)} 
  } \times 100
\end{equation*}
\annotate[yshift=1em]{left}{nss}{Normalised Severity Score \\for each phenotype}
\annotate[yshift=3em]{left}{nss2}{Sum of weighted annotation values \\across all metrics}
\annotate[yshift=3em]{right}{nss3}{Numerically encoded annotation value \\of metric $j$ for phenotype $p$}
\annotate[yshift=1em]{right}{nss4}{Weight for metric $j$} 
\annotate[yshift=-1em]{below,right}{nss5}{Theoretical maximum severity score}
```
:::

\

```{r tbl-metric-weights}
#| label: tbl-metric-weights
#| tbl-cap: Weighted scores for each clinical characteristic and GPT-4 response category.

## Get weights from HPOExplorer::gpt_annot_codify function defaults
args <- formals(HPOExplorer::gpt_annot_codify)
code_dict <- eval(args$code_dict)|>sort(decreasing = TRUE)
names(code_dict) <- paste0(stringr::str_to_sentence(names(code_dict)),
                           " (",code_dict,")")
weights_dict <- eval(args$weights_dict)|>unlist()|>sort(decreasing = TRUE)
names(weights_dict) <- paste0(stringr::str_to_sentence(gsub("_"," ",names(weights_dict))),
                           " (",weights_dict,")")
## Create the outer product of the two weight vectors
tbl_metric_weights <- as.data.table(outer(weights_dict,code_dict, "*"),
                                    keep.rownames = "Clinical characteristic")

knitr::kable(as.data.frame(tbl_metric_weights))
```

### Severity classes

```{r tiers_dict}
tiers_dict <- unlist(eval(formals(HPOExplorer::gpt_annot_class)$tiers_dict))
```

The decision tree algorithm used in @Lazarin2014-rz was adapted here for use with the GPT-4 clinical characteristic annotations. This algorithm first assigned each clinical chacteristic to a tier, where Tier 1 indicated the most severe clinical characteristics and Tier 4 indicated the least severe clinical characteristics (`r paste(paste0(shQuote(gsub("_"," ",names(tiers_dict))),"=",tiers_dict), collapse = ", ")`). If a phenotype often or always caused more than one Tier 1 clinical characteristic, it was assigned a severity class of "Profound". If the phenotype often or always caused only one Tier 1 clinical characteristic, it was assigned a severity class of "Severe". A "Severe" class assignment was also assigned if the phenotype often or always caused three or more Tier 2 and Tier3 clinical characteristics. If the phenotype often or always caused at least one Tier 2 clinical characteristic, it was assigned a severity class of "Moderate". All remaining phenotypes were was assigned a severity class of "Mild". In cases where the phenotype mapped to more than one class, only the most severe class was used. This procedure is implemented within the function `HPOExplorer::gpt_annot_class`.

### Correlations between clinical characteristic severity metrics

To assess the correlation structure between each clinical characteristic severity metric, as well as between the composite severity score and each metric, we computed Pearson correlation coefficients for all pairwise combinations of these variables using the numerically encoded metric values. The correlation matrix was visualised using a heatmap, with the colour intensity representing the strength of the correlation (@fig-metric-corplot).

## Data and code availability statement

All code and data used in this study are available on GitHub at:

<https://github.com/neurogenomics/gpt_hpo_annotations>

The GPT-4 clinical characteristic annotations for all HPO phenotypes are made available through the R function `HPOExplorer::gpt_annot_read` or in CSV format at:

<https://github.com/neurogenomics/gpt_hpo_annotations/tree/master/data>

A fully reproducible version of this Quarto manuscript can be found at:

<https://github.com/neurogenomics/gpt_hpo_annotations/blob/master/manuscript.qmd>

## Acknowledgements

We would like to thank members of the Monarch Initiative for their insight and feedback throughout this project. In particular, Peter Robinson.

### Funding

This work was supported by a UK Dementia Research Institute (UK DRI) Future Leaders Fellowship \[MR/T04327X/1\] and the UK DRI which receives its funding from UK DRI Ltd, funded by the UK Medical Research Council, Alzheimer's Society and Alzheimer's Research UK.

## References {.unnumbered}

::: {#refs}
:::

{{< pagebreak >}}

## Supplementary Materials

### Supplementary Figures

```{r fig-consist-vs-ontLvl}
#| label: fig-consist-vs-ontLvl
#| fig-cap: Relationship between the consistency of GPT-4 clinical characterstic annotations (using the stringent criterion) and the level of each phenotype within the HPO ontology (with the number of phenotypes in parentheses). Greater ontology levels (x-axis) indicate more specific phenotypes. The subtitle indicates summary statistics for the overall relationship between HPO level and the proportion of phenotypes that were annotated consistently. The p-values above each bar indicate whether the distribution of consistent/inconsistent annotations, within a given HPO level, significantly deviate from the expected null distribution.
#| fig-height: 6
#| fig-width: 12

consist_vs_ontLvl$labels$caption <- NULL
(
consist_vs_ontLvl /
HPOExplorer::plot_arrow(x=0,xend=1, y=1, yend=1, labels_x=c(.8,.2), labels_y=c(1,1), 
labels=c("Specific\nphenotypes","Broad\nphenotypes"))
) + patchwork::plot_layout(heights = c(1, 0.1))

```

```{r fig-severity-histo}
#| label: fig-severity-histo
#| fig-cap: Distribution of the composite GPT-4 severity score of the severity scores for all HPO terms.
#| fig-height: 8
#| fig-width: 10

gpt_annot_plot_out$gp3 +
  labs(x="Severity score") 
```

```{r fig-severity-boxplot}
#| label: fig-severity-boxplot
#| fig-cap: Boxplot showing the relationship between composite severity score (y-axis) and the frequency response categories within each clinical characteristic type.
#| fig-height: 5
#| fig-width: 8

gpt_annot_plot_out$gp2 +
  scale_fill_brewer(palette = "GnBu", direction = -1) +
  labs(y="Severity score") 
```

```{r fig-metric-corplot}
#| label: fig-metric-corplot
#| fig-cap: Pearson correlations between each individual clinical characteristic severity metric and the composite severity score ('severity_score_gpt').
#| fig-height: 8
#| fig-width: 8

diag(cor_metrics$P) <- 0
corrplot::corrplot(cor_metrics$r,
                   method = c("circle", "square", "ellipse", "number", "shade", "color", "pie")[2],
                   tl.srt = 45,
                   tl.col = "grey20",
                    # insig='blank',
                   addCoef.col ='grey20', number.cex = 0.8,
                   order="hclust",
                   addCoefasPercent=TRUE,
                   p.mat = cor_metrics$P)
```

```{r fig-severity-class}
#| label: fig-severity-class
#| fig-cap: Distribution of the composite GPT-4 severity score introduced in this paper (y-axis) by an approximation of the severity class system introduced @Lazarin2014-rz (x-axis). While these are different schemes for ranking phenotype severity, there is a strong correspondence between them (see summary statistics in subtitle). The sample size (number of phenotypes) is shown in parentheses along the x-axis. 
#| fig-height: 6
#| fig-width: 8

fig_severity_class
```

{{< pagebreak >}}

### Supplementary Tables

```{r top_phenos_path}
top_phenos_path <- here::here("data","top_phenos_annotations.csv.gz")
top_phenos <- data.table::rbindlist(plot_top_phenos_out$data)
data.table::setcolorder(top_phenos, "variable",before="value")
data.table::setcolorder(top_phenos, "hpo_name",after="hpo_id") 
top_phenos <- checks$annot[hpo_id %in% top_phenos$hpo_id]
data.table::fwrite(top_phenos, top_phenos_path)
```

::: {#tbl-annotations}
[**Top phenotype annotations table**](https://github.com/neurogenomics/gpt_hpo_annotations/raw/master/data/%60r%20basename(top_phenos_path)%60)

Table of GPT-4 clinical characterstic annotations for all Human Phenotype Ontology (HPO) phenotypes in @fig-top-phenos. For each phenotype, this includes the name of the phenotype ('hpo_name'), the ID of the phenotype ('hpo_id'), the frequency of each annotation (always, often, rarely, never), and the justification for each annotation ('...\_justification'). These results can also be downloaded programmatically using the R function `HPOExplorer::gpt_annot_check`.
:::

```{r tbl-fas}
#| label: tbl-fas
#| tbl-cap: Severity nnotations generated for GPT-4 clinical characteristic annotations for the HPO phenotype 'Fetal akinesia sequence' (HP:000198).
#| tbl-colwidths: [20,20,60]

annot_cols <- setdiff(grep("justification",names(gpt_annot), 
                       value = TRUE, invert = TRUE),
                       c("hpo_name","hpo_id","pheno_count"))
tbl_fas <- gpt_annot[hpo_name=='Fetal akinesia sequence',]|>
  data.table::melt.data.table(id.vars = c("hpo_name","hpo_id"), 
                              measure.vars = list(
                                Annotation=annot_cols,
                                Justification=grep("justification",names(gpt_annot), 
                                                   value = TRUE)
                                )
                              )
tbl_fas[,"Clinical characteristic":=stringr::str_to_sentence(gsub("_"," ",annot_cols))]
knitr::kable(
  as.data.frame(tbl_fas[,c("Clinical characteristic","Annotation","Justification")])
  )
```