Skip to content

Latest commit

 

History

History
175 lines (98 loc) · 9.44 KB

README.md

File metadata and controls

175 lines (98 loc) · 9.44 KB

Intelligent Antibodies

Therapeutic (monoclonal) antibodies are one of the most effective therapies available today for the treatment of chronic inflammatory diseases such as Crohn's disease, lupus and multiple sclerosis. To treat the latter, monoclonal antibodies can target certain proteins involved in these pathologies with a view to neutralizing them, and can also be used to limit the supply of factors essential to tumor growth or disruptors of the tumor microenvironment. Monoclonal antibody-based serotherapy can also compensate for treatment shortfalls in the case of fulminant epidemics where the pathogens involved have a high mutability rate, such as COVID-19.

Although promising and a major product on the pharmaceutical market, only around thirty monoclonal antibodies are currently available for chronic inflammatory diseases, and around ten for the treatment of cancer. This lack of comprehensiveness is due to the many difficulties inherent in the in-vitro and in-silico design of these therapeutic molecules. Antibody design and/or optimization remains a real challenge, not least because of the need to produce molecules that are effective, target-specific and deliverable to the organs being treated. The difficulties are also linked to long and costly development times.

In order to accelerate the development of therapeutic antibodies, in-silico methods have been developed to reduce modeling times for these molecules, while exploring design possibilities more exhaustively. Although advantageous, these methods currently rely essentially on estimating the affinity between the antibody and its target by calculating the binding energy, which remains difficult to estimate and extremely time-consuming from an experimental point of view.

Data

Two data sets are available, one about multiple species from SabDab, the other about COVID [Cov-AbDab] [https://opig.stats.ox.ac.uk/webapps/covabdab/].

All data previously mentionned are free to acces.

SabDab dataset

The data use here are part of SabDab. They relates to immune complexes characterized through X-ray crystallography.

Two files have been collected :

  • All_PDB_files.txt : which contains ids for each constitutive proteine-protein structure (Antigen-Antibody). The first four characters of the structure name refer to the RCSB PDB database. Second part of the ids concern the chains inside the structures.

  • Positive_samples.txt : which contains all the positively interacting proteins from same or different complexes. For instance, 4gms_J_N_E 2vir_B_A_C relates interactions between 4gms and 2vir. No specific order is precised, meaning that first partner can act the antigen or the antibody. This is reciprocal. Indedd as there are complexes, the antigenic chains of 4gms can form immune complexes with antibody chains of 2vir and antigenic chains of 2vir can also form immune complexes with antibody chains of 4gms. Obviously antigenic and antibody parts chains of a given RCSB ids are forming complexes.

Therefore, two complexes ids that are not matched are not able to form complexes and will act as negative samples.

All the structures are directly collected from SabDab as fasta files looking as :

>1A2Y_1|Chain A|IGG1-KAPPA D1.3 FV (LIGHT CHAIN)|Mus musculus (10090)
DIVLTQSPASLSASVGETVTITCRASGNIHNYLAWYQQKQGKSPQLLVYYTTTLADGVPSRFSGSGSGTQYSLKINSLQPEDFGSYYCQHFWSTPRTFGGGTKLEIK
>1A2Y_2|Chain B|IGG1-KAPPA D1.3 FV (HEAVY CHAIN)|Mus musculus (10090)
QVQLQESGPGLVAPSQSLSITCTVSGFSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSLHTDDTARYYCARERDYRLDYWGQGTTLTVSS
>1A2Y_3|Chain C|LYSOZYME|Gallus gallus (9031)
KVFGRCELAAAMKRHGLANYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL

Sequence informations lines start with > and following line correspond to the constitutive chain of residues (amino-acids).

Cov-AbDab

The data use here are part of Cov-AbDab. They relates to immune complexes.

Three files are available :

  • positive dataset.txt : a three column file metionning the Sars-cov identifier, the antibody sequence and the antigen sequence. Such chains can form immune complexes.

  • negative dataset.txt : a three column file metionning the Sars-cov identifier, the antibody sequence and the antigen sequence. Such chains do not form immune complexes.

  • Independant test.txt :

Get the data

For convenience, some scripts have been written to parse SAbDab database to collect all the fasta files of complexes and to structures sequences table.

To get all the sequences please use ./scripts/download_pdb.py through :

python download_pdb.py

All fasta files will be saved in ./data/SabDab/fasta folder.

To get the interaction table use :

python get_interaction_table.py

This will create data.csv a three column tabular file containing antibody identifier, antigen identifier and 0/1 depending on the ability to form immune complexes. Identifiers are supplemented wit |ag or |ab to refer to the antigenic or antiboy part of the complex.

The data look like :

ab;ag;interaction
5kel|ab;5kel|ag;1
5kel|ab;6cwt|ag;0
...

To have the sequences table use :

python get_seq_table.py

This will create sequences.csv_ a three column tabular file containing extended identifier such as abcd|ag or abcd|ab, the species from where it comes and the chain of residues.

seq_id;specie;sequence
5kel|ag;Zaire ebolavirus (strain Mayinga-76) (128952);IPLGVIHNSTLQVSDVDKLVCRDKLSSTNQLRSVGLNLEGNGVATDVPSATKRWGFRSGVPPKVVNYEAGEWAENCYNLEIKKPDGSECLPAAPDGIRGFPRCRYVHKVSGTGPCAGDFAFHKEGAFFLYDRLASTVIYRGTTFAEGVVAFLILPQAKKDFFSSHPLREPVNATEDPSSGYYSTTIRYQATGFGTNETEYLFEVDNLTYVQLESRFTPQFLLQLNETIYTSGKRSNTTGKLIWKVNPEIDTTIGEWAFWETKKNLTRKIRSEELSFTVVSNGAKNISGQSPARTSSDPGTNTTTEDHKIMASENSSAMVQVHSQGREAAVSHLTTLATISTSPQSLTTKPGPDNSTHNTPVYKLDISEATQVEQHHRRTDNDSTASDTPSATTAAGPPKAENTNTSKSTDFLDPATTTSPQNHSETAGNNNTHHQDTGEESASSGKLGLITNTIAGVAGLITGGRRTRR
5kel|ag;Zaire ebolavirus (128952);EAIVNAQPKCNPNLHYWTTQDEGAAIGLAWIPYFGPAAEGIYTEGLMHNQDGLICGLRQLANETTQALQLFLRATTELRTFSILNRKAIDFLLQRWGGTCHILGPDCCIEPHDWTKNITDKIDQIIHDFVDKTLPDLEVDDDD
...

Strategy

Data encoding

Neural networks

https://github.com/sebgra/Tensorflow_Advanced_Specialization/blob/main/C1/week_1/C1_W1_Lab_3_siamese-network.ipynb

Draft

All the data that can be used fir the challenge can be found on SabDab

To get access to all the data the search module is used.

To get data containing both antibody and proteic antigene sequences with affinity use this - 468 entries To get data containing both antibody and proteic antigene sequences without affinity use this - 5092 entries.

To get data containing both antibody and non necessary proteic antigene sequences with affinity use this - 737 entries

To get data containing both antibody and non necessary proteic antigene sequences without affinity use this - 7825 entries.

More criteria can be applied to select data from here

Backup data can be found here

Covid data

Here

Resources

Article

Benchmark

Siamese Network https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2022.1053617/full#h6x

Data

https://github.com/emersON106/AbAgIntPre/tree/main

  • Get this data to have all the usefull PDBs, then collect all the corresponding fasta files.
  • Parse all the Fasta grepping "heavy chain", "light chain", "antibody", "antigene" to create dataset of sequences for both Ag and Ab.