
IDENTIFYING INDICATIONS FOR NOVEL DRUGS USING ELECTRONIC HEALTH RECORDS
Unsupervised methods that read large EHR databases to predict which diseases a drug under development could treat.
A peer-reviewed study extending and benchmarking unsupervised computational methods that identify candidate indications for drugs in development directly from electronic health records — including a matrix-factorisation variant tuned for first-in-class molecules, where the obvious comparators do not yet exist.
WHAT THE PAPER SHOWS
Choosing which disease to develop a drug for is one of the earliest and highest-stakes decisions in research and development. This study extends and tests several unsupervised computational methods that use electronic health records to identify candidate indications — diseases a drug could plausibly treat — for molecules still in development. The methods are phenotypic-similarity driven: they reason from the patterns of disease that co-occur across millions of real patients, rather than from literature or expert intuition alone.
To benchmark the approach, the methods were tested on known drugs that already have multiple approved indications, so predictions could be checked against ground truth. A variant of matrix factorisation gave the best performance for first-in-line drugs — the hardest and most valuable case, where no established comparator exists yet — improving on earlier methods built for well-characterised, established drugs. Applied beyond the benchmark, the methods surfaced novel predictions for key immunology and oncology drugs.
KEY FINDINGS
- A matrix-factorisation variant gives the best performance for first-in-class drugs — improving on prior methods that were built for established, well-characterised molecules.
- The approach is phenotypic-similarity driven: it learns from the patterns of disease that co-occur across large patient populations, not from literature review alone.
- Applied beyond the benchmark, the methods produced novel indication predictions for key immunology and oncology drugs.
- Performance differs sharply by therapeutic area — stronger in inflammation and immunology than in oncology, likely because many chemotherapies are not targeted therapies, so phenotypic signal is weaker.
- The implementation is released as open-source code, so the methods can be inspected, reproduced and extended.
Methods for treating digitally-identified IL-4/IL-13 related disorders
The same line of work produced more than a paper. Applying unsupervised machine learning — bisecting K-means clustering with Multiple Correspondence Analysis — across electronic health records for roughly 94 million patients, the team digitally identified novel candidate indications for dupilumab, an anti-IL-4Rα antibody that blocks signalling through the shared IL-4/IL-13 pathway. Beyond dupilumab's approved set, the analysis surfaced new candidate disorders to evaluate across skin, blood — including sickle-cell disease — lung fibrosis and eye disorders, the basis for the patent's treatment claims.
Inventors: Cliona Marie Molony · Paul Bryce · Emanuele De Rinaldis · Ramon Antonio Hernandez Vecino · Francisco Javier Jimenez Jimenez

WHERE THE METHOD CAME FROM
Before chAIron existed as a company, the methods now at the centre of its platform were being built, tested and published in the peer-reviewed literature. This paper — co-authored by Flavio Dormont, today a co-founder and chief scientific officer of chAIron — is one of the earliest pieces of that story: an EHR-based, mechanism-grounded way to find the indications a molecule could treat, validated against drugs whose answers were already known.
It is one of the first engagements that shaped how chAIron works today. The platform operationalises this lineage for clients — pairing real-world evidence with a biomedical knowledge graph, and keeping clinical experts in the loop — to turn indication finding from a months-long, intuition-led exercise into a ranked, testable, evidence-backed shortlist.
AT A GLANCE
- Type
- Peer-reviewed research article
- Journal
- Computers in Biology and Medicine (Elsevier)
- Published
- October 2024 · vol. 183, 109158
- Access
- Open access — no form
- Methods
- Unsupervised ML · matrix factorisation
- Code
- Open-source on GitHub
FREQUENTLY ASKED QUESTIONS
Yes. It is available openly through the publisher. You can read it on ScienceDirect, resolve it via its DOI, or read the mirrored record on the ACM Digital Library — no form and no charge. The implementation code is also released openly on GitHub.
It extends and benchmarks unsupervised computational methods that read electronic health records to predict candidate indications for drugs in development. The methods are tested on known drugs with multiple approved indications so that predictions can be checked against ground truth, and a matrix-factorisation variant is shown to work best for first-in-class molecules.
It is part of chAIron's founding story. The EHR-based, mechanism-grounded approach to indication finding described here is co-authored by chAIron co-founder Flavio Dormont, and the same lineage of methods underpins the chAIron platform today. The published work was carried out with the Data & Computational Science and Clinical Real-World Evidence teams at Sanofi R&D.
The patent (WO2021119028A1) covering digitally-identified IL-4/IL-13 disorders is owned by Sanofi Biotechnology SAS, the pharma partner — not by chAIron. It is referenced here because it is a concrete outcome of the same kind of EHR-based indication-finding work, not because chAIron holds any rights in it.
Using unsupervised machine learning across electronic health records for roughly 94 million patients, the work digitally identified novel candidate indications for dupilumab — an anti-IL-4Rα antibody acting on the IL-4/IL-13 pathway — beyond its approved set, spanning skin, blood (including sickle-cell disease), lung fibrosis and eye disorders. These formed the basis of the patent's treatment claims.
No. Performance differs sharply by therapeutic area. The methods perform better for inflammation and immunology than for oncology — likely because many chemotherapies are not targeted therapies, so the phenotypic signal the methods rely on is weaker. The paper is transparent about this limitation.
RELATED
THE METHOD BEHIND CHAIRON — APPLIED TO YOUR ASSET
Read the paper, then talk to us about finding the right indication for a specific molecule — peer to peer.
BOOK AN INDICATION-STRATEGY CALL© 2026 chAIron SA. The referenced publication and patent are the property of their respective owners; the patent (WO2021119028A1) is owned by Sanofi Biotechnology SAS, not chAIron. Provided for informational purposes only — not legal, regulatory, financial or investment advice.


