Background + Problem

Many medical patients have issues that the current standard of care in medicine just hasn’t produced results for. People suffer every day because their condition just isn’t responding to the existing methods. Enter clinical trials, which are trials for emerging methods of treatment that could potentially change the standard of care entirely. However, the existing system for finding clinical trials is outdated and nearly impossible to navigate as someone who isn’t medically trained.

Currently, finding applicable clinical trails for a patient is a task predominantly done by medical professionals due to how complex the existing systems are. However, unless your doctor is scouring upcoming clinical trials every day, there is a high chance that you will never hear about potentially life saving treatments for your condition.

If you were to manually search these trials, it would take over 400 days.

Our client ClinicalNet is a company that seeks to transform the healthcare landscape by connecting patients with the most relevant clinical trials for them. As such, it was critical for them to have the best clinical trial matching experience possible via providing users with accurate and timely results.

However, their existing trial search algorithm leveraged old school text matching techniques. These worked in some cases, but often failed to capture the variety of synonyms and related conditions when patients attempted to search for their condition by disease. We needed to consider over 65,000 active clinical trials, each with 20+ inclusion/exclusion criteria (not to mention the hundreds of thousands of completed trials to use as a baseline) and create an accurate (critical for sensitive medical applications) and timely (critical for user experience) search engine. This seems like a job for Natural Language Processing AI!

High Level Solution

In order to tackle this problem, we leveraged a powerful NLP technique called Semantic Search. This is actually the same technique that powers the “Retrieval” step in “Retrieval Augmented Generation” (RAG) used in many Large Language Model applications from enterprise search to custom chatbots, and more.

To describe the technique succinctly, we encode all of our “target” data, the clinical trials, via “embedding” the important text as “vectors.” These vectors are fixed length lists of numbers that represent the actual meaning of the text they were embedded from. In order to facilitate this, we utilized a custom built embedding model trained specifically on medical terminology from various popular medical ontologies.

A simplified visual of the embedding process. Credit: haystack

We can then create vectors from any user input query, and perform the actual Semantic Search, which involves computing various math equations, such as cosine similarity, between the input terms and our target terms to retrieve the most relevant targets on a per-query basis. This prevents us from missing out on tons of potentially qualifying conditions!

Example “qualifying conditions” for users who have Lung Cancer vs. the previous system

However, just this problem formulation wasn’t nearly enough to get us where we needed to go. Our final approach involved a variety of mapping algorithms, term clustering approaches, and clever optimizations to cross the accuracy and timing thresholds we were looking for.


Condition Family Disambiguation

Medical conditions are related to each other much like relatives in a family tree. There are broader conditions that encompass more specific conditions which in turn encompass even more specific conditions. We refer to this relationship as “parent” and “child.”

At the surface level, it’s pretty clear that “lung cancer” is a child concept of the condition “cancer.” However, as we venture deeper into specific medical conditions, it becomes less clear how “oat cell carcinoma” relates to “small cell lung cancer.” Which is the parent and which is the child? Trick question - they’re actually sibling concepts! Thus, we needed a specialized method for determining the relationship, between a user’s input query and the various conditions the semantic search returned.

After much thought and consideration, our chosen method ended up solving this problem rather elegantly, and provided us with a complete ontology of familial relationships that we could then leverage to make our search even better. Not only that, but we devised a way to “pre-solve” the problem for many branches of the condition tree in order to avoid heavy computation on every user query.

System Responsiveness

This system needed to provide fast responses over a vast array of clinical trials and medical terminology. It was necessary to sort the relevance over 65,000 clinical trials on a per-query basis, which also performing mapping into existing medical term ontologies composed of millions of terms to ensure the accuracy of our results. All of this needed to be done in just a few seconds, even for queries the system had never seen before.

In order to facilitate this, we leveraged a few approaches:

  • Adopted a statistics based approach to precomputing key condition families and mapping those back to common queries

  • Created different tiers of search, allowing for a more in depth search to be conducted in the background while returning the most “obvious” results immediately

  • Tuned the dimensions of the custom embedding model to our use case. Larger vectors mean more info captured, but also more search time!


ClinicalNet now has a search engine that matches patients with highly relevant clinical trials across over a million unique conditions in 5 seconds or less.

Our overhaul of the existing search system resulted in significant improvements to the relevance of the clinical trial results to the user input queries, while in the general case introducing little to no additional processing time.

On top of that, our method for building the condition trees automatically saved over 1000 hours of manual effort, in addition to automatically incorporating any new conditions.

Through a combination of creating custom Artificial Intelligence models and employing principled software engineering solutions, we were able to solve this difficult problem in an elegant way!

Looking to stay connected?
Get access to the only newsletter you need to implement AI in your business.

Looking to stay connected?
Get access to the only newsletter you need to implement AI in your business.

Looking to stay connected?
Get access to the only newsletter you need to implement AI in your business.