I’m Alfie (more on my bio), Evariste Technologies’ resident medicinal chemist. I decided to write this blog to give a bit of detail on the processes we use when working on projects and how we come up with novel compound designs. What’s described below is the sort of output we generate when working with our partners, written in a way that is hopefully accessible for anyone with a scientific background.
This post is going to detail some of the work behind the contributions Evariste has made (so far) to the Covid Moonshot effort coordinated by PostEra. All the work done here was in a Jupyter notebook using Python 3.7 and a number of open source and proprietary modules.
I won’t give too much background on the Moonshot effort itself as there are some excellent explanations on the PostEra website and a paper that was published in late 2020. The project is a collaboration between many groups and is looking to identify a drug targeting the SARS-CoV-2 main protease (MPro). Suffice to say it’s a worthy effort and one to which Evariste is happy to dedicate some time and effort, as have many others. As for our contribution, this has involved taking the publicly available datasets and applying them using Evariste’s hit-to-candidate optimisation platform, Frobenius.
The workflow when using Frobenius is not too hard, but takes some thought if you’re going to get sensible results, especially when working with a dataset composed of multiple series. Outlined below are the basic steps for this project:
1. Decide what data you’re going to use
This is not as easy as it sounds, the old adage of ‘garbage in garbage out’ is as true as it ever was. Happily, the Moonshot data is really well curated and so it’s very easy to select the compounds and data we think are most useful. In this case, there are two assays measuring the potency of the Moonshot compounds, and they tend to correlate well across most of the dataset.
To build our model, we chose to use the IC50 data generated by the fluorescence assay. There isn’t a brilliant justification for this, other than the fact that a number of compounds which were relatively more potent in the RapidFire assay contained isatin groups. These are quite reactive under a range of conditions, and can also be highly coloured, so we initially focused on the fluorescence data.
We also chose to remove the compounds flagged as acrylamides or chloroacetamides to reduce the number of irreversible inhibitors in the dataset. There’s a significant amount of discussion around what constitutes the correct way to compare reversible and irreversible inhibitors in the literature. I won’t add to that here other than to say that this assay probably isn’t capturing some of the intricacies of irreversible interactions and so we (initially) chose to remove them from the data. This isn’t a criticism of the approach, quick and dirty was definitely the way to go here. That leaves us with a set of 524 compounds, with a reasonably standard distribution of potency.
The large number of compounds with pIC50 = 4 are all below the sensitivity of the assay and it’s actually impossible to correctly represent this data. All we know is that the value is > 50 uM, according to the experimental information from the bioArxiv paper. You can assign practically any value <= 4 to this data and it’s a constant issue when trying to build machine learning models on biological datasets. In this case, 4 is fine. It’s several orders of magnitude lower than the most potent compounds and so the model isn’t going to try and optimise towards any of the molecules found here.
2. Generate some fingerprints
We now have to somehow represent the molecules we’ve chosen in a way that makes sense to a computer. One of the more common ways of doing this is by generating Morgan fingerprints for each molecule. I won’t go into detail about how these work but this nice little function does the job once you’ve converted your SMILES strings into a set of molecules.
df['mol'] = [Chem.MolFromSmiles(smi) for smi in df['SMILES']] fps = quick_morgan(df.mol, n_bits=128).toarray().tolist()
You may notice if you know about fingerprints that we're using 128 bit vs 1024 bit fingerprints. I’m in no way an expert on this but am told by Evariste’s mathematicians that 1024 bit fingerprints are not always the best choice. The main problem is that if you don’t have enough compounds in your dataset, when you build a model it will be prone to overfitting and will generalise poorly. It doesn’t mean you’re doomed to fail, but it does mean you need to think about how best to build the model and test it thoroughly on a sensible out-of-sample.
The choice of 128 vs 1024 bits has wider implications for Frobenius because we filter our designs by the Jaccard/Tanimoto distance and two compounds will (almost always) have different similarities when compared using different fingerprints. For the reasons above, we used the 128 bit fingerprints to build our models.
3. Build a model and find your best compounds
If you’ve done everything up until now correctly, this is the easy bit. We built a random forest which is then used by Frobenius to score every molecule in the dataset and rank them according to their likelihood of leading to a molecule with a pIC50 > 8. Choosing the target you aim for is really important. You obviously have to pick a value which would be a genuine improvement for the vast majority of compounds in the dataset. However, if you aim too high, the model will only explore unknown areas of chemical space where there’s higher error in the predictions. Not a great strategy.
The top two molecules (244 and 448) are the most potent molecules in the dataset (which is a relief!), but also not very useful. 244 is peptidic with a slightly unusual carbon bonded to both a sulfonic acid and an alcohol. For many reasons, this is going to be hard to turn into a drug. 448 is actually an irreversible inhibitor but wasn’t removed from the data in the cleaning step. This is a reasonable starting point, but as I said earlier I’m not 100% sure that the potency of irreversible inhibitors is correctly represented by this assay.
The next four molecules all look reasonable, two of them (384 and 151) are clearly closely related. Digging into the data slightly shows them to be close to equipotent but 151 is going to be substantially more lipophilic (probably at least 100-fold) so we selected 105, 384, and 291 to be the starting points for further optimisation.
4. Design some new analogues
Frobenius has three design algorithms, all of which are proprietary so we can’t go into too much detail. Suffice to say they cover different areas of chemical space and between them produce about 10^6 molecules for every input. However, there’s no point suggesting things we can’t predict well on and so after filtering for distance from the dataset, we end up with about 10^4 molecules to be scored using the models built earlier.
The top scoring designs are then filtered again by a medicinal chemist and the top scoring molecules submitted to the Moonshot website.
We didn’t include the scores associated with any of these, they were all broadly in the same range of 0.01 - 0.02. This represents a 1 - 2% chance of achieving the pIC50 > 8 we specified earlier. This doesn’t sound particularly high, but the model is broadly bullish about the prospects of each of these compounds as starting points. Part of the output of Frobenius is the fraction success score. This is a measure of the likelihood of finding a molecule that meets the desired endpoint/s within a ball of chemical space 0.2 distant from the starting point. The scores here range from 48 - 83% success, indicating that, based on the data that has been gathered so far, there is a very good likelihood of finding a compound with pIC50 > 8 in each of these series.
5. Some tweaks to the model
The highest scoring changes suggested for 105 were in a limited area of chemical space and one thing it would be great to change is the very lipophilic ethyl linked fluorobenzene group, which is unchanged by the model as it thinks it’s essential for potency. However, there are plenty of fairly potent molecules in the dataset which don’t contain this group, many of which contain irreversible warheads. We cleaned these molecules out of the dataset due to the potentially very different mechanism of action. However, that biases the model, particularly in respect of this series.
If we add these molecules back in and re-run the notebook the designs are slightly different, albeit with broadly similar scores of 1 - 2%. These were submitted separately.
Unfortunately not all of these are obviously going to reduce the logP, if we were to investigate that thoroughly we’d need to add it as a parameter for Frobenius to optimise towards. This is completely do-able but not a focus of the project at the minute. Running this new model with the other starting points resulted in a similar set of suggestions, with some changes in rank order, so we didn’t update with new suggestions.
We also decided to make some suggestions based on 151. There’s clearly scope to reduce the lipophilicity and, once the models have been built, running Frobenius is trivial. These designs were actually our most successful so far. Three of them were predicted to have a 7% chance of meeting the criteria and all of the others were > 2%. This poses an interesting problem because there are tens (if not hundreds) of analogues that could be submitted. Future design ideas will be scored by synthetic accessibility, as assessed by a number of different computational methods.
6. Scoring all the submissions
The PostEra website lets you download a csv of all the submissions so far (well done them, for this and the design of the Moonshot interface generally, which is extremely intuitive). There are about 14,000 molecules in this dataset once you remove everything that has been tested. A summary of these scores is below:
• The top molecules (other than our submissions) are, unsurprisingly, analogues of the top compounds picked out by Frobenius. One is a close relative of 244, and two are closely related to 291.
• Frobenius scores its own designs quite highly, almost all the compound designs we submitted are in the top 1% . This isn’t really a fair comparison because many of the suggestions were generated using structure based design which is (somewhat) ab initio so allows more accurate predictions further from the dataset. These are a long way from anything the model has seen before so it essentially predicts the mean value with significant error bars, resulting in a low score overall.
• There are a number of highly scoring molecules designed by Frobenius that we didn’t submit because they had already been suggested by others.
Hopefully this was a legible and vaguely engaging description of how we go from a dataset through to a series of novel designs. I’ve tried to acknowledge the PostEra team and other contributors to the Moonshot throughout but it really should be emphasized what an excellent job they’ve done of coordinating and accelerating this effort. For IP reasons I haven’t been able to discuss a lot of the nuts and bolts of the modeling and statistical analysis but if anyone has any questions feel free to get in touch with us as we’d be more than happy to chat!