Tuesday, February 28, 2017

Biotechnology and drug-design: My latest paper explained without the jargon

Our latest paper has just appeared in the open access journal Chemical Science. It's ultimately related to biotechnology and drug design so first some background.

Background
Most biotechnology and medicine involves proteins in some way. Many diseases involve mutations that alter the function of proteins, most drugs are molecules that bind to proteins and inhibit their actions, and a large part of industrial biotechnology involves making new types of proteins such as enzymes. Like with everything else, it is easier to deal with proteins of you know what they look like but protein structure determination can be very difficult and we don't know what many important proteins actually look like.

The most popular way of determining protein structure is a technique called x-ray crystallography where you basically take an x-ray of a crystal made from the protein. Unfortunately, it can be very difficult or impossible to grow crystals of some proteins and if you can't get the protein to crystallise you can't use x-ray crystallography to find the structure.  The other main way for determining protein structure is a technique called NMR spectroscopy where you basically take an MRI of a solution containing the protein. The advantage is that there is no need for crystallisation, but the disadvantage it that it is difficult to extract enough information from the "NMR-MRI" go get a good structure.

The "NMR-MRI" of a protein actually provides a unique fingerprint of each protein so in principle all one has to do is generate a lot of possible structures of a protein, compute the NMR fingerprint for each, and compare to the measured fingerprint. The structure with the best fingerprint match should be the correct protein structure.  The questions are how to best generate the structure and how to best predict the NMR fingerprint using the structure.

The New Study
In 2015 we published a new method for predicting NMR fingerprints and in the paper that just got published we combined it with a method for generating a lot of protein structures. We started with known x-ray structures and generated millions of relatively small variations of the structure and found the structure with the best match.  We started from a known structure to answer the question: what is the best match we can hope for? The answer is: not perfect but good enough.  Now that we know this the next step will be to start with a structure we know is wrong and see if the program can find the right structure.  Also, our NMR fingerprint method does not generate fingerprints for all parts of the protein so we need to improve the model as well.

Monday, February 13, 2017

What are the most important reactions in drug synthesis and stability?

Synthesis
1. Aromatic electrophilic substitution of heteroaromatics
2. Suzuki coupling of heteroaryl halides
3. Diels-Alder reaction (of what)?
4. Michael reaction (of what)?
5. Friedel-Crafts alkylation (of what)?
6. Nazarov cyclization reaction (of what)?
7. ...

Stability (Probably solvolysis and oxidative degradation, but can we be more specific)
1. Autooxidation of C-H bonds?
2. Ester hydrolysis?
3. ..

Sunday, February 5, 2017

Preprints and the speed of publishing

When I talk about preprints with colleagues some of them say "Oh, what's the rush?" or "Publishing is so fast these days. Why, my last paper was online 6 weeks after submission don't you know" and then go on to clean their pipe with a thoughtful smile or get up to stoke the coal fire.

I'm currently writing a couple of proposals and as I was updating the reference list on one of them when I noticed that two preprints I cited apparently still haven't been "published" by a journal.  One of them first appeared on arXiv on October 7th and the other October 27th.

Both papers are important to the proposal in the sense that they changed my thinking on what is possible and I think the proposal would be less ambitious if I hadn't read them. So I am very happy these authors chose to deposit them as preprints.  I wish more people would do this and I wish fewer journals/journal editors would stand in the way of them doing so.

Saturday, January 28, 2017

Drug design: My latest paper explained without the jargon

Our latest paper has just appeared in the Journal of Physical Chemistry A.  If you don't have access to this journal you can find a free version of an earlier draft here. It's ultimately related to making better drugs so first some background.

Background
Designing new drugs currently involves a lot of trial-and-error, so you have to pay a lot of smart scientists a lot of money for a long time to design new drugs - a cost that is ultimately passed on to you and I as consumers.  There are many, many reasons why drug design is so difficult. One of them is that we often don't know fundamental properties of drug-candidates such as the charge of the molecule at a given pH. Obviously, it is hard to figure out whether or how a drug-candidate interact with the body if you don't even know whether it is postive, negative or neutral.

It is not too difficult to measure the charge at a given pH, but modern day drug design involves the screening of hundreds of thousands of molecules and it is simply not feasible to measure them all. Besides, you have to make the molecules to do the measurement, which may be a waster of time if it turn out to have the wrong charge. There are several computer programs that can predict the charge at a given pH very quickly but they have been known to fail quite badly from time to time.  The main problem it that these programs rely on a database of experimental data and if the molecule of interest doesn't resemble anything in the database this approach will fail.

Last year we developed a "new" method for predicting the charge of a molecule that relies less on experimental data but it fast enough to be of practical use in drug design. We showed that the basic approach works reasonably well for small prototypical molecules and we even tested one drug-like molecule where one of the commercial programs fail and show that our new method performs better (but not great).

The New Study
We test the method on 48 drug molecules and show that it works reasonably well.  It is not quite as accurate as the methods that rely on experimental data, but this is probably because many of the molecules we test are in the databases that the programs use.  But we felt we had to test these molecules first because they are some of the first molecules other users will try to test the method. The next step is to test the method on molecules where some of the existing methods perform poorly. We also have to think about how best to make this method available to researchers who are acutually doing the drug design.

Saturday, January 21, 2017

Prediction of the Regioselectivity of Electrophilic Aromatic Substitution Reactions of Heteroaromatic Systems Using Semi-Empirical Quantum Chemical Methods

Art Winter tweeted this paper by Morten Jørgensen and co-workers last year and I decided to see if semi-empirical methods could help here.  The paper uses Chemdraws chemical shift predictor to predict where a bromine atom will be added to a heteroaromatic molecules using electrophilic aromatic substitution reactions. They tested this on 132 different compounds and achieved an 80% success rate, which is very good.

Googling a bit let me to this paper by Wang and Streitwieser where they show a correlation between the rate of electrophilic aromatic substitution reactions and the lowest proton affinity of the protonated species.  This suggests that the protonated carbon with the lowest proton affinity (or pKa if solvent is included) should be the reacting carbon.  So I tested this using semiempirical QM methods for these 132 compounds.  When I say "I" I should say that +Jimmy Charnley Kromann ran many of the calculations and Monika Kruszyk provided most of the structures as Chemdraw files, which I could convert to SMILES strings using OpenBabel. These are preliminary results and may contain errors.

The reactions for the 132 compounds are not all run in the same solvent, so I first tested gas phase, chloroform (i.e. dielectric 4.8) and DMF (dimethylformamide, dielectric 37) using PM3 and COSMO in MOPAC. I chose PM3/COSMO because that gave the best results in a previous pKa study. The most representative choice of solvent seems to be chloroform, where PM3/COSMO predicted the correct bromination site in 95% of the cases, i.e. it fails for 7 cases. Gas phase and DMF fails for 14 and 8 cases, so it's important to include solvent, but the value of the dielectric constant is not all that important.  Using chloroform as a solvent, I then tested AM1,  PM6, PM6-DH+, PM7 and DFTB3/SMD (using GAMESS for the last one), which resulted in 12, 12, 12, 9, and 13 wrong predictions. One of the compounds includes an Si atom, which the DFTB3 parameter set I used couldn't handle so the 13 wrong predictions is out of 131 compounds.  Anyway, PM3/COSMO/chloroform works best.

In some cases the lowest pKa value is quite close to some of the other pKa values, so I took an approach similar to that of Jørgensen and co-workers: if the correct bromination site is included in the set of atoms with pKa values within 0.74 pH units (corresponding to 1 kcal/mol at room temperature) then I counted it as correct.  For PM3/COSMO/chloroform this occurred 10 times. In 9 cases the set included 2 atoms and in 1 case, 3 atoms.  In one of the 9 cases (15) there are only two possible bromination sites, so this case is not a successful prediction and PM3/COSMO/chloroform actually gets 8 wrong. However, in all other cases there are more possibilities than those predicted. Furthermore, in all but 2 of thes 10 cases the atom with the lowest pKa is the "correct" atom.

Bromination, or more generally, halogenation is often a first step towards adding an aryl group, usually using a Suzuki reaction.  Often there is more than one halogen of the same type so there is also interest in predicting where the aryl group will go.  I tried the PM3/COSMO/chloroform approach on the six molecules in this paper by Houk and co-workers. Computing pKa's of the halogenated carbon atoms let to correct predictions in 4 of the 6 cases, while computing proton affinities of the carbon atoms in the non-halogenated parent compounds let to correct predictions in 2 of the 6 cases. The former approach seems promising but needs to be tested on a much larger set of molecules.

Next step is to write this up and get the set-up and analysis code in such a shape that we can distribute it. I've also started thinking about how to make the approach more generally available and usable for non-experts. A grant proposal is also in the works, so if we're successful that should definitely be possible to achieve.

Monday, January 16, 2017

Open access chemistry publishing options in 2017

I just noticed that my go-to journal increased its APC again.*  Now there's a flat fee of $1095 so I am re-evaluating my options for impact neutral OA publishing. I don't think PeerJ is greedy, so I think the most likely explanation is be that their old model was not sustainable. I now feel I have been a bit to hard on some other OA publishers (e.g. here and here, but not here). While price and impact-neutrality is the main consideration, open peer review is a nice bonus that I became accustomed to from PeerJ. In my experience it makes for much better reviews and keeps the tone civil. Impact neutral journals$0. Royal Society Open Science still has an APC waiver and open peer review. (The RSC manages "the journal’s chemistry section by commissioning articles and overseeing the peer-review process")

€750. Research Ideas and Outcomes (disclaimer: I am subject editor), open peer review.

$1000 F1000Research. Open peer review$1095 PeerJ. Open peer review.

$1350 Cogent Chemistry. Has a "pay what you can" policy. Closed peer review. HT +Stephan P. A. Sauer$1495 PLoS ONE. Closed peer review.

$1675 Scientific Reports. Closed peer review$2000 ACS Omega. Price for CC-BY by ACS member ($140/year). Closed peer review. So it looks like Royal Society Open Science is the next thing for me to try, as long as the APC waiver is in place. Free or reasonably priced journals that judge perceived impact$0 Chemical Science Closed peer review

$0 Beilstein Journal of Organic Chemistry. Closed peer review. HT +Wendy Patterson$0 Beilstein Journal of Nanotechnology. Closed peer review. HT +Wendy Patterson

$500 ACS Central Science Price for CC-BY by ACS member ($140/year). Closed peer review.

£500 RSC Advances. Closed peer review. (Normally £750)

Let me know if I have missed anything.

Last update: 2017.03.05

*I just noticed that the membership model still exists though the price has increased. I already have a premium membership, so this may still be a viable option for me. If you are a single author or have only one co-author this is still the way to go.

Sunday, January 15, 2017

Making your computational protocol available to the non-expert

I recently read this paper by Jonathan Goodman and co-workers which I learned about through this highlight by Steven Bachrach.  The DP4 method is a protocol for computing chemical shifts of organic molecules using DFT and comparing the chemical shifts to experimental values.  This paper automates the method, switches to free software packages (NWCHEM instead of Gaussian and TINKER instead of Macromodel), and tests the applicability for drug like molecules.  The python and Java code is made available on Github under the MIT license.

The method is clearly aimed at organic chemists who use NMR to figure out what they made or isolated. Let's say they want to try DP4 to see how well it works on some molecule they are currently working on.

What's needed to get started
1. Access to multicore Linux computer.  The method requires quite many B3LYP/6-31G(d,p) NMR calculations and given the typical size of organic molecules it will probably not be practically possible to even test this method on a desktop computer.  Even if it is, the instructions for PyDP4 assumes you are using Linux to you'd have to somehow deal with that if you, like many, have a Windows machine.

2. Installation. You have to install NWCHEM, Tinker, OpenBabel and configure PyDP4.

3. Coordinates. PyDP4 requires an sdf file as input.  You have to figure out what that is and how to make one.

4. Familiarity with Linux.  All this assumes that you are familiar with Linux. How many synthetic organic chemists are?

If you'll be using DP4 a lot, all of this may be worth doing but perhaps not just to try it?  If you don't have access to a Linux cluster, buying one for the occasional NMR calculation may be hard to justify. If one is convinced/determined enough, the most likely solution would probably be to find and pay an undergrad to do all this using an older computer you were gonna throw out anyway.  Or maybe your department has a shared cluster and a sysadmin who could handle the installation.

Alternative 1: Web server
One alternative is to make DP4 available as a web server, where the user can upload the sdf file and other data.  If one includes a GUI all 4 problems are solved ... for the user.  The problem for the developer is that this could eat up a lot of your own computational resources. One could probably do something smart to only use idle cycles, but the best case scenario (lots of users) also becomes the worst case scenario.  Perhaps there's a way to crowdsource this?

Alternative 2: VM Virtual box
Another alternative is to make DP4 available as a virtual machine (VM).

This mostly solves the installation issue. The main problem here is that the user needs still needs to find a reasonably powerful computer to run this on. The other problem is that the developer needs to test the VM-installation on various operating systems and keep up to date as new ones appear. Perhaps there's a way to crowdsource all this?

Alternative 3: Amazon Web Services or Google Compute Engine
Another alternative is to make DP4 available as a VM image for AWS or GCE.  This mostly solves the CPU and installation issue. The user creates an AWS or GCE account and imports the VM image and then pays Amazon and Google for computer time using a credit card. For reasonably sized molecules the cost would probably be less than $10/molecule as far as I can tell. I don't have any direct experience with AWS or GCE so I don't know how slick the interface can be made. All examples I have seen have involved ssh to the AWS/GCE console, so some Linux knowledge is required. Alternative 4: AWS/GCE-based Web server Another alternative is to combine 2 and 3. The problem here is how to bill the individual user for their CPU-usage. There is probably ways to to this but it's starting to sound like a lot of work to set up and manage. Perhaps by adding a surcharge one could pay someone to handle this on a part-time basis. Perhaps existing companies would be interesting in offering such a service? Licensing issues As far as I can tell the licenses of NWCHEM, TINKER, and OpenBabel allow for all 4 alternatives. The bigger issue A key step in making a computational chemistry-based methods such as DP4 usable to the non-expert is clearly automation and careful testing. Another is using free software (I have access to Gaussian but I am not going to buy Macromodel just to try out DP4!). Kudos to Goodman and co-workers for doing this. But if we want to target the non-experts, I think we should try to go a bit further. One could even imagine something like this in the impact/dissemination section of a proposal: The computational methodology is based on free software packages and the code needed for automatisation and analysis, that is written as part of the proposed work, will be made available on Github under an open source license. Furthermore, Company X will make the approach available on the AWS cloud computing platform, which will allow the non-expert to use the approach without installation or investment in in-house computational resources and greatly increase usage. Company X handles the set-up, billing for on-demand CPU-time, usage-statistics, and provides a rudimentary GUI for the approach for a one-time fee of$2000, which is included in the budget.
Anyway, just some thoughts.  Have I missed other ways of getting a relatively CPU-intensive computational chemistry method in the hands of non-experts?