## Monday, January 16, 2017

### Open access publishing options in 2017

I just noticed that my go-to journal increased its APC again.  Now there's a flat fee of $1095 so I am re-evaluating my options for impact neutral OA publishing. I don't think PeerJ is greedy, so I think the most likely explanation is be that their old model was not sustainable. I now feel I have been a bit to hard on some other OA publishers (e.g. here and here, but not here). While price and impact-neutrality is the main consideration, open peer review is a nice bonus that I became accustomed to from PeerJ. In my experience it makes for much better reviews and keeps the tone civil.$0. Royal Society Open Science still has an APC waiver and open peer review.

€750. Research Ideas and Outcomes (disclaimer: I am subject editor), open peer review.

$1000 F1000Research. Open peer review$1095 PeerJ. Open peer review.

$1495 PLoS ONE. Closed peer review.$1675 Scientific Reports. Closed peer review

$2000 ACS Omega. Price for CC-BY by ACS member ($140/year). Closed peer review.

So it looks like Royal Society Open Science is the next thing for me to try, as long as the APC waiver is in place.  I should also say that I had a very good experience publishing in Chemical Science ($0, closed peer review) recently, but not all my papers are appropriate for that journal. Similarly, the cost for publishing in ACS Central Science under the CC-BY licence is$500 for non-members.

Let me know if I have missed anything.

## Sunday, January 15, 2017

### Making your computational protocol available to the non-expert

I recently read this paper by Jonathan Goodman and co-workers which I learned about through this highlight by Steven Bachrach.  The DP4 method is a protocol for computing chemical shifts of organic molecules using DFT and comparing the chemical shifts to experimental values.  This paper automates the method, switches to free software packages (NWCHEM instead of Gaussian and TINKER instead of Macromodel), and tests the applicability for drug like molecules.  The python and Java code is made available on Github under the MIT license.

The method is clearly aimed at organic chemists who use NMR to figure out what they made or isolated. Let's say they want to try DP4 to see how well it works on some molecule they are currently working on.

What's needed to get started
1. Access to multicore Linux computer.  The method requires quite many B3LYP/6-31G(d,p) NMR calculations and given the typical size of organic molecules it will probably not be practically possible to even test this method on a desktop computer.  Even if it is, the instructions for PyDP4 assumes you are using Linux to you'd have to somehow deal with that if you, like many, have a Windows machine.

2. Installation. You have to install NWCHEM, Tinker, OpenBabel and configure PyDP4.

3. Coordinates. PyDP4 requires an sdf file as input.  You have to figure out what that is and how to make one.

4. Familiarity with Linux.  All this assumes that you are familiar with Linux. How many synthetic organic chemists are?

If you'll be using DP4 a lot, all of this may be worth doing but perhaps not just to try it?  If you don't have access to a Linux cluster, buying one for the occasional NMR calculation may be hard to justify. If one is convinced/determined enough, the most likely solution would probably be to find and pay an undergrad to do all this using an older computer you were gonna throw out anyway.  Or maybe your department has a shared cluster and a sysadmin who could handle the installation.

Alternative 1: Web server
One alternative is to make DP4 available as a web server, where the user can upload the sdf file and other data.  If one includes a GUI all 4 problems are solved ... for the user.  The problem for the developer is that this could eat up a lot of your own computational resources. One could probably do something smart to only use idle cycles, but the best case scenario (lots of users) also becomes the worst case scenario.  Perhaps there's a way to crowdsource this?

Alternative 2: VM Virtual box
Another alternative is to make DP4 available as a virtual machine (VM).

This mostly solves the installation issue. The main problem here is that the user needs still needs to find a reasonably powerful computer to run this on. The other problem is that the developer needs to test the VM-installation on various operating systems and keep up to date as new ones appear. Perhaps there's a way to crowdsource all this?

Alternative 3: Amazon Web Services or Google Compute Engine
Another alternative is to make DP4 available as a VM image for AWS or GCE.  This mostly solves the CPU and installation issue. The user creates an AWS or GCE account and imports the VM image and then pays Amazon and Google for computer time using a credit card. For reasonably sized molecules the cost would probably be less than $10/molecule as far as I can tell. I don't have any direct experience with AWS or GCE so I don't know how slick the interface can be made. All examples I have seen have involved ssh to the AWS/GCE console, so some Linux knowledge is required. Alternative 4: AWS/GCE-based Web server Another alternative is to combine 2 and 3. The problem here is how to bill the individual user for their CPU-usage. There is probably ways to to this but it's starting to sound like a lot of work to set up and manage. Perhaps by adding a surcharge one could pay someone to handle this on a part-time basis. Perhaps existing companies would be interesting in offering such a service? Licensing issues As far as I can tell the licenses of NWCHEM, TINKER, and OpenBabel allow for all 4 alternatives. The bigger issue A key step in making a computational chemistry-based methods such as DP4 usable to the non-expert is clearly automation and careful testing. Another is using free software (I have access to Gaussian but I am not going to buy Macromodel just to try out DP4!). Kudos to Goodman and co-workers for doing this. But if we want to target the non-experts, I think we should try to go a bit further. One could even imagine something like this in the impact/dissemination section of a proposal: The computational methodology is based on free software packages and the code needed for automatisation and analysis, that is written as part of the proposed work, will be made available on Github under an open source license. Furthermore, Company X will make the approach available on the AWS cloud computing platform, which will allow the non-expert to use the approach without installation or investment in in-house computational resources and greatly increase usage. Company X handles the set-up, billing for on-demand CPU-time, usage-statistics, and provides a rudimentary GUI for the approach for a one-time fee of$2000, which is included in the budget.
Anyway, just some thoughts.  Have I missed other ways of getting a relatively CPU-intensive computational chemistry method in the hands of non-experts?

## Saturday, January 7, 2017

### Planned papers for 2017

A year ago I thought I'd probably publish three papers in 2016:

Listed as probable in 2016
1. Benchmarking of PM6 and DFTB3 for barrier heights computed using enzyme active site models.
2. pKa prediction using PM6 - part 1
3. Protein structure refinement using ProCS15 - starting from x-ray structure

and this basically turned out to be correct, as you can see from the links, except that paper number 3 officially is published in 2017 because Chemical Science still uses issues. So I will have to list it as a 2017 paper, meaning I published two papers in 2016.  Not my best year.

Here's the plan for 2017

Accepted
1. Protein structure refinement using a quantum mechanics-based chemical shielding predictor
2. Prediction of pKa values for drug-like molecules using semiempirical quantum chemical methods

Probable
3. Intermolecular Interactions in the Condensed Phase: Evaluation of Semi-empirical Quantum Mechanical Methods
4. Fast Prediction of the Regioselectivity of Electrophilic Aromatic Substitution Reactions of Heteroaromatic Systems Using Semi-Empirical Quantum Chemical Methods
5. Benchmarking cost vs. accuracy for computation of NMR shielding constants by quantum mechanical methods
6. Improved prediction of chemical shifts using machine learning
7. PM6 for all elements in GAMESS, including PCM interface

Probably not in 2017
8. Protonator: an open source program for the rapid prediction of the dominant protonation states of organic molecules in aqueous solution
9. pKa prediction using semi-empirical methods: difficult cases
10. Prediction of C-H pKa values and homolytic bond strengths using semi-empirical methods
11. High throughput transition state determination using semi-empirical methods

### Reviews of Prediction of pKa values for drug-like molecules using semiempirical quantum chemical methods

I have been remiss in posting reviews of my papers. I submitted the paper to Journal of Physical Chemistry A on November 2, 2016, received first round of reviews November 29, and second round of reviews December 12.  The paper was accepted January 5, 2017 and has appeared online.

Round 1

Reviewer: 1

Recommendation: This paper is not recommended because it does not provide new physical insights.

This is an interesting study on very important subject - prediction of pKa for drug-like molecules. Standard free energy of a molecule is determined as the sum of heat of formation/electronic energy and solvation free energy and these terms are obtained by various semiempirical QM (SQM) methods and two continuous solvent models. Author used SQM methods as a black box and compared them on the basis of their performance to predict pKa. This is, however, not justified since the SQM methods used described differently system under study. For example, PM6-DH+ describes well H-bonding and dispersion energy contrary to e.g. PM3 and AM1. Consequently, structures stabilized by H-bonding and dispersion will be described much better by the former method. Further, PM7 was parametrized to cover dispersion in core parametrization, contrary to PM6 (and PM3) where it should be included a posteriori by e.g. DH+ term. Consequently, PM7 should be also better suited than, e.g. PM6. The question arises how good those methods work and here performance of these methods should be compared with some higher-level method like DFT.

Further, SQM methods were in the last 5 years already used for protein - ligand interactions but these papers were not mentioned at all.

On the basis of above-mentioned arguments I cannot recommend the paper for publication in JPC.

Reviewer: 2

Recommendation: This paper is publishable subject to minor revisions noted.  Further review is not needed.

This is simply excellent work on an important topic. The only thing is that the author could put the importance of his work in an even greater perspective. Semi-empirical methods are becoming increasingly important also in materials science and the pKa is of high importance also in this field, as it is a good indicator of general chemical stability (like it is used in organic chemistry) of molecular (especially organic) materials for technical applications. A recent example is the search for new organic electrolyte solvents for Lithium-air battery devices, where current design principles strongly rely on pKa values (see for instance http://pubs.rsc.org/en/Content/ArticleLanding/2015/CP/C5CP02937F#!divAbstract ).

Round 2

Reviewer: 1

Recommendation: This paper is not recommended because it does not provide new physical insights.

Since the ms was not modified according my comments I cannot recommend it for publication.

Reviewer: 3

Recommendation: This paper is publishable subject to minor revisions noted.  Further review is not needed.

This paper evaluates a number of semi-empirical quantum mechanical (SQM) methods for their suitability in calculating the pKa’s of amine groups in drug-like molecules, with the hope that these methods can be used for high-throughput screening.  This paper is suitable for publication in the special issue, subject to minor revision.

(a) The paper shows that pKa’s calculated by some SQM methods is sufficiently accurate for high-throughput screening.

(b) Indicate the accuracy of related QM calculations (e.g. Eckert and Klamt) and the relative cost of QM vs SQM calculations (order of magnitude will do)

(c) How much better is the SQM approach than the empirical methods cited by the author? (add a comparison in the tables)

(d) The need for 26 reference compounds for 53 amine groups in 48 molecules is disturbingly high (so much so that the null hypothesis has errors only a factor of 2 larger than the best results). What are the errors in the SQM calculated pKa’s if a much smaller number of reference compounds are used? (e.g. 6 or less)  If the errors are acceptable, this could make it possible to automate the procedure so that it could be used to screen larger sets of molecules extracted from typical industrial databases (10,000 – 10,000,000 compounds).

### Reviews of Protein structure refinement using a quantum mechanics-based chemical shielding predictor

I have been remiss in posting reviews of my papers. I received this review on November 11, 2016 of a manuscript I submitted to Chemical Science on September 29, 2016.  The paper was accepted November 17 and has appeared online.

REVIEWER REPORT(S):
Referee: 1

Recommendation: Accept

Review of 'Protein Structure Refinement Using a Quantum Mechanics-Based
Chemical Shielding Predictor'

The authors present a method to refine protein structures with respect to
chemical shifts evaluated by their QM-based ProCS15 method. First applications
to a set of different protein structures showed that small structural changes
lead to a significant reduction of the RMSD.

Empirical methods to predict NMR shifts have shown to be able to deliver
results that correlate well with experimental at almost no computational cost,
in particular in comparison with quantum chemical methods. However, these
methods are also insensitive with respect to structural changes of the
molecular structure. In this work, the authors analyse their empirical ProCS15
method, which is parametrized based on quantum chemical reference calculations,
with respect to structural changes in the molecular geometry. First examples
show that their method has a similar high sensitivity with respect to structure
changes as quantum chemical methods. The results indicate that ProCS15 can
hold a 'predictive power' beyond previous empirical methods, i.e., in
applications to more exotic molecular geometries and conformations.

The manuscript is well written and of appropriate length, and certainly of
great interest for the readers of Chemical Science. The presented applications
have been thoroughly analyzed and results are well outlined for the reader.
Since I've have only a few comments/suggestions, no further revision prior to
publication is necessary. However, I would strongly suggest to consider my
suggestion on the ordering of sections (see below).

+ My main point is actually regarding to the order of sections in the
manuscript.  Since the different methods used are constantly refered to in
the result-section, I would recommend to first outline the
theory/computational methodology and then present the results of the
illustrative calculations on the test systems.

+ In the summary, the authors mention that their method might be used to
improve the accuracy of QM or QM/MM calculations of NMR chemical shifts.
It is certainly difficult to judge the quality of the ProCS15-optimized
structures objectively, i.e., without refering to secondary properties like
NMR shifts. However, it would be interesting to see the impact of the
structural changes in quantum chemical calculations.
This point might be beyond the scope of this work, but is certainly worthwile
to be considered by the authors as a future project.

+ Just a comment on the DFT-based reference calculations used to parametrize
the ProCS15 method: It might be worthwile considering the use of the KT2
functional by Keal and Tozer [JCP 119, 3015 (2003)] and the basis sets
pcS-x/pcSseg-x by Frank Jensen [JCTC 4, 719 (2008):JCTC 10, 1074 (2014)].
Both functional and basis sets are optimized for NMR chemical shift
calculations. A benchmark of those method was done by Flaig et al. [JCTC 10,
572 (2014)].

## Thursday, December 15, 2016

### The Open Access Reviewer Rewards Program or "Reviews for OA"

It's almost New Year so I'll soon get e-mails from journals saying how much they appreciate my reviews.  I might even get advertisement material thinly disguised as a calendar.

I've decided I want something more. Well, different - they can keep the calendar and they can keep the emails.  I wan't a partial APC voucher that I can use to publish open access in the journal. And I mean "CC-license open access", not your "pay-us-but-we-own-your-work-anyway" license.

How big a voucher, you ask?  I would say 5-10% of the CC-BY APC is reasonable.  Certainly, if I have reviewed 20 papers for a journal, I should be able to publish an accepted paper as OA free of charge there.

Obviously, it will take time to implement such a scheme.  I'll give them a year.  In 2018 I'll start saying "no" to journals that don't offer some kind of scheme like this. Or give them another year, I dunno.

And just as obviously, this won't happen by itself. Here's an example of my reply to the usual post review thank you email:

Dear Gus

You’re very welcome.  As you know reviewing takes a lot of time.  Would it be possible for JCTC to reward reviewers with a partial APC voucher along the lines described here: http://proteinsandwavefunctions.blogspot.dk/2016/12/the-open-access-reviewer-rewards.html?  This would be a tangible demonstration of how much you value your reviewers and increase open access publication, which is good for science.  If you like the idea perhaps you could pass this suggestion along to the ACS.

Best regards, Jan

2017.01.01 Update
Journal/publishers who do something similar (may not be current)
Announcing a New Benefit for PeerJ Peer Reviewers  (HT @chanin_nanta)
Reviewing for MDPI Just Became More Rewarding (HT @chanin_nanta)
Reviewer Discount for BMC journals (HT @chanin_nanta)

To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighbouring rights to this work.

## Thursday, November 24, 2016

### Which method is more accurate? or Errors have error bars

2017.01.10 update: this blogpost is now available as a citeable preprint

This post is my attempt at distilling some of the information in two papers published by Anthony Nicholls (here and here). Anthony also very kindly provided some new equations, not found in the papers, in response to my questions.

Errors also have error bars
Say you have two methods, $A$ and $B$, for predicting some property and you want to determine which method is more accurate by computing the property using both methods for the same set of $N$ different molecules for which reference values are available. You evaluate the error (for example the RMSE) of each method relative to the reference values and compare. The point of this post is that these errors have uncertainties (error bars) that depend on the number of data points ($N$, more data less uncertainty) and you have to take these uncertainties into consideration when you compare errors.

The most common error bars reflect 95% confidence and that's what I'll use here.

The expression for the error bars assume a large $N$ where in practice "large" in this context means roughly 10 or more data points.  If you use fewer points or would like more accurate estimates please see the Nicholls papers for what to do.

Root-Mean-Square-Error (RMSE)
The error bars for the RMSE are asymmetric.  The lower and higher error bar on the RMSE for method $X$ $(RMSE_X)$ is
$$L_X = RMSE_X - \sqrt {RMSE_X^2 - \frac{{1.96\sqrt 2 RMSE_X^2}}{{\sqrt {N - 1} }}}$$
$$= RMSE_X \left( 1- \sqrt{ 1- \frac{1.96\sqrt{2}}{\sqrt{N-1}}} \right)$$

$$U_X = RMSE_X \left( \sqrt{ 1+ \frac{1.96\sqrt{2}}{\sqrt{N-1}}}-1 \right)$$

Mean Absolute Error (MAE)
The error bars for the MAE is also asymetric. The lower and higher error bar on the MAE for method $X$ $(MAE_X)$ is

$$L_X = MAE_X \left( 1- \sqrt{ 1- \frac{1.96\sqrt{2}}{\sqrt{N-1}}} \right)$$

$$U_X = MAE_X \left( \sqrt{ 1+ \frac{1.96\sqrt{2}}{\sqrt{N-1}}}-1 \right)$$

Mean Error (ME)
The error bars for the mean error are symmetric and given by
$$L_X = U_X = \frac{1.96 s_N}{\sqrt{N}}$$

where $s_N$ is the standard population deviation (e.g. STDEVP in Excel).

Pearson’s correlation coefficient, $\textbf{r}$
The first thing to check is whether your $r$ values themselves are statistically significant, i.e. $r_X > r_{significant}$ where

$$r_{significant} = \frac{1.96}{\sqrt{N-2+1.96^2}}$$

The error bars for the Pearson's $r$ value are asymmetric and given by
$$L_X = r_X - \frac{e^{2F_-}-1}{e^{2F_-}+1}$$
$$U_X = \frac{e^{2F_+}-1}{e^{2F_+}+1} - r_X$$

where

$$F_{\pm} = \frac{1}{2} \ln \frac{1+r_X}{1-r_X} \pm r_{significant}$$

Comparing two methods
If $error_X$ is some measure of the error, RMSE, MAE, etc, and $error_A > error_B$ then the difference is statistically significant only if

$$error_A - error_B > \sqrt {L_A^2 + U_B^2 - 2{r_{AB}}{L_A}{U_B}}$$

where $r_{AB}$ is the Pearson's $r$ value of method $A$ compared to $B$, not to be confused with $r_A$ which compares $A$ to the reference value.  Conversely, if this condition is not satisfied then you cannot say that method $B$ is not more accurate than method $A$ with 95% confidence because the error bars are too large.

Note also that if there is a high degree of correlation between the predictions ($r_{AB} \approx$ 1) and the error bars are similar in size $L_A \approx U_B$ then even small differences in error could be significant.

Usually one can assume that $r_{AB} > 0$ so if $error_A - error_B > \sqrt {L_A^2 + U_B^2}$ or $error_A - error_B > L_A + U_B$ then the difference is statistically significant, but it is better to evaluate $r_{AB}$ to be sure.

The meaning of 95% confidence
Say you compute errors for some property for 50 molecules using method $A$ ($error_A$) and $B$ ($error_B$) and observe that Eq 11 is true.

Assuming no prior knowledge on the performance of $A$ and $B$, if you repeat this process an additional 40 times using all new molecules each time then in 38 cases (38/40 = 0.95) the errors observed for method $A$ will likely be between $error_A - L_A$ and $error_A + U_A$ and similarly for method $B$. For one of the remaining two cases the error is expected to be larger than this range, while for the other remaining case it is expected to be smaller. Furthermore, in 39 of the 40 cases $error_A$ is likely larger than $error_B$, while $error_A$ is likely smaller than $error_B$ in the remaining case.