Creating a cheminformatics workflow Using the command-line interface to ChemAxon tools, Aabel and Applescript.
You may have used Marvin a collection of tools for drawing, displaying and characterizing chemical structures, substructures and reactions. Most of the time you would access these tools via the GUI provided by ChemAxon, however it is also possible to access these tools via the command-line. Open up a Terminal window and typecxcalc -h and you should see the
following options available.
ChrisMacBookPro:~ username$ cxcalc -h
Calculator, (C) 1998-2009 ChemAxon Ltd.
version 5.2.2
Runs various molecule calculations: charge, pKa, logP, etc.
Usage:
cxcalc [general options] [input files/strings]
[plugin options] [input files/strings]
cxcalc [general options] [input files]
[plugin1 options] [input files/strings]
[plugin2 options] [input files/strings]
...
cxcalc [training options] [input file (the training set)]
General options:
cxcalc -h, --help this help message,
list of available calculations
cxcalc -h, --help plugin specific help message
-o, --output output file path (default: stdout)
-t, --tag name of the SDFile tag to store the
calculation results, tag name prefix
to default tag names in case of multiple
plugins (default: see plugin help)
-i, --id SDFile tag that stores the molecule ID
if no such tag exists in the input molecule
then molecule ID is the molecule itself
converted to the specified format
(default: ID = molecule index)
-N, --do-not-display do not display molecule ID and/or
table header (in table output form):
i - no molecule ID
h - no table header
ih - neither molecule ID nor table header
-S, --sdf-output SDF output with results in SDF tags
-M, --mrv-output result molecule output in MRV format
(if neither -S nor -M is specified then
plugin results are written in table form)
-g, --ignore-error continue with next molecule on error
-v, --verbose print calculation warnings to the console
Training options:
-T, --train-knowledge-base [logP|pKa]
generate knowledge base for the specified
calculation
-o, --output logP: output file path
pKa: output directory path
-t, --tag name of the SDFile tag that stores the
experimental values (logP only)
-a, --add-built-in-training-set
add built-in training set (logP only)
Available calculations:
atomcount, composition, dotdisconnectedformula,
dotdisconnectedisotopeformula, elemanal, elementalanalysistable,
exactmass, formula, icomposition, iformula, isotopecomposition,
isotopeformula, mass
Charge
atomicpolarizability, atompol, averagemolecularpolarizability,
averagepol, avgpol, axxpol, ayypol, azzpol, charge, formalcharge,
ioncharge, molecularpolarizability, molpol, oen,
orbitalelectronegativity, pol, polarizability, tholepolarizability,
tpol, tpolarizability
Conformation
conformers, hasvalidconformer, leconformer, lowestenergyconformer,
moldyn, moleculardynamics
Geometry
aliphaticatom, aliphaticatomcount, aliphaticbondcount,
aliphaticringcount, aliphaticringcountofsize, angle, aromaticatom,
aromaticatomcount, aromaticbondcount, aromaticringcount,
aromaticringcountofsize, asa, asymmetricatom, asymmetricatomcount,
balabanindex, bondcount, bondtype, carboaromaticringcount,
carboringcount, chainatom, chainatomcount, chainbond, chainbondcount,
chiralcenter, chiralcentercount, connected, connectedgraph,
cyclomaticnumber, dihedral, distance, distancedegree, dreidingenergy,
eccentricity, fragmentcount, fusedaliphaticringcount,
fusedaromaticringcount, fusedringcount, hararyindex,
heteroaromaticringcount, heteroringcount, hindrance, hyperwienerindex,
largestatomringsize, largestringsize, largestringsystemsize,
maximalprojectionarea, maximalprojectionradius, minimalprojectionarea,
minimalprojectionradius, molecularsurfacearea, msa, plattindex,
polarsurfacearea, psa, randicindex, ringatom, ringatomcount, ringbond,
ringbondcount, ringcount, ringcountofatom, ringcountofsize,
ringsystemcount, ringsystemcountofsize, rotatablebond,
rotatablebondcount, shortestpath, smallestatomringsize,
smallestringsize, smallestringsystemsize, stereodoublebondcount,
stericeffectindex, sterichindrance, szegedindex, topanal,
topologyanalysistable, vdwsa, wateraccessiblesurfacearea, wienerindex,
wienerpolarity
Isomers
canonicaltautomer, dominanttautomerdistribution,
doublebondstereoisomercount, doublebondstereoisomers, generictautomer,
majortautomer, moststabletautomer, stereoisomercount, stereoisomers,
tautomercount, tautomers, tetrahedralstereoisomercount,
tetrahedralstereoisomers
Markush Enumerations
enumerationcount, enumerations, markushenumerationcount,
markushenumerations, randommarkushenumerations
Name
name
Partitioning
logd, logp
Protonation
averagemicrospeciescharge, chargedistribution, isoelectricpoint,
majormicrospecies, majorms, microspeciesdistribution, msdistr, pi, pka
Other
acc, acceptor, acceptorcount, acceptorsitecount, acceptortable,
accsitecount, aromaticelectrophilicityorder,
aromaticnucleophilicityorder, canonicalresonant, chargedensity, don,
donor, donorcount, donorsitecount, donortable, donsitecount,
electrondensity, electrophilicityorder,
electrophiliclocalizationenergy, energy, frameworks, hbda,
hbonddonoracceptor, huckel, huckeleigenvalue, huckeleigenvector,
huckelorbitals, huckeltable, localizationenergy, msacc, msdon,
nucleophilicityorder, nucleophiliclocalizationenergy, order,
pichargedensity, pienergy, refractivity, resonantcount, resonants,
totalchargedensity
Examples:
cxcalc mols.sdf charge
cxcalc -i smiles mols.sdf logP pKa
cxcalc -S -t myLOGP mols.sdf logp -t increments,logP -p 3
cxcalc -t my mols.sdf logd -l 3 -u 7 -s 0.5 logp -t increments,logP -p 3
cxcalc -T logP -t LOGP -o logPparameters.txt trainingset.sdf
We can use these tools to provide a
cheminformatics computation engine for use in quick
calculations from the command-line or as part of a
workflow. For example the following quickly calculates
the LogP of the input SMILES string.
ChrisMacBookPro:~ username$ cxcalc 'c1(c(cccc1)Br)C(=O)C' logp
id logP
1 2.30
Alternatively you might want to
calculate the LogD at a particular pH (usually
physiological pH 7.4).
ChrisMacBookPro:~ username$ cxcalc 'c1(c(cccc1)Br)C(=O)C' logd -H 7.4
id logD[pH=7.4]
1 2.30
The commands can be added together,
for example if you wanted to calculate the Lipinski "Rule of Five" properties
(doi:10.1016/S0169-409X(00)00129-0).
ChrisMacBookPro:~ username$ cxcalc 'c1(c(cccc1)Br)C(=O)C' logp mass acceptorcount donorcount
id logP Mass acceptorcount donorcount
1 2.30 199.045 2 0
These commands can also be used to
manipulate files, so calculate the Ro5 properties for
a file use:-
ChrisMacBookPro:~ username$ cxcalc /Users/username/Desktop/acetophenones.smiles -o /Users/username/Desktop/results.tab logp mass acceptorcount donorcount
Or if you want the results added to
an SDFile use:-
ChrisMacBookPro:~ username$ cxcalc /Users/username/Desktop/acetophenones.sdf -S -o /Users/username/Desktop/results.sdf logp mass acceptorcount donorcount
The attraction of the command-line
options is that they can be included in an Applescript
to automate processing a chemical structure file.
However when using Applescript to run UNIX commands
the are are few things you need to bear in mind. the
Applescript command "do shell script" always uses
/bin/sh to interpret your command not your default
shell, it also ignores the configuration file that an
interactive shell would read, so commands you may use in
the terminal may need modifying to work in an
Applescript. In particular you will probably have to
give full paths to commands etc. and it is probably a
good idea to enclose paths in single quotes to avoid
problems with spaces in folder/file names. The other
thing to note is that Applescript uses a colon ":" as a
separator for directories however UNIX uses POSIX file
paths in which the slash "/" is used as the directory
separator. However one of the additions to AppleScript
1.8 was the ability to inter-convert the two file
reference systems. You can demonstrate this using the
simple applescript below.
set this_file to choose file set this_file_text to (this_file as text) display dialog this_file_text set posix_this_file to POSIX path of this_file display dialog posix_this_file set this_file_back to (POSIX file posix_this_file) as string display dialog this_file_backUsing quotes and backslashes in the shell command, Strings in AppleScript go from an opening double quote to a closing double quote. To put a literal double quote in your string you must "escape" it with a backslash character. Some punctuation has special meanings in shell so use quoted form to avoid punctuation being interpreted by the shell.
set the_text to "this is a test." do shell script "echo " & quoted form of the_text & " | perl -n -e 'print \"\\U$_\"'"So a script to calculate the Ro5 properties on a file might look like this:-
-- shell sees, echo 'this is a test.' | perl -n -e 'print "\U$_"' --result: "THIS IS A TEST."
set this_file to choose file set this_file_text to (this_file as text) --get the posix path to chosen file set posix_this_file to quoted form of POSIX path of this_file set shell_script to "'/Applications/ChemAxon/MarvinBeans/bin/cxcalc' " & posix_this_file & " -o '/Users/username/Desktop/results.tab' logp mass acceptorcount donorcount polarsurfacearea" do shell script shell_scriptThis script creates a tab delimited file on the desktop called results.tab. Hard coding the path to the desktop will fail if the user has renamed the hard drive, so we use a short script snippet to get the path to the desktop.
set user_path to (path to desktop) as text --file for calculated results set result_file to user_path & "results.tab" as text set posix_result_file to quoted form of POSIX path of result_fileNow we have a file containing the data in tab delimited format we can use a plotting application to plot the distribution of molecular properties for the compounds in the file. I use Aabel. Using Applescript to create a chart using Aabel is efficient but perhaps not straightforward at first sight so lets take it in small steps. Also you need to download the latest patched version of Aabel, during the course of developing this script I identified a couple of bugs in the Applescript support that the developers rapidly fixed.
First we define the tab delimited file to be imported and then import the data into a new worksheet. The next part simply defines the working directory to save files to. Aabel separates data from charts so now we need to create a new viewer to plot the data onto.
SelectChart, Selects a chart type, the parameters are the Viewer button and menu coordinates as tab delimited TEXT. So SelectChart "4 3" means select the fourth button from the top row of the viewer (which is pie charts) and then select the third item from the dropdown menu (Square Pie). In our case we chose button 8 (histograms) and menu item 1 (continuous data). We then define the fill colour, the numbers are available from the "Edit colour palette" menu, and select the variables we use the second column of data from the topmost worksheet. Then we position the chart, Dimensions are given in current real world page units (inches, cm etc.: horizontal start point, horizontal end point, vertical start point, vertical end point). The coordinates are 0,0 in the upper-left corner of the page, and the positive direction is down and to the right.
tell application "Aabel_3" Run set thetabdelimitedfile to (result_file as text) as alias ImportDataIntoNewWorksheet thetabdelimitedfile set currentdirectory to alias "Macintosh HD:Public" SetCurrentDirectory currentdirectory CreateNewViewer "1" activate SelectChart "8 1" SelectDefaultFillColor "120" SelectVariables "1 2" SetChartInstanceDimensions "0.2 3.2 0.2 3.2" end tellThis is the simplest type of script and it is possible to refine the display. With SelectChart You can add an additional 15 parameters. These parameters define the settings that correspond to those defined using the Variables & Plot Options palette controls, which can include:
(1) 5 popup menu values
(2) 5 checkbox values
(3) 5 slider values
So SelectChart "8 1 1 3 1 1 -1 0 0 0 0 1 20 0.2 -1 -1 -1"
Corresponds to
Finally if we want to add text we use these commands.
tell application "Aabel_3" Run SetDefaultTextLineFont "Helvetica" SetDefaultTextLineFontStyle "Bold" SetDefaultTextLineFontSize "14" SelectDefaultLineColor "1" CreateTextLine the_text end tellThe final complete script that includes export of the final chart as a pdf is shown below and the actual script can be downloaded here.
property obgrepPath : "'/usr/local/bin/obgrep'" set user_path to (path to desktop) as text --file for calculated results set result_file to user_path & "results.tab" as text set posix_result_file to quoted form of POSIX path of result_file set this_file to choose file set this_file_text to (this_file as text) tell application "Finder" to set file_name to (name of this_file) --get the posix path to chosen file set posix_this_file to quoted form of POSIX path of this_file --use openbabel to get number of structures set obgrep_command to obgrepPath & " -v -c \"NNNNNN\" '" & posix_this_file & "'" set obgrep_command_shell to obgrep_command & " |cut -d \" \" -f2" set count_lines to (do shell script obgrep_command_shell) as string set the_text to " The file " & file_name & " contains " & count_lines & " structures." --get molecular properties set shell_script to "'/Applications/ChemAxon/MarvinBeans/bin/cxcalc' " & posix_this_file & " -o " & posix_result_file & " logp mass acceptorcount donorcount polarsurfacearea rotatablebondcount" do shell script shell_script --Use Aabel create histograms tell application "Aabel_3" Run set thetabdelimitedfile to (result_file as text) as alias ImportDataIntoNewWorksheet thetabdelimitedfile set currentdirectory to alias "Macintosh HD:Public" SetCurrentDirectory currentdirectory CreateNewViewer "1" activate SelectChart "8 1" SelectDefaultFillColor "120" SelectVariables "1 2" SetChartInstanceDimensions "0.2 3.2 0.2 3.2" CreateNewChartInstance " " SelectDefaultFillColor "21" SelectChart "8 1" SelectVariables "1 3" SetChartInstanceDimensions "3.2 6.2 0.2.2 3.2" CreateNewChartInstance " " SelectDefaultFillColor "64" SelectChart "8 1 1 3 1 1 -1 0 0 0 0 1 10 0.2 -1 -1 -1" SelectVariables "1 4" SetChartInstanceDimensions "0.2 3.2 3.2 6.2" CreateNewChartInstance " " SelectDefaultFillColor "140" SelectChart "8 1 1 3 1 1 -1 0 0 0 0 1 10 0.2 -1 -1 -1" SelectVariables "1 5" SetChartInstanceDimensions "3.2 6.2 3.2.2 6.2" CreateNewChartInstance " " SelectDefaultFillColor "45" SelectChart "8 1" SelectVariables "1 6" SetChartInstanceDimensions "6.2 9.2 0.2.2 3.2" CreateNewChartInstance " " SelectDefaultFillColor "64" SelectChart "8 1 1 3 1 1 -1 0 0 0 0 1 20 0.2 -1 -1 -1" SelectVariables "1 7" SetChartInstanceDimensions "6.2 9.2 3.2 6.2" CreateNewChartInstance " " SelectDefaultFillColor "140" SetDefaultObjectLocation "1 7" SetDefaultTextLineFont "Helvetica" SetDefaultTextLineFontStyle "Bold" SetDefaultTextLineFontSize "14" SelectDefaultLineColor "1" CreateTextLine the_text ExportVisibleViewerContent "MyfileNew.pdf" Quit end tellAnd the result is shown here.
This script could be stored in the Applescript menu folder or converted into a droplet and left on the desktop. Many more properties could be calculated using the chemaxon tools. I've only tested this with SMILES and sdf format files, but you should be able to use Openbabel to convert most file formats to these formats.