Analysis of High-Throughput Protein Expression in Escherichia Coli
Submitted for publication.Analyzing your set of proteins
To create Table 2 with your own sequences you will need Python, Biopython and ZODB. The sequence analysis modules I wrote are available through Biopython but require programming skills. I will post my own custom scripts and detailed instructions soon but it will still be a bit complicated for those who don't have any programming knownledge. In parallel I am working on making an easy-to-use web interface. In the meantime, please feel free to send me your sequences in fasta format and I will create Table 2 with your own data. Make sure you include the group number in the title of each sequence.
Highly expressed genes in E.coli
This is a set of 121 proteins that are highly expressed in E.coli. These proteins were derived from 2D gel analysis, using the SWISS-2DPAGE database. I downloaded all the E.coli 2D gels in melanie format (melanie is a software for 2D gel analysis) from the ftp web server and therefore was able to make my own selection of spots based on the gel analysis.
This excel file contains the data I gathered from all the melanie analyzed gels of E.coli. The file is sorted based on %Vol and there are 121 spots with a %Vol above 0.2 (marked in light orange background). The DNA and protein sequences of these genes were fetched and are available in Fasta format.
Codon adaptation index was calculated based on these 121 highly expressed genes. The calculation was described by Sharp and Li and is performed using the codon usage module I submitted to biopython. Here is a chart showing the difference between these CAI values and the original CAI Values.
These are the values per codon calculated from the 121 highly expressed E.coli genes:
Amino Acid |
Codon |
CAI |
Amino Acid |
Codon |
CAI |
Amino Acid |
Codon |
CAI |
Ala |
GCG |
1.000 |
Gly |
GGC |
1.000 |
Pro |
CCG |
1.000 |
Ala |
GCA |
0.690 |
Gly |
GGT |
0.994 |
Pro |
CCA |
0.277 |
Ala |
GCT |
0.677 |
Gly |
GGG |
0.186 |
Pro |
CCT |
0.189 |
Ala |
GCC |
0.632 |
Gly |
GGA |
0.113 |
Pro |
CCC |
0.090 |
Arg |
CGT |
1.000 |
His |
CAC |
1.000 |
Ser |
AGC |
1.000 |
Arg |
CGC |
0.692 |
His |
CAT |
0.735 |
Ser |
TCT |
0.910 |
Arg |
CGG |
0.053 |
Ile |
ATC |
1.000 |
Ser |
TCC |
0.784 |
Arg |
CGA |
0.039 |
Ile |
ATT |
0.714 |
Ser |
TCG |
0.404 |
Arg |
AGA |
0.023 |
Ile |
ATA |
0.033 |
Ser |
AGT |
0.302 |
Arg |
AGG |
0.011 |
Leu |
CTG |
1.000 |
Ser |
TCA |
0.301 |
Asn |
AAC |
1.000 |
Leu |
CTC |
0.131 |
Thr |
ACC |
1.000 |
Asn |
AAT |
0.396 |
Leu |
TTG |
0.120 |
Thr |
ACT |
0.437 |
Asp |
GAT |
1.000 |
Leu |
CTT |
0.114 |
Thr |
ACG |
0.347 |
Asp |
GAC |
0.856 |
Leu |
TTA |
0.107 |
Thr |
ACA |
0.159 |
Cys |
TGC |
1.000 |
Leu |
CTA |
0.023 |
Trp |
TGG |
1.000 |
Cys |
TGT |
0.676 |
Lys |
AAA |
1.000 |
Tyr |
TAC |
1.000 |
Gln |
CAG |
1.000 |
Lys |
AAG |
0.243 |
Tyr |
TAT |
0.822 |
Gln |
CAA |
0.345 |
Met |
ATG |
1.000 |
Val |
GTT |
1.000 |
Glu |
GAA |
1.000 |
Phe |
TTC |
1.000 |
Val |
GTG |
0.967 |
Glu |
GAG |
0.347 |
Phe |
TTT |
0.691 |
Val |
GTA |
0.497 |
Val |
GTC |
0.466 |
Computing protein attributes
The DNA and protein sequence analysis modules were all written in Python. The modules I used for the article are more advanced than the modules I submitted to biopython, mostly because they are imbeded in ZODB infrastructure. However, I am currently working on upgrading the biopython module. If you need to see it sooner rather than later, please contact me.
