PaperBLAST
PaperBLAST Hits for CCNA_02589 (78 a.a., MPMSHAALEA...)
Show query sequence
>CCNA_02589
MPMSHAALEARLVAAFPDSEIVLTDLVGDNDHWKARIVSPAFKGLPRVRQHQLVNRALAD
VLGGTLHALALETAAPAE
Running BLASTp...
Found 26 similar proteins in the literature:
Saro_2520 BolA-like protein from Novosphingobium aromaticivorans DSM 12444
55% identity, 97% coverage
ATU_RS09055 BolA/IbaG family iron-sulfur metabolism protein from Agrobacterium fabrum str. C58
49% identity, 97% coverage
5nfmA / A0A452CS62 Crystal structure of yrba from sinorhizobium meliloti in complex with copper. (see paper)
46% identity, 95% coverage
- Ligand: copper (ii) ion (5nfmA)
BAB1_0856 ATP/GTP-binding site motif A (P-loop):BolA-like protein from Brucella melitensis biovar Abortus 2308
43% identity, 97% coverage
H375_7110 BolA/IbaG family iron-sulfur metabolism protein from Rickettsia prowazekii str. Breinl
41% identity, 96% coverage
ECH_0303 BolA family protein from Ehrlichia chaffeensis str. Arkansas
33% identity, 97% coverage
- Impact of Three Different Mutations in Ehrlichia chaffeensis in Altering the Global Gene Expression Patterns
Kondethimmanahalli, Scientific reports 2018 - “...protein RibD ECH_0230 991 4109 4.15 putative membrane protein ECH_0251 1042 2185 2.1 hypothetical protein ECH_0303 1018 2856 2.80 BolA family protein ECH_0367 849 1274 2.49 ATP-dependent Clp protease, ATP-binding subunit ClpB ECH_0450 1261 3710 2.94 conserved hypothetical protein ECH_0531 1363 11788 8.65 hypothetical protein ECH_0630...”
- “...were also upregulated. Transcripts for two genes coding for iron sulfur proteins {BolA family protein (ECH_0303) and FeS cluster assembly scaffold (IscU) (ECH_0630)} were similarly up-regulated. We observed differential expression of six hypothetical protein genes, which included ECH_0166, ECH_0251, ECH_0450, ECH_0531, ECH_0753, and ECH_0878. Mutation in...”
RSP_2952 BolA-like protein from Rhodobacter sphaeroides 2.4.1
41% identity, 95% coverage
- Convergence of the transcriptional responses to heat shock and singlet oxygen stresses
Dufour, PLoS genetics 2012 - “...RSP_1684, RSP_1743, RSP_1852, RSP_2121, RSP_2125, RSP_2214, RSP_2219, RSP_2387, RSP_2638, RSP_2640, RSP_2641, RSP_2739, RSP_2763, RSP_2764, RSP_2816, RSP_2952, RSP_2953, RSP_3067, RSP_3068, RSP_3378, RSP_3426, RSP_3552, RSP_3597, RSP_3598, RSP_3634, RSP_3809, RSP_3810, RSP_4244, RSP_4245, RSP_4248, RSP_4305 RpoHII regulon (99 genes) Energy metabolism Biosynthesis and degradation of polysaccharides RSP_0482 Electron transport RSP_0108,...”
- Identification, functional studies, and genomic comparisons of new members of the NnrR regulon in Rhodobacter sphaeroides
Hartsock, Journal of bacteriology 2010 - “...site is located upstream of a gene designated bolA (gene RSP_2952 in 2.4.1) (47). While this gene is conserved in strain 2.4.3, the NnrR binding site is not...”
ssr3122 hypothetical protein from Synechocystis sp. PCC 6803
47% identity, 81% coverage
- Deep Proteogenomics of a Photosynthetic Cyanobacterium
Spät, Journal of proteome research 2023 - “...0.01). By focusing on the subset of uncharacterized proteins, we observed that Slr1419, Slr1846, and Ssr3122 were significantly decreased in abundance during resuscitation compared to chlorosis ( Figure S2f ). Instead, Sll1735 (increased), Sll1783, Sll7086, Slr5111, and Slr5127 (decreased) were connected to the low CO 2...”
- Structural Determinants and Their Role in Cyanobacterial Morphogenesis
Springstein, Life (Basel, Switzerland) 2020 - “...Synpcc7942_0299 (WP_011243525.1) All0086 (WP_010994263.1) Cell elongation MreD N/A Synpcc7942_0298 (ABB56330.1) All0085 (BAB77609.1) Cell elongation BolA Ssr3122 (WP_010871705.1) Synpcc7942_1146 (ABB57176.1) Asr0798 (WP_010994972.1) Cell elongation CikA Slr1969 (WP_010872820.1) Synpcc7942_0644 (WP_011243194.1) All1688 (WP_010995857.1) Circadian rhythm PBP1 Sll0002 (WP_010873436.1) Synpcc7942_2000 (WP_011378270.1) Alr5101 (WP_010999227.1) Cell wall synthesis PBP2 Slr1710 (WP_010871874.1) Synpcc7942_0785...”
- Proteomic analysis reveals resistance mechanism against biofuel hexane in Synechocystis sp. PCC 6803
Liu, Biotechnology for biofuels 2012 - “...Slr1846, Slr1847, Slr2101, Ssl0242, Ssl0352, Ssl0467, Ssl0832, Ssl1690, Ssl1707, Ssl1972, Ssl2717, Ssl3364, Ssr1528, Ssr1853, Ssr2554, Ssr3122, Ssr3304, Ssr3402 Hypothetical proteins *Proteins with1.5 fold change and p - value less than 0.05. **Hypothetical proteins listed with gene ID only, full information in Additional file 1 : Table...”
ZMO1874 BolA family protein from Zymomonas mobilis subsp. mobilis ZM4
38% identity, 97% coverage
- Systems-Level Analysis of Oxygen Exposure in Zymomonas mobilis: Implications for Isoprenoid Production
Martien, mSystems 2019 - “...were also upregulated. ZMO1872 and ZMO1875 are in an operon with glutaredoxin (ZMO1873) and BolA (ZMO1874), and all four genes were upregulated at both the protein and transcript levels ( Fig.4 ). Interestingly, overexpression of ZMO1875 has also been found to increase growth and ethanol production...”
- Dissecting a complex chemical stress: chemogenomic profiling of plant hydrolysates
Skerker, Molecular systems biology 2013 - “...2008 ). We identified four tolerance genes in Z. mobilis ( ZMO0429 , ZMO1067 , ZMO1874 , and ZMO1875 ) that provide a causal link between growth in hydrolysate, oxidative stress, and Fe-S clusters. In Z. mobilis , ZMO0429 and ZMO1067 encode predicted Fe-S assembly proteins,...”
- “...; Shakamuri et al, 2012 ; Willems et al, 2012 ). Consistent with this role, ZMO1874 is in the same operon with a predicted monothiol glutaredoxin ( ZMO1873 ) and an Fe-S containing enzyme quinolinate synthetase ( ZMO1871 ), which is involved in NAD biosynthesis. The...”
Synpcc7942_1146 conserved hypothetical protein from Synechococcus elongatus PCC 7942
39% identity, 78% coverage
- Structural Determinants and Their Role in Cyanobacterial Morphogenesis
Springstein, Life (Basel, Switzerland) 2020 - “...All0086 (WP_010994263.1) Cell elongation MreD N/A Synpcc7942_0298 (ABB56330.1) All0085 (BAB77609.1) Cell elongation BolA Ssr3122 (WP_010871705.1) Synpcc7942_1146 (ABB57176.1) Asr0798 (WP_010994972.1) Cell elongation CikA Slr1969 (WP_010872820.1) Synpcc7942_0644 (WP_011243194.1) All1688 (WP_010995857.1) Circadian rhythm PBP1 Sll0002 (WP_010873436.1) Synpcc7942_2000 (WP_011378270.1) Alr5101 (WP_010999227.1) Cell wall synthesis PBP2 Slr1710 (WP_010871874.1) Synpcc7942_0785 (ABB56817.1) Alr4579...”
SPV1_10194 hypothetical protein from Mariprofundus ferrooxydans PV-1
46% identity, 81% coverage
- Mariprofundus ferrooxydans PV-1 the first genome of a marine Fe(II) oxidizing Zetaproteobacterium
Singer, PloS one 2011 - “...be rather narrow for M. ferrooxydans [1] , however PV-1 possesses a predicted operon (SPV1_t10271, SPV1_10194, SPV1_10199, SPV1_10204, SPV1_10209, SPV1_10214, SPV1_10219, SPV1_10224, SPV1_10229, SPV1_10234, SPV1_, SPV1_10239) encoding for a phosphoenolypyruvate-dependent sugar phosphotransferase system (PTS), which is the major carbohydrate transport system in bacteria [37] . The...”
JUK32_RS25875 BolA family protein from Halomicronema sp. CCY15110
44% identity, 71% coverage
asr0798 hypothetical protein from Nostoc sp. PCC 7120
40% identity, 86% coverage
- β-N-Methylamino-L-Alanine (BMAA) Causes Severe Stress in Nostoc sp. PCC 7120 Cells under Diazotrophic Conditions: A Proteomic Study
Koksharova, Toxins 2021 - “...STRING. The protein network is represented with the following 10 protein partners: arl0045 is ferredoxin; asr0798 is a hypothetical protein; alr0799 is monothiol glutaredoxin; all3791 is Ribonuclease D; gshB is Glutathione synthetase (all3859); alr3798 is Glutathione S-transferase; alr2204 and all0737 are Thioredoxin reductases; all4873 is Glutaredoxin-3;...”
- Structural Determinants and Their Role in Cyanobacterial Morphogenesis
Springstein, Life (Basel, Switzerland) 2020 - “...Cell elongation MreD N/A Synpcc7942_0298 (ABB56330.1) All0085 (BAB77609.1) Cell elongation BolA Ssr3122 (WP_010871705.1) Synpcc7942_1146 (ABB57176.1) Asr0798 (WP_010994972.1) Cell elongation CikA Slr1969 (WP_010872820.1) Synpcc7942_0644 (WP_011243194.1) All1688 (WP_010995857.1) Circadian rhythm PBP1 Sll0002 (WP_010873436.1) Synpcc7942_2000 (WP_011378270.1) Alr5101 (WP_010999227.1) Cell wall synthesis PBP2 Slr1710 (WP_010871874.1) Synpcc7942_0785 (ABB56817.1) Alr4579 (WP_010998711.1) Cell...”
HVO_2899 hypothetical protein from Haloferax volcanii DS2
37% identity, 87% coverage
BPSL3142 BolA-like protein from Burkholderia pseudomallei K96243
39% identity, 86% coverage
CNAG_03927 hypothetical protein from Cryptococcus neoformans var. grubii H99
38% identity, 57% coverage
- The monothiol glutaredoxin GrxD is essential for sensing iron starvation in Aspergillus fumigatus
Misslinger, PLoS genetics 2019 - “...Fe-S cluster transfer to mitochondrial clients 8.60 5.53 5.65 - - - Bol3 (Aim1) Fra3 CNAG_03927 Afu4g10100 - (Methionine synthase reductase) 7.62 4.11 2.49 - - - nd SPAC1783.01 nd Afu7g02320 - Unknown 7.60 4.39 5.91 5.03 4.04 3.49 nd nd nd Afu8g04090 CodA Choline oxidase...”
- Leu1 plays a role in iron metabolism and is required for virulence in Cryptococcus neoformans
Do, Fungal genetics and biology : FG & B 2015 - “...for homologs in the C. neoformans genome. The identified homologs, CNAG_01137 (Aco1), CNAG_03395 (Nfu1) and CNAG_03927 (Fra2), were fused with the 3FLAG epitope tag for protein expression analysis. To construct the Aco1-FLAG fusion protein, the ACO1 gene was amplified by PCR from the wild-type genomic DNA...”
- “...the leu1 mutant and the wild-type strain grown in low- and high-iron media. The homolog (CNAG_03927) of S. cerevisiae Fra2, which is a cytosolic Fe-S containing protein that is involved in regulation of iron uptake and homeostasis, was also selected and fused with the same epitope...”
DP16_RS16825 BolA family protein from Stenotrophomonas maltophilia
44% identity, 69% coverage
Rta_20200 BolA family protein from Ramlibacter tataouinensis TTB310
51% identity, 55% coverage
LIC11808 BolA-like protein from Leptospira interrogans serovar Copenhageni str. Fiocruz L1-130
32% identity, 86% coverage
bolA / CAB45536.1 BolA protein from Pseudomonas fluorescens (see paper)
49% identity, 52% coverage
SM2011_c00698 BolA family protein from Sinorhizobium meliloti 2011
49% identity, 55% coverage
- Sinorhizobium meliloti YrbA binds divalent metal cations using two conserved histidines
Roret, Bioscience reports 2020 - “...mutants were constructed by inserting an antibiotic-resistance cartridge in one or both open reading frames (SM2011_c00698 and SM2011_c00487). First, 1-kb fragments covering the entire gene sequence and the upstream and downstream flanking regions were amplified by PCR with the primer pairs bolA-F GGAGAGGCCGGAAAAATAGT/bolA-R TGAAGAACCGGATCACCAAG and yrbA-F...”
BOLA1_HUMAN / Q9Y3E2 BolA-like protein 1; hBolA from Homo sapiens (Human) (see 4 papers)
NP_001307955 bolA-like protein 1 from Homo sapiens
43% identity, 37% coverage
- function: Acts as a mitochondrial iron-sulfur (Fe-S) cluster assembly factor that facilitates (Fe-S) cluster insertion into a subset of mitochondrial proteins (By similarity). Probably acts together with the monothiol glutaredoxin GLRX5 (PubMed:27532772). May protect cells against oxidative stress (PubMed:22746225).
subunit: Interacts with GLRX5 (PubMed:22746225, PubMed:27532772, PubMed:27532773). - Cluster exchange reactivity of [2Fe-2S]-bridged heterodimeric BOLA1-GLRX5.
Sen, The FEBS journal 2021 (PubMed)- GeneRIF: Cluster exchange reactivity of [2Fe-2S]-bridged heterodimeric BOLA1-GLRX5.
- BOLA1 is an aerobic protein that prevents mitochondrial morphology changes induced by glutathione depletion.
Willems, Antioxidants & redox signaling 2013 - GeneRIF: is an aerobic, mitochondrial protein that prevents mitochondrial morphology aberrations induced by GSH depletion and reduces the associated oxidative shift of the mitochondrial thiol redox potential
- Genetic variants in nuclear-encoded mitochondrial genes influence AIDS progression.
Hendrickson, PloS one 2010 - GeneRIF: Observational study of gene-disease association. (HuGE Navigator)
- hBolA, novel non-classical secreted proteins, belonging to different BolA family with functional divergence.
Zhou, Molecular and cellular biochemistry 2008 (PubMed)- GeneRIF: This study reported that all three human BolA proteins (hBolA1, hBolA2, and hBolA3) are novel non-classical secreted proteins identified with bioinformatics and molecular biology experiments.
- Solution structure of a BolA-like protein from Mus musculus.
Kasai, Protein science : a publication of the Protein Society 2004 - GeneRIF: BOLA-like proteins are widely conserved from prokaryotes to eukaryotes and may be involved in cell proliferation or cell-cycle regulation.
- Multicenter proteome-wide Mendelian randomization study identifies causal plasma proteins in melanoma and non-melanoma skin cancers.
Li, Communications biology 2024 - “...- 1.4014 (1.2891 1.5235) 2.38E-15 1.60E-12 12.87% 0.02% 2.01E-200 Pass 1.2840 (1.1275, 1.4770),0.0012 1064.99 BOLA1 Q9Y3E2 Wald ratio 1 - 1.1053 (1.0534 1.1596) 4.41E-05 7.39E-03 16.83% 0.01% 5.19E-282 Pass - 1459.03 CLMP Q9H6B4 IVW 2 0.2626, 0.6083 1.1120 (1.0495 1.1783) 3.25E-04 3.63E-02 10.09% 0.00% 1.41E-160 Pass...”
- Adaptation of the late ISC pathway in the anaerobic mitochondrial organelles of Giardia intestinalis
Motyčková, PLoS pathogens 2023 - “...alignment of the identified Gi BolA1 with the homologues from, Saccharomyces cerevisiae (Q3E793), Homo sapiens (Q9Y3E2), Plasmodium falciparum (Q8I3V0), Naegleria gruberi (D2V472) and Trypanosoma brucei (Q57YM0). BolA signature V/I/LHAL/I motif is highlighted. (D) Structure of Gi BolA1 as predicted by AlphaFold2 [ 75 ], predicted structure...”
- Proteomic Characterization of Acute Kidney Injury in Patients Hospitalized with SARS-CoV2 Infection.
Paranjpe, medRxiv : the preprint server for health sciences 2022 - “...Coefficient P Value Adjusted P Value Uniprot ID BolA-like protein 1 18.022031 1.3810 5 0.0008 Q9Y3E2 Cystatin-C 17.63355 3.4010 7 2.1410 5 P01034 Desmocollin-2 16.597692 1.6310 9 1.03 10 7 Q02487 Transmembrane emp24 domain-containing protein 10 15.819807 1.3610 8 8.5410 7 P49755 Ephrin-B1 15.167647 0.0002 0.01...”
- A Highly Conserved Iron-Sulfur Cluster Assembly Machinery between Humans and Amoeba Dictyostelium discoideum: The Characterization of Frataxin.
Olmos, International journal of molecular sciences 2020 - “...Q8IWL3 21.3/39 80/1.00 10 25 PRK05014 (co-chaperone HscB; Provisional) 136290/ 1.22 10 24 BolA1 DDB_G0274169/XP_644296.1 Q9Y3E2 36.8/50.7 74/1.00 10 31 pfam01722 (BolA, BolA-like protein; morphoprotein BolA from E. coli ) 1589/ 1.03 10 35 BolA1 DDB_G0290319/XP_635799.1 Q9Y3E2 32.9/50.7 72/2.00 10 24 pfam01722 (BolA-like protein; morphoprotein BolA...”
- NMR as a Tool to Investigate the Processes of Mitochondrial and Cytosolic Iron-Sulfur Cluster Biosynthesis.
Cai, Molecules (Basel, Switzerland) 2018 - “...Multiple mitochondrial dysfunction syndrome 4 (MMDS4) [ 79 ] [2Fe-2S], [4Fe-4S] [4Fe-4S] cluster assembly BOLA1 Q9Y3E2 Bol1 Iron sensing/[2Fe-2S] delivery BOLA3 Q53S33 Aim1 Multiple mitochondrial dysfunction syndrome 2 (MMDS2) [ 48 ] Fe-S delivery to specific recipients IBA57 Q5T440 Iba57 Multiple mitochondrial dysfunction syndrome 3 (MMDS3)...”
C0J56_08545 BolA family protein from Pseudomonas fluorescens
47% identity, 52% coverage
NE0913 BolA-like protein from Nitrosomonas europaea ATCC 19718
47% identity, 45% coverage
Q9D8S9 BolA-like protein 1 from Mus musculus
37% identity, 37% coverage
- Functional, proteomic and bioinformatic analyses of Nrf2- and Keap1- null skeletal muscle
Gao, The Journal of physiology 2020 - “...3.15 Nlrx1 Q3TL44 NLR family member X1 0.02 2.6 Fxyd1 Q9Z239 Phospholemman 0.02 2.64 Bola1 Q9D8S9 BolA-like protein 1 0.02 2.64 Rpl15 Q9CZM2 60S ribosomal protein L15 0.02 2.52 Hddc2 Q3SXD3 HD domain-containing protein 2 0.02 2.28 Smc3 Q9CW03 Structural maintenance of chromosomes protein 3 0.02...”
- Integrated analysis of proteome and transcriptome changes in the mucopolysaccharidosis type VII mouse hippocampus
Parente, Molecular genetics and metabolism 2016 - “...domain containing 2; predicted gene 13202; similar to coiled-coil-helix-coiled-coil-helix domain containing 2; predicted gene 12350 Q9D8S9 Bola1 bolA-like 1 (E. coli) Q9D8T7 Slirp RIKEN cDNA 1810035L17 gene Q9ERG2 Strn3 striatin, calmodulin binding protein 3 Q9ERU9 Ranbp2 RAN binding protein 2 Q9JKL4 Ndufaf3 NADH dehydrogenase (ubiquinone) 1...”
- “...Mrps36 0.028 -3.13 28S ribosomal protein S36, mitochondrial P28271 Aco1 0.020 -3.61 Cytoplasmic aconitate hydratase Q9D8S9 Bola1 BolA-like protein 1 Q9R1Z7 Pts 6-pyruvoyl tetrahydrobiopterin synthase P58059 Mrps21 28S ribosomal protein S21, mitochondrial Q9D1L0 Chchd2 Coiled-coil-helix-domain-containing protein 2, mitochondrial Q9CQV7 Dnajc19 Mitochondrial import inner membrane translocase subunit...”
- Identification of astrocyte secreted proteins with a combination of shotgun proteomics and bioinformatics
Dowell, Journal of proteome research 2009 - “...4 Q9QYB1 0.563 4 2 Sulfhydryl oxidase 1 Q8BND5 Yes 4 2 BolA-like protein 1 Q9D8S9 0.864 3 2 LIM and SH3 domain protein 1 Q61792 0.722 3 2 3'(2'),5'-bisphosphate nucleotidase 1 Q9Z0S1 0.652 3 2 Probable ATP-dependent RNA helicase DDX56 Q9D0R4 0.534 3 2 SH3...”
XAC2966 conserved hypothetical protein from Xanthomonas axonopodis pv. citri str. 306
40% identity, 69% coverage
- Proteome of the phytopathogen Xanthomonas citri subsp. citri: a global expression profile
Soares, Proteome science 2010 - “...-19 (see Additional file 4 ). Examples of proteins previously proposed as hypothetical are XAC3997, XAC2966, XAC3898 and XAC1756, which now are categorized as proteins with putative functions, such as ABC transporter permease, BolA superfamily transcriptional regulator, membrane-bound metalloendopeptidase and PhoH-like protein, respectively. ORFs with undefined...”
For advice on how to use these tools together, see
Interactive tools for functional annotation of bacterial genomes.
The PaperBLAST database links 793,807 different protein sequences to 1,259,118 scientific articles. Searches against EuropePMC were last performed on March 13 2025.
PaperBLAST builds a database of protein sequences that are linked
to scientific articles. These links come from automated text searches
against the articles in EuropePMC
and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot,
BRENDA,
CAZy (as made available by dbCAN),
BioLiP,
CharProtDB,
MetaCyc,
EcoCyc,
TCDB,
REBASE,
the Fitness Browser,
and a subset of the European Nucleotide Archive with the /experiment tag.
Given this database and a protein sequence query,
PaperBLAST uses protein-protein BLAST
to find similar sequences with E < 0.001.
To build the database, we query EuropePMC with locus tags, with RefSeq protein
identifiers, and with UniProt
accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use
queries of the form "locus_tag AND genus_name" to try to ensure that
the paper is actually discussing that gene. Because EuropePMC indexes
most recent biomedical papers, even if they are not open access, some
of the links may be to papers that you cannot read or that our
computers cannot read. We query each of these identifiers that
appears in the open access part of EuropePMC, as well as every locus
tag that appears in the 500 most-referenced genomes, so that a gene
may appear in the PaperBLAST results even though none of the papers
that mention it are open access. We also incorporate text-mined links
from EuropePMC that link open access articles to UniProt or RefSeq
identifiers. (This yields some additional links because EuropePMC
uses different heuristics for their text mining than we do.)
For every article that mentions a locus tag, a RefSeq protein
identifier, or a UniProt accession, we try to select one or two
snippets of text that refer to the protein. If we cannot get access to
the full text, we try to select a snippet from the abstract, but
unfortunately, unique identifiers such as locus tags are rarely
provided in abstracts.
PaperBLAST also incorporates manually-curated protein functions:
- Proteins from NCBI's RefSeq are included if a
GeneRIF
entry links the gene to an article in
PubMed®.
GeneRIF also provides a short summary of the article's claim about the
protein, which is shown instead of a snippet.
- Proteins from Swiss-Prot (the curated part of UniProt)
are included if the curators
identified experimental evidence for the protein's function (evidence
code ECO:0000269). For these proteins, the fields of the Swiss-Prot entry that
describe the protein's function are shown (with bold headings).
- Proteins from BRENDA,
a curated database of enzymes, are included if they are linked to a paper in PubMed
and their full sequence is known.
- Every protein from the non-redundant subset of
BioLiP,
a database
of ligand-binding sites and catalytic residues in protein structures, is included. Since BioLiP itself
does not include descriptions of the proteins, those are taken from the
Protein Data Bank.
Descriptions from PDB rely on the original submitter of the
structure and cannot be updated by others, so they may be less reliable.
(For SitesBLAST and Sites on a Tree, we use a larger subset of BioLiP so that every
ligand is represented among a group of structures with similar sequences, but for
PaperBLAST, we use the non-redundant set provided by BioLiP.)
- Every protein from EcoCyc, a curated
database of the proteins in Escherichia coli K-12, is included, regardless
of whether they are characterized or not.
- Proteins from the MetaCyc metabolic pathway database
are included if they are linked to a paper in PubMed and their full sequence is known.
- Proteins from the Transport Classification Database (TCDB)
are included if they have known substrate(s), have reference(s),
and are not described as uncharacterized or putative.
(Some of the references are not visible on the PaperBLAST web site.)
- Every protein from CharProtDB,
a database of experimentally characterized protein annotations, is included.
- Proteins from the CAZy database of carbohydrate-active enzymes
are included if they are associated with an Enzyme Classification number.
Even though CAZy does not provide links from individual protein sequences to papers,
these should all be experimentally-characterized proteins.
- Proteins from the REBASE database
of restriction enzymes are included if they have known specificity.
- Every protein with an evidence-based reannotation (based on mutant phenotypes)
in the Fitness Browser is included.
- Sequence-specific transcription factors (including sigma factors and DNA-binding response regulators)
with experimentally-determined DNA binding sites from the
PRODORIC database of gene regulation in prokaryotes.
- Putative transcription factors from RegPrecise
that have manually-curated predictions for their binding sites. These predictions are based on
conserved putative regulatory sites across genomes that contain similar transcription factors,
so PaperBLAST clusters the TFs at 70% identity and retains just one member of each cluster.
- Coding sequence (CDS) features from the
European Nucleotide Archive (ENA)
are included if the /experiment tag is set (implying that there is experimental evidence for the annotation),
the nucleotide entry links to paper(s) in PubMed,
and the nucleotide entry is from the STD data class
(implying that these are targeted annotated sequences, not from shotgun sequencing).
Also, to filter out genes whose transcription or translation was detected, but whose function
was not studied, nucleotide entries or papers with more than 25 such proteins are excluded.
Descriptions from ENA rely on the original submitter of the
sequence and cannot be updated by others, so they may be less reliable.
Except for GeneRIF and ENA,
the curated entries include a short curated
description of the protein's function.
For entries from BioLiP, the protein's function may not be known beyond binding to the ligand.
Many of these entries also link to articles in PubMed.
For more information see the
PaperBLAST paper (mSystems 2017)
or the code.
You can download PaperBLAST's database here.
Changes to PaperBLAST since the paper was written:
- November 2023: incorporated PRODORIC and RegPrecise. Many PRODORIC entries were not linked to a protein sequence (no UniProt identifier), so we added this information.
- February 2023: BioLiP changed their download format. PaperBLAST now includes their non-redundant subset. SitesBLAST and Sites on a Tree use a larger non-redundant subset that ensures that every ligand is represented within each cluster. This should ensure that every binding site is represented.
- June 2022: incorporated some coding sequences from ENA with the /experiment tag.
- March 2022: incorporated BioLiP.
- April 2020: incorporated TCDB.
- April 2019: EuropePMC now returns table entries in their search results. This has expanded PaperBLAST's database, but most of the new entries are of low relevance, and the resulting snippets are often just lists of locus tags with annotations.
- February 2018: the alignment page reports the conservation of the hit's functional sites (if available from from Swiss-Prot or UniProt)
- January 2018: incorporated BRENDA.
- December 2017: incorporated MetaCyc, CharProtDB, CAZy, REBASE, and the reannotations from the Fitness Browser.
- September 2017: EuropePMC no longer returns some table entries in their search results. This has shrunk PaperBLAST's database, but has also reduced the number of low-relevance hits.
Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.
PaperBLAST cannot provide snippets for many of the papers that are
published in non-open-access journals. This limitation applies even if
the paper is marked as "free" on the publisher's web site and is
available in PubmedCentral or EuropePMC. If a journal that you publish
in is marked as "secret," please consider publishing elsewhere.
Many important articles are missing from PaperBLAST, either because
the article's full text is not in EuropePMC (as for many older
articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an
article that characterizes a protein's function but is missing from
PaperBLAST, please notify the curators at UniProt
or add an entry to GeneRIF.
Entries in either of these databases will eventually be incorporated
into PaperBLAST. Note that to add an entry to UniProt, you will need
to find the UniProt identifier for the protein. If the protein is not
already in UniProt, you can ask them to create an entry. To add an
entry to GeneRIF, you will need an NCBI Gene identifier, but
unfortunately many prokaryotic proteins in RefSeq do not have
corresponding Gene identifers.
References
PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.
Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.
Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.
UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.
BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.
The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.
The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.
CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.
The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.
The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.
REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.
Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.
by Morgan Price,
Arkin group
Lawrence Berkeley National Laboratory