PaperBLAST
PaperBLAST Hits for VIMSS7447876 hypothetical protein (119 a.a., MFRSLILAAV...)
Show query sequence
>VIMSS7447876 hypothetical protein
MFRSLILAAVLLAAGPLVANAGEITLLPSVKLQIGDRDNYGNYWDGGSWRDRDYWRRHYE
WRDNRWHRHDNGWHKGWYKGRDKAWERGYRAGWNDRDDHRGGWGRGPGGRGHGHGHGHH
Running BLASTp...
Found 11 similar proteins in the literature:
KP1_3995 hypothetical protein from Klebsiella pneumoniae NTUH-K2044
KPN_02742 hypothetical protein from Klebsiella pneumoniae subsp. pneumoniae MGH 78578
100% identity, 100% coverage
YpeC / b2390 DUF2502 domain-containing protein YpeC from Escherichia coli K-12 substr. MG1655 (see paper)
ECs3270 hypothetical protein from Escherichia coli O157:H7 str. Sakai
b2390 hypothetical protein from Escherichia coli str. K-12 substr. MG1655
Z3656 orf, hypothetical protein from Escherichia coli O157:H7 EDL933
74% identity, 97% coverage
STM2407 putative periplasmic protein from Salmonella typhimurium LT2
t0450 conserved hypothetical protein from Salmonella enterica subsp. enterica serovar Typhi Ty2
T_RS02280 DUF2502 domain-containing protein YpeC from Salmonella enterica subsp. enterica serovar Typhi str. Ty2
75% identity, 97% coverage
YaaX / b0005 DUF2502 domain-containing protein YaaX from Escherichia coli K-12 substr. MG1655 (see paper)
b0005 hypothetical protein from Escherichia coli str. K-12 substr. MG1655
63% identity, 57% coverage
- Global analysis of extracytoplasmic stress signaling in Escherichia coli
Bury-Moné, PLoS genetics 2009 - “...H wcaB H b2062-58 Colanic acid biosynthesis and secretion 1.6; -; 1.9; -; - yaaX b0005 Putative periplasmic protein protein 1.9 wcaC F H wcaD F H wcaE F H wcaF FH gmd F fcl F nudD F H wcaI FH cpsB H cpsG H wcaJ...”
- The BaeSR two-component regulatory system mediates resistance to condensed tannins in Escherichia coli
Zoetendal, Applied and environmental microbiology 2008 - “...15.3 27.6 22.5 b1532 b1530 marB marR 2.6 2.6 21.5 27.4 b0005 b2390 b0618 b3687 b3686 b3238 b2675 yaaX ypeC citC ibpA ibpB yhcN nrdE 4.9 2.8 34.6 6.9 9.5 3.1...”
- A functional update of the Escherichia coli K-12 genome
Serres, Genome biology 2001 - “...leader peptide b0050 ec_0078 apaG o Conserved protein b0081 ec_0123 mraZ o Conserved hypothetical protein b0005 ec_G0005 yaaX o Unknown CDS * Gene product type: c, carrier; e, enzyme; f, factor; h, extrachromosomal origin; l, leader peptide; m, membrane component; n, RNA; o, ORF of unknown...”
- Web-based visualization tools for bacterial genome alignments
Florea, Nucleic acids research 2000 - “...same, contiguous fragment from the other genome. For example, the gene b0005 (at ~5 kb in ECO) is simply deleted from STM. A blue vertical stripe at the end of...”
AAF13_RS18930 DUF2502 domain-containing protein from Escherichia coli O104:H4 str. C227-11
63% identity, 55% coverage
S0005 hypothetical protein from Shigella flexneri 2a str. 2457T
63% identity, 55% coverage
Z0005 orf, hypothetical protein from Escherichia coli O157:H7 EDL933
62% identity, 55% coverage
- Genome mining of novel rubiginones from Streptomyces sp. CB02414 and characterization of the post-PKS modification steps in rubiginone biosynthesis
Zhang, Microbial cell factories 2021 - “...by in-frame deletion [ 26 ], to generate mutant strains Z0004 (i.e., CB02414 rubN1 ), Z0005 (i.e., CB02414 rubN2 ), and Z0008 (i.e., CB02414 rubM4 ) (Additional file 1 : Fig. S53). The respective genes were cloned into the pSET152 plasmid under the ermE * promoter...”
- “...that RubN1 is responsible for the introduction of the -hydroxyl group at C-2. Similarly, the Z0005 mutant only produced compounds 3 , 6 , and 7 , which lack the C-4 hydroxyl group, and the production of the other five compounds (rubiginones J, K, A 2...”
- Disruption of rcsB by a duplicated sequence in a curli-producing Escherichia coli O157:H7 results in differential gene expression in relation to biofilm formation, stress responses and metabolism
Sharma, BMC microbiology 2017 - “...protein +2.83 0.04 Z3305 Bacteriophage CP-933V encoded protein 1.62 0.01 Z4330 Transposase +1.94 0.03 Hypothetical Z0005 Unknown 7.49 4.7E-08 yagU Z0353 Putative acid resistance 2.29 0.004 yagY Z0359 Predicted pilus chaperone (cryptic) 2.13 0.04 yagZ Z0360 Predicted pilus major subunit 2.83 0.001 yaiY Z0475 Predicted inner...”
YPTB2704 putative exported protein from Yersinia pseudotuberculosis IP 32953
YPO2981 putative exported protein from Yersinia pestis CO92
52% identity, 42% coverage
SMDB11_2817 DUF2502 domain-containing protein from Serratia marcescens subsp. marcescens Db11
40% identity, 55% coverage
- The Short-chain Fatty Acid Propionic Acid Activates the Rcs Stress Response System Partially through Inhibition of d-Alanine Racemase
Harshaw, mSphere 2023 - “...-gfp 73 pMQ747 pMQ713 with SMDB11_1637 ( osmB ) promoter-l uxCDABE 29 pMQ748 pMQ713 with SMDB11_2817 promoter-l uxCDABE 29 pMQ749 pMQ713 with umoD promoter luxCDABE 29 pMQ802 pMQ414 with pHLuorin2 replacing tdtomato , codon optimized for S. marcescens This study pKD4 Source of kanamycin resistance marker...”
- “...recently shown to activate Rcs in this strain ( 29 ). These promoters, namely, P SMDB11_2817 and P umoD , are also highly responsive to propionic acid in the WT background but are minimally or not activated in the rcsB mutant ( Fig.3 ). FIG3 Propionic...”
- Antibiotics Used in Empiric Treatment of Ocular Infections Trigger the Bacterial Rcs Stress Response System Independent of Antibiotic Susceptibility
Harshaw, Antibiotics (Basel, Switzerland) 2021 - “...vector ( Figure 2 A and Figure S1 ). The promoters were for the SMDB11_1637, SMDB11_2817, and SMDB11_1194 open reading frames. All of these previously uncharacterized open reading frames bear high similarity to Rcs-regulated genes in other bacteria. SMDB11_1637 is similar to osmotically inducible lipoprotein B...”
- “...in P. mirabilis [ 27 ], as is its ortholog YPO1624 in Y. pseudotuberculosis . SMDB11_2817 has similarity to yaaX from E. coli with the DUF2502 domain of unknown function and was identified as an RcsB-regulated gene in E. coli [ 25 ]. In addition, the...”
CU052_07910 penicillin-binding protein 1A from Vibrio harveyi
34% identity, 9% coverage
VP_RS13510 penicillin-binding protein 1A from Vibrio parahaemolyticus RIMD 2210633
VP2751 penicillin-binding protein 1A from Vibrio parahaemolyticus RIMD 2210633
33% identity, 9% coverage
- The Influence of Outer Membrane Protein on Ampicillin Resistance of Vibrio parahaemolyticus
Meng, The Canadian journal of infectious diseases & medical microbiology = Journal canadien des maladies infectieuses et de la microbiologie medicale 2023 - “...-lactamase genes, PG synthesis-related genes, stress-regulation-related genes, and lipid A synthesis genes. VP_RS17515 expresses -lactamase. VP_RS13510 , mrcB , mrdA , VP_RS02165 , VP_RS03450 , VP_RS22785 , dacB , VP_RS22200 , VP_RS09310 , VP_RS15980 were selected as PG synthesis-related genes through BLASTp in NCBI, and the...”
- “...-F CGTAAGCGATTTTCTGTGC RT- VP_RS11205 -R AAAGCGGCTGGGATTGG RT- VP_RS17515 -F GCTTGTCCGTTTGTGTATCCC RT- VP_RS17515 -R TGCTCAACTGTTAGTTACGCCTC RT- VP_RS13510 -F AATCATTGCTCGTTACCACAG RT- VP_RS13510 -R CCGACGTATAGGCTTTCTCTTC RT- mrcB -F GCGACAGAAGACCGAGAT RT- mrcB -R CGTTAAGGTACTGCCACCT RT- VP_RS02165 -F TCGCTTACCGTGCCATC RT- VP_RS02165 -R TTTTACATCCAGCATCACCAC RT- mrdA -F GTTTTGATGGGCTTGCTG RT- mrd -R CCACTTTGATGCGGTTGT...”
- Sensor histidine kinase is a β-lactam receptor and induces resistance to β-lactam antibiotics
Li, Proceedings of the National Academy of Sciences of the United States of America 2016 - “...penicillin binding protein 3, vp0545 encoding -hexosaminidase, vp2751 encoding penicillin binding protein 1A, and vpa0477 encoding a class A -lactamase. The...”
- Association of a D-alanyl-D-alanine carboxypeptidase gene with the formation of aberrantly shaped cells during the induction of viable but nonculturable Vibrio parahaemolyticus
Hung, Applied and environmental microbiology 2013 - “...VP0722 VP1385 VP1485 VP2369 VP2463 VP2468 VP2471 VP2497 VP2658 VP2751 VPA0517 VPA1649 a Gene lspA mraY tagE mepA rodA nlpC mltA ftsH dacB mrcB murA tagE Product...”
- “...expression at 0 h was determined by RT-qPCR. VP2497, VP2751, and VPA0517) in the genome of V. parahaemolyticus RIMD2210633 have been identified and selected in...”
For advice on how to use these tools together, see
Interactive tools for functional annotation of bacterial genomes.
The PaperBLAST database links 789,361 different protein sequences to 1,256,019 scientific articles. Searches against EuropePMC were last performed on January 10 2025.
PaperBLAST builds a database of protein sequences that are linked
to scientific articles. These links come from automated text searches
against the articles in EuropePMC
and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot,
BRENDA,
CAZy (as made available by dbCAN),
BioLiP,
CharProtDB,
MetaCyc,
EcoCyc,
TCDB,
REBASE,
the Fitness Browser,
and a subset of the European Nucleotide Archive with the /experiment tag.
Given this database and a protein sequence query,
PaperBLAST uses protein-protein BLAST
to find similar sequences with E < 0.001.
To build the database, we query EuropePMC with locus tags, with RefSeq protein
identifiers, and with UniProt
accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use
queries of the form "locus_tag AND genus_name" to try to ensure that
the paper is actually discussing that gene. Because EuropePMC indexes
most recent biomedical papers, even if they are not open access, some
of the links may be to papers that you cannot read or that our
computers cannot read. We query each of these identifiers that
appears in the open access part of EuropePMC, as well as every locus
tag that appears in the 500 most-referenced genomes, so that a gene
may appear in the PaperBLAST results even though none of the papers
that mention it are open access. We also incorporate text-mined links
from EuropePMC that link open access articles to UniProt or RefSeq
identifiers. (This yields some additional links because EuropePMC
uses different heuristics for their text mining than we do.)
For every article that mentions a locus tag, a RefSeq protein
identifier, or a UniProt accession, we try to select one or two
snippets of text that refer to the protein. If we cannot get access to
the full text, we try to select a snippet from the abstract, but
unfortunately, unique identifiers such as locus tags are rarely
provided in abstracts.
PaperBLAST also incorporates manually-curated protein functions:
- Proteins from NCBI's RefSeq are included if a
GeneRIF
entry links the gene to an article in
PubMed®.
GeneRIF also provides a short summary of the article's claim about the
protein, which is shown instead of a snippet.
- Proteins from Swiss-Prot (the curated part of UniProt)
are included if the curators
identified experimental evidence for the protein's function (evidence
code ECO:0000269). For these proteins, the fields of the Swiss-Prot entry that
describe the protein's function are shown (with bold headings).
- Proteins from BRENDA,
a curated database of enzymes, are included if they are linked to a paper in PubMed
and their full sequence is known.
- Every protein from the non-redundant subset of
BioLiP,
a database
of ligand-binding sites and catalytic residues in protein structures, is included. Since BioLiP itself
does not include descriptions of the proteins, those are taken from the
Protein Data Bank.
Descriptions from PDB rely on the original submitter of the
structure and cannot be updated by others, so they may be less reliable.
(For SitesBLAST and Sites on a Tree, we use a larger subset of BioLiP so that every
ligand is represented among a group of structures with similar sequences, but for
PaperBLAST, we use the non-redundant set provided by BioLiP.)
- Every protein from EcoCyc, a curated
database of the proteins in Escherichia coli K-12, is included, regardless
of whether they are characterized or not.
- Proteins from the MetaCyc metabolic pathway database
are included if they are linked to a paper in PubMed and their full sequence is known.
- Proteins from the Transport Classification Database (TCDB)
are included if they have known substrate(s), have reference(s),
and are not described as uncharacterized or putative.
(Some of the references are not visible on the PaperBLAST web site.)
- Every protein from CharProtDB,
a database of experimentally characterized protein annotations, is included.
- Proteins from the CAZy database of carbohydrate-active enzymes
are included if they are associated with an Enzyme Classification number.
Even though CAZy does not provide links from individual protein sequences to papers,
these should all be experimentally-characterized proteins.
- Proteins from the REBASE database
of restriction enzymes are included if they have known specificity.
- Every protein with an evidence-based reannotation (based on mutant phenotypes)
in the Fitness Browser is included.
- Sequence-specific transcription factors (including sigma factors and DNA-binding response regulators)
with experimentally-determined DNA binding sites from the
PRODORIC database of gene regulation in prokaryotes.
- Putative transcription factors from RegPrecise
that have manually-curated predictions for their binding sites. These predictions are based on
conserved putative regulatory sites across genomes that contain similar transcription factors,
so PaperBLAST clusters the TFs at 70% identity and retains just one member of each cluster.
- Coding sequence (CDS) features from the
European Nucleotide Archive (ENA)
are included if the /experiment tag is set (implying that there is experimental evidence for the annotation),
the nucleotide entry links to paper(s) in PubMed,
and the nucleotide entry is from the STD data class
(implying that these are targeted annotated sequences, not from shotgun sequencing).
Also, to filter out genes whose transcription or translation was detected, but whose function
was not studied, nucleotide entries or papers with more than 25 such proteins are excluded.
Descriptions from ENA rely on the original submitter of the
sequence and cannot be updated by others, so they may be less reliable.
Except for GeneRIF and ENA,
the curated entries include a short curated
description of the protein's function.
For entries from BioLiP, the protein's function may not be known beyond binding to the ligand.
Many of these entries also link to articles in PubMed.
For more information see the
PaperBLAST paper (mSystems 2017)
or the code.
You can download PaperBLAST's database here.
Changes to PaperBLAST since the paper was written:
- November 2023: incorporated PRODORIC and RegPrecise. Many PRODORIC entries were not linked to a protein sequence (no UniProt identifier), so we added this information.
- February 2023: BioLiP changed their download format. PaperBLAST now includes their non-redundant subset. SitesBLAST and Sites on a Tree use a larger non-redundant subset that ensures that every ligand is represented within each cluster. This should ensure that every binding site is represented.
- June 2022: incorporated some coding sequences from ENA with the /experiment tag.
- March 2022: incorporated BioLiP.
- April 2020: incorporated TCDB.
- April 2019: EuropePMC now returns table entries in their search results. This has expanded PaperBLAST's database, but most of the new entries are of low relevance, and the resulting snippets are often just lists of locus tags with annotations.
- February 2018: the alignment page reports the conservation of the hit's functional sites (if available from from Swiss-Prot or UniProt)
- January 2018: incorporated BRENDA.
- December 2017: incorporated MetaCyc, CharProtDB, CAZy, REBASE, and the reannotations from the Fitness Browser.
- September 2017: EuropePMC no longer returns some table entries in their search results. This has shrunk PaperBLAST's database, but has also reduced the number of low-relevance hits.
Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.
PaperBLAST cannot provide snippets for many of the papers that are
published in non-open-access journals. This limitation applies even if
the paper is marked as "free" on the publisher's web site and is
available in PubmedCentral or EuropePMC. If a journal that you publish
in is marked as "secret," please consider publishing elsewhere.
Many important articles are missing from PaperBLAST, either because
the article's full text is not in EuropePMC (as for many older
articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an
article that characterizes a protein's function but is missing from
PaperBLAST, please notify the curators at UniProt
or add an entry to GeneRIF.
Entries in either of these databases will eventually be incorporated
into PaperBLAST. Note that to add an entry to UniProt, you will need
to find the UniProt identifier for the protein. If the protein is not
already in UniProt, you can ask them to create an entry. To add an
entry to GeneRIF, you will need an NCBI Gene identifier, but
unfortunately many prokaryotic proteins in RefSeq do not have
corresponding Gene identifers.
References
PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.
Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.
Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.
UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.
BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.
The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.
The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.
CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.
The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.
The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.
REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.
Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.
by Morgan Price,
Arkin group
Lawrence Berkeley National Laboratory