PaperBLAST
PaperBLAST Hits for MPMX20_02743 (79 a.a., MTKTSVRIGA...)
Show query sequence
>MPMX20_02743
MTKTSVRIGAFEIDDAELRGEAQGERTLSIPCKSDPDLCMQLDAWDADTSVPAILDGEHS
VLYREHYDSQSDAWVMRLA
Running BLASTp...
Found 9 similar proteins in the literature:
KPN_02352 hypothetical protein from Klebsiella pneumoniae subsp. pneumoniae MGH 78578
91% identity, 100% coverage
STY1982 conserved hypothetical protein from Salmonella enterica subsp. enterica serovar Typhi str. CT18
86% identity, 100% coverage
- Transcriptomic study of Salmonella enterica subspecies enterica serovar Typhi biofilm
Chin, BMC genomics 2017 - “...yaiA Hypothetical protein 2.48198 5.00E-05 0.001837 hupA Histone like DNA-binding protein HU-alpha 2.59176 5.00E-05 0.001837 STY1982 Hypothetical protein 2.61628 5.00E-05 0.001837 STY1323 Hypothetical protein 2.67081 5.00E-05 0.001837 ompC Outer membrane protein C 2.75117 5.00E-05 0.001837 lppA Major outer membrane lipoprotein 2.93078 5.00E-05 0.001837 STY1938 Hypothetical protein...”
SEN1186 hypothetical protein from Salmonella enterica subsp. enterica serovar Enteritidis str. P125109
STM1851 putative cytoplasmic protein from Salmonella typhimurium LT2
86% identity, 100% coverage
- Rapid identification of novel antigens of Salmonella Enteritidis by microarray-based immunoscreening
Danckert, Mikrochimica acta 2014 - “...a highly conserved DNA primase that shows homology in all bacteria. Moreover, two hypothetical proteins (SEN1186 and SEN2464) were detected with no known function rendering these candidates highly attractive for further investigations. SEN2464 is additionally described as a methionine tRNA cytidine acetyltransferase providing precise recognition of...”
- “...as a virulence factor contributing to enteric infection [ 29 ]. Last but not least, SEN1186 is a DNA mismatch endonuclease located in the cytoplasm. Table 1 List of all identified immunogenic proteins from cDNA expression library screening. The candidates are listed according to their locus...”
- Stress response, amino acid biosynthesis and pathogenesis genes expressed in Salmonella enterica colonizing tomato shoot and root surfaces
Han, Heliyon 2020 - “...Stress-induced bacterial acidophilic repeat motif 3.3 0.002 STM1808 STM1808 Putative cytoplasmic protein; NsrR regulon 0.008 STM1851 STM1851 hypothetical protein 1.3 0.003 1.6 0.003 STM05615 STM05615 hypothetical protein 2.2 0.002 2.0 0.003 STM4271 STM4271 murein hydrolase regulator LrgA 1.4 0.006 2.2 0.003 STM4552 STM4552 putative inner membrane...”
- A connecter-like factor, CacA, links RssB/RpoS and the CpxR/CpxA two-component system in Salmonella
Kato, BMC microbiology 2012 - “...no homology to any protein of known function, as well as the 3 region of STM1851 and the 5 region of pphA (Figure 1B ). Expression of STM1852 from tetracycline- (Figure 1C ) or L-arabinose- (Figure 1D ) inducible promoters recapitulated the increase in the -galactosidase...”
- “...the PcacA-lac 1 strain contains a DNA fragment encompassing the 3 region (80 bp) of STM1851 and the intergenic region (110 bp) between STM1851 and cacA , whereas the PcacA-lac 2 strain harbors only the intergenic region (110 bp) between STM1851 and cacA preceding the lacZ...”
- Genomic profiling of iron-responsive genes in Salmonella enterica serovar typhimurium by high-throughput screening of a random promoter library
Bjarnason, Journal of bacteriology 2003 - “...12, 2017 by University of California, Berkeley yfcZ STM1851 yhcO STM4448 Function(s) 4982 BJARNASON ET AL. 31. Masse, E., N. Majdalani, and S. Gottesman....”
SENTW_1340 YebV family protein from Salmonella enterica subsp. enterica serovar Weltevreden str.
87% identity, 97% coverage
SL1344_1780, STM14_2239 YebV family protein from Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S
87% identity, 97% coverage
- The Crystal Structure of the Domain of Unknown Function 1480 (DUF1480) From Klebsiella pneumoniae
Patel, Proteins 2025 (no snippet) - Fatty Acid Homeostasis Tunes Flagellar Motility by Activating Phase 2 Flagellin Expression, Contributing to Salmonella Gut Colonization
Hoshino, Infection and immunity 2022 (secret) - Proteome remodelling by the stress sigma factor RpoS/σS in Salmonella: identification of small proteins and evidence for post-transcriptional regulation
Lago, Scientific reports 2017 - “...of identified homologs in other bacterial genomes, prompted us to re-annotate the start codon of STM14_2239, STM14_2409, and STM14_5481 (Table 1 , Supplementary Fig. S5 , Supplementary Dataset S1 ). The uncharacterized ORFs ymdF and STM14_1829 are paralogous to yciG (Table 1 , Supplementary Fig. S6...”
- “...3TM fragments 84 ymgE 1114 Bacteria and Archaea STM14_2188 45 Yes Enterobacteriaceae STM14_2189 32 Salmonella STM14_2239 Start codon re-annotated, 88% identity with E . coli YebV, DUF1480 IPR009950 78 Yes yebV 13, 14 Enterobacteriaceae STM14_2405 DUF2525 IPR019669 75 Yes yodD 1114 Enterobacteriaceae STM14_2409 Start codon re-annotated,...”
- A Highly Effective Component Vaccine against Nontyphoidal Salmonella enterica Infections
Ferreira, mBio 2015 - “...and lysed in buffer A (20mM Tris, 500mM NaCl, 2mM MgCl 2 ; pH8) for SL1344_1780 and SL1344_2251 or buffer B (40mM Tris, 150mM NaCl, 3mM MgCl 2 , 10mM imidazole, 0.02% NaN 3 ; pH8) for all proteins except for SL1780 and SL2251. Buffers A...”
ECs2546 hypothetical protein from Escherichia coli O157:H7 str. Sakai
82% identity, 93% coverage
YebV / b1836 DUF1480 domain-containing protein YebV from Escherichia coli K-12 substr. MG1655 (see paper)
b1836 orf, hypothetical protein from Escherichia coli str. K-12 substr. MG1655
DR76_3007 YebV family protein from Escherichia coli ATCC 25922
82% identity, 97% coverage
- 18th Congress of the European Hematology Association, Stockholm, Sweden, June 13–16, 2013
, Haematologica 2013 - The HU regulon is composed of genes responding to anaerobiosis, acid stress, high osmolarity and SOS induction
Oberto, PloS one 2009 - “...1.06 1 1.25 3.82 0.89 1 1.37 1.44 0.52 a activator of ntrL gene yebV b1836 yebV 1 0.81 1.17 1.7 1 1.1 1.35 1.73 1 2 1.93 0.64 a, h hypothetical protein otsA b1896 otsBA 1 2.53 3.08 0.75 1 1.34 1.81 0.25 1 1.71...”
- Parallel adaptive evolution cultures of Escherichia coli lead to convergent growth phenotypes with different gene expression states
Fong, Genome research 2005 - “...n b1045, b1471, b1784, b2086, b2327, b3293, b1141, b1566, b1836, b2263, b2742, b4078, metU, alaW, thrV, flgN, wrbA, sodC, otsB, yehV, yjhH, flgM, flgD, ycfN,...”
- Gene expression profiling of the pH response in Escherichia coli
Tucker, Journal of bacteriology 2002 - “...cbpA b3510 hdeAc b0329 yahO b1003 yccJ b1597 asrc b1724 b1836 b3491 b1004 b2885 b4186 b0485 b0486 b0897 b1493 b3517 b3507 b3512 b3515 c ydiZ yebV yhiMc wrbA...”
- Coupling next-generation sequencing to dominant positive screens for finding antibiotic cellular targets and resistance mechanisms in Escherichia coli
Gingras, Microbial genomics 2018 - “...the lipoprotein NplE (which was also highlighted by the CRO screen) and a hypothetical protein (DR76_3007), YebV (Fig. S2). The product of the yebV gene has the Pfam motif DUF1480 of unknown function and is part of a family of enterobacterial proteins of about 80 amino...”
- “...DR76_2706 DR76_2709 2894751..2896439 DR76_2709 nlpE Lipoprotein NlpE 2 1 GEN 1 7277 DR76_3002 DR76_3009 3214843..3222301 DR76_3007 yebV Hypothetical protein 2 4 2 6206 DR76_2705 DR76_2716 2894138..2902771 DR76_2709 nlpE Lipoprotein NlpE 2 1 LEV 1 330375 DR76_2505 DR76_2509 2666131..2669603 DR76_2506 rob Right origin-binding protein 2 2 2...”
YPTB2387 hypothetical protein from Yersinia pseudotuberculosis IP 32953
YPO1694 conserved hypothetical protein from Yersinia pestis CO92
63% identity, 100% coverage
PMI_RS04910 DUF1480 family protein from Proteus mirabilis HI4320
PMI1011 hypothetical protein from Proteus mirabilis HI4320
40% identity, 99% coverage
For advice on how to use these tools together, see
Interactive tools for functional annotation of bacterial genomes.
The PaperBLAST database links 793,807 different protein sequences to 1,259,118 scientific articles. Searches against EuropePMC were last performed on March 13 2025.
PaperBLAST builds a database of protein sequences that are linked
to scientific articles. These links come from automated text searches
against the articles in EuropePMC
and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot,
BRENDA,
CAZy (as made available by dbCAN),
BioLiP,
CharProtDB,
MetaCyc,
EcoCyc,
TCDB,
REBASE,
the Fitness Browser,
and a subset of the European Nucleotide Archive with the /experiment tag.
Given this database and a protein sequence query,
PaperBLAST uses protein-protein BLAST
to find similar sequences with E < 0.001.
To build the database, we query EuropePMC with locus tags, with RefSeq protein
identifiers, and with UniProt
accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use
queries of the form "locus_tag AND genus_name" to try to ensure that
the paper is actually discussing that gene. Because EuropePMC indexes
most recent biomedical papers, even if they are not open access, some
of the links may be to papers that you cannot read or that our
computers cannot read. We query each of these identifiers that
appears in the open access part of EuropePMC, as well as every locus
tag that appears in the 500 most-referenced genomes, so that a gene
may appear in the PaperBLAST results even though none of the papers
that mention it are open access. We also incorporate text-mined links
from EuropePMC that link open access articles to UniProt or RefSeq
identifiers. (This yields some additional links because EuropePMC
uses different heuristics for their text mining than we do.)
For every article that mentions a locus tag, a RefSeq protein
identifier, or a UniProt accession, we try to select one or two
snippets of text that refer to the protein. If we cannot get access to
the full text, we try to select a snippet from the abstract, but
unfortunately, unique identifiers such as locus tags are rarely
provided in abstracts.
PaperBLAST also incorporates manually-curated protein functions:
- Proteins from NCBI's RefSeq are included if a
GeneRIF
entry links the gene to an article in
PubMed®.
GeneRIF also provides a short summary of the article's claim about the
protein, which is shown instead of a snippet.
- Proteins from Swiss-Prot (the curated part of UniProt)
are included if the curators
identified experimental evidence for the protein's function (evidence
code ECO:0000269). For these proteins, the fields of the Swiss-Prot entry that
describe the protein's function are shown (with bold headings).
- Proteins from BRENDA,
a curated database of enzymes, are included if they are linked to a paper in PubMed
and their full sequence is known.
- Every protein from the non-redundant subset of
BioLiP,
a database
of ligand-binding sites and catalytic residues in protein structures, is included. Since BioLiP itself
does not include descriptions of the proteins, those are taken from the
Protein Data Bank.
Descriptions from PDB rely on the original submitter of the
structure and cannot be updated by others, so they may be less reliable.
(For SitesBLAST and Sites on a Tree, we use a larger subset of BioLiP so that every
ligand is represented among a group of structures with similar sequences, but for
PaperBLAST, we use the non-redundant set provided by BioLiP.)
- Every protein from EcoCyc, a curated
database of the proteins in Escherichia coli K-12, is included, regardless
of whether they are characterized or not.
- Proteins from the MetaCyc metabolic pathway database
are included if they are linked to a paper in PubMed and their full sequence is known.
- Proteins from the Transport Classification Database (TCDB)
are included if they have known substrate(s), have reference(s),
and are not described as uncharacterized or putative.
(Some of the references are not visible on the PaperBLAST web site.)
- Every protein from CharProtDB,
a database of experimentally characterized protein annotations, is included.
- Proteins from the CAZy database of carbohydrate-active enzymes
are included if they are associated with an Enzyme Classification number.
Even though CAZy does not provide links from individual protein sequences to papers,
these should all be experimentally-characterized proteins.
- Proteins from the REBASE database
of restriction enzymes are included if they have known specificity.
- Every protein with an evidence-based reannotation (based on mutant phenotypes)
in the Fitness Browser is included.
- Sequence-specific transcription factors (including sigma factors and DNA-binding response regulators)
with experimentally-determined DNA binding sites from the
PRODORIC database of gene regulation in prokaryotes.
- Putative transcription factors from RegPrecise
that have manually-curated predictions for their binding sites. These predictions are based on
conserved putative regulatory sites across genomes that contain similar transcription factors,
so PaperBLAST clusters the TFs at 70% identity and retains just one member of each cluster.
- Coding sequence (CDS) features from the
European Nucleotide Archive (ENA)
are included if the /experiment tag is set (implying that there is experimental evidence for the annotation),
the nucleotide entry links to paper(s) in PubMed,
and the nucleotide entry is from the STD data class
(implying that these are targeted annotated sequences, not from shotgun sequencing).
Also, to filter out genes whose transcription or translation was detected, but whose function
was not studied, nucleotide entries or papers with more than 25 such proteins are excluded.
Descriptions from ENA rely on the original submitter of the
sequence and cannot be updated by others, so they may be less reliable.
Except for GeneRIF and ENA,
the curated entries include a short curated
description of the protein's function.
For entries from BioLiP, the protein's function may not be known beyond binding to the ligand.
Many of these entries also link to articles in PubMed.
For more information see the
PaperBLAST paper (mSystems 2017)
or the code.
You can download PaperBLAST's database here.
Changes to PaperBLAST since the paper was written:
- November 2023: incorporated PRODORIC and RegPrecise. Many PRODORIC entries were not linked to a protein sequence (no UniProt identifier), so we added this information.
- February 2023: BioLiP changed their download format. PaperBLAST now includes their non-redundant subset. SitesBLAST and Sites on a Tree use a larger non-redundant subset that ensures that every ligand is represented within each cluster. This should ensure that every binding site is represented.
- June 2022: incorporated some coding sequences from ENA with the /experiment tag.
- March 2022: incorporated BioLiP.
- April 2020: incorporated TCDB.
- April 2019: EuropePMC now returns table entries in their search results. This has expanded PaperBLAST's database, but most of the new entries are of low relevance, and the resulting snippets are often just lists of locus tags with annotations.
- February 2018: the alignment page reports the conservation of the hit's functional sites (if available from from Swiss-Prot or UniProt)
- January 2018: incorporated BRENDA.
- December 2017: incorporated MetaCyc, CharProtDB, CAZy, REBASE, and the reannotations from the Fitness Browser.
- September 2017: EuropePMC no longer returns some table entries in their search results. This has shrunk PaperBLAST's database, but has also reduced the number of low-relevance hits.
Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.
PaperBLAST cannot provide snippets for many of the papers that are
published in non-open-access journals. This limitation applies even if
the paper is marked as "free" on the publisher's web site and is
available in PubmedCentral or EuropePMC. If a journal that you publish
in is marked as "secret," please consider publishing elsewhere.
Many important articles are missing from PaperBLAST, either because
the article's full text is not in EuropePMC (as for many older
articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an
article that characterizes a protein's function but is missing from
PaperBLAST, please notify the curators at UniProt
or add an entry to GeneRIF.
Entries in either of these databases will eventually be incorporated
into PaperBLAST. Note that to add an entry to UniProt, you will need
to find the UniProt identifier for the protein. If the protein is not
already in UniProt, you can ask them to create an entry. To add an
entry to GeneRIF, you will need an NCBI Gene identifier, but
unfortunately many prokaryotic proteins in RefSeq do not have
corresponding Gene identifers.
References
PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.
Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.
Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.
UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.
BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.
The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.
The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.
CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.
The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.
The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.
REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.
Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.
by Morgan Price,
Arkin group
Lawrence Berkeley National Laboratory