PaperBLAST
PaperBLAST Hits for SO0334 (70 a.a., MRVFPVYAPK...)
Show query sequence
>SO0334
MRVFPVYAPKLIVKHARIFLTGVIWVKDLGRLEFEKGRFLLPRKSLPKVKQAILELNELI
EAQNHQTKTA
Running BLASTp...
Found 6 similar proteins in the literature:
VP2585 conserved hypothetical protein from Vibrio parahaemolyticus RIMD 2210633
37% identity, 86% coverage
M892_13365 DUF1107 domain-containing protein from Vibrio campbellii ATCC BAA-1116
37% identity, 86% coverage
Q9KPA0 DUF1107 domain-containing protein from Vibrio cholerae serotype O1 (strain ATCC 39315 / El Tor Inaba N16961)
VC2473 conserved hypothetical protein from Vibrio cholerae O1 biovar eltor str. N16961
35% identity, 78% coverage
- Comprehensive in silico analyses of fifty-one uncharacterized proteins from Vibrio cholerae
Basu, PloS one 2024 - “...Q9KPZ1 Q9KL73 Q9KNG0 Q9K2J6 Q9KNG0 Q9KPZ1 Q9KS64 Q9KPZ1 Q9KNI6 Q9KN40 Q9K9I6 Q9KVT0 Q9KVW5 Q9KVT0 Q9KL81 Q9KPA0 Q9KL73 Q9KNG0 Q9KSJ4 Q9KPZ1 Q9KNI6 Q9KVT0 Q9KST0 Based on antigenicity, allergenicity, toxicity B-cell and T-cell epitope prediction, there are 20 proteins (potential vaccine candidates) that are predicted to elicit immunogenic...”
- cAMP Receptor Protein Controls Vibrio cholerae Gene Expression in Response to Host Colonization
Manneh-Roussel, mBio 2018 - “...3.7E03 GATGAATTTATTCATC VC2390 Yes 2610382 2610307.5 9.8E03 GCTGATTCGCGTCTTG VC2435<> tolC No 2653838 2653780.5 7.5E04 CGCGAGTCTCTTCAAA VC2473 Yes 2667326 2667406.5 8.8E03 TAATATTCACGTCAAA VC2486 No 2699390 2699329.5 1.5E03 GGTGATGGTCGCCACT pyrB No 2743349 2743361.5 8.1E04 ATCGCGTCACATCACA VC2561<> cpdB No 2787939 2787903.5 3.0E04 TGAGATAAACCCCACA VC2618 Yes 2845246 2845280.5 5.7E07 TGTGATTTTCATCACG...”
- The LonA Protease Regulates Biofilm Formation, Motility, Virulence, and the Type VI Secretion System in Vibrio cholerae
Rogers, Journal of bacteriology 2016 - “...by cellular role category VC1191 VC1262 VC1510 VC2473 VCA0883 VCA1065 a Fold change (lon mutant/WT) Hypothetical protein Hypothetical protein Hypothetical...”
- A Cytosine Methyltransferase Modulates the Cell Envelope Stress Response in the Cholera Pathogen [corrected]
Chao, PLoS genetics 2015 - “...of vc2437 leads to increased accumulation of E , we hypothesize that increased levels of Vc2473 in the vchM mutant will reduce stimuli activating the E stress response, as outlined in our model ( Fig 5D ). In E . coli , two signals are required...”
- The transcriptional regulator, CosR, controls compatible solute biosynthesis and transport, motility and biofilm formation in Vibrio cholerae
Shikuma, Environmental microbiology 2013 - “...protein 0.30 VCA0849 Hypothetical protein 2.90 VCA0973 Hypothetical protein 0.36 VC1645 Conserved hypothetical protein 0.47 VC2473 Conserved hypothetical protein 0.50 Regulatory Functions VC2692 cpxR Transcriptional regulator CpxR 2.01 Transport and Binding Proteins VC0173 peptide ABC transporter permease protein 3.45 VC1092 oppB oligopeptide ABC transporter permease protein...”
- Identification of the TcpP-binding site in the toxT promoter of Vibrio cholerae and the role of ToxR in TcpP-mediated activation
Goss, Infection and immunity 2010 - “...the hypothetical open reading frames (ORFs) VC2472 and VC2473. Microarray analysis of gene expression in V. cholerae N16961 grown under AKI (virulence-inducing)...”
SENTW_4510, SL1344_4339, STM14_5292 DUF1107 domain-containing protein from Salmonella enterica subsp. enterica serovar Typhimurium str. SL1344
STM4406.S putative cytoplasmic protein from Salmonella typhimurium LT2
37% identity, 61% coverage
- Structure-based analyses of Salmonella RcsB variants unravel new features of the Rcs regulon
Huesa, Nucleic acids research 2021 - “...1.42 437145 SL1344_0379 psiF gene 5UTR + 57 437138 7 to 21 TGGGATATTTCCCA 1.46 4663803 SL1344_4339 ytfK gene promoter + 165 4663853 50 to 36 TTAGATAATTCTGA 1.72 1601764 SL1344_1493 osmC gene promoter 243 1601743 21 to 35 TGAGATTTATCCTA 2.67 2968595 SL1344_2780 gene promoter 161 2968491 104...”
- Proteome remodelling by the stress sigma factor RpoS/σS in Salmonella: identification of small proteins and evidence for post-transcriptional regulation
Lago, Scientific reports 2017 - “...Enterobacteriaceae STM14_5097 CsbD like IPR008462, pdb1RYK 70 Yes yjbJ 7, 1114 Bacteria, Archaea and Eukaryota STM14_5292 DUF1107 IPR009491 68 ytfK 7, 1214 - Proteobacteria STM14_5469 65% identity with E . coli YjjZ, 3TM fragments, DUF1435 IPR009885 78 yjjZ 13, 14 Enterobacteriaceae STM14_5479 Not translated under the...”
- “...of the uncharacterized small ORFs, for which no protein was identified by LC-MS (STM14_2173, STM14_2189, STM14_5292, STM14_5469, STM14_5479, STM14_5481), was assessed by immunodetection of the corresponding 3xFlag-tagged proteins. Proteins were detected for all of them, except STM14_5469 and STM14_5479 (Fig. 1b ). In the case of...”
- Transcriptional profile of Salmonella enterica subsp. enterica serovar Weltevreden during alfalfa sprout colonization
Brankatschk, Microbial biotechnology 2014 - “...SENTW_0580 4.55 Putative aminotransferase yjgF SENTW_4564 7.03 Protein TdcF yiaL SENTW_3768 7.86 Protein YiaL ytfK SENTW_4510 8.24 Uncharacterized protein YtfK yhcN3 SENTW_3490 5.82 Protein YdgH ygaU SENTW_2883 5.51 Uncharacterized protein YgaU yggE SENTW_3181 6.70 Uncharacterized protein YggE yfeK SENTW_2617 5.38 Uncharacterized protein YfeK yjfN SENTW_4481 6.70...”
- Identification of HilD-regulated genes in Salmonella enterica serovar Typhimurium
Petrone, Journal of bacteriology 2014 - “...(STM14_5117) STM14_5184 rtsA (STM14_5188)/STM14_5189 ytfJ (STM14_5290)/STM14_5291/ytfK (STM14_5292) a SPI-1 regions are shaded in gray. Position (bp) in the...”
- “...described previously: STM14_1282, STM14_2342, lpxR (STM14_1612), and ytfK (STM14_5292). HilC and RtsA are homologues of HilD with 62% and 61% identity with...”
- Deep sequencing analysis of small noncoding RNA and mRNA targets of the global post-transcriptional regulator, Hfq
Sittka, PLoS genetics 2008 - “...198 4.3 needle complex inner membrane lipoprotein STM2884 sipC 96 192 2.0 translocation machinery component STM4406.S ytfK 6 191 31.8 putative cytoplasmic protein STM2867 hilC 3 187 62.3 invasion regulatory protein STM2869 orgB 8 182 22.8 needle complex export protein STM2878 sptP 20 177 8.9 protein...”
YtfK / b4217 stringent response modulator YtfK from Escherichia coli K-12 substr. MG1655 (see 3 papers)
ECK4213, NP_418638 stringent response modulator YtfK from Escherichia coli str. K-12 substr. MG1655
NP_313222 hypothetical protein from Escherichia coli O157:H7 str. Sakai
b4217 orf, hypothetical protein from Escherichia coli str. K-12 substr. MG1655
37% identity, 61% coverage
- YtfK activates the stringent response by triggering the alarmone synthetase SpoT in Escherichia coli.
Germain, Nature communications 2019 - GeneRIF: Study shows that the protein YtfK promotes SpoT-dependent accumulation of (p)ppGpp in E. coli and is required for activation of the stringent response during phosphate and fatty acid starvation. Results indicate that YtfK can interact with SpoT and propose that YtfK activates the stringent response by tilting the catalytic balance of SpoT toward (p)ppGpp synthesis.
- Involvement of the ytfK gene from the PhoB regulon in stationary-phase H2O2 stress tolerance in Escherichia coli.
Iwadate, Microbiology (Reading, England) 2017 (PubMed)- GeneRIF: ytfK disruption results in reduced viability of stationary-phase cells under phosphate starvation.
- Global regulation by the seven-component Pi signaling system
Hsieh, Current opinion in microbiology 2010 - “...subunit, membrane component [ 1 ] yibD ECK3605 predicted glycosyl transferase [ 7 ] ytfK ECK4213 conserved protein [ 7 ] a ECK numbers are in accordance with Riley et al. [ 15 ] b Product descriptions are in accordance with Riley et al. [ 15...”
- Identification of PhoB binding sites of the yibD and ytfK promoter regions in Escherichia coli.
Yoshida, Journal of microbiology (Seoul, Korea) 2011 (PubMed)- GeneRIF: The authors determined the binding regions of PhoB in the promoter regions of yibD and ytfK by DNase I footprinting.
- NtrBC and Nac contribute to efficient Shigella flexneri intracellular replication
Waddell, Journal of bacteriology 2014 - “...Conserved protein b1407 b1418 b1422 b1446 b1447 b1450 b1847 b4217 P value 0.04 6.15E07 0.12 0.0007 0.16 0.26 0.0052 0.0308 0.11 0.24 0.17 0.0007 0.0371 0.0050...”
- “...upregulated and expressed at optimal levels. Only one gene, b4217 or ytfK, was overexpressed in the nac mutant relative to the wild type. ytfK encodes a...”
- The HU regulon is composed of genes responding to anaerobiosis, acid stress, high osmolarity and SOS induction
Oberto, PloS one 2009 - “...1.1 1.67 1 1.94 1.32 1.73 1 1.52 1.06 0.47 a, b hypothetical protein ytfK b4217 ytfK 1 0.36 0.81 0.93 1 0.62 0.66 0.36 1 1.45 1.84 0.46 a hypothetical protein osmY b4376 osmY 1 1.1 1.41 2.55 1 0.6 0.67 1.78 1 1.93 1.54...”
- Autoinducer 2 controls biofilm formation in Escherichia coli through a novel motility quorum-sensing regulator (MqsR, B3022)
González, Journal of bacteriology 2006 - “...yidS ybbY yciF yahO yjcH ydcV b3717 b1511 b2252 b3143 b4217 b1160 b1257 b3690 b0513 b1258 b0329 b4068 b1443 yncG spf yjcO ymgB ymgC yeaQ amyA gatC b1454 b3864...”
- SigmaS-dependent gene expression at the onset of stationary phase in Escherichia coli: function of sigmaS-dependent genes and identification of their promoter sequences
Lacour, Journal of bacteriology 2004 - “...(membrane) protein (b1582) Hypothetical (periplasmic) protein (b3097) Hypothetical protein (b4217) CHM CHM CHM CHM NM CB MF (MF) CD, Lrp, DHCP MMC, GadX Lrp PQ,...”
- DNA microarray-mediated transcriptional profiling of the Escherichia coli response to hydrogen peroxide
Zheng, Journal of bacteriology 2001 - “...ygaQ yaiA yceP glgS ydcH tnaL b3708 b4326 b2414 b4322 b2365 b4217 b2616 b4062 b2366 b3924 b1166 b2012 b2654 b0389 b1060 b3049 b1426 b3707 30 29 25 23 22 20 20...”
- Genome-wide transcriptional profiling of the Escherichia coli responses to superoxide stress and sodium salicylate
Pomposiello, Journal of bacteriology 2001 - “...b3321 b3316 b0724 b3908 b4062 b0729 b3708 b0850 b2523 b3520 b4217 b1852 acrA ahpC aldA artI artP b0710 b1378 b1452 b2351 b2962 cadA cadC ccmD cyoD cysD cysK...”
c5315 Hypothetical protein ytfK from Escherichia coli CFT073
37% identity, 48% coverage
- Logic Synthesis of Recombinase-Based Genetic Circuits
Chiu, Scientific reports 2017 - “...19 400 1002 9 c3540 50/22 490 1179 (223) 566 1473 36 553 1649 14 c5315 178/123 581 1726 (313) 942 2202 25 908 2333 12 c6288 32/32 32 2384 1825 3709 89 1502 3995 38 c7552 207/108 876 2636 (534) 1149 2496 59 1084 2754...”
For advice on how to use these tools together, see
Interactive tools for functional annotation of bacterial genomes.
The PaperBLAST database links 793,807 different protein sequences to 1,259,118 scientific articles. Searches against EuropePMC were last performed on March 13 2025.
PaperBLAST builds a database of protein sequences that are linked
to scientific articles. These links come from automated text searches
against the articles in EuropePMC
and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot,
BRENDA,
CAZy (as made available by dbCAN),
BioLiP,
CharProtDB,
MetaCyc,
EcoCyc,
TCDB,
REBASE,
the Fitness Browser,
and a subset of the European Nucleotide Archive with the /experiment tag.
Given this database and a protein sequence query,
PaperBLAST uses protein-protein BLAST
to find similar sequences with E < 0.001.
To build the database, we query EuropePMC with locus tags, with RefSeq protein
identifiers, and with UniProt
accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use
queries of the form "locus_tag AND genus_name" to try to ensure that
the paper is actually discussing that gene. Because EuropePMC indexes
most recent biomedical papers, even if they are not open access, some
of the links may be to papers that you cannot read or that our
computers cannot read. We query each of these identifiers that
appears in the open access part of EuropePMC, as well as every locus
tag that appears in the 500 most-referenced genomes, so that a gene
may appear in the PaperBLAST results even though none of the papers
that mention it are open access. We also incorporate text-mined links
from EuropePMC that link open access articles to UniProt or RefSeq
identifiers. (This yields some additional links because EuropePMC
uses different heuristics for their text mining than we do.)
For every article that mentions a locus tag, a RefSeq protein
identifier, or a UniProt accession, we try to select one or two
snippets of text that refer to the protein. If we cannot get access to
the full text, we try to select a snippet from the abstract, but
unfortunately, unique identifiers such as locus tags are rarely
provided in abstracts.
PaperBLAST also incorporates manually-curated protein functions:
- Proteins from NCBI's RefSeq are included if a
GeneRIF
entry links the gene to an article in
PubMed®.
GeneRIF also provides a short summary of the article's claim about the
protein, which is shown instead of a snippet.
- Proteins from Swiss-Prot (the curated part of UniProt)
are included if the curators
identified experimental evidence for the protein's function (evidence
code ECO:0000269). For these proteins, the fields of the Swiss-Prot entry that
describe the protein's function are shown (with bold headings).
- Proteins from BRENDA,
a curated database of enzymes, are included if they are linked to a paper in PubMed
and their full sequence is known.
- Every protein from the non-redundant subset of
BioLiP,
a database
of ligand-binding sites and catalytic residues in protein structures, is included. Since BioLiP itself
does not include descriptions of the proteins, those are taken from the
Protein Data Bank.
Descriptions from PDB rely on the original submitter of the
structure and cannot be updated by others, so they may be less reliable.
(For SitesBLAST and Sites on a Tree, we use a larger subset of BioLiP so that every
ligand is represented among a group of structures with similar sequences, but for
PaperBLAST, we use the non-redundant set provided by BioLiP.)
- Every protein from EcoCyc, a curated
database of the proteins in Escherichia coli K-12, is included, regardless
of whether they are characterized or not.
- Proteins from the MetaCyc metabolic pathway database
are included if they are linked to a paper in PubMed and their full sequence is known.
- Proteins from the Transport Classification Database (TCDB)
are included if they have known substrate(s), have reference(s),
and are not described as uncharacterized or putative.
(Some of the references are not visible on the PaperBLAST web site.)
- Every protein from CharProtDB,
a database of experimentally characterized protein annotations, is included.
- Proteins from the CAZy database of carbohydrate-active enzymes
are included if they are associated with an Enzyme Classification number.
Even though CAZy does not provide links from individual protein sequences to papers,
these should all be experimentally-characterized proteins.
- Proteins from the REBASE database
of restriction enzymes are included if they have known specificity.
- Every protein with an evidence-based reannotation (based on mutant phenotypes)
in the Fitness Browser is included.
- Sequence-specific transcription factors (including sigma factors and DNA-binding response regulators)
with experimentally-determined DNA binding sites from the
PRODORIC database of gene regulation in prokaryotes.
- Putative transcription factors from RegPrecise
that have manually-curated predictions for their binding sites. These predictions are based on
conserved putative regulatory sites across genomes that contain similar transcription factors,
so PaperBLAST clusters the TFs at 70% identity and retains just one member of each cluster.
- Coding sequence (CDS) features from the
European Nucleotide Archive (ENA)
are included if the /experiment tag is set (implying that there is experimental evidence for the annotation),
the nucleotide entry links to paper(s) in PubMed,
and the nucleotide entry is from the STD data class
(implying that these are targeted annotated sequences, not from shotgun sequencing).
Also, to filter out genes whose transcription or translation was detected, but whose function
was not studied, nucleotide entries or papers with more than 25 such proteins are excluded.
Descriptions from ENA rely on the original submitter of the
sequence and cannot be updated by others, so they may be less reliable.
Except for GeneRIF and ENA,
the curated entries include a short curated
description of the protein's function.
For entries from BioLiP, the protein's function may not be known beyond binding to the ligand.
Many of these entries also link to articles in PubMed.
For more information see the
PaperBLAST paper (mSystems 2017)
or the code.
You can download PaperBLAST's database here.
Changes to PaperBLAST since the paper was written:
- November 2023: incorporated PRODORIC and RegPrecise. Many PRODORIC entries were not linked to a protein sequence (no UniProt identifier), so we added this information.
- February 2023: BioLiP changed their download format. PaperBLAST now includes their non-redundant subset. SitesBLAST and Sites on a Tree use a larger non-redundant subset that ensures that every ligand is represented within each cluster. This should ensure that every binding site is represented.
- June 2022: incorporated some coding sequences from ENA with the /experiment tag.
- March 2022: incorporated BioLiP.
- April 2020: incorporated TCDB.
- April 2019: EuropePMC now returns table entries in their search results. This has expanded PaperBLAST's database, but most of the new entries are of low relevance, and the resulting snippets are often just lists of locus tags with annotations.
- February 2018: the alignment page reports the conservation of the hit's functional sites (if available from from Swiss-Prot or UniProt)
- January 2018: incorporated BRENDA.
- December 2017: incorporated MetaCyc, CharProtDB, CAZy, REBASE, and the reannotations from the Fitness Browser.
- September 2017: EuropePMC no longer returns some table entries in their search results. This has shrunk PaperBLAST's database, but has also reduced the number of low-relevance hits.
Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.
PaperBLAST cannot provide snippets for many of the papers that are
published in non-open-access journals. This limitation applies even if
the paper is marked as "free" on the publisher's web site and is
available in PubmedCentral or EuropePMC. If a journal that you publish
in is marked as "secret," please consider publishing elsewhere.
Many important articles are missing from PaperBLAST, either because
the article's full text is not in EuropePMC (as for many older
articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an
article that characterizes a protein's function but is missing from
PaperBLAST, please notify the curators at UniProt
or add an entry to GeneRIF.
Entries in either of these databases will eventually be incorporated
into PaperBLAST. Note that to add an entry to UniProt, you will need
to find the UniProt identifier for the protein. If the protein is not
already in UniProt, you can ask them to create an entry. To add an
entry to GeneRIF, you will need an NCBI Gene identifier, but
unfortunately many prokaryotic proteins in RefSeq do not have
corresponding Gene identifers.
References
PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.
Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.
Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.
UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.
BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.
The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.
The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.
CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.
The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.
The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.
REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.
Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.
by Morgan Price,
Arkin group
Lawrence Berkeley National Laboratory