PaperBLAST
PaperBLAST Hits for P77804 Protein YdgA (Escherichia coli (strain K12)) (502 a.a., MNKSLVAVGV...)
Show query sequence
>P77804 Protein YdgA (Escherichia coli (strain K12))
MNKSLVAVGVIVALGVVWTGGAWYTGKKIETHLEDMVAQANAQLKLTAPESNLEVSYQNY
HRGVFSSQLQLLVKPIAGKENPWIKSGQSVIFNESVDHGPFPLAQLKKLNLIPSMASIQT
TLVNNEVSKPLFDMAKGETPFEINSRIGYSGDSSSDISLKPLNYEQKDEKVAFSGGEFQL
NADRDGKAISLSGEAQSGRIDAVNEYNQKVQLTFNNLKTDGSSTLASFGERVGNQKLSLE
KMTISVEGKELALLEGMEISGKSDLVNDGKTINSQLDYSLNSLKVQNQDLGSGKLTLKVG
QIDGEAWHQFSQQYNAQTQALLAQPEIANNPELYQEKVTEAFFSALPLMLKGDPVITIAP
LSWKNSQGESALNLSLFLKDPATTKEAPQTLAQEVDRSVKSLDAKLTIPVDMATEFMTQV
AKLEGYQEDQAKKLAKQQVEGASAMGQMFRLTTLQDNTITTSLQYANGQITLNGQKMSLE
DFVGMFAMPALNVPAVPAIPQQ
Running BLASTp...
Found 11 similar proteins in the literature:
YdgA / b1614 DUF945 domain-containing protein YdgA from Escherichia coli K-12 substr. MG1655 (see 3 papers)
YDGA_ECOLI / P77804 Protein YdgA from Escherichia coli (strain K12) (see paper)
ydgA / RF|NP_416131 protein ydgA from Escherichia coli K12 (see paper)
b1614 hypothetical protein from Escherichia coli str. K-12 substr. MG1655
100% identity, 100% coverage
- subunit: Homodimer.
- Biodistribution of 89Zr-DFO-labeled avian pathogenic Escherichia coli outer membrane vesicles by PET imaging in chickens
Li, Poultry science 2023 - “...and vesicular transport Cell inner membrane 355 P0CK95 ACFD Function unknown Cell inner membrane 356 P77804 YDGA Function unknown Cell inner membrane 357 P0C0S1 MSCS Cell wall/membrane/envelope biogenesis Cell inner membrane 358 P0AB98 ATP6 Energy production and conversion Cell inner membrane 359 P0ADA3 NLPD Cell wall/membrane/envelope...”
- Comparative Analysis of Outer Membrane Vesicle Isolation Methods With an Escherichia coli tolA Mutant Reveals a Hypervesiculating Phenotype With Outer-Inner Membrane Vesicle Content
Reimer, Frontiers in microbiology 2021 - “...Increased in tolA P0AC41 sdhA Succinate dehydrogenase flavoprotein subunit tolA 0.0013 INF Increased in tolA P77804 ydgA Protein YdgA tolA 0.015 INF Increased in tolA OUTER MEMBRANE P77774 bamB Outer membrane protein assembly factor BamB BOTH <0.00010 23 Increased in tolA P0A903 bamC Outer membrane protein...”
- The Escherichia coli proteome: past, present, and future prospects
Han, Microbiology and molecular biology reviews : MMBR 2006 - “...YdhR YdiA YdiJ YdiY YdjA YeaD YeaZ P77318 P77804 P76177 P0AC69 P77552 P0ACX3 P0A8A4 P77748 P76206 P0ACY1 P39173 P76256 5.38/59,928.8 5.07/54,689 9.1/31,910.83...”
- 18th Congress of the European Hematology Association, Stockholm, Sweden, June 13–16, 2013
, Haematologica 2013 - Genome-wide analysis of the general stress response network in Escherichia coli: sigmaS-dependent genes, promoters, and sigma factor selectivity
Weber, Journal of bacteriology 2005 - “...b1003 b1050 b1188 b1164 b1258 b1259 b1341 b1547 b1614 b1783 b1784 b1847 b1999 b2080 b2086 b2602 b2660 b2672 b2665 Putative acyl-CoA dehydrogenase...”
SSON_1546 hypothetical protein from Shigella sonnei Ss046
100% identity, 100% coverage
- High yield production process for Shigella outer membrane particles
Berlanda, PloS one 2012 - “...protein SF1022 [S. flexneri 2a str. 301] SF1022 gi|24112431 Inner membrane 56 1 hypothetical protein SSON_1546 [S. sonnei Ss046] ydgA gi|74312061 Unknown 57 3 putative receptor [S. sonnei Ss046] SSON_1681 gi|74312191 58 2 hypothetical protein SSON_1556 [S. sonnei Ss046] ydgH gi|74312071 59 2 hypothetical protein SSON_3340...”
STM1466 putative periplasmic protein from Salmonella typhimurium LT2
80% identity, 100% coverage
t1334 conserved hypothetical protein from Salmonella enterica subsp. enterica serovar Typhi Ty2
80% identity, 100% coverage
- Prevalence and Diversity of Staphylococcus aureus and Staphylococcal Enterotoxins in Raw Milk From Northern Portugal
Oliveira, Frontiers in microbiology 2022 - “...0.07.6%) distinct isolates, while t002, t108, t117, t127, t189, t208, t267, t843, t899, t1200, t1207, t1334, t2383, t3585, t9216, and t19272, were associated to one (1.6%, 95% CI: 0.04.7%) S. aureus isolate ( Figure 2 ). FIGURE 2 Minimum spanning tree of the spa typing for...”
- “...and one t2383. S. aureus t1403, t2802, t571, t108, t189. t208, t267, t843, t1200, t1207, t1334, and t19272 were exclusively associated with strains that did not contain any of the virulence/resistance genes evaluated. In total, S. aureus t1403-none (16.1%, 95% CI: 7.025.3%) is the predominant molecular...”
YPO2262 putative exported protein from Yersinia pestis CO92
43% identity, 95% coverage
y2104 hypothetical protein from Yersinia pestis KIM
43% identity, 91% coverage
- Integral and peripheral association of proteins and protein complexes with Yersinia pestis inner and outer membranes
Pieper, Proteome science 2009 - “...system subunits (ClpB, #78, y3673, #172; y3674, #89; y3675, #211) and a putative phospholipid-binding protein (y2104, #207). While we only infer the association of aforementioned proteins with distinct membrane protein complexes, this experiment was in support of the notion that many proteins assigned to the membrane...”
- “...which has been linked to pathogenicity of Vibrio cholerae [ 58 ], and the protein y2104, a putative phospholipid-binding protein whose ortholog YdgA also formed oligomeric structures in E. coli [ 50 ]. OM-associated proteins and protein complexes of the OM Thirty-one proteins were designated Y....”
YihF / b3861 DUF945 domain-containing protein YihF from Escherichia coli K-12 substr. MG1655
b3861 putative GTP-binding protein from Escherichia coli str. K-12 substr. MG1655
34% identity, 95% coverage
PP1230 conserved hypothetical protein from Pseudomonas putida KT2440
27% identity, 93% coverage
NTHI1930 hypothetical protein from Haemophilus influenzae 86-028NP
24% identity, 95% coverage
APL_0889 hypothetical protein from Actinobacillus pleuropneumoniae L20
26% identity, 97% coverage
HI1236 conserved hypothetical protein from Haemophilus influenzae Rd KW20
P44132 Uncharacterized protein HI_1236 from Haemophilus influenzae (strain ATCC 51907 / DSM 11121 / KW20 / Rd)
26% identity, 61% coverage
- Identification and functional analysis of 'hypothetical' genes expressed in Haemophilus influenzae
Kolker, Nucleic acids research 2004 - “...the original list: HI0246, HI0668, HI0700, HI0847, HI1168, HI1236 and HI1709. For two more, HI0370 and HI1681, a conserved gene neighborhood (e.g. co-expression...”
- Identification of the exported proteins of the oral opportunistic pathogen Actinobacillus actinomycetemcomitans by using alkaline phosphatase fusions
Ward, Infection and immunity 2001 - “...(367) (371) 1301 (136) 748 (185) HI1085 HI1603 HI1701 HI1150 HI0693 HI0370 HI1236 HI1126.1 HI1628 HI0389 89 89 66 74 89 74 24 87 58 65 79 80 49 62 78 51 44 80...”
- Identification of genes coding for exported proteins of Actinobacillus actinomycetemcomitans
Mintz, Infection and immunity 1999 - “...pVT1063 pVT1064 pVT1067 Hypothetical protein of H. influenzae HI1236 (8) Outer membrane protein A precursor of Serratia marcescens (4) H. influenzae chaperone...”
- Functional annotation of conserved hypothetical proteins from Haemophilus influenzae Rd KW20
Shahbaaz, PloS one 2013 - “...23. P45074 Yes Cellular process 24. P45077 Yes Cellular process 25. P71373 Yes Yes 26. P44132 Yes Metabolism molecule 27. P44138 Yes Cellular process 28. P44140 Yes Yes 29. P44165 Yes Yes 30. P45182 Yes Yes 31. P44169 Yes Yes 32. P44183 Yes Yes 33. P56507...”
For advice on how to use these tools together, see
Interactive tools for functional annotation of bacterial genomes.
The PaperBLAST database links 789,361 different protein sequences to 1,256,019 scientific articles. Searches against EuropePMC were last performed on January 10 2025.
PaperBLAST builds a database of protein sequences that are linked
to scientific articles. These links come from automated text searches
against the articles in EuropePMC
and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot,
BRENDA,
CAZy (as made available by dbCAN),
BioLiP,
CharProtDB,
MetaCyc,
EcoCyc,
TCDB,
REBASE,
the Fitness Browser,
and a subset of the European Nucleotide Archive with the /experiment tag.
Given this database and a protein sequence query,
PaperBLAST uses protein-protein BLAST
to find similar sequences with E < 0.001.
To build the database, we query EuropePMC with locus tags, with RefSeq protein
identifiers, and with UniProt
accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use
queries of the form "locus_tag AND genus_name" to try to ensure that
the paper is actually discussing that gene. Because EuropePMC indexes
most recent biomedical papers, even if they are not open access, some
of the links may be to papers that you cannot read or that our
computers cannot read. We query each of these identifiers that
appears in the open access part of EuropePMC, as well as every locus
tag that appears in the 500 most-referenced genomes, so that a gene
may appear in the PaperBLAST results even though none of the papers
that mention it are open access. We also incorporate text-mined links
from EuropePMC that link open access articles to UniProt or RefSeq
identifiers. (This yields some additional links because EuropePMC
uses different heuristics for their text mining than we do.)
For every article that mentions a locus tag, a RefSeq protein
identifier, or a UniProt accession, we try to select one or two
snippets of text that refer to the protein. If we cannot get access to
the full text, we try to select a snippet from the abstract, but
unfortunately, unique identifiers such as locus tags are rarely
provided in abstracts.
PaperBLAST also incorporates manually-curated protein functions:
- Proteins from NCBI's RefSeq are included if a
GeneRIF
entry links the gene to an article in
PubMed®.
GeneRIF also provides a short summary of the article's claim about the
protein, which is shown instead of a snippet.
- Proteins from Swiss-Prot (the curated part of UniProt)
are included if the curators
identified experimental evidence for the protein's function (evidence
code ECO:0000269). For these proteins, the fields of the Swiss-Prot entry that
describe the protein's function are shown (with bold headings).
- Proteins from BRENDA,
a curated database of enzymes, are included if they are linked to a paper in PubMed
and their full sequence is known.
- Every protein from the non-redundant subset of
BioLiP,
a database
of ligand-binding sites and catalytic residues in protein structures, is included. Since BioLiP itself
does not include descriptions of the proteins, those are taken from the
Protein Data Bank.
Descriptions from PDB rely on the original submitter of the
structure and cannot be updated by others, so they may be less reliable.
(For SitesBLAST and Sites on a Tree, we use a larger subset of BioLiP so that every
ligand is represented among a group of structures with similar sequences, but for
PaperBLAST, we use the non-redundant set provided by BioLiP.)
- Every protein from EcoCyc, a curated
database of the proteins in Escherichia coli K-12, is included, regardless
of whether they are characterized or not.
- Proteins from the MetaCyc metabolic pathway database
are included if they are linked to a paper in PubMed and their full sequence is known.
- Proteins from the Transport Classification Database (TCDB)
are included if they have known substrate(s), have reference(s),
and are not described as uncharacterized or putative.
(Some of the references are not visible on the PaperBLAST web site.)
- Every protein from CharProtDB,
a database of experimentally characterized protein annotations, is included.
- Proteins from the CAZy database of carbohydrate-active enzymes
are included if they are associated with an Enzyme Classification number.
Even though CAZy does not provide links from individual protein sequences to papers,
these should all be experimentally-characterized proteins.
- Proteins from the REBASE database
of restriction enzymes are included if they have known specificity.
- Every protein with an evidence-based reannotation (based on mutant phenotypes)
in the Fitness Browser is included.
- Sequence-specific transcription factors (including sigma factors and DNA-binding response regulators)
with experimentally-determined DNA binding sites from the
PRODORIC database of gene regulation in prokaryotes.
- Putative transcription factors from RegPrecise
that have manually-curated predictions for their binding sites. These predictions are based on
conserved putative regulatory sites across genomes that contain similar transcription factors,
so PaperBLAST clusters the TFs at 70% identity and retains just one member of each cluster.
- Coding sequence (CDS) features from the
European Nucleotide Archive (ENA)
are included if the /experiment tag is set (implying that there is experimental evidence for the annotation),
the nucleotide entry links to paper(s) in PubMed,
and the nucleotide entry is from the STD data class
(implying that these are targeted annotated sequences, not from shotgun sequencing).
Also, to filter out genes whose transcription or translation was detected, but whose function
was not studied, nucleotide entries or papers with more than 25 such proteins are excluded.
Descriptions from ENA rely on the original submitter of the
sequence and cannot be updated by others, so they may be less reliable.
Except for GeneRIF and ENA,
the curated entries include a short curated
description of the protein's function.
For entries from BioLiP, the protein's function may not be known beyond binding to the ligand.
Many of these entries also link to articles in PubMed.
For more information see the
PaperBLAST paper (mSystems 2017)
or the code.
You can download PaperBLAST's database here.
Changes to PaperBLAST since the paper was written:
- November 2023: incorporated PRODORIC and RegPrecise. Many PRODORIC entries were not linked to a protein sequence (no UniProt identifier), so we added this information.
- February 2023: BioLiP changed their download format. PaperBLAST now includes their non-redundant subset. SitesBLAST and Sites on a Tree use a larger non-redundant subset that ensures that every ligand is represented within each cluster. This should ensure that every binding site is represented.
- June 2022: incorporated some coding sequences from ENA with the /experiment tag.
- March 2022: incorporated BioLiP.
- April 2020: incorporated TCDB.
- April 2019: EuropePMC now returns table entries in their search results. This has expanded PaperBLAST's database, but most of the new entries are of low relevance, and the resulting snippets are often just lists of locus tags with annotations.
- February 2018: the alignment page reports the conservation of the hit's functional sites (if available from from Swiss-Prot or UniProt)
- January 2018: incorporated BRENDA.
- December 2017: incorporated MetaCyc, CharProtDB, CAZy, REBASE, and the reannotations from the Fitness Browser.
- September 2017: EuropePMC no longer returns some table entries in their search results. This has shrunk PaperBLAST's database, but has also reduced the number of low-relevance hits.
Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.
PaperBLAST cannot provide snippets for many of the papers that are
published in non-open-access journals. This limitation applies even if
the paper is marked as "free" on the publisher's web site and is
available in PubmedCentral or EuropePMC. If a journal that you publish
in is marked as "secret," please consider publishing elsewhere.
Many important articles are missing from PaperBLAST, either because
the article's full text is not in EuropePMC (as for many older
articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an
article that characterizes a protein's function but is missing from
PaperBLAST, please notify the curators at UniProt
or add an entry to GeneRIF.
Entries in either of these databases will eventually be incorporated
into PaperBLAST. Note that to add an entry to UniProt, you will need
to find the UniProt identifier for the protein. If the protein is not
already in UniProt, you can ask them to create an entry. To add an
entry to GeneRIF, you will need an NCBI Gene identifier, but
unfortunately many prokaryotic proteins in RefSeq do not have
corresponding Gene identifers.
References
PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.
Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.
Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.
UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.
BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.
The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.
The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.
CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.
The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.
The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.
REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.
Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.
by Morgan Price,
Arkin group
Lawrence Berkeley National Laboratory