PaperBLAST

PaperBLAST – Find papers about a protein or its homologs

PaperBLAST

PaperBLAST Hits for P77804 Protein YdgA (Escherichia coli (strain K12)) (502 a.a., MNKSLVAVGV...)

Other sequence analysis tools:

Find functional residues: SitesBLAST

Search for conserved domains

Find the best match in UniProt

Compare to protein structures

Predict transmenbrane helices: Phobius

Predict protein localization: PSORTb

Find homologs in fast.genomics

Fitness BLAST: loading...

Show query sequence

Found 11 similar proteins in the literature:

YdgA / b1614 DUF945 domain-containing protein YdgA from Escherichia coli K-12 substr. MG1655 (see 3 papers)
YDGA_ECOLI / P77804 Protein YdgA from Escherichia coli (strain K12) (see paper)
ydgA / RF|NP_416131 protein ydgA from Escherichia coli K12 (see paper)
b1614 hypothetical protein from Escherichia coli str. K-12 substr. MG1655
100% identity, 100% coverage

subunit: Homodimer.
Biodistribution of ⁸⁹Zr-DFO-labeled avian pathogenic Escherichia coli outer membrane vesicles by PET imaging in chickens
Li, Poultry science 2023
- “...and vesicular transport Cell inner membrane 355 P0CK95 ACFD Function unknown Cell inner membrane 356 P77804 YDGA Function unknown Cell inner membrane 357 P0C0S1 MSCS Cell wall/membrane/envelope biogenesis Cell inner membrane 358 P0AB98 ATP6 Energy production and conversion Cell inner membrane 359 P0ADA3 NLPD Cell wall/membrane/envelope...”
Comparative Analysis of Outer Membrane Vesicle Isolation Methods With an Escherichia coli tolA Mutant Reveals a Hypervesiculating Phenotype With Outer-Inner Membrane Vesicle Content
Reimer, Frontiers in microbiology 2021
- “...Increased in tolA P0AC41 sdhA Succinate dehydrogenase flavoprotein subunit tolA 0.0013 INF Increased in tolA P77804 ydgA Protein YdgA tolA 0.015 INF Increased in tolA OUTER MEMBRANE P77774 bamB Outer membrane protein assembly factor BamB BOTH <0.00010 23 Increased in tolA P0A903 bamC Outer membrane protein...”
The Escherichia coli proteome: past, present, and future prospects
Han, Microbiology and molecular biology reviews : MMBR 2006
- “...YdhR YdiA YdiJ YdiY YdjA YeaD YeaZ P77318 P77804 P76177 P0AC69 P77552 P0ACX3 P0A8A4 P77748 P76206 P0ACY1 P39173 P76256 5.38/59,928.8 5.07/54,689 9.1/31,910.83...”
18th Congress of the European Hematology Association, Stockholm, Sweden, June 13–16, 2013
, Haematologica 2013
Genome-wide analysis of the general stress response network in Escherichia coli: sigmaS-dependent genes, promoters, and sigma factor selectivity
Weber, Journal of bacteriology 2005
- “...b1003 b1050 b1188 b1164 b1258 b1259 b1341 b1547 b1614 b1783 b1784 b1847 b1999 b2080 b2086 b2602 b2660 b2672 b2665 Putative acyl-CoA dehydrogenase...”

SSON_1546 hypothetical protein from Shigella sonnei Ss046
100% identity, 100% coverage

High yield production process for Shigella outer membrane particles
Berlanda, PloS one 2012
- “...protein SF1022 [S. flexneri 2a str. 301] SF1022 gi|24112431 Inner membrane 56 1 hypothetical protein SSON_1546 [S. sonnei Ss046] ydgA gi|74312061 Unknown 57 3 putative receptor [S. sonnei Ss046] SSON_1681 gi|74312191 58 2 hypothetical protein SSON_1556 [S. sonnei Ss046] ydgH gi|74312071 59 2 hypothetical protein SSON_3340...”

STM1466 putative periplasmic protein from Salmonella typhimurium LT2
80% identity, 100% coverage

Media ion composition controls regulatory and virulence response of Salmonella in spaceflight
Wilson, PloS one 2008
- “...flgG flagellar biosynthesis, cell-distal portion of basal-body rod STM1196 0.59 x acpP acyl carrier protein STM1466 0.59 ydgA putative periplasmic protein STM1916 0.55 x cheY chemotaxis regulator, transmits chemoreceptor signals to flagellar motor STM1959 0.44 x fliC flagellar biosynthesis; flagellin, filament structural protein STM1962 0.54 x...”

t1334 conserved hypothetical protein from Salmonella enterica subsp. enterica serovar Typhi Ty2
80% identity, 100% coverage

Prevalence and Diversity of Staphylococcus aureus and Staphylococcal Enterotoxins in Raw Milk From Northern Portugal
Oliveira, Frontiers in microbiology 2022
- “...0.07.6%) distinct isolates, while t002, t108, t117, t127, t189, t208, t267, t843, t899, t1200, t1207, t1334, t2383, t3585, t9216, and t19272, were associated to one (1.6%, 95% CI: 0.04.7%) S. aureus isolate ( Figure 2 ). FIGURE 2 Minimum spanning tree of the spa typing for...”
- “...and one t2383. S. aureus t1403, t2802, t571, t108, t189. t208, t267, t843, t1200, t1207, t1334, and t19272 were exclusively associated with strains that did not contain any of the virulence/resistance genes evaluated. In total, S. aureus t1403-none (16.1%, 95% CI: 7.025.3%) is the predominant molecular...”

YPO2262 putative exported protein from Yersinia pestis CO92
43% identity, 95% coverage

Cell membrane is impaired, accompanied by enhanced type III secretion system expression in Yersinia pestis deficient in RovA regulator
Yang, PloS one 2010
- “...exported sulfate-binding protein 3.C.1 Cell envelop YPO0917 2.57 yggE putative exported protein 3.C.1 Cell envelop YPO2262 3.82 ----- putative exported protein 3.C.1 Cell envelop YPO2315 3.34 ----- putative exported protein 3.C.1 Cell envelop YPO2670 4.12 ureG urease accessory protein 3.C.1 Cell envelop YPO4070 3.16 yiaF putative...”

y2104 hypothetical protein from Yersinia pestis KIM
43% identity, 91% coverage

Integral and peripheral association of proteins and protein complexes with Yersinia pestis inner and outer membranes
Pieper, Proteome science 2009
- “...system subunits (ClpB, #78, y3673, #172; y3674, #89; y3675, #211) and a putative phospholipid-binding protein (y2104, #207). While we only infer the association of aforementioned proteins with distinct membrane protein complexes, this experiment was in support of the notion that many proteins assigned to the membrane...”
- “...which has been linked to pathogenicity of Vibrio cholerae [ 58 ], and the protein y2104, a putative phospholipid-binding protein whose ortholog YdgA also formed oligomeric structures in E. coli [ 50 ]. OM-associated proteins and protein complexes of the OM Thirty-one proteins were designated Y....”

YihF / b3861 DUF945 domain-containing protein YihF from Escherichia coli K-12 substr. MG1655
b3861 putative GTP-binding protein from Escherichia coli str. K-12 substr. MG1655
34% identity, 95% coverage

Combined, functional genomic-biochemical approach to intermediary metabolism: interaction of acivicin, a glutamine amidotransferase inhibitor, with Escherichia coli K-12
Smulski, Journal of bacteriology 2001
- “...b3548 b3555 b3574 b3581 b3596 b3655 b3698 b3818 b3827 b3861 b3875 b3923 b3928 b3937 b3995 b4030 b4126 b4127 b4135 b4178 b4189 b4199 b4206 b4234 b4255 b4311...”

PP1230 conserved hypothetical protein from Pseudomonas putida KT2440
27% identity, 93% coverage

UEG Week 2024 Poster Presentations
, United European gastroenterology journal 2024
UEG Week 2023 Poster Presentations
, United European gastroenterology journal 2023

NTHI1930 hypothetical protein from Haemophilus influenzae 86-028NP
24% identity, 95% coverage

Antisera Against Certain Conserved Surface-Exposed Peptides of Nontypeable Haemophilus influenzae Are Protective
Whitby, PloS one 2015
- “...NTHI1957 lppC HI1655 Lipoprotein LppC Amorphous NTHI1954 spr HI1652 Lipoprotein Spr, probable murein endopeptidase Amorphous NTHI1930 HI1236m Conserved hypothetical protein -barrel NTHI1668 tdeA HI1462 Outer membrane efflux porin TdeA -barrel NTHI1794m HI1369 Probable TonB-dependent transporter -barrel NTHI1473 lpp HI1579 15 kDa peptidoglycan-associated lipoprotein -helix NTHI1435 lolB...”

APL_0889 hypothetical protein from Actinobacillus pleuropneumoniae L20
26% identity, 97% coverage

Host-pathogen interactions of Actinobacillus pleuropneumoniae with porcine lung and tracheal epithelial cells
Auger, Infection and immunity 2009
- “...APL_1437 APL_0116 APL_0389 APL_0704 APL_1365 APL_0110 APL_1396 APL_0756 APL_0889 Gene 1438 AUGER ET AL. INFECT. IMMUN. FIG. 6. Adherence of 12 members of the...”

HI1236 conserved hypothetical protein from Haemophilus influenzae Rd KW20
P44132 Uncharacterized protein HI_1236 from Haemophilus influenzae (strain ATCC 51907 / DSM 11121 / KW20 / Rd)
26% identity, 61% coverage

Identification and functional analysis of 'hypothetical' genes expressed in Haemophilus influenzae
Kolker, Nucleic acids research 2004
- “...the original list: HI0246, HI0668, HI0700, HI0847, HI1168, HI1236 and HI1709. For two more, HI0370 and HI1681, a conserved gene neighborhood (e.g. co-expression...”
Identification of the exported proteins of the oral opportunistic pathogen Actinobacillus actinomycetemcomitans by using alkaline phosphatase fusions
Ward, Infection and immunity 2001
- “...(367) (371) 1301 (136) 748 (185) HI1085 HI1603 HI1701 HI1150 HI0693 HI0370 HI1236 HI1126.1 HI1628 HI0389 89 89 66 74 89 74 24 87 58 65 79 80 49 62 78 51 44 80...”
Identification of genes coding for exported proteins of Actinobacillus actinomycetemcomitans
Mintz, Infection and immunity 1999
- “...pVT1063 pVT1064 pVT1067 Hypothetical protein of H. influenzae HI1236 (8) Outer membrane protein A precursor of Serratia marcescens (4) H. influenzae chaperone...”
Functional annotation of conserved hypothetical proteins from Haemophilus influenzae Rd KW20
Shahbaaz, PloS one 2013
- “...23. P45074 Yes Cellular process 24. P45077 Yes Cellular process 25. P71373 Yes Yes 26. P44132 Yes Metabolism molecule 27. P44138 Yes Cellular process 28. P44140 Yes Yes 29. P44165 Yes Yes 30. P45182 Yes Yes 31. P44169 Yes Yes 32. P44183 Yes Yes 33. P56507...”

New Search

For advice on how to use these tools together, see Interactive tools for functional annotation of bacterial genomes.

Statistics

The PaperBLAST database links 789,361 different protein sequences to 1,256,019 scientific articles. Searches against EuropePMC were last performed on January 10 2025.

How It Works

PaperBLAST builds a database of protein sequences that are linked to scientific articles. These links come from automated text searches against the articles in EuropePMC and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot, BRENDA, CAZy (as made available by dbCAN), BioLiP, CharProtDB, MetaCyc, EcoCyc, TCDB, REBASE, the Fitness Browser, and a subset of the European Nucleotide Archive with the /experiment tag. Given this database and a protein sequence query, PaperBLAST uses protein-protein BLAST to find similar sequences with E < 0.001.

To build the database, we query EuropePMC with locus tags, with RefSeq protein identifiers, and with UniProt accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use queries of the form "locus_tag AND genus_name" to try to ensure that the paper is actually discussing that gene. Because EuropePMC indexes most recent biomedical papers, even if they are not open access, some of the links may be to papers that you cannot read or that our computers cannot read. We query each of these identifiers that appears in the open access part of EuropePMC, as well as every locus tag that appears in the 500 most-referenced genomes, so that a gene may appear in the PaperBLAST results even though none of the papers that mention it are open access. We also incorporate text-mined links from EuropePMC that link open access articles to UniProt or RefSeq identifiers. (This yields some additional links because EuropePMC uses different heuristics for their text mining than we do.)

For every article that mentions a locus tag, a RefSeq protein identifier, or a UniProt accession, we try to select one or two snippets of text that refer to the protein. If we cannot get access to the full text, we try to select a snippet from the abstract, but unfortunately, unique identifiers such as locus tags are rarely provided in abstracts.

PaperBLAST also incorporates manually-curated protein functions:

Proteins from NCBI's RefSeq are included if a GeneRIF entry links the gene to an article in PubMed^®. GeneRIF also provides a short summary of the article's claim about the protein, which is shown instead of a snippet.
Proteins from Swiss-Prot (the curated part of UniProt) are included if the curators identified experimental evidence for the protein's function (evidence code ECO:0000269). For these proteins, the fields of the Swiss-Prot entry that describe the protein's function are shown (with bold headings).
Proteins from BRENDA, a curated database of enzymes, are included if they are linked to a paper in PubMed and their full sequence is known.
Every protein from the non-redundant subset of BioLiP, a database of ligand-binding sites and catalytic residues in protein structures, is included. Since BioLiP itself does not include descriptions of the proteins, those are taken from the Protein Data Bank. Descriptions from PDB rely on the original submitter of the structure and cannot be updated by others, so they may be less reliable. (For SitesBLAST and Sites on a Tree, we use a larger subset of BioLiP so that every ligand is represented among a group of structures with similar sequences, but for PaperBLAST, we use the non-redundant set provided by BioLiP.)
Every protein from EcoCyc, a curated database of the proteins in Escherichia coli K-12, is included, regardless of whether they are characterized or not.
Proteins from the MetaCyc metabolic pathway database are included if they are linked to a paper in PubMed and their full sequence is known.
Proteins from the Transport Classification Database (TCDB) are included if they have known substrate(s), have reference(s), and are not described as uncharacterized or putative. (Some of the references are not visible on the PaperBLAST web site.)
Every protein from CharProtDB, a database of experimentally characterized protein annotations, is included.
Proteins from the CAZy database of carbohydrate-active enzymes are included if they are associated with an Enzyme Classification number. Even though CAZy does not provide links from individual protein sequences to papers, these should all be experimentally-characterized proteins.
Proteins from the REBASE database of restriction enzymes are included if they have known specificity.
Every protein with an evidence-based reannotation (based on mutant phenotypes) in the Fitness Browser is included.
Sequence-specific transcription factors (including sigma factors and DNA-binding response regulators) with experimentally-determined DNA binding sites from the PRODORIC database of gene regulation in prokaryotes.
Putative transcription factors from RegPrecise that have manually-curated predictions for their binding sites. These predictions are based on conserved putative regulatory sites across genomes that contain similar transcription factors, so PaperBLAST clusters the TFs at 70% identity and retains just one member of each cluster.
Coding sequence (CDS) features from the European Nucleotide Archive (ENA) are included if the /experiment tag is set (implying that there is experimental evidence for the annotation), the nucleotide entry links to paper(s) in PubMed, and the nucleotide entry is from the STD data class (implying that these are targeted annotated sequences, not from shotgun sequencing). Also, to filter out genes whose transcription or translation was detected, but whose function was not studied, nucleotide entries or papers with more than 25 such proteins are excluded. Descriptions from ENA rely on the original submitter of the sequence and cannot be updated by others, so they may be less reliable.

Except for GeneRIF and ENA, the curated entries include a short curated description of the protein's function. For entries from BioLiP, the protein's function may not be known beyond binding to the ligand. Many of these entries also link to articles in PubMed.

For more information see the PaperBLAST paper (mSystems 2017) or the code. You can download PaperBLAST's database here.

Changes to PaperBLAST since the paper was written:

November 2023: incorporated PRODORIC and RegPrecise. Many PRODORIC entries were not linked to a protein sequence (no UniProt identifier), so we added this information.
February 2023: BioLiP changed their download format. PaperBLAST now includes their non-redundant subset. SitesBLAST and Sites on a Tree use a larger non-redundant subset that ensures that every ligand is represented within each cluster. This should ensure that every binding site is represented.
June 2022: incorporated some coding sequences from ENA with the /experiment tag.
March 2022: incorporated BioLiP.
April 2020: incorporated TCDB.
April 2019: EuropePMC now returns table entries in their search results. This has expanded PaperBLAST's database, but most of the new entries are of low relevance, and the resulting snippets are often just lists of locus tags with annotations.
February 2018: the alignment page reports the conservation of the hit's functional sites (if available from from Swiss-Prot or UniProt)
January 2018: incorporated BRENDA.
December 2017: incorporated MetaCyc, CharProtDB, CAZy, REBASE, and the reannotations from the Fitness Browser.
September 2017: EuropePMC no longer returns some table entries in their search results. This has shrunk PaperBLAST's database, but has also reduced the number of low-relevance hits.

Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.

Secrets

PaperBLAST cannot provide snippets for many of the papers that are published in non-open-access journals. This limitation applies even if the paper is marked as "free" on the publisher's web site and is available in PubmedCentral or EuropePMC. If a journal that you publish in is marked as "secret," please consider publishing elsewhere.

Omissions from the PaperBLAST Database

Many important articles are missing from PaperBLAST, either because the article's full text is not in EuropePMC (as for many older articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an article that characterizes a protein's function but is missing from PaperBLAST, please notify the curators at UniProt or add an entry to GeneRIF. Entries in either of these databases will eventually be incorporated into PaperBLAST. Note that to add an entry to UniProt, you will need to find the UniProt identifier for the protein. If the protein is not already in UniProt, you can ask them to create an entry. To add an entry to GeneRIF, you will need an NCBI Gene identifier, but unfortunately many prokaryotic proteins in RefSeq do not have corresponding Gene identifers.

References

PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.

Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.

Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.

UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.

BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.

The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.

The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.

CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.

The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.

The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.

REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.

Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.

by Morgan Price, Arkin group
Lawrence Berkeley National Laboratory