PaperBLAST
PaperBLAST Hits for DZA65_RS05275 (72 a.a., MRFRIKTAVA...)
Show query sequence
>DZA65_RS05275
MRFRIKTAVALTFVVLLSGCASHYVIATKDGQMLLTRGKPALDPATGLLSYTDEEGVKRQ
INNNNISQVIER
Running BLASTp...
Found 17 similar proteins in the literature:
Ent638_3274 protein of unknown function DUF903 from Enterobacter sp. 638
59% identity, 94% coverage
t2916 possible lipoprotein from Salmonella enterica subsp. enterica serovar Typhi Ty2
59% identity, 92% coverage
B5XUP5 Putative lipoprotein from Klebsiella pneumoniae (strain 342)
58% identity, 92% coverage
YgdR / b2833 DUF903 domain-containing lipoprotein YgdR from Escherichia coli K-12 substr. MG1655 (see 2 papers)
Z4151 orf, hypothetical protein from Escherichia coli O157:H7 EDL933
ECs_3690 lipoprotein from Escherichia coli O157:H7 str. Sakai
P65294 Uncharacterized lipoprotein YgdR from Escherichia coli (strain K12)
b2833 hypothetical protein from Escherichia coli str. K-12 substr. MG1655
S3041 hypothetical protein from Shigella flexneri 2a str. 2457T
57% identity, 94% coverage
- Disruption of rcsB by a duplicated sequence in a curli-producing Escherichia coli O157:H7 results in differential gene expression in relation to biofilm formation, stress responses and metabolism
Sharma, BMC microbiology 2017 - “...23.74 1.9E-17 ypfG Z3722 Unknown 4.58 9.0E-09 Z3965 Unknown 2.33 1.5E-03 Z4126 Unknown 4.31 3.0E-04 Z4151 Unknown 3.84 7.4E-07 Z4267 Unknown +5.52 2.3E-03 Z4268 Unknown +4.92 4.3E-06 yggG Z4280 Unknown 1.71 0.02 Z4318 Unknown +1.30 0.04 ygjT Z4441 Unknown 2.25 5.0E-03 yhaM Z4462 Unknown +2.34 0.03...”
- Characterizing the Escherichia coli O157:H7 proteome including protein associations with higher order assemblies
Pieper, PloS one 2011 - “...19440 L (IM or OM) PP 3878 91 LAAC 19 A 2 uncharacterized lipoprotein YgdR Z4151 (ygdR) 7877 L (IM or OM) unkn 3742 376 VSGC 19 S 2 putative OM lipoprotein YiaD Z4977 yiaD 22196 L (IM or OM) OM 3322 45* VSGC 21 T...”
- Antibacterial efficacy interference of the photocatalytic TiO2 nanoparticle and the lytic bacteriophage vb_EcoS_bov25_1D on the Enterohaemorragic Escherichia coli strain Sakai
Steinbach, Heliyon 2024 - “...ECs_1037 rmf ribosome modulation factor protein_coding 913590 2,41264487 ECs_5190 ytfH transcriptional regulator protein_coding 913947 2,41048321 ECs_3690 ygdR lipoprotein protein_coding 916497 2,38729769 ECs_2382 ydhZ hypothetical protein protein_coding 914156 2,35933833 ECs_5729 yneM inner membrane-associated protein protein_coding 62675925 2,33294951 ECs_2695 yodC hypothetical protein protein_coding 913057 2,30330373 ECs_2819 hisL his...”
- A community resource of experimental data for NMR / X-ray crystal structure pairs.
Everett, Protein science : a publication of the Protein Society 2016 - Genome-wide analysis of lipoprotein expression in Escherichia coli MG1655
Brokx, Journal of bacteriology 2004 - “...b2432 b2477 b2512 b2593 b2595 b2605 b2701 b2742 b2809 b2813 b2833 b2865 b2963 b3150 b3163 b3267 b3369 b3661 b4149 b4189 b4288 1,177.4 P 2,142.5 P 75 P 45.55 P...”
- mRNA expression profiles for Escherichia coli ingested by normal and phagocyte oxidase-deficient human neutrophils
Staudinger, The Journal of clinical investigation 2002 - “...1. With the exception of trxC, the fold change metabolism, b2833, napD and napA, trpE, and yadI. estimates by PCR agreed well with estimates of One gene codes...”
- “...b0605 b2597 3.8 3.8 0.0004 0.004 b1172 trxC yfiG soxS katG b2833 b1172 b2582 b4062 b3942 b2833 3.7 3.3 2.8 2.6 2.6 0.02 0.001 0.02 0.001 0.02 osmB b1283 2.4...”
- Acidic pH sensing in the bacterial cytoplasm is required for Salmonella virulence
Choi, Molecular microbiology 2016 - “...et al. , 1972 ). All S. bongori strains were derived from the wild-type strain S3041 (SARC11) ( Boyd et al. , 1996 ). Bacteria were grown at 37C in Luria-Bertani (LB) broth or N-minimal media (pH 7.7) ( Snavely et al. , 1991 ) supplemented...”
- “...the strain JC369 was replaced with the PCR products amplified from the S. bongori strain S3041 using primers 14890/14891. To generate S. enterica with variant phoQ allele (D233E, H409N, Q460H), the tetRA fragment was amplified using primers 14835/14836 and used to electroporate the S. enterica strain...”
- Analysis of the type 1 pilin gene cluster fim in Salmonella: its distinct evolutionary histories in the 5' and 3' regions
Boyd, Journal of bacteriology 1999 - “...2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 S3041 S3044 2 2 2 2 2 2 2 2 2 2 2 2 1, positive hybridization signal; 2, indicates no hybridization signal. bp and the fimY-fimW...”
- Comparative genetics of the inv-spa invasion gene complex of Salmonella enterica
Boyd, Journal of bacteriology 1997 - “...S3049 S3043 S3046 S3041 S3047 S3042 S3044 S3040 48:z39:-60:z41:-66:z35:-61:z35:-44:r:-1,13,22:i:-48:z35:-44:z39:-66:z41:-1,40:z35:-48:a:-48:z41:-66:z65 Human...”
- “...could not be PCR amplified from subspecies V isolates S3041 and S3044. In order to determine whether invH is absent from this subspecies, 13 strains from this...”
- Intergeneric transfer and recombination of the 6-phosphogluconate dehydrogenase gene (gnd) in enteric bacteria
Nelson, Proceedings of the National Academy of Sciences of the United States of America 1994 - “...S2976 S So297I 53013 S2983 . ~~~~~VII II S83047 S3041 S3044 V FIG. 2. Neighbor-joining tree for gnd sequences of strains of S. enterica, based on synonymous...”
STM1585 putative outer membrane lipoprotein from Salmonella typhimurium LT2
44% identity, 96% coverage
A6T9R7 Lipoprotein YgdI/YgdR-like SH3-like domain-containing protein from Klebsiella pneumoniae subsp. pneumoniae (strain ATCC 700721 / MGH 78578)
52% identity, 86% coverage
KPN_03160 hypothetical protein from Klebsiella pneumoniae subsp. pneumoniae MGH 78578
44% identity, 82% coverage
UTI89_C3182 hypothetical protein from Escherichia coli UTI89
SSON_2966 hypothetical protein from Shigella sonnei Ss046
ECs3669 hypothetical protein from Escherichia coli O157:H7 str. Sakai
43% identity, 88% coverage
YgdI / b2809 DUF903 domain-containing lipoprotein YgdI from Escherichia coli K-12 substr. MG1655 (see 3 papers)
P65293 Uncharacterized lipoprotein YgdI from Escherichia coli O6:H1 (strain CFT073 / ATCC 700928 / UPEC)
P65292 Uncharacterized lipoprotein YgdI from Escherichia coli (strain K12)
b2809 orf, hypothetical protein from Escherichia coli str. K-12 substr. MG1655
43% identity, 89% coverage
- First report on the physicochemical and proteomic characterization of <i>Proteus mirabilis</i> outer membrane vesicles under urine-mimicking growth conditions: comparative analysis with <i>Escherichia coli</i>
González, Frontiers in microbiology 2024 - “...1.36 0.001 LB OM Lipoproteins A0A0H2VF34 Blc Outer membrane lipoprotein ED p <0.05 LB OM P65293 YgdI Uncharacterized lipoprotein ED p <0.05 LB U A0A0H2V4U4 YajI Hypothetical lipoprotein ED p <0.05 LB U A0A0H2V7I8 SlyB Outer membrane lipoprotein 2.40 0.001 LB OM A0A0H2V7T2 OsmE Osmotically inducible...”
- Biodistribution of 89Zr-DFO-labeled avian pathogenic Escherichia coli outer membrane vesicles by PET imaging in chickens
Li, Poultry science 2023 - “...transport and metabolism Cell membrane 305 P64429 YPFJ General function prediction only Cell membrane 306 P65292 YGDI Function unknown Cell membrane 307 P67244 YQHA Function unknown Cell membrane 308 P0ACE0 MBHM Energy production and conversion Cell membrane 309 P0AFB1 NLPI Function unknown Cell membrane 310 P69741...”
- Semiconductor Nanomaterials-Based Fluorescence Spectroscopic and Matrix-Assisted Laser Desorption/Ionization (MALDI) Mass Spectrometric Approaches to Proteome Analysis
Kailasa, Materials (Basel, Switzerland) 2013 - “...proteins ecnB (P56549), lpp (P69776), and osmE (P23933); hypothetical membrane proteins yifL (P39166) and ygdI (P65292); acetylacyl carrier protein ( ydhI ; acetyl-ACP , P0A6A8) and lipoproteins ( ecnB , lpp , osmE , yifL , ygdI ) and water-insoluble ATPase proteolipid (at m / z...”
- Global transcriptomic analysis of an engineered Escherichia coli strain lacking the phosphoenolpyruvate: carbohydrate phosphotransferase system during shikimic acid production in rich culture medium
Cortés-Tolalpa, Microbial cell factories 2014 - “...Conserved protein Other metabolic process 18.9020 ydiS b1699 Putative flavoprotein Other metabolic process 20.0646 ygdI b2809 Putative lipoprotein Out of OGs 15.8477 slp b3506 Starvation lipoprotein Out of OGs 16.3895 ybaY b0453 Predicted OM lipoprotein Out of OGs 31.4852 yhjR b3535 Conserved protein Out of OGs...”
- Reconfiguring the quorum-sensing regulator SdiA of Escherichia coli to control biofilm formation via indole and N-acylhomoserine lactones
Lee, Applied and environmental microbiology 2009 - “...b2080 b2112 b2135 b2165 b2660 b2654 b2655 b2658 b2672 b2809 b2856 b2857 b3099 b3170 b3184 b3260 b3512 b4068 b4126 b4127 Change in expression (fold) SdiA vs no...”
- Autoinducer 2 controls biofilm formation in Escherichia coli through a novel motility quorum-sensing regulator (MqsR, B3022)
González, Journal of bacteriology 2006 - “...dsdX ygdI b3254 yiaG tra5_3 yhjS rbsC b2365 b2809 b3254 b3555 b0372 b3536 b3750 Hypothetical protein Transcriptional regulator of cryptic csgA gene for curli...”
- Genome-wide analysis of lipoprotein expression in Escherichia coli MG1655
Brokx, Journal of bacteriology 2004 - “...b2214 b2346 b2432 b2477 b2512 b2593 b2595 b2605 b2701 b2742 b2809 b2813 b2833 b2865 b2963 b3150 b3163 b3267 b3369 b3661 b4149 b4189 b4288 1,177.4 P 2,142.5 P 75...”
- Parallel changes in gene expression after 20,000 generations of evolution in Escherichiacoli
Cooper, Proceedings of the National Academy of Sciences of the United States of America 2003 - “...cdsA ybaC yaeT yafU ybcU b0762 b1044 b1490 b2445 b2462 b2772 b2809 syd yaeJ ydfC ydfE yjgF yjiD yjjY ykfB yohH nrdA yliJ b1168 ynbD yqhC yhcJ yidJ yjcP hsdR...”
- A microarray-based antibiotic screen identifies a regulatory role for supercoiling in the osmotic stress response of Escherichia coli
Cheung, Genome research 2003 - “...for Supercoiling-Dependent Gene Regulation Gene Product acnA b1664 b1724 b2809 bax btuE dps gcd grxB nlpD osmE otsA otsB poxB proV Aconitate hydrase 1 Possible...”
Z4126 hypothetical protein from Escherichia coli O157:H7 EDL933
40% identity, 87% coverage
SF2823 orf, conserved hypothetical protein from Shigella flexneri 2a str. 301
41% identity, 88% coverage
- Virulence and Stress Responses of Shigella flexneri Regulated by PhoP/PhoQ
Lin, Frontiers in microbiology 2017 - “...0.43 0.0016 ND Chromosome Putative carnitine operon oxidoreductase SF3143 0.45 0.0002 ND Chromosome Hypothetical protein SF2823 0.47 0.0092 ND Chromosome Hypothetical protein yqjE 0.48 0.0065 0.58 0.21 Chromosome Hypothetical protein yqjD 0.49 0.0031 ND Chromosome Hypothetical protein SF0551 0.50 0.0197 ND Chromosome Putative homeobox protein ipgB1...”
STM2983 putative lipoprotein from Salmonella typhimurium LT2
STM14_3597 YgdI/YgdR family lipoprotein from Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S
42% identity, 88% coverage
- Salmonella enterica serovar typhimurium colonizing the lumen of the chicken intestine grows slowly and upregulates a unique set of virulence and metabolism genes
Harvey, Infection and immunity 2011 - “...2.3 0.02 Not in COGs STM0471 STM1059 STM1092 STM1601 STM2983 ylaC ycbW ugtL orfX Putative Putative Putative Putative Putative 2.06 2.71 2.32 2.1 2.09 0.003 0.03...”
- The Rcs phosphorelay system is specific to enteric pathogens/commensals and activates ydeI, a gene important for persistent Salmonella infection of mice
Erickson, Molecular microbiology 2006 - “...STM1174 STM3445 STM3433 STM4064 STM4561b STM4239b STM4222 STM2983 STM4336b STM4240 STM3269 STM1285b STM1491 STM3443 STM1515 STM1492 STM3363 STM2311 STM2795...”
- Proteome remodelling by the stress sigma factor RpoS/σS in Salmonella: identification of small proteins and evidence for post-transcriptional regulation
Lago, Scientific reports 2017 - “.... coli YodC, DUF2158 IPR019226 61 Yes yodC 11, 13, 14 -, -, - Proteobacteria STM14_3597 Sm-like protein IPR010920, pdb 2RA2, SP, Lipo, DUF903 IPR010305 75 Yes ygdI 1214 Bacteria STM14_4398 Putative transcriptional regulator, DNA-binding domain IPR010982 96 Yes yiaG 7, 1114 Bacteria STM14_4446 79% identity...”
KPN_00042 hypothetical protein from Klebsiella pneumoniae subsp. pneumoniae MGH 78578
38% identity, 87% coverage
PMI1737 lipoprotein from Proteus mirabilis HI4320
B4EZ34 Lipoprotein from Proteus mirabilis (strain HI4320)
46% identity, 75% coverage
Ent638_0594 protein of unknown function DUF903 from Enterobacter sp. 638
40% identity, 83% coverage
STM0080 putative outer membrane lipoprotein from Salmonella typhimurium LT2
SEN0081 putative lipoprotein from Salmonella enterica subsp. enterica serovar Enteritidis str. P125109
SPA0081 putative lipoprotein from Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150
36% identity, 93% coverage
- Peeling back the many layers of competitive exclusion
Maurer, Frontiers in microbiology 2024 - “...; STM3156 ; STM3604 + + Outer membrane pgtE + yhcN ; STM3361-2 + + STM0080 + yhfL + + Cell wall cidAB + Oxidative stress srgA + yciGFE + Osmotic stress yehY + + Antimicrobials emrD + yabI + ydhE + CRISPR STM2938-93 ; STM2937-43...”
- Adaptation of multilocus sequencing for studying variation within a major clone: evolutionary relationships of Salmonella enterica serovar Typhimurium
Hu, Genetics 2006 - “...TM3124_2 TM3211_1 TM3211_2d TM3230_1 TM3230_2 TM3230_3d TM3275_1d TM3275_2 G to A (ns, STM0080) C to A T to C C to T (s, dsbG) G to A G to T G to A A to...”
- RNA sequencing reveals differences between the global transcriptomes of Salmonella enterica serovar enteritidis strains with high and low pathogenicities
Shah, Applied and environmental microbiology 2014 - “...particular interest are genes encoding putative lipoproteins (SEN0081, yfbK, and ygdI). Bacterial lipoproteins perform various roles, including nutrient uptake,...”
- Comparative profiling of Pseudomonas aeruginosa strains reveals differential expression of novel unique and conserved small RNAs
Ferrara, PloS one 2012 - “...bkdR 5070 SPA0078 I 2421/2422 33370/33380 + 2040 SPA0079 I 2763/2764 28350/28360 + 50 70 SPA0081 I 3069/3070 moxR /24440 + 90 SPA0084 I 3535/3536 + 18620/18630 7050 SPA0085 c I rsmZ rpoS / fdxA + 120 ; RsmZ SPA0086 I 3919/1920 13170/13190 + 70 430...”
STM1673 putative outer membrane lipoprotein from Salmonella typhimurium LT2
38% identity, 69% coverage
- Salmonella serovar identification using PCR-based detection of gene presence and absence
Arrach, Journal of clinical microbiology 2008 - “...STM1035, STM1045, STM1521, STM1534, STM1549, STM1579, STM1673, STM1869, STM1869A, STM2030, STM2033, STM2055, STM2094, STM2177, STM2438, STM2591, STM2601,...”
- Adaptation of multilocus sequencing for studying variation within a major clone: evolutionary relationships of Salmonella enterica serovar Typhimurium
Hu, Genetics 2006 - “...1766735-1766881 8 bp (dp) IG (STM1672, STM1673) 3062413-3062595 (Ec) 791460 456364 720484-720485 481632-481650 907035 6 (Ec) 1096771-1096774 1766792-1766799...”
- Fluorescent amplified fragment length polymorphism analysis of Salmonella enterica serovar typhimurium reveals phage-type- specific markers and potential for microarray typing
Hu, Journal of clinical microbiology 2002 - “...re- spectively, in an intergenic region between STM1672 and STM1673 at serovar Typhimurium LT2 genome sequence bases 1766792 to 1766799 (26). For both pairs of...”
For advice on how to use these tools together, see
Interactive tools for functional annotation of bacterial genomes.
The PaperBLAST database links 793,807 different protein sequences to 1,259,118 scientific articles. Searches against EuropePMC were last performed on March 13 2025.
PaperBLAST builds a database of protein sequences that are linked
to scientific articles. These links come from automated text searches
against the articles in EuropePMC
and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot,
BRENDA,
CAZy (as made available by dbCAN),
BioLiP,
CharProtDB,
MetaCyc,
EcoCyc,
TCDB,
REBASE,
the Fitness Browser,
and a subset of the European Nucleotide Archive with the /experiment tag.
Given this database and a protein sequence query,
PaperBLAST uses protein-protein BLAST
to find similar sequences with E < 0.001.
To build the database, we query EuropePMC with locus tags, with RefSeq protein
identifiers, and with UniProt
accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use
queries of the form "locus_tag AND genus_name" to try to ensure that
the paper is actually discussing that gene. Because EuropePMC indexes
most recent biomedical papers, even if they are not open access, some
of the links may be to papers that you cannot read or that our
computers cannot read. We query each of these identifiers that
appears in the open access part of EuropePMC, as well as every locus
tag that appears in the 500 most-referenced genomes, so that a gene
may appear in the PaperBLAST results even though none of the papers
that mention it are open access. We also incorporate text-mined links
from EuropePMC that link open access articles to UniProt or RefSeq
identifiers. (This yields some additional links because EuropePMC
uses different heuristics for their text mining than we do.)
For every article that mentions a locus tag, a RefSeq protein
identifier, or a UniProt accession, we try to select one or two
snippets of text that refer to the protein. If we cannot get access to
the full text, we try to select a snippet from the abstract, but
unfortunately, unique identifiers such as locus tags are rarely
provided in abstracts.
PaperBLAST also incorporates manually-curated protein functions:
- Proteins from NCBI's RefSeq are included if a
GeneRIF
entry links the gene to an article in
PubMed®.
GeneRIF also provides a short summary of the article's claim about the
protein, which is shown instead of a snippet.
- Proteins from Swiss-Prot (the curated part of UniProt)
are included if the curators
identified experimental evidence for the protein's function (evidence
code ECO:0000269). For these proteins, the fields of the Swiss-Prot entry that
describe the protein's function are shown (with bold headings).
- Proteins from BRENDA,
a curated database of enzymes, are included if they are linked to a paper in PubMed
and their full sequence is known.
- Every protein from the non-redundant subset of
BioLiP,
a database
of ligand-binding sites and catalytic residues in protein structures, is included. Since BioLiP itself
does not include descriptions of the proteins, those are taken from the
Protein Data Bank.
Descriptions from PDB rely on the original submitter of the
structure and cannot be updated by others, so they may be less reliable.
(For SitesBLAST and Sites on a Tree, we use a larger subset of BioLiP so that every
ligand is represented among a group of structures with similar sequences, but for
PaperBLAST, we use the non-redundant set provided by BioLiP.)
- Every protein from EcoCyc, a curated
database of the proteins in Escherichia coli K-12, is included, regardless
of whether they are characterized or not.
- Proteins from the MetaCyc metabolic pathway database
are included if they are linked to a paper in PubMed and their full sequence is known.
- Proteins from the Transport Classification Database (TCDB)
are included if they have known substrate(s), have reference(s),
and are not described as uncharacterized or putative.
(Some of the references are not visible on the PaperBLAST web site.)
- Every protein from CharProtDB,
a database of experimentally characterized protein annotations, is included.
- Proteins from the CAZy database of carbohydrate-active enzymes
are included if they are associated with an Enzyme Classification number.
Even though CAZy does not provide links from individual protein sequences to papers,
these should all be experimentally-characterized proteins.
- Proteins from the REBASE database
of restriction enzymes are included if they have known specificity.
- Every protein with an evidence-based reannotation (based on mutant phenotypes)
in the Fitness Browser is included.
- Sequence-specific transcription factors (including sigma factors and DNA-binding response regulators)
with experimentally-determined DNA binding sites from the
PRODORIC database of gene regulation in prokaryotes.
- Putative transcription factors from RegPrecise
that have manually-curated predictions for their binding sites. These predictions are based on
conserved putative regulatory sites across genomes that contain similar transcription factors,
so PaperBLAST clusters the TFs at 70% identity and retains just one member of each cluster.
- Coding sequence (CDS) features from the
European Nucleotide Archive (ENA)
are included if the /experiment tag is set (implying that there is experimental evidence for the annotation),
the nucleotide entry links to paper(s) in PubMed,
and the nucleotide entry is from the STD data class
(implying that these are targeted annotated sequences, not from shotgun sequencing).
Also, to filter out genes whose transcription or translation was detected, but whose function
was not studied, nucleotide entries or papers with more than 25 such proteins are excluded.
Descriptions from ENA rely on the original submitter of the
sequence and cannot be updated by others, so they may be less reliable.
Except for GeneRIF and ENA,
the curated entries include a short curated
description of the protein's function.
For entries from BioLiP, the protein's function may not be known beyond binding to the ligand.
Many of these entries also link to articles in PubMed.
For more information see the
PaperBLAST paper (mSystems 2017)
or the code.
You can download PaperBLAST's database here.
Changes to PaperBLAST since the paper was written:
- November 2023: incorporated PRODORIC and RegPrecise. Many PRODORIC entries were not linked to a protein sequence (no UniProt identifier), so we added this information.
- February 2023: BioLiP changed their download format. PaperBLAST now includes their non-redundant subset. SitesBLAST and Sites on a Tree use a larger non-redundant subset that ensures that every ligand is represented within each cluster. This should ensure that every binding site is represented.
- June 2022: incorporated some coding sequences from ENA with the /experiment tag.
- March 2022: incorporated BioLiP.
- April 2020: incorporated TCDB.
- April 2019: EuropePMC now returns table entries in their search results. This has expanded PaperBLAST's database, but most of the new entries are of low relevance, and the resulting snippets are often just lists of locus tags with annotations.
- February 2018: the alignment page reports the conservation of the hit's functional sites (if available from from Swiss-Prot or UniProt)
- January 2018: incorporated BRENDA.
- December 2017: incorporated MetaCyc, CharProtDB, CAZy, REBASE, and the reannotations from the Fitness Browser.
- September 2017: EuropePMC no longer returns some table entries in their search results. This has shrunk PaperBLAST's database, but has also reduced the number of low-relevance hits.
Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.
PaperBLAST cannot provide snippets for many of the papers that are
published in non-open-access journals. This limitation applies even if
the paper is marked as "free" on the publisher's web site and is
available in PubmedCentral or EuropePMC. If a journal that you publish
in is marked as "secret," please consider publishing elsewhere.
Many important articles are missing from PaperBLAST, either because
the article's full text is not in EuropePMC (as for many older
articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an
article that characterizes a protein's function but is missing from
PaperBLAST, please notify the curators at UniProt
or add an entry to GeneRIF.
Entries in either of these databases will eventually be incorporated
into PaperBLAST. Note that to add an entry to UniProt, you will need
to find the UniProt identifier for the protein. If the protein is not
already in UniProt, you can ask them to create an entry. To add an
entry to GeneRIF, you will need an NCBI Gene identifier, but
unfortunately many prokaryotic proteins in RefSeq do not have
corresponding Gene identifers.
References
PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.
Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.
Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.
UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.
BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.
The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.
The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.
CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.
The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.
The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.
REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.
Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.
by Morgan Price,
Arkin group
Lawrence Berkeley National Laboratory