PaperBLAST
PaperBLAST Hits for VIMSS10097433 proline-rich family protein (602 a.a., MGCCQSRIDS...)
Show query sequence
>VIMSS10097433 proline-rich family protein
MGCCQSRIDSKEIVSRCKARKRYLKHLVKARQTLSVSHALYLRSLRAVGSSLVHFSSKET
PLHLHHNPPSPSPPPPPPPRPPPPPLSPGSETTTWTTTTTSSVLPPPPPPPPPPPPPSST
WDFWDPFIPPPPSSSEEEWEEETTTATRTATGTGSDAAVTTAPTTATPQASSVVSGFSKD
TMTTTTTGSELAVVVSRNGKDLMEIIKEVDEYFLKAADSGAPLSSLLEISTSITDFSGHS
KSGKMYSSSNYECNLNPTSFWTRGFAPSKLSEYRNAGGVIGGNCIVGSHSSTVDRLYAWE
KKLYQEVKYAESIKMDHEKKVEQVRRLEMKRAEYVKTEKAKKDVEKLESQLSVSSQAIQS
ASNEIIKLRETELYPQLVELVKGSMYESHQVQTHIVQQLKYLNTIPSTEPTSELHRQSTL
QLELEFSKNPLVRSSYESKIYSFCEEWHLAIDRIPDKVASEGIKSFLTAVHGIVAQQADE
HKQKKRTESMLKDFEKKSASLRALESKYSPYSVPESRKKNPVIEKRVKVEMLKGKAEEEK
SKHEKSVSVTRAMTLNNLQMGFPHVFQAMVGFSSVCMQAFESVYNQAKSIGEDQEEVKRL
LP
Running BLASTp...
Found 26 similar proteins in the literature:
AT3G51290 proline-rich family protein from Arabidopsis thaliana
100% identity, 100% coverage
- Comprehensive Analysis of Subcellular Localization, Immune Function and Role in Bacterial wilt Disease Resistance of Solanum lycopersicum Linn. ROP Family Small GTPases
Wang, International journal of molecular sciences 2022 - “...CA00g82910, CA00g84620, CA01g27430, CA02g04310, CA02g05500, CA02g21300, CA03g28070, CA04g05500, and CA08g19280; AtAPSR1 and AtRop1 11 : At3g51290, At3g51300, At1g20090, At2g17800, At1g75840, At4g35950, At4g35020, At5g45970, At2g44690, At4g28950, At3g48040, and At5g62880. Acknowledgments We are very grateful to Tsuyoshi Nakagawa and Rosa Lozano-Duran for kindly providing the Gateway-pGWB vectors. We...”
- Braving the attitude of altitude: Caragana jubata at work in cold desert of Himalaya
Bhardwaj, Scientific reports 2013 - “...3, 6, 12, and 24h for aerial tissues. Microarray data for Arabidopsis genes At4g04720 and At3g51290, homologues of CjCDPK and CjMce family , respectively were not available in database and hence could not be included in the analysis ( Supplementary Fig. S4 and Supplementary Table S4...”
APSR1_ARATH / A0A178VBJ0 Protein ALTERED PHOSPHATE STARVATION RESPONSE 1 from Arabidopsis thaliana (Mouse-ear cress) (see paper)
94% identity, 94% coverage
- function: Required for the coordination of cell differentiation and cell elongation in the root tip (PubMed:23498857). Required for the coordination of cell processes necessary for correct root growth in response to phosphate starvation, through the modulation of the auxin transporter protein PIN7 (PubMed:23498857).
disruption phenotype: Defects in primary root elongation and enhanced root hair elongation.
NP_001190052 pyridoxal-phosphate-dependent serine hydroxymethyltransferase, putative (DUF632) from Arabidopsis thaliana
94% identity, 75% coverage
LOC115717108 nitrate regulatory gene2 protein from Cannabis sativa
39% identity, 49% coverage
NRG2_ARATH / Q93YU8 Nitrate regulatory gene2 protein; NRG2 protein from Arabidopsis thaliana (Mouse-ear cress) (see paper)
AT3G60320 DNA binding from Arabidopsis thaliana
38% identity, 52% coverage
- function: Required for nitrate signaling. Regulates expression of the nitrate-responsive genes NIA1, NIR1, NRT2.1 and NPF6.3/NRT1.1.
subunit: Interacts with NLP7.
disruption phenotype: Under high nitrate concentration, seedlings are slightly smaller and display later flowering than wild-type. Under low nitrate concentration, seedlings appear normal. Nitrate accumulation of the seedlings and roots is significantly lower than in wild-type, however no difference of nitrate accumulation in leaves. Nitrate content differences in roots and leaves may be due to the reduced expression of NPF6.3/NRT1.1 in roots and the increased expression of NPF7.2/NRT1.8 in leaves. After nitrate treatment, altered expression of many genes involved in nitrogen-related clusters including nitrate transport and response to nitrate. - Alkaloid production and response to natural adverse conditions in Peganum harmala: in silico transcriptome analyses
Jazayeri, Biotechnologia 2022 - “...most abundant proteins found as UniProtKB hits with more than 20 top-matched hits were Q9SZL8, Q93YU8, Q9ZT94, A0A1P8AUY4, P0C2F6, Q6NQJ8, Q9S7I6, and O23372, as presented in Table 2 . These proteins were chosen because they were the most abundant in the generated integrated transcriptome. These proteins...”
- “...processes (BP) such as growth and development and stress resistance (Ma and Li, 2018 ). Q93YU8 or nitrate regulatory gene 2 protein is involved in nitrate signaling and regulation (Xu et al., 2016 ). It causes nitrate accumulation in plants by modulating nitrate uptake by roots...”
- Comprehensive Analysis of the Membrane Phosphoproteome Regulated by Oligogalacturonides in Arabidopsis thaliana
Mattei, Frontiers in plant science 2016 - “...NU Q93ZG7 25 ADTVEKVPTVVES(0.005)S(0.004)S(0.007)S(0.013)S(0.011)T(0.011)VEAS(0.186)N S (0.762)AEK Tr Tr Tr ns Putative bZIP protein At3g60320 NU Q93YU8 147 IPHIIS(0.15)ES(0.649)S(0.189)PS(0.019)S(0.08) S (0.912)PR Tr Tr Tr 3.0/ns Nuclear/nucleolar GTPase (AtNug2) At1g52980 NU Q9C923 116 ERKIPM S (1)IIT(1)DNK Tr Tr Tr ns Nuclear/nucleolar GTPase (AtNug2) At1g52980 NU Q9C923 119 ERKIPMS(1)II...”
- Alkaloid production and response to natural adverse conditions in Peganum harmala: in silico transcriptome analyses
Jazayeri, Biotechnologia 2022 - “...FAR1-RELATED SEQUENCE 5, (FAR1: FAR-RED IMPAIRED RESPONSE 1) AT4G38180 Q93YU8 25 nitrate regulatory gene2 protein AT3G60320 Q9ZT94 24 retrovirus-related Pol polyprotein from transposon RE2 (retro element 2) (AtRE2) [includes: protease RE2 (EC 3.4.23.-); reverse transcriptase RE2 (EC 2.7.7.49); endonuclease RE2] AT4G02960 A0A1P8AUY4 23 midasin (AtMDN1) (dynein-related...”
- Mapping of a novel clubroot disease resistance locus in Brassica napus and related functional identification
Jiang, Frontiers in plant science 2022 - “...WRKY family transcription factor LOC106441903 AT2G44770 91.61 ELMO/CED-12 family LOC106438229 AT2G44790 77.99 uclacyanin 2 LOC106438230 AT3G60320 80.50 bZIP domain class transcription factor (DUF630 and DUF632) LOC106438231 AT2G44940 78.01 Ethylene-responsive transcription factor 34 LOC106441904 AT2G45060 88.52 alanine-tRNA ligase LOC106438232 AT2G45060 85.02 alanine-tRNA ligase LOC106438234 AT2G45070 85.88 Preprotein...”
- WGCNA Analysis of Salt-Responsive Core Transcriptome Identifies Novel Hub Genes in Rice
Zhu, Genes 2019 - “...AT2G41010 Calmodulin (CAM)-binding protein of 25 kDa LOC_Os02g13800 AT3G24520 Heat shock transcription factor C1 LOC_Os02g43770 AT3G60320 Protein of unknown function (DUF630 and DUF632) Grey LOC_Os02g51080 AT1G74470 Pyridine nucleotide-disulphide oxidoreductase family protein LOC_Os03g20700 AT5G13630 Magnesium-chelatase subunit chlH, chloroplast, putative/Mg-protoporphyrin IX chelatase, LOC_Os01g17170 AT3G56940 Dicarboxylate diiron protein CRD1...”
- Comprehensive Analysis of the Membrane Phosphoproteome Regulated by Oligogalacturonides in Arabidopsis thaliana
Mattei, Frontiers in plant science 2016 - “...At3g53110 CY; NU Q93ZG7 25 ADTVEKVPTVVES(0.005)S(0.004)S(0.007)S(0.013)S(0.011)T(0.011)VEAS(0.186)N S (0.762)AEK Tr Tr Tr ns Putative bZIP protein At3g60320 NU Q93YU8 147 IPHIIS(0.15)ES(0.649)S(0.189)PS(0.019)S(0.08) S (0.912)PR Tr Tr Tr 3.0/ns Nuclear/nucleolar GTPase (AtNug2) At1g52980 NU Q9C923 116 ERKIPM S (1)IIT(1)DNK Tr Tr Tr ns Nuclear/nucleolar GTPase (AtNug2) At1g52980 NU Q9C923...”
- Unraveling the early molecular and physiological mechanisms involved in response to phenanthrene exposure
Dumas, BMC genomics 2016 - “...b Glycine cleavage T-protein family 0.00 1.00E+00 0.16 1.00E+00 0.32 1.00E+00 1.17 7.45E-12 1.14 0.00E+00 AT3G60320 Protein of unknown function (DUF630 and DUF632) 0.03 1.00E+00 0.18 1.00E+00 0.72 1.00E+00 1.10 3.72E-10 0.77 6.14E-04 AT5G24760 GroES-like zinc-binding dehydrogenase family protein 0.04 1.00E+00 0.50 1.00E+00 0.34 1.00E+00 1.08...”
- Coronatine-insensitive 1 (COI1) mediates transcriptional responses of Arabidopsis thaliana to external potassium supply
Armengaud, Molecular plant 2010 - “...protein 10.9 0.6 0.6 0.7 0.7 At1g48610 Regulatory protein HAL3B 2.9 0.7 0.9 1.5 1.6 At3g60320 bZIP protein 11.1 0.6 0.5 1.4 1.7 At4g17980 NAM-TF NAM (no apical meristem)-like 11.5 0.5 0.7 1.3 1.6 At5g14260 Putative protein 2.1 0.5 0.8 1.4 1.5 At2g31680 RABA5D GTP-binding protein,...”
- Conservation of microstructure between a sequenced region of the genome of rice and multiple segments of the genome of Arabidopsis thaliana
Mayer, Genome research 2001 - “...C AT2g44830 (1.4e-170) W AT4g17510 (3.8e-71) C AT3g60320 (9.6e-74) C AT4g17500 (6.6e-35) W; AT5g47220 (4.7e-31) C; AT2g44840 (4.1e-30) W AT2g44850 (8.9e-63)...”
REL2_ORYSJ / Q9AQW1 Protein ROLLING AND ERECT LEAF 2 from Oryza sativa subsp. japonica (Rice) (see paper)
37% identity, 54% coverage
- function: Involved in the regulation of leaf shape formation (PubMed:27473144). May function by coordinating the expression of genes associated with leaf and bulliform cell development (PubMed:27473144).
disruption phenotype: Adaxially rolling and erect leaves which leads to plants with erect architecture and reduced lamina joint angle (PubMed:27473144). Abnormal bulliform cell number, size and arrangement in leaf blades (PubMed:27473144). Dark-green leaves with increased levels of chlorophylls (PubMed:27473144). Reduced number of tillers, altered grain morphology, and reduced number of grains per main panicle (PubMed:27473144). Reduced number and length of adventitious roots (PubMed:27473144).
Q8BGH1 Uncharacterized protein from Mus musculus
23% identity, 88% coverage
AT1G52320 hypothetical protein from Arabidopsis thaliana
26% identity, 48% coverage
- IQD1 Involvement in Hormonal Signaling and General Defense Responses Against Botrytis cinerea
Barda, Frontiers in plant science 2022 - “...(Spindly) 2.026 Brassinosteroids AT4G25420 GA20OX1 (Gibberellin 20-Oxidase 1) 2.309 AT3G20730 BIN3 (Brassinosteroid Insensitive 3) 2.064 AT1G52320 2OG-Fe(ll)-dependent oxygenase 2.565 AT1G74360 Leucine-rich repeat transmembrane protein kinase 3.282 AT3G10185 Gibberellin-regulated family protein 2.795 Hormone biosynthesis or metabolism genes are in red and hormone response genes are in black....”
AT1G21740 hypothetical protein from Arabidopsis thaliana
26% identity, 41% coverage
AT4G39790 hypothetical protein from Arabidopsis thaliana
26% identity, 61% coverage
AT4G35240 hypothetical protein from Arabidopsis thaliana
25% identity, 46% coverage
- Transcriptome-wide high-throughput deep m(6)A-seq reveals unique differential m(6)A methylation patterns between three organs in Arabidopsis thaliana
Wan, Genome biology 2015 - “...AT1G77680 Nucleotide binding, regulation of transcription AT3G08940, AT3G07650, AT5G66570, AT5G12400, AT2G42270, AT1G70060, AT2G40770, AT5G04290, AT1G33700, AT4G35240, AT1G14790 [ 18 ] sn (o) RNA or other ncRNA AT4G13495, AT5G09585 ATP binding, ATPase or kinase activity AT2G20850, AT2G42270, AT2G40770, AT1G17750 Signaling transduction AT5G22690, AT2G20850, AT5G13000, AT1G64060, AT1G17750 [...”
bZIP107 uncharacterized protein LOC778196 from Glycine max
25% identity, 49% coverage
- Polyamines Interaction with Gaseous Signaling Molecules for Resilience Against Drought and Heat Stress in Plants
Nidhi,, Plants (Basel, Switzerland) 2025 - “...leads to H 2 S signaling. Furthermore, signal transduction regulates transcription factors (such as bZIP37, bZIP107, DREB2, DREB4, and WRKY108715) that play a role in white clover leaves drought response and antioxidant defense. The control of DREB2 protein by Spd via H 2 S signaling could...”
- Hydrogen Sulfide in Plants: Crosstalk with Other Signal Molecules in Response to Abiotic Stresses
Wang, International journal of molecular sciences 2021 - “..., CsH8 , CsH9 and CsHA10 [ 83 ] dehydration Trifolium repens seedlings bZIP37 , bZIP107 , DREB2 , DREB4 and WRKY108715 [ 84 ] H 2 S and ETH osmotic stress S. lycopersicum seedlings LeACO1 and LeACO2 [ 94 ] H 2 S and Pro...”
AT1G77500 hypothetical protein from Arabidopsis thaliana
25% identity, 44% coverage
AT2G27090 hypothetical protein from Arabidopsis thaliana
27% identity, 40% coverage
- The family of LSU-like proteins
Sirko, Frontiers in plant science 2014 - “...GU066886 Joka 38 144 DUF248/methyltransferase At4g18030 At1g26850 GU066887 Joka 39 119 DUF632/Function unknown, leucine zipper At2g27090 GU066888 Joka 40 515 Function unknown, nucleoporin-like At4g37130 GU066889 Joka 41 99 Poly A binding At1g49760 At4g34110 At2g23350 At1g22760 At1g71770 GU066890 Joka 42 77 FtsH protease At2g26140 GU066891 Joka 43...”
AT2G19090 hypothetical protein from Arabidopsis thaliana
30% identity, 34% coverage
- Genome-Wide Association Studies and Transcriptome Changes during Acclimation and Deacclimation in Divergent Brassica napus Varieties
Horvath, International journal of molecular sciences 2020 - “...to 9387697 AT5G04120 Phosphoglycerate mutase family protein S1_38158858 C03 57859505 57759505-5799505 BnaC03g68090D 57742606 to 57744811 AT2G19090 Protein of unknown function (DUF630 and DUF632) BnaC03g68100D 57751536 to 57752049 AT4G30074 low-molecular-weight cysteine-rich 19 BnaC03g68110D 57761207 to 57764215 AT4G30060 Core-2/I-branching beta-1,6-N-acetylglucosaminyltransferase protein BnaC03g68120D 57765406 to 57766578 AT4G30010 LOCATED IN:...”
- Identification of Loci and Candidate Genes Responsible for Pod Dehiscence in Soybean via Genome-Wide Association Analysis Across Multiple Environments
Hu, Frontiers in plant science 2019 - “...small nucleolar RNA-associated protein Glyma09g06460 AT3G20250 Arabidopsis Pumilio (APUM) protein Glyma09g06470 AT2G19080 Metaxin-like protein Glyma09g06480 AT2G19090 DUF630 family protein Glyma09g06491 ATCG00905 / Glyma09g06500 / Chloroplast gene encoding ribosomal protein s12 Glyma09g06521 AT5G54780 Gyp1p superfamily protein Expression Patterns of Putative Genes According to the Soybase 7 ,...”
- Tandem quadruplication of HMA4 in the zinc (Zn) and cadmium (Cd) hyperaccumulator Noccaea caerulescens
Ó, PloS one 2011 - “...5 region of NcHMA4 -3 in addition to four orthologues to At2g19060, At2g19070, At2g19080 and At2g19090, which were syntenic to this region in A. thaliana ( Figure S4 , Data S4 ). As indicated through locus specific PCR analysis ( Figure 2 ), sequence data from...”
- “...the 5 end of NcHMA4 -3. Brown arrows illustrate flanking genes At2g19060, At2g19070, At2g19080 and At2g19090 and their transcriptional directions. Flanking genes are labelled according to their A. thaliana orthologues. Blue script and lines highlight sites in the fosmid which were 100% specific for that primer....”
AT1G20530 hypothetical protein from Arabidopsis thaliana
33% identity, 36% coverage
- Identification of immunity-related genes in Arabidopsis and cassava using genomic data
Leal, Genomics, proteomics & bioinformatics 2013 - “...CLV2, AT4G09150, AT1G63280, AT2G33600, AT3G43890, AT1G24650 RPS4 AT5G45250 TIR-NBS-LRR class disease resistance protein APX3, BAM1, AT1G20530, ATKDSA2 ER AT2G26330 Homologous to receptor protein kinases; contains a cytoplasmic protein kinase catalytic domain, a transmembrane region and an extracellular LRR AT2G20110, TUBG1, TSD2, AT4G04170, AT2G38000, AT1G17210, AT3G06540, SWI2,...”
AT4G30130 hypothetical protein from Arabidopsis thaliana
32% identity, 36% coverage
LOC18042616 nitrate regulatory gene2 protein from Citrus x clementina
22% identity, 63% coverage
For advice on how to use these tools together, see
Interactive tools for functional annotation of bacterial genomes.
The PaperBLAST database links 793,807 different protein sequences to 1,259,118 scientific articles. Searches against EuropePMC were last performed on March 13 2025.
PaperBLAST builds a database of protein sequences that are linked
to scientific articles. These links come from automated text searches
against the articles in EuropePMC
and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot,
BRENDA,
CAZy (as made available by dbCAN),
BioLiP,
CharProtDB,
MetaCyc,
EcoCyc,
TCDB,
REBASE,
the Fitness Browser,
and a subset of the European Nucleotide Archive with the /experiment tag.
Given this database and a protein sequence query,
PaperBLAST uses protein-protein BLAST
to find similar sequences with E < 0.001.
To build the database, we query EuropePMC with locus tags, with RefSeq protein
identifiers, and with UniProt
accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use
queries of the form "locus_tag AND genus_name" to try to ensure that
the paper is actually discussing that gene. Because EuropePMC indexes
most recent biomedical papers, even if they are not open access, some
of the links may be to papers that you cannot read or that our
computers cannot read. We query each of these identifiers that
appears in the open access part of EuropePMC, as well as every locus
tag that appears in the 500 most-referenced genomes, so that a gene
may appear in the PaperBLAST results even though none of the papers
that mention it are open access. We also incorporate text-mined links
from EuropePMC that link open access articles to UniProt or RefSeq
identifiers. (This yields some additional links because EuropePMC
uses different heuristics for their text mining than we do.)
For every article that mentions a locus tag, a RefSeq protein
identifier, or a UniProt accession, we try to select one or two
snippets of text that refer to the protein. If we cannot get access to
the full text, we try to select a snippet from the abstract, but
unfortunately, unique identifiers such as locus tags are rarely
provided in abstracts.
PaperBLAST also incorporates manually-curated protein functions:
- Proteins from NCBI's RefSeq are included if a
GeneRIF
entry links the gene to an article in
PubMed®.
GeneRIF also provides a short summary of the article's claim about the
protein, which is shown instead of a snippet.
- Proteins from Swiss-Prot (the curated part of UniProt)
are included if the curators
identified experimental evidence for the protein's function (evidence
code ECO:0000269). For these proteins, the fields of the Swiss-Prot entry that
describe the protein's function are shown (with bold headings).
- Proteins from BRENDA,
a curated database of enzymes, are included if they are linked to a paper in PubMed
and their full sequence is known.
- Every protein from the non-redundant subset of
BioLiP,
a database
of ligand-binding sites and catalytic residues in protein structures, is included. Since BioLiP itself
does not include descriptions of the proteins, those are taken from the
Protein Data Bank.
Descriptions from PDB rely on the original submitter of the
structure and cannot be updated by others, so they may be less reliable.
(For SitesBLAST and Sites on a Tree, we use a larger subset of BioLiP so that every
ligand is represented among a group of structures with similar sequences, but for
PaperBLAST, we use the non-redundant set provided by BioLiP.)
- Every protein from EcoCyc, a curated
database of the proteins in Escherichia coli K-12, is included, regardless
of whether they are characterized or not.
- Proteins from the MetaCyc metabolic pathway database
are included if they are linked to a paper in PubMed and their full sequence is known.
- Proteins from the Transport Classification Database (TCDB)
are included if they have known substrate(s), have reference(s),
and are not described as uncharacterized or putative.
(Some of the references are not visible on the PaperBLAST web site.)
- Every protein from CharProtDB,
a database of experimentally characterized protein annotations, is included.
- Proteins from the CAZy database of carbohydrate-active enzymes
are included if they are associated with an Enzyme Classification number.
Even though CAZy does not provide links from individual protein sequences to papers,
these should all be experimentally-characterized proteins.
- Proteins from the REBASE database
of restriction enzymes are included if they have known specificity.
- Every protein with an evidence-based reannotation (based on mutant phenotypes)
in the Fitness Browser is included.
- Sequence-specific transcription factors (including sigma factors and DNA-binding response regulators)
with experimentally-determined DNA binding sites from the
PRODORIC database of gene regulation in prokaryotes.
- Putative transcription factors from RegPrecise
that have manually-curated predictions for their binding sites. These predictions are based on
conserved putative regulatory sites across genomes that contain similar transcription factors,
so PaperBLAST clusters the TFs at 70% identity and retains just one member of each cluster.
- Coding sequence (CDS) features from the
European Nucleotide Archive (ENA)
are included if the /experiment tag is set (implying that there is experimental evidence for the annotation),
the nucleotide entry links to paper(s) in PubMed,
and the nucleotide entry is from the STD data class
(implying that these are targeted annotated sequences, not from shotgun sequencing).
Also, to filter out genes whose transcription or translation was detected, but whose function
was not studied, nucleotide entries or papers with more than 25 such proteins are excluded.
Descriptions from ENA rely on the original submitter of the
sequence and cannot be updated by others, so they may be less reliable.
Except for GeneRIF and ENA,
the curated entries include a short curated
description of the protein's function.
For entries from BioLiP, the protein's function may not be known beyond binding to the ligand.
Many of these entries also link to articles in PubMed.
For more information see the
PaperBLAST paper (mSystems 2017)
or the code.
You can download PaperBLAST's database here.
Changes to PaperBLAST since the paper was written:
- November 2023: incorporated PRODORIC and RegPrecise. Many PRODORIC entries were not linked to a protein sequence (no UniProt identifier), so we added this information.
- February 2023: BioLiP changed their download format. PaperBLAST now includes their non-redundant subset. SitesBLAST and Sites on a Tree use a larger non-redundant subset that ensures that every ligand is represented within each cluster. This should ensure that every binding site is represented.
- June 2022: incorporated some coding sequences from ENA with the /experiment tag.
- March 2022: incorporated BioLiP.
- April 2020: incorporated TCDB.
- April 2019: EuropePMC now returns table entries in their search results. This has expanded PaperBLAST's database, but most of the new entries are of low relevance, and the resulting snippets are often just lists of locus tags with annotations.
- February 2018: the alignment page reports the conservation of the hit's functional sites (if available from from Swiss-Prot or UniProt)
- January 2018: incorporated BRENDA.
- December 2017: incorporated MetaCyc, CharProtDB, CAZy, REBASE, and the reannotations from the Fitness Browser.
- September 2017: EuropePMC no longer returns some table entries in their search results. This has shrunk PaperBLAST's database, but has also reduced the number of low-relevance hits.
Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.
PaperBLAST cannot provide snippets for many of the papers that are
published in non-open-access journals. This limitation applies even if
the paper is marked as "free" on the publisher's web site and is
available in PubmedCentral or EuropePMC. If a journal that you publish
in is marked as "secret," please consider publishing elsewhere.
Many important articles are missing from PaperBLAST, either because
the article's full text is not in EuropePMC (as for many older
articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an
article that characterizes a protein's function but is missing from
PaperBLAST, please notify the curators at UniProt
or add an entry to GeneRIF.
Entries in either of these databases will eventually be incorporated
into PaperBLAST. Note that to add an entry to UniProt, you will need
to find the UniProt identifier for the protein. If the protein is not
already in UniProt, you can ask them to create an entry. To add an
entry to GeneRIF, you will need an NCBI Gene identifier, but
unfortunately many prokaryotic proteins in RefSeq do not have
corresponding Gene identifers.
References
PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.
Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.
Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.
UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.
BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.
The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.
The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.
CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.
The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.
The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.
REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.
Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.
by Morgan Price,
Arkin group
Lawrence Berkeley National Laboratory