PaperBLAST

PaperBLAST – Find papers about a protein or its homologs

PaperBLAST

PaperBLAST Hits for VIMSS10097433 proline-rich family protein (602 a.a., MGCCQSRIDS...)

Other sequence analysis tools:

Find functional residues: SitesBLAST

Search for conserved domains

Find the best match in UniProt

Compare to protein structures

Predict transmenbrane helices: Phobius

Predict protein localization: PSORTb

Find homologs in fast.genomics

Fitness BLAST: loading...

Show query sequence

Found 26 similar proteins in the literature:

AT3G51290 proline-rich family protein from Arabidopsis thaliana
100% identity, 100% coverage

Comprehensive Analysis of Subcellular Localization, Immune Function and Role in Bacterial wilt Disease Resistance of Solanum lycopersicum Linn. ROP Family Small GTPases
Wang, International journal of molecular sciences 2022
- “...CA00g82910, CA00g84620, CA01g27430, CA02g04310, CA02g05500, CA02g21300, CA03g28070, CA04g05500, and CA08g19280; AtAPSR1 and AtRop1 11 : At3g51290, At3g51300, At1g20090, At2g17800, At1g75840, At4g35950, At4g35020, At5g45970, At2g44690, At4g28950, At3g48040, and At5g62880. Acknowledgments We are very grateful to Tsuyoshi Nakagawa and Rosa Lozano-Duran for kindly providing the Gateway-pGWB vectors. We...”
Braving the attitude of altitude: Caragana jubata at work in cold desert of Himalaya
Bhardwaj, Scientific reports 2013
- “...3, 6, 12, and 24h for aerial tissues. Microarray data for Arabidopsis genes At4g04720 and At3g51290, homologues of CjCDPK and CjMce family , respectively were not available in database and hence could not be included in the analysis ( Supplementary Fig. S4 and Supplementary Table S4...”

APSR1_ARATH / A0A178VBJ0 Protein ALTERED PHOSPHATE STARVATION RESPONSE 1 from Arabidopsis thaliana (Mouse-ear cress) (see paper)
94% identity, 94% coverage

function: Required for the coordination of cell differentiation and cell elongation in the root tip (PubMed:23498857). Required for the coordination of cell processes necessary for correct root growth in response to phosphate starvation, through the modulation of the auxin transporter protein PIN7 (PubMed:23498857).
disruption phenotype: Defects in primary root elongation and enhanced root hair elongation.

NP_001190052 pyridoxal-phosphate-dependent serine hydroxymethyltransferase, putative (DUF632) from Arabidopsis thaliana
94% identity, 75% coverage

APSR1, a novel gene required for meristem maintenance, is negatively regulated by low phosphate availability.
González-Mendoza, Plant science : an international journal of experimental plant biology 2013 (PubMed)
- GeneRIF: APSR1 is required for the coordination of cell processes necessary for correct root growth in response to phosphate starvation conceivably by direct or indirect modulation of PIN7. [APSR1]

LOC115717108 nitrate regulatory gene2 protein from Cannabis sativa
39% identity, 49% coverage

Sink strength, nutrient allocation, cannabinoid yield, and associated transcript profiles vary in two drug-type Cannabis chemovars
Jost, Journal of experimental botany 2025
- “...al. , 2016 ), were detected in the RNA-seq data. The homologue encoded by the LOC115717108 gene was expressed in both chemovars and at similar levels across all organs of the THC-dominant chemovar. For the CBD-dominant chemovar, its expression was higher in the roots (LR) and...”

NRG2_ARATH / Q93YU8 Nitrate regulatory gene2 protein; NRG2 protein from Arabidopsis thaliana (Mouse-ear cress) (see paper)
AT3G60320 DNA binding from Arabidopsis thaliana
38% identity, 52% coverage

function: Required for nitrate signaling. Regulates expression of the nitrate-responsive genes NIA1, NIR1, NRT2.1 and NPF6.3/NRT1.1.
subunit: Interacts with NLP7.
disruption phenotype: Under high nitrate concentration, seedlings are slightly smaller and display later flowering than wild-type. Under low nitrate concentration, seedlings appear normal. Nitrate accumulation of the seedlings and roots is significantly lower than in wild-type, however no difference of nitrate accumulation in leaves. Nitrate content differences in roots and leaves may be due to the reduced expression of NPF6.3/NRT1.1 in roots and the increased expression of NPF7.2/NRT1.8 in leaves. After nitrate treatment, altered expression of many genes involved in nitrogen-related clusters including nitrate transport and response to nitrate.
Alkaloid production and response to natural adverse conditions in Peganum harmala: in silico transcriptome analyses
Jazayeri, Biotechnologia 2022
- “...most abundant proteins found as UniProtKB hits with more than 20 top-matched hits were Q9SZL8, Q93YU8, Q9ZT94, A0A1P8AUY4, P0C2F6, Q6NQJ8, Q9S7I6, and O23372, as presented in Table 2 . These proteins were chosen because they were the most abundant in the generated integrated transcriptome. These proteins...”
- “...processes (BP) such as growth and development and stress resistance (Ma and Li, 2018 ). Q93YU8 or nitrate regulatory gene 2 protein is involved in nitrate signaling and regulation (Xu et al., 2016 ). It causes nitrate accumulation in plants by modulating nitrate uptake by roots...”
Comprehensive Analysis of the Membrane Phosphoproteome Regulated by Oligogalacturonides in Arabidopsis thaliana
Mattei, Frontiers in plant science 2016
- “...NU Q93ZG7 25 ADTVEKVPTVVES(0.005)S(0.004)S(0.007)S(0.013)S(0.011)T(0.011)VEAS(0.186)N S (0.762)AEK Tr Tr Tr ns Putative bZIP protein At3g60320 NU Q93YU8 147 IPHIIS(0.15)ES(0.649)S(0.189)PS(0.019)S(0.08) S (0.912)PR Tr Tr Tr 3.0/ns Nuclear/nucleolar GTPase (AtNug2) At1g52980 NU Q9C923 116 ERKIPM S (1)IIT(1)DNK Tr Tr Tr ns Nuclear/nucleolar GTPase (AtNug2) At1g52980 NU Q9C923 119 ERKIPMS(1)II...”
Alkaloid production and response to natural adverse conditions in Peganum harmala: in silico transcriptome analyses
Jazayeri, Biotechnologia 2022
- “...FAR1-RELATED SEQUENCE 5, (FAR1: FAR-RED IMPAIRED RESPONSE 1) AT4G38180 Q93YU8 25 nitrate regulatory gene2 protein AT3G60320 Q9ZT94 24 retrovirus-related Pol polyprotein from transposon RE2 (retro element 2) (AtRE2) [includes: protease RE2 (EC 3.4.23.-); reverse transcriptase RE2 (EC 2.7.7.49); endonuclease RE2] AT4G02960 A0A1P8AUY4 23 midasin (AtMDN1) (dynein-related...”
Mapping of a novel clubroot disease resistance locus in Brassica napus and related functional identification
Jiang, Frontiers in plant science 2022
- “...WRKY family transcription factor LOC106441903 AT2G44770 91.61 ELMO/CED-12 family LOC106438229 AT2G44790 77.99 uclacyanin 2 LOC106438230 AT3G60320 80.50 bZIP domain class transcription factor (DUF630 and DUF632) LOC106438231 AT2G44940 78.01 Ethylene-responsive transcription factor 34 LOC106441904 AT2G45060 88.52 alanine-tRNA ligase LOC106438232 AT2G45060 85.02 alanine-tRNA ligase LOC106438234 AT2G45070 85.88 Preprotein...”
WGCNA Analysis of Salt-Responsive Core Transcriptome Identifies Novel Hub Genes in Rice
Zhu, Genes 2019
- “...AT2G41010 Calmodulin (CAM)-binding protein of 25 kDa LOC_Os02g13800 AT3G24520 Heat shock transcription factor C1 LOC_Os02g43770 AT3G60320 Protein of unknown function (DUF630 and DUF632) Grey LOC_Os02g51080 AT1G74470 Pyridine nucleotide-disulphide oxidoreductase family protein LOC_Os03g20700 AT5G13630 Magnesium-chelatase subunit chlH, chloroplast, putative/Mg-protoporphyrin IX chelatase, LOC_Os01g17170 AT3G56940 Dicarboxylate diiron protein CRD1...”
Comprehensive Analysis of the Membrane Phosphoproteome Regulated by Oligogalacturonides in Arabidopsis thaliana
Mattei, Frontiers in plant science 2016
- “...At3g53110 CY; NU Q93ZG7 25 ADTVEKVPTVVES(0.005)S(0.004)S(0.007)S(0.013)S(0.011)T(0.011)VEAS(0.186)N S (0.762)AEK Tr Tr Tr ns Putative bZIP protein At3g60320 NU Q93YU8 147 IPHIIS(0.15)ES(0.649)S(0.189)PS(0.019)S(0.08) S (0.912)PR Tr Tr Tr 3.0/ns Nuclear/nucleolar GTPase (AtNug2) At1g52980 NU Q9C923 116 ERKIPM S (1)IIT(1)DNK Tr Tr Tr ns Nuclear/nucleolar GTPase (AtNug2) At1g52980 NU Q9C923...”
Unraveling the early molecular and physiological mechanisms involved in response to phenanthrene exposure
Dumas, BMC genomics 2016
- “...b Glycine cleavage T-protein family 0.00 1.00E+00 0.16 1.00E+00 0.32 1.00E+00 1.17 7.45E-12 1.14 0.00E+00 AT3G60320 Protein of unknown function (DUF630 and DUF632) 0.03 1.00E+00 0.18 1.00E+00 0.72 1.00E+00 1.10 3.72E-10 0.77 6.14E-04 AT5G24760 GroES-like zinc-binding dehydrogenase family protein 0.04 1.00E+00 0.50 1.00E+00 0.34 1.00E+00 1.08...”
Coronatine-insensitive 1 (COI1) mediates transcriptional responses of Arabidopsis thaliana to external potassium supply
Armengaud, Molecular plant 2010
- “...protein 10.9 0.6 0.6 0.7 0.7 At1g48610 Regulatory protein HAL3B 2.9 0.7 0.9 1.5 1.6 At3g60320 bZIP protein 11.1 0.6 0.5 1.4 1.7 At4g17980 NAM-TF NAM (no apical meristem)-like 11.5 0.5 0.7 1.3 1.6 At5g14260 Putative protein 2.1 0.5 0.8 1.4 1.5 At2g31680 RABA5D GTP-binding protein,...”
Conservation of microstructure between a sequenced region of the genome of rice and multiple segments of the genome of Arabidopsis thaliana
Mayer, Genome research 2001
- “...C AT2g44830 (1.4e-170) W AT4g17510 (3.8e-71) C AT3g60320 (9.6e-74) C AT4g17500 (6.6e-35) W; AT5g47220 (4.7e-31) C; AT2g44840 (4.1e-30) W AT2g44850 (8.9e-63)...”

REL2_ORYSJ / Q9AQW1 Protein ROLLING AND ERECT LEAF 2 from Oryza sativa subsp. japonica (Rice) (see paper)
37% identity, 54% coverage

function: Involved in the regulation of leaf shape formation (PubMed:27473144). May function by coordinating the expression of genes associated with leaf and bulliform cell development (PubMed:27473144).
disruption phenotype: Adaxially rolling and erect leaves which leads to plants with erect architecture and reduced lamina joint angle (PubMed:27473144). Abnormal bulliform cell number, size and arrangement in leaf blades (PubMed:27473144). Dark-green leaves with increased levels of chlorophylls (PubMed:27473144). Reduced number of tillers, altered grain morphology, and reduced number of grains per main panicle (PubMed:27473144). Reduced number and length of adventitious roots (PubMed:27473144).

Q8BGH1 Uncharacterized protein from Mus musculus
23% identity, 88% coverage

Differential proteomic and behavioral effects of long-term voluntary exercise in wild-type and APP-overexpressing transgenics.
Rao, Neurobiology of disease 2015
- “...<.05 MAPK/PI3K signaling Q9CQQ8 U6 snRNA-associated Sm-like LSm7 Lsm7 11.6 2 9.3 <.01 RNA degradation Q8BGH1 Uncharacterized 73.3 2 3.6 <.05 Unknown All proteins are listed that had significantly different main or interaction effects using 2 2 ANOVAs followed by Fisher's protected LSD; df = 13...”

AT1G52320 hypothetical protein from Arabidopsis thaliana
26% identity, 48% coverage

IQD1 Involvement in Hormonal Signaling and General Defense Responses Against Botrytis cinerea
Barda, Frontiers in plant science 2022
- “...(Spindly) 2.026 Brassinosteroids AT4G25420 GA20OX1 (Gibberellin 20-Oxidase 1) 2.309 AT3G20730 BIN3 (Brassinosteroid Insensitive 3) 2.064 AT1G52320 2OG-Fe(ll)-dependent oxygenase 2.565 AT1G74360 Leucine-rich repeat transmembrane protein kinase 3.282 AT3G10185 Gibberellin-regulated family protein 2.795 Hormone biosynthesis or metabolism genes are in red and hormone response genes are in black....”

AT1G21740 hypothetical protein from Arabidopsis thaliana
26% identity, 41% coverage

Floral heteromorphy in Primula vulgaris: progress towards isolation and characterization of the S locus
Li, Annals of botany 2011
- “...Primula S locus BAC-end Contig 51M17-F 46L18-F 1 1 AT1G21740 AT1G21750 1e-20 9e-24 20E12-R 20L16-R 1 1 AT1G44414 AT1G77390 2e-25 9e-24 82I1-R 28H15-F 2 2...”
Prediction of the plant beta-barrel proteome: a case study of the chloroplast outer envelope
Schleiff, Protein science : a publication of the Protein Society 2003
- “...soluble test protein, the transcription factor encoded by At1g21740 containing a -sheet structure and a calculated BBS of 0.662 by using the standard algorithm,...”

AT4G39790 hypothetical protein from Arabidopsis thaliana
26% identity, 61% coverage

Integrating Genome-Wide Association Study, Transcriptome and Metabolome Reveal Novel QTL and Candidate Genes That Control Protein Content in Soybean
Zhao, Plants (Basel, Switzerland) 2024
- “...1.20 1.12 1.65 AT5G10250 Phototropic-responsive NPH3 family protein Glyma.08G137900 rs8158 8 10,476,797 1.32 1.58 1.09 AT4G39790 Protein of unknown function (DUF630 and DUF632) Glyma.12G114100 rs12338 12 11,269,928 6.26 NA NA AT4G28350 Concanavalin A-like lectin protein kinase family protein Glyma.08G135800 rs8158 8 104,76,797 2.89 1.97 2.46 AT5G65000...”

AT4G35240 hypothetical protein from Arabidopsis thaliana
25% identity, 46% coverage

Transcriptome-wide high-throughput deep m(6)A-seq reveals unique differential m(6)A methylation patterns between three organs in Arabidopsis thaliana
Wan, Genome biology 2015
- “...AT1G77680 Nucleotide binding, regulation of transcription AT3G08940, AT3G07650, AT5G66570, AT5G12400, AT2G42270, AT1G70060, AT2G40770, AT5G04290, AT1G33700, AT4G35240, AT1G14790 [ 18 ] sn (o) RNA or other ncRNA AT4G13495, AT5G09585 ATP binding, ATPase or kinase activity AT2G20850, AT2G42270, AT2G40770, AT1G17750 Signaling transduction AT5G22690, AT2G20850, AT5G13000, AT1G64060, AT1G17750 [...”

bZIP107 uncharacterized protein LOC778196 from Glycine max
25% identity, 49% coverage

Polyamines Interaction with Gaseous Signaling Molecules for Resilience Against Drought and Heat Stress in Plants
Nidhi,, Plants (Basel, Switzerland) 2025
- “...leads to H 2 S signaling. Furthermore, signal transduction regulates transcription factors (such as bZIP37, bZIP107, DREB2, DREB4, and WRKY108715) that play a role in white clover leaves drought response and antioxidant defense. The control of DREB2 protein by Spd via H 2 S signaling could...”
Hydrogen Sulfide in Plants: Crosstalk with Other Signal Molecules in Response to Abiotic Stresses
Wang, International journal of molecular sciences 2021
- “..., CsH8 , CsH9 and CsHA10 [ 83 ] dehydration Trifolium repens seedlings bZIP37 , bZIP107 , DREB2 , DREB4 and WRKY108715 [ 84 ] H 2 S and ETH osmotic stress S. lycopersicum seedlings LeACO1 and LeACO2 [ 94 ] H 2 S and Pro...”

AT1G77500 hypothetical protein from Arabidopsis thaliana
25% identity, 44% coverage

The Arabidopsis wall associated kinase-like 10 gene encodes a functional guanylyl cyclase and is co-expressed with pathogen defense related genes
Meier, PloS one 2010
- “...finger (C2H2 type) family protein 22 AT5G67080 0.852 Similar to mitogen-activated PKKK 20 (MAPKKK20) 23 AT1G77500 0.850 N-terminal protein myristoylation 24 AT5G48400 0.848 Glutamate receptor family protein (GLR1.2) 25 AT4G17500 0.845 Ethylene-Response-Factor -1A ( ERF-1A ) 26 AT5G64890 0.844 Elicitor peptide 2 precursor (PROPEP2) 27 AT1G66090...”

AT2G27090 hypothetical protein from Arabidopsis thaliana
27% identity, 40% coverage

The family of LSU-like proteins
Sirko, Frontiers in plant science 2014
- “...GU066886 Joka 38 144 DUF248/methyltransferase At4g18030 At1g26850 GU066887 Joka 39 119 DUF632/Function unknown, leucine zipper At2g27090 GU066888 Joka 40 515 Function unknown, nucleoporin-like At4g37130 GU066889 Joka 41 99 Poly A binding At1g49760 At4g34110 At2g23350 At1g22760 At1g71770 GU066890 Joka 42 77 FtsH protease At2g26140 GU066891 Joka 43...”

AT2G19090 hypothetical protein from Arabidopsis thaliana
30% identity, 34% coverage

Genome-Wide Association Studies and Transcriptome Changes during Acclimation and Deacclimation in Divergent Brassica napus Varieties
Horvath, International journal of molecular sciences 2020
- “...to 9387697 AT5G04120 Phosphoglycerate mutase family protein S1_38158858 C03 57859505 57759505-5799505 BnaC03g68090D 57742606 to 57744811 AT2G19090 Protein of unknown function (DUF630 and DUF632) BnaC03g68100D 57751536 to 57752049 AT4G30074 low-molecular-weight cysteine-rich 19 BnaC03g68110D 57761207 to 57764215 AT4G30060 Core-2/I-branching beta-1,6-N-acetylglucosaminyltransferase protein BnaC03g68120D 57765406 to 57766578 AT4G30010 LOCATED IN:...”
Identification of Loci and Candidate Genes Responsible for Pod Dehiscence in Soybean via Genome-Wide Association Analysis Across Multiple Environments
Hu, Frontiers in plant science 2019
- “...small nucleolar RNA-associated protein Glyma09g06460 AT3G20250 Arabidopsis Pumilio (APUM) protein Glyma09g06470 AT2G19080 Metaxin-like protein Glyma09g06480 AT2G19090 DUF630 family protein Glyma09g06491 ATCG00905 / Glyma09g06500 / Chloroplast gene encoding ribosomal protein s12 Glyma09g06521 AT5G54780 Gyp1p superfamily protein Expression Patterns of Putative Genes According to the Soybase 7 ,...”
Tandem quadruplication of HMA4 in the zinc (Zn) and cadmium (Cd) hyperaccumulator Noccaea caerulescens
Ó, PloS one 2011
- “...5 region of NcHMA4 -3 in addition to four orthologues to At2g19060, At2g19070, At2g19080 and At2g19090, which were syntenic to this region in A. thaliana ( Figure S4 , Data S4 ). As indicated through locus specific PCR analysis ( Figure 2 ), sequence data from...”
- “...the 5 end of NcHMA4 -3. Brown arrows illustrate flanking genes At2g19060, At2g19070, At2g19080 and At2g19090 and their transcriptional directions. Flanking genes are labelled according to their A. thaliana orthologues. Blue script and lines highlight sites in the fosmid which were 100% specific for that primer....”

AT1G20530 hypothetical protein from Arabidopsis thaliana
33% identity, 36% coverage

Identification of immunity-related genes in Arabidopsis and cassava using genomic data
Leal, Genomics, proteomics & bioinformatics 2013
- “...CLV2, AT4G09150, AT1G63280, AT2G33600, AT3G43890, AT1G24650 RPS4 AT5G45250 TIR-NBS-LRR class disease resistance protein APX3, BAM1, AT1G20530, ATKDSA2 ER AT2G26330 Homologous to receptor protein kinases; contains a cytoplasmic protein kinase catalytic domain, a transmembrane region and an extracellular LRR AT2G20110, TUBG1, TSD2, AT4G04170, AT2G38000, AT1G17210, AT3G06540, SWI2,...”

AT4G30130 hypothetical protein from Arabidopsis thaliana
32% identity, 36% coverage

Genome-wide association studies identify heavy metal ATPase3 as the primary determinant of natural variation in leaf cadmium in Arabidopsis thaliana
Chao, PLoS genetics 2012
- “...Heavy Metal ATPase 2 12074 AT4G30120 14730401 14733510 Reverse HMA3, Heavy Metal ATPase 3 3148 AT4G30130 14734819 14737978 Forward unknown protein 1839 AT4G30140 14738387 14740676 Reverse CDEF1, Cuticle Destructing Factor 1 4018 AT4G30150 14742452 14749987 Forward unknown protein 5794 AT4G30160 14753432 14760189 Forward VLN4,Villin-Like actin-bindingprotein 4...”

LOC18042616 nitrate regulatory gene2 protein from Citrus x clementina
22% identity, 63% coverage

Transcriptome and Metabolome Comparison of Smooth and Rough Citrus limon L. Peels Grown on Same Trees and Harvested in Different Seasons
Liu, Frontiers in plant science 2021
- “...(TF) (ERF003, LOC18053793 ), GPI-anchored protein LLG1 ( LOC18055492 ), nitrate regulatory gene 2 ( LOC18042616 ), protein GRAVITROPIC IN THE LIGHT 1 (GIL1, LOC18046532 ), syntaxin-related protein KNOLLE ( LOC18049273 ), TF bHLH62 ( LOC18033163 ), and zinc transporter 1 ( LOC18032278 ) ( Supplementary...”

New Search

For advice on how to use these tools together, see Interactive tools for functional annotation of bacterial genomes.

Statistics

The PaperBLAST database links 793,807 different protein sequences to 1,259,118 scientific articles. Searches against EuropePMC were last performed on March 13 2025.

How It Works

PaperBLAST builds a database of protein sequences that are linked to scientific articles. These links come from automated text searches against the articles in EuropePMC and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot, BRENDA, CAZy (as made available by dbCAN), BioLiP, CharProtDB, MetaCyc, EcoCyc, TCDB, REBASE, the Fitness Browser, and a subset of the European Nucleotide Archive with the /experiment tag. Given this database and a protein sequence query, PaperBLAST uses protein-protein BLAST to find similar sequences with E < 0.001.

To build the database, we query EuropePMC with locus tags, with RefSeq protein identifiers, and with UniProt accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use queries of the form "locus_tag AND genus_name" to try to ensure that the paper is actually discussing that gene. Because EuropePMC indexes most recent biomedical papers, even if they are not open access, some of the links may be to papers that you cannot read or that our computers cannot read. We query each of these identifiers that appears in the open access part of EuropePMC, as well as every locus tag that appears in the 500 most-referenced genomes, so that a gene may appear in the PaperBLAST results even though none of the papers that mention it are open access. We also incorporate text-mined links from EuropePMC that link open access articles to UniProt or RefSeq identifiers. (This yields some additional links because EuropePMC uses different heuristics for their text mining than we do.)

For every article that mentions a locus tag, a RefSeq protein identifier, or a UniProt accession, we try to select one or two snippets of text that refer to the protein. If we cannot get access to the full text, we try to select a snippet from the abstract, but unfortunately, unique identifiers such as locus tags are rarely provided in abstracts.

PaperBLAST also incorporates manually-curated protein functions:

Proteins from NCBI's RefSeq are included if a GeneRIF entry links the gene to an article in PubMed^®. GeneRIF also provides a short summary of the article's claim about the protein, which is shown instead of a snippet.
Proteins from Swiss-Prot (the curated part of UniProt) are included if the curators identified experimental evidence for the protein's function (evidence code ECO:0000269). For these proteins, the fields of the Swiss-Prot entry that describe the protein's function are shown (with bold headings).
Proteins from BRENDA, a curated database of enzymes, are included if they are linked to a paper in PubMed and their full sequence is known.
Every protein from the non-redundant subset of BioLiP, a database of ligand-binding sites and catalytic residues in protein structures, is included. Since BioLiP itself does not include descriptions of the proteins, those are taken from the Protein Data Bank. Descriptions from PDB rely on the original submitter of the structure and cannot be updated by others, so they may be less reliable. (For SitesBLAST and Sites on a Tree, we use a larger subset of BioLiP so that every ligand is represented among a group of structures with similar sequences, but for PaperBLAST, we use the non-redundant set provided by BioLiP.)
Every protein from EcoCyc, a curated database of the proteins in Escherichia coli K-12, is included, regardless of whether they are characterized or not.
Proteins from the MetaCyc metabolic pathway database are included if they are linked to a paper in PubMed and their full sequence is known.
Proteins from the Transport Classification Database (TCDB) are included if they have known substrate(s), have reference(s), and are not described as uncharacterized or putative. (Some of the references are not visible on the PaperBLAST web site.)
Every protein from CharProtDB, a database of experimentally characterized protein annotations, is included.
Proteins from the CAZy database of carbohydrate-active enzymes are included if they are associated with an Enzyme Classification number. Even though CAZy does not provide links from individual protein sequences to papers, these should all be experimentally-characterized proteins.
Proteins from the REBASE database of restriction enzymes are included if they have known specificity.
Every protein with an evidence-based reannotation (based on mutant phenotypes) in the Fitness Browser is included.
Sequence-specific transcription factors (including sigma factors and DNA-binding response regulators) with experimentally-determined DNA binding sites from the PRODORIC database of gene regulation in prokaryotes.
Putative transcription factors from RegPrecise that have manually-curated predictions for their binding sites. These predictions are based on conserved putative regulatory sites across genomes that contain similar transcription factors, so PaperBLAST clusters the TFs at 70% identity and retains just one member of each cluster.
Coding sequence (CDS) features from the European Nucleotide Archive (ENA) are included if the /experiment tag is set (implying that there is experimental evidence for the annotation), the nucleotide entry links to paper(s) in PubMed, and the nucleotide entry is from the STD data class (implying that these are targeted annotated sequences, not from shotgun sequencing). Also, to filter out genes whose transcription or translation was detected, but whose function was not studied, nucleotide entries or papers with more than 25 such proteins are excluded. Descriptions from ENA rely on the original submitter of the sequence and cannot be updated by others, so they may be less reliable.

Except for GeneRIF and ENA, the curated entries include a short curated description of the protein's function. For entries from BioLiP, the protein's function may not be known beyond binding to the ligand. Many of these entries also link to articles in PubMed.

For more information see the PaperBLAST paper (mSystems 2017) or the code. You can download PaperBLAST's database here.

Changes to PaperBLAST since the paper was written:

November 2023: incorporated PRODORIC and RegPrecise. Many PRODORIC entries were not linked to a protein sequence (no UniProt identifier), so we added this information.
February 2023: BioLiP changed their download format. PaperBLAST now includes their non-redundant subset. SitesBLAST and Sites on a Tree use a larger non-redundant subset that ensures that every ligand is represented within each cluster. This should ensure that every binding site is represented.
June 2022: incorporated some coding sequences from ENA with the /experiment tag.
March 2022: incorporated BioLiP.
April 2020: incorporated TCDB.
April 2019: EuropePMC now returns table entries in their search results. This has expanded PaperBLAST's database, but most of the new entries are of low relevance, and the resulting snippets are often just lists of locus tags with annotations.
February 2018: the alignment page reports the conservation of the hit's functional sites (if available from from Swiss-Prot or UniProt)
January 2018: incorporated BRENDA.
December 2017: incorporated MetaCyc, CharProtDB, CAZy, REBASE, and the reannotations from the Fitness Browser.
September 2017: EuropePMC no longer returns some table entries in their search results. This has shrunk PaperBLAST's database, but has also reduced the number of low-relevance hits.

Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.

Secrets

PaperBLAST cannot provide snippets for many of the papers that are published in non-open-access journals. This limitation applies even if the paper is marked as "free" on the publisher's web site and is available in PubmedCentral or EuropePMC. If a journal that you publish in is marked as "secret," please consider publishing elsewhere.

Omissions from the PaperBLAST Database

Many important articles are missing from PaperBLAST, either because the article's full text is not in EuropePMC (as for many older articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an article that characterizes a protein's function but is missing from PaperBLAST, please notify the curators at UniProt or add an entry to GeneRIF. Entries in either of these databases will eventually be incorporated into PaperBLAST. Note that to add an entry to UniProt, you will need to find the UniProt identifier for the protein. If the protein is not already in UniProt, you can ask them to create an entry. To add an entry to GeneRIF, you will need an NCBI Gene identifier, but unfortunately many prokaryotic proteins in RefSeq do not have corresponding Gene identifers.

References

PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.

Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.

Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.

UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.

BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.

The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.

The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.

CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.

The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.

The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.

REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.

Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.

by Morgan Price, Arkin group
Lawrence Berkeley National Laboratory