PaperBLAST

PaperBLAST – Find papers about a protein or its homologs

PaperBLAST

PaperBLAST Hits for NP_171829.1 agenet domain protein (DOMAIN OF UNKNOWN FUNCTION 724 1) (Arabidopsis thaliana) (670 a.a., MAIFKDCEVE...)

Other sequence analysis tools:

Find functional residues: SitesBLAST

Search for conserved domains

Find the best match in UniProt

Compare to protein structures

Predict transmenbrane helices: Phobius

Predict protein localization: PSORTb

Find homologs in fast.genomics

Fitness BLAST: loading...

Show query sequence

Found 29 similar proteins in the literature:

DUF1_ARATH / Q9ZVT1 DUF724 domain-containing protein 1; AtDUF1 from Arabidopsis thaliana (Mouse-ear cress) (see paper)
NP_171829 agenet domain protein (DOMAIN OF UNKNOWN FUNCTION 724 1) from Arabidopsis thaliana
AT1G03300 agenet domain-containing protein from Arabidopsis thaliana
100% identity, 100% coverage

function: May be involved in the polar growth of plant cells via transportation of RNAs.
disruption phenotype: No visible phenotype under normal growth conditions.
Characterization of DUF724 gene family in Arabidopsis thaliana.
Cao, Plant molecular biology 2010 (PubMed)
- GeneRIF: AtDuf4 were found to express in the root tips. They were localized in nucleus.
Functional Insight of Nitric-Oxide Induced DUF Genes in Arabidopsis thaliana
Nabi, Frontiers in plant science 2020
- “...proteins in 107 species: Archae0; Bacteria4; Metazoa91; Fungi93; Plants473; Viruses0; Other Eukaryotes14 (source: NCBI BLink). AT1G03300 2.99000 0.24000 12.45833 0.00000 3.62457 Member of the plant-specific DUF724 protein family. Arabidopsis has 10 DUF724 proteins. Loss of function mutant has a WT phenotype AT3G15310 25.97000 2.10000 12.36667 0.00000...”

DUF6_ARATH / O22897 DUF724 domain-containing protein 6; AtDUF6 from Arabidopsis thaliana (Mouse-ear cress) (see paper)
AT2G47230 agenet domain-containing protein from Arabidopsis thaliana
NP_182245 agenet domain protein (DOMAIN OF UNKNOWN FUNCTION 724 6) from Arabidopsis thaliana
56% identity, 95% coverage

function: May be involved in the polar growth of plant cells via transportation of RNAs.
Methodological implementation of mixed linear models in multi-locus genome-wide association studies
Wen, Briefings in bioinformatics 2018
- “...0.460 1.199 [ 31 ] At2g27380 2 11703876 4.744 0.043 0.323 1.122 [ 33 ] At2g47230 2 19396129 4.208 0.038 0.298 0.911 [ 31 ] At3g56900 3 21079518 3.081 0.032 0.311 0.661 [ 31 ] At3g57000 3 21079518 3.081 0.032 0.311 0.661 [ 31 ] At5g06550...”
- “...al. [ 29 ]. For example, among seven known genes ( At1g03457 , At2g27380 , At2g47230 , At3g56900 , At3g57000 , At5g06550 and At5g06590 ) for 8W GH FT in this study, no genes were within the 133 candidate genes in Atwell et al. [ 29...”
Characterization of DUF724 gene family in Arabidopsis thaliana.
Cao, Plant molecular biology 2010 (PubMed)
- GeneRIF: Data show that AtDuf6 genes were expressed in roots, leaves, shoot apical meristems, anthers and pollen grains. They were localized in nucleus.

DUF3_ARATH / Q9FZD9 DUF724 domain-containing protein 3; AtDUF3 from Arabidopsis thaliana (Mouse-ear cress) (see paper)
AT1G26540 agenet domain-containing protein from Arabidopsis thaliana
55% identity, 96% coverage

function: May be involved in the polar growth of plant cells via transportation of RNAs.
subunit: Homodimer.
disruption phenotype: No visible phenotype under normal growth conditions.
Fine Mapping and Identification of a Candidate Gene for the Glossy Green Trait in Cabbage (Brassica oleracea var. capitata)
Wang, Plants (Basel, Switzerland) 2023
- “...15769050 reverse AT1G54680.4 unknown protein Bol026948 15773097 15774439 reverse unknown protein Bol026949 15776395 15780942 forward AT1G26540 Agenet domain-containing protein-related Bol026950 15803605 15804112 reverse AT1G20340 plastocyanin (petE) Bol026952 15815490 15817138 reverse AT1G20370 tRNA pseudouridylate synthase...”
Effect of thermospermine on expression profiling of different gene using massive analysis of cDNA ends (MACE) and vascular maintenance in Arabidopsis
Sagor, Physiology and molecular biology of plants : an international journal of functional plant biology 2021
- “...- 2.0116 Cell division control 6 19 AT1G26540 - 2.0101 Agenet domain-containing protein 20 AT4G30860 - 2.0042 Histone-lysine N-methyltransferase ASHR3...”

DUF2_ARATH / F4I8W1 DUF724 domain-containing protein 2; AtDUF2 from Arabidopsis thaliana (Mouse-ear cress) (see paper)
NP_172609 agenet domain protein (DOMAIN OF UNKNOWN FUNCTION 724 2) from Arabidopsis thaliana
40% identity, 100% coverage

function: May be involved in the polar growth of plant cells via transportation of RNAs.
Characterization of DUF724 gene family in Arabidopsis thaliana.
Cao, Plant molecular biology 2010 (PubMed)
- GeneRIF: AtDuf2 was detected in trichomes and the cells at the base of trichomes.

NP_001331103 agenet domain protein (DOMAIN OF UNKNOWN FUNCTION 724 9) from Arabidopsis thaliana
37% identity, 78% coverage

Characterization of DUF724 gene family in Arabidopsis thaliana.
Cao, Plant molecular biology 2010 (PubMed)
- GeneRIF: AtDuf9 RNAs were mostly detected in flowers, while undetectable or extremely low in other organs. They were localized in nucleus.

DUF7_ARATH / Q8H0V4 DUF724 domain-containing protein 7; AtDUF7; ABAP1-interacting protein 1 from Arabidopsis thaliana (Mouse-ear cress) (see 2 papers)
AT3G62300 agenet domain-containing protein from Arabidopsis thaliana
29% identity, 93% coverage

function: May act as a link between DNA replication, transcription and chromatin remodeling during flower development. May participate in the repression of LHP1-targeted genes during flower development by direct interaction with LHP1 (PubMed:26538092). May be involved in the polar growth of plant cells via transportation of RNAs (Probable).
subunit: Homodimer (PubMed:19795213, PubMed:26538092). Interacts wtih ABAP1, ARIA and LHP1 (PubMed:26538092). Interacts with the non-modified histones H1, H2B, H3 and H4 (PubMed:26538092).
AIP1 is a novel Agenet/Tudor domain protein from Arabidopsis that interacts with regulators of DNA replication, transcription and chromatin remodeling
Brasil, BMC plant biology 2015
- “...sativa Os05g04180; Angiosperm Eudicot Populus trichocarpa Potri_018G030500_5, Brassica rapa Bra022578, Manihot esculenta cassava4_1_003152, A. thaliana AT3G62300, AT5G13020. The two sequences of Agenet/Tudor repetitions from AIP1 were used (AT3G62300.1 and AT3G62300.2). c Overlapping Agenet/Tudor models generated in the I-TASSER server. The structures are colored in white (B_MA_20337g0010),...”
- “...division in leaves [ 17 ]. Among the ABAP1-interacting proteins (AIPs) identified, there was AIP1 (At3G62300), an unknown protein predicted with 722 amino acids and approximately 80,9kDa. It harbors two repeats of Agenet/Tudor domain in its N-terminal region (amino acids 1384, and 161224) as well as...”

DUF9_ARATH / Q9FFA2 DUF724 domain-containing protein 9; AtDUF9 from Arabidopsis thaliana (Mouse-ear cress) (see paper)
AT5G23780 agenet domain-containing protein from Arabidopsis thaliana
37% identity, 78% coverage

function: May be involved in the polar growth of plant cells via transportation of RNAs.
Feedback inhibition by thiols outranks glutathione depletion: a luciferase-based screen reveals glutathione-deficient γ-ECS and glutathione synthetase mutants impaired in cadmium-induced sulfate assimilation
Jobe, The Plant journal : for cell and molecular biology 2012
- “...5, between the F18A17 and F15F15 markers. Candidate gene sequencing identified a point mutation in At5g23780, which causes an Ala Val (A404V) change in the glutathione synthetase protein. (b) Expression of genomic GS in the nrc2 mutant behind a constitutive promoter restores root elongation on Cd....”

NP_001190163 agenet domain protein (DOMAIN OF UNKNOWN FUNCTION 724 7) from Arabidopsis thaliana
29% identity, 93% coverage

Characterization of DUF724 gene family in Arabidopsis thaliana.
Cao, Plant molecular biology 2010 (PubMed)
- GeneRIF: was expressed in many different tissues, but with the highest level of expression in seedlings, roots, leaves and flowers. They were localized in nucleus.

DUF8_ARATH / F4KEA4 DUF724 domain-containing protein 8; AtDUF8 from Arabidopsis thaliana (Mouse-ear cress) (see paper)
37% identity, 78% coverage

function: May be involved in the polar growth of plant cells via transportation of RNAs.
disruption phenotype: No visible phenotype under normal growth conditions.

DUF10_ARATH / Q9FFA0 DUF724 domain-containing protein 10; AtDUF10 from Arabidopsis thaliana (Mouse-ear cress) (see paper)
NP_197769 agenet domain protein (DOMAIN OF UNKNOWN FUNCTION 724 10) from Arabidopsis thaliana
41% identity, 42% coverage

function: May be involved in the polar growth of plant cells via transportation of RNAs.
disruption phenotype: No visible phenotype under normal growth conditions.
Characterization of DUF724 gene family in Arabidopsis thaliana.
Cao, Plant molecular biology 2010 (PubMed)
- GeneRIF: Data show that AtDuf10 genes were expressed in roots, stems, and in other tissues. They were also localized in nucleus.

AT2G47220 3' exoribonuclease family domain 1 protein-related from Arabidopsis thaliana
NP_182244 polyribonucleotide phosphorylase, putative (DOMAIN OF UNKNOWN FUNCTION 724 5) from Arabidopsis thaliana
38% identity, 46% coverage

Genome-based analysis of Chlamydomonas reinhardtii exoribonucleases and poly(A) polymerases predicts unexpected organellar and exosomal features
Zimmer, Genetics 2008
- “...as the Mtr3 homolog and do not identify At2g47220. They further note that At3g60500, but not At3g12990, copurify with the TAP-tagged exosome. RNB1, which,...”
Characterization of DUF724 gene family in Arabidopsis thaliana.
Cao, Plant molecular biology 2010 (PubMed)
- GeneRIF: AtDuf5 was expressed in many different organs with the highest level of expression in seedlings, leaves, flowers and siliques. They were also localized in nucleus.

DUF5_ARATH / Q0WNB1 DUF724 domain-containing protein 5; AtDUF5 from Arabidopsis thaliana (Mouse-ear cress) (see paper)
46% identity, 29% coverage

function: May be involved in the polar growth of plant cells via transportation of RNAs.
subunit: Homodimer.
disruption phenotype: No visible phenotype under normal growth conditions.

NP_001318629 agenet domain protein (DOMAIN OF UNKNOWN FUNCTION 724 8) from Arabidopsis thaliana
39% identity, 44% coverage

Characterization of DUF724 gene family in Arabidopsis thaliana.
Cao, Plant molecular biology 2010 (PubMed)
- GeneRIF: Data show that AtDuf8 genes were expressed in the vascular bundles of roots . They were localized in nucleus.

D7U2L4 Agenet domain-containing protein from Vitis vinifera
33% identity, 27% coverage

Grape ASR-Silencing Sways Nuclear Proteome, Histone Marks and Interplay of Intrinsically Disordered Proteins
Atanassov, International journal of molecular sciences 2022
- “...1.53 D7T3I0 (D7T3I0_VITVI) CBI25061.3 VIT_00s0179g00340.t01 Histone H2A.1 1.71 F6GV41 (F6GV41_VITVI) CBI16181.3 VIT_06s0004g04230.t01 Histone H2B 1.74 D7U2L4 (D7U2L4_VITVI) CBI36980.3 VIT_07s0005g01810.t01 Agenet domain-containing protein 1.87 D7TCM4 (D7TCM4_VIT CBI27882.3 VIT_11s0016g01890.t01 Single myb histone 1.33 D7TED8 (D7TED8_VITVI) CBI28861.3 VIT_12s0059g01310.t01 SUMO protein 1.43 D7TUZ2 (D7TUZ2_VITVI) CBI34317.3 VIT_14s0030g00480.t01 RNA recognition motif family...”

AGDP1_ARATH / Q500V5 Protein AGENET DOMAIN (AGD)-CONTAINING P1; Protein ONE AGENET DOMAIN-CONTAINING PROTEIN from Arabidopsis thaliana (Mouse-ear cress) (see 2 papers)
AT1G09320 agenet domain-containing protein from Arabidopsis thaliana
NP_172403 agenet domain-containing protein from Arabidopsis thaliana
27% identity, 48% coverage

function: Heterochromatin-binding protein that preferentially occupies long transposons and specifically recognizes the histone H3 'Lys-9' methylation (H3K9me) marks, with a stronger affinity for dimethylated H3K9 (H3K9me2) (PubMed:30382101, PubMed:30425322). Required for transcriptional silencing, non-CG DNA methylation (e.g. CHG and CHH regions), and H3K9 dimethylation (H3K9me2) at some loci (PubMed:30382101, PubMed:30425322). Mediates heterochromatin phase separation and chromocenter formation (PubMed:30425322).
disruption phenotype: Abnormal transcription up-regulation of some transposable elements (TEs) and of hypermethylated loci (including MU1, GP1, SN1 and ERT7) (PubMed:30382101, PubMed:30425322). Hypomethylated DNA CHG and CHH regions (PubMed:30382101, PubMed:30425322). Reduced H3K9me2 levels (PubMed:30382101). Increased ratio of decondensed nuclei (PubMed:30425322).
Plant HP1 protein ADCP1 links multivalent H3K9 methylation readout to heterochromatin formation
Zhao, Cell research 2019
- “...in Arabidopsis through a 3D-carbene based SPRi platform. 17 , 18 One Agenet domain-containing protein AT1G09320 (abbreviated as ADCP1) showed a significant signal towards H3K9me2 peptide on the SPRi platform (Fig. 1a ). ADCP1 contains three conserved tandem Agenet domains, which are labelled as Agenet 1/2,...”
AIP1 is a novel Agenet/Tudor domain protein from Arabidopsis that interacts with regulators of DNA replication, transcription and chromatin remodeling
Brasil, BMC plant biology 2015
- “...belonging to Agenet/Tudor domain family in plants, we used an Agenet/Tudor sequence from the gene At1g09320 to perform TBLASTN query against available genome sequences in Phytozome, NCBI, TAIR and Congenie databases [ 18 21 ]. The search included genomes of unicellular green algae (4 species), nonvascular...”
Plant HP1 protein ADCP1 links multivalent H3K9 methylation readout to heterochromatin formation.
Zhao, Cell research 2019
- GeneRIF: The authors report on the discovery of ADCP1 (Agenet Domain Containing Protein 1) as a multivalent histone H3K9 methylation reader in plants, and outline its functional roles in mediating heterochromatin phase separation, histone H3K9 and DNA methylation maintenance, as well as transposon silencing.
Arabidopsis AGDP1 links H3K9me2 to DNA methylation in heterochromatin.
Zhang, Nature communications 2018
- GeneRIF: AGDP1 links histone H3 lysine 9 dimethylation to DNA methylation in heterochromatin regions.[AGDP1]

7ytaB / A0A1S4CD95 Crystal structure of ntagdp3 agd1-2 in complex with an h3k9me2 peptide (see paper)
39% identity, 21% coverage

Ligand: peptide (7ytaB)

DUF4_ARATH / O81039 DUF724 domain-containing protein 4; AtDUF4 from Arabidopsis thaliana (Mouse-ear cress) (see paper)
AT2G46840 hypothetical protein from Arabidopsis thaliana
NP_182207 hypothetical protein (DOMAIN OF UNKNOWN FUNCTION 724 4) from Arabidopsis thaliana
32% identity, 18% coverage

function: May be involved in the polar growth of plant cells via transportation of RNAs.
disruption phenotype: No visible phenotype under normal growth conditions.
Overexpression of the nuclear protein gene AtDUF4 increases organ size in Arabidopsis thaliana and Brassica napus
Chen, Journal of genetics and genomics = Yi chuan xue bao 2018 (PubMed)
- “...understood. Previously, we identified a DUF724 family gene, AtDUF4 (At2g46840 ), which encodes a protein of unknown function that is 205 amino acids in length...”
Overexpression of the nuclear protein gene AtDUF4 increases organ size in Arabidopsis thaliana and Brassica napus.
Chen, Journal of genetics and genomics = Yi chuan xue bao 2018 (PubMed)
- GeneRIF: Overexpression of AtDUF4 increases plant organ size, possibly by influencing the expression of the cell wall-formation and auxin transporter genes that regulate cell size in Arabidopsis and the oilseed rape B. napus.
Characterization of DUF724 gene family in Arabidopsis thaliana.
Cao, Plant molecular biology 2010 (PubMed)
- GeneRIF: AtDuf4 were found to express in the root tips.

6ie6A / Q500V5 Crystal structure of adcp1 tandem agenet domain 3-4 in complex with h3k9me2
31% identity, 20% coverage

Ligand: peptide (6ie6A)

6ie4A / Q500V5 Crystal structure of adcp1 tandem agenet domain 1-2 in complex with h3k9me1
33% identity, 19% coverage

Ligand: peptide (6ie4A)

5zwxA / A0A493R6M0 Crystal structure of raphanus sativus agdp1 agd12 in complex with an h3k9me2 peptide (see paper)
31% identity, 20% coverage

Ligand: peptide (5zwxA)

AT1G06340 agenet domain-containing protein from Arabidopsis thaliana
30% identity, 21% coverage

PARylation of the forkhead-associated domain protein DAWDLE regulates plant immunity
Feng, EMBO reports 2016
- “...(DDL, AT3G20550), PLANT TUDOR-LIKE PROTEIN (AT1G06340), HYALURONAN/mRNA-BINDING PROTEIN (AT5G47210), and METHYL-CPG-BINDING DOMAIN 11 (MBD11, AT3G15790) were...”
- “...failed to identify the homozygous mutants for AT5G47210 and AT1G06340 that were PARylated in vivo (Fig 1E). The transcript of MBD11 and UBC13B was not altered...”

AT4G32440 agenet domain-containing protein from Arabidopsis thaliana
36% identity, 10% coverage

Transcriptome profiling of genes and pathways associated with arsenic toxicity and tolerance in Arabidopsis
Fu, BMC plant biology 2014
- “...2.27 AT1G24090 RNase H family protein 2.90 2.24 AT2G16900 Arabidopsis phospholipase-like protein family 2.80 2.08 AT4G32440 Plant Tudor-like RNA-binding protein 2.40 2.28 AT1G67360 Rubber elongation factor protein 2.24 2.73 a Col-0 200/Ws-100 refers to pair-wise comparison of expression ratio (Col-0 200M As vs Col-0 Control) /...”
Genome-wide patterns of genetic variation in worldwide Arabidopsis thaliana accessions from the RegMap panel
Horton, Nature genetics 2012
- “...signal peaks on a SNP at 15.66 Mb in a gene of unknown function ( AT4G32440 ). Because PHS, CLR and F ST identify loci at different stages in the selection process, or loci that are experiencing different modes of adaptation 38 , one might not...”

New Search

For advice on how to use these tools together, see Interactive tools for functional annotation of bacterial genomes.

Statistics

The PaperBLAST database links 793,807 different protein sequences to 1,259,118 scientific articles. Searches against EuropePMC were last performed on March 13 2025.

How It Works

PaperBLAST builds a database of protein sequences that are linked to scientific articles. These links come from automated text searches against the articles in EuropePMC and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot, BRENDA, CAZy (as made available by dbCAN), BioLiP, CharProtDB, MetaCyc, EcoCyc, TCDB, REBASE, the Fitness Browser, and a subset of the European Nucleotide Archive with the /experiment tag. Given this database and a protein sequence query, PaperBLAST uses protein-protein BLAST to find similar sequences with E < 0.001.

To build the database, we query EuropePMC with locus tags, with RefSeq protein identifiers, and with UniProt accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use queries of the form "locus_tag AND genus_name" to try to ensure that the paper is actually discussing that gene. Because EuropePMC indexes most recent biomedical papers, even if they are not open access, some of the links may be to papers that you cannot read or that our computers cannot read. We query each of these identifiers that appears in the open access part of EuropePMC, as well as every locus tag that appears in the 500 most-referenced genomes, so that a gene may appear in the PaperBLAST results even though none of the papers that mention it are open access. We also incorporate text-mined links from EuropePMC that link open access articles to UniProt or RefSeq identifiers. (This yields some additional links because EuropePMC uses different heuristics for their text mining than we do.)

For every article that mentions a locus tag, a RefSeq protein identifier, or a UniProt accession, we try to select one or two snippets of text that refer to the protein. If we cannot get access to the full text, we try to select a snippet from the abstract, but unfortunately, unique identifiers such as locus tags are rarely provided in abstracts.

PaperBLAST also incorporates manually-curated protein functions:

Proteins from NCBI's RefSeq are included if a GeneRIF entry links the gene to an article in PubMed^®. GeneRIF also provides a short summary of the article's claim about the protein, which is shown instead of a snippet.
Proteins from Swiss-Prot (the curated part of UniProt) are included if the curators identified experimental evidence for the protein's function (evidence code ECO:0000269). For these proteins, the fields of the Swiss-Prot entry that describe the protein's function are shown (with bold headings).
Proteins from BRENDA, a curated database of enzymes, are included if they are linked to a paper in PubMed and their full sequence is known.
Every protein from the non-redundant subset of BioLiP, a database of ligand-binding sites and catalytic residues in protein structures, is included. Since BioLiP itself does not include descriptions of the proteins, those are taken from the Protein Data Bank. Descriptions from PDB rely on the original submitter of the structure and cannot be updated by others, so they may be less reliable. (For SitesBLAST and Sites on a Tree, we use a larger subset of BioLiP so that every ligand is represented among a group of structures with similar sequences, but for PaperBLAST, we use the non-redundant set provided by BioLiP.)
Every protein from EcoCyc, a curated database of the proteins in Escherichia coli K-12, is included, regardless of whether they are characterized or not.
Proteins from the MetaCyc metabolic pathway database are included if they are linked to a paper in PubMed and their full sequence is known.
Proteins from the Transport Classification Database (TCDB) are included if they have known substrate(s), have reference(s), and are not described as uncharacterized or putative. (Some of the references are not visible on the PaperBLAST web site.)
Every protein from CharProtDB, a database of experimentally characterized protein annotations, is included.
Proteins from the CAZy database of carbohydrate-active enzymes are included if they are associated with an Enzyme Classification number. Even though CAZy does not provide links from individual protein sequences to papers, these should all be experimentally-characterized proteins.
Proteins from the REBASE database of restriction enzymes are included if they have known specificity.
Every protein with an evidence-based reannotation (based on mutant phenotypes) in the Fitness Browser is included.
Sequence-specific transcription factors (including sigma factors and DNA-binding response regulators) with experimentally-determined DNA binding sites from the PRODORIC database of gene regulation in prokaryotes.
Putative transcription factors from RegPrecise that have manually-curated predictions for their binding sites. These predictions are based on conserved putative regulatory sites across genomes that contain similar transcription factors, so PaperBLAST clusters the TFs at 70% identity and retains just one member of each cluster.
Coding sequence (CDS) features from the European Nucleotide Archive (ENA) are included if the /experiment tag is set (implying that there is experimental evidence for the annotation), the nucleotide entry links to paper(s) in PubMed, and the nucleotide entry is from the STD data class (implying that these are targeted annotated sequences, not from shotgun sequencing). Also, to filter out genes whose transcription or translation was detected, but whose function was not studied, nucleotide entries or papers with more than 25 such proteins are excluded. Descriptions from ENA rely on the original submitter of the sequence and cannot be updated by others, so they may be less reliable.

Except for GeneRIF and ENA, the curated entries include a short curated description of the protein's function. For entries from BioLiP, the protein's function may not be known beyond binding to the ligand. Many of these entries also link to articles in PubMed.

For more information see the PaperBLAST paper (mSystems 2017) or the code. You can download PaperBLAST's database here.

Changes to PaperBLAST since the paper was written:

November 2023: incorporated PRODORIC and RegPrecise. Many PRODORIC entries were not linked to a protein sequence (no UniProt identifier), so we added this information.
February 2023: BioLiP changed their download format. PaperBLAST now includes their non-redundant subset. SitesBLAST and Sites on a Tree use a larger non-redundant subset that ensures that every ligand is represented within each cluster. This should ensure that every binding site is represented.
June 2022: incorporated some coding sequences from ENA with the /experiment tag.
March 2022: incorporated BioLiP.
April 2020: incorporated TCDB.
April 2019: EuropePMC now returns table entries in their search results. This has expanded PaperBLAST's database, but most of the new entries are of low relevance, and the resulting snippets are often just lists of locus tags with annotations.
February 2018: the alignment page reports the conservation of the hit's functional sites (if available from from Swiss-Prot or UniProt)
January 2018: incorporated BRENDA.
December 2017: incorporated MetaCyc, CharProtDB, CAZy, REBASE, and the reannotations from the Fitness Browser.
September 2017: EuropePMC no longer returns some table entries in their search results. This has shrunk PaperBLAST's database, but has also reduced the number of low-relevance hits.

Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.

Secrets

PaperBLAST cannot provide snippets for many of the papers that are published in non-open-access journals. This limitation applies even if the paper is marked as "free" on the publisher's web site and is available in PubmedCentral or EuropePMC. If a journal that you publish in is marked as "secret," please consider publishing elsewhere.

Omissions from the PaperBLAST Database

Many important articles are missing from PaperBLAST, either because the article's full text is not in EuropePMC (as for many older articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an article that characterizes a protein's function but is missing from PaperBLAST, please notify the curators at UniProt or add an entry to GeneRIF. Entries in either of these databases will eventually be incorporated into PaperBLAST. Note that to add an entry to UniProt, you will need to find the UniProt identifier for the protein. If the protein is not already in UniProt, you can ask them to create an entry. To add an entry to GeneRIF, you will need an NCBI Gene identifier, but unfortunately many prokaryotic proteins in RefSeq do not have corresponding Gene identifers.

References

PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.

Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.

Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.

UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.

BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.

The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.

The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.

CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.

The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.

The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.

REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.

Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.

by Morgan Price, Arkin group
Lawrence Berkeley National Laboratory