PaperBLAST – Find papers about a protein or its homologs

 

PaperBLAST

PaperBLAST Hits for 58 a.a. (ERKRMRNRIA...)

Other sequence analysis tools:

Find functional residues: SitesBLAST

Search for conserved domains

Find the best match in UniProt

Compare to protein structures

Predict transmenbrane helices: Phobius

Predict protein localization: PSORTb

Find homologs in fast.genomics

Fitness BLAST: loading...

Found 50 similar proteins in the literature:

NP_001185876 transcription factor Jun from Oryctolagus cuniculus
98% identity, 17% coverage

NP_001252779 transcription factor AP-1 from Macaca mulatta
98% identity, 17% coverage

XP_005620302 transcription factor AP-1 from Canis lupus familiaris
98% identity, 17% coverage

JUN_MOUSE / P05627 Transcription factor Jun; AH119; Activator protein 1; AP1; Proto-oncogene c-Jun; Transcription factor AP-1 subunit Jun; V-jun avian sarcoma virus 17 oncogene homolog; Jun A from Mus musculus (Mouse) (see 9 papers)
NP_034721 transcription factor Jun from Mus musculus
98% identity, 17% coverage

XP_011283269 transcription factor AP-1 from Felis catus
98% identity, 17% coverage

JUN_RAT / P17325 Transcription factor Jun; Activator protein 1; AP1; Proto-oncogene c-Jun; Transcription factor AP-1 subunit Jun; V-jun avian sarcoma virus 17 oncogene homolog from Rattus norvegicus (Rat) (see 3 papers)
NP_068607 transcription factor Jun from Rattus norvegicus
98% identity, 17% coverage

JUN_HUMAN / P05412 Transcription factor Jun; Activator protein 1; AP1; Proto-oncogene c-Jun; Transcription factor AP-1 subunit Jun; V-jun avian sarcoma virus 17 oncogene homolog; p39 from Homo sapiens (Human) (see 30 papers)
NP_002219 transcription factor Jun from Homo sapiens
98% identity, 18% coverage

P56432 Transcription factor Jun from Sus scrofa
NP_999045 transcription factor AP-1 from Sus scrofa
98% identity, 18% coverage

NP_001071295 transcription factor Jun from Bos taurus
98% identity, 19% coverage

NP_001026460 transcription factor AP-1 from Gallus gallus
98% identity, 18% coverage

JUN_CHICK / P18870 Transcription factor Jun; Proto-oncogene c-Jun; Transcription factor AP-1 subunit Jun from Gallus gallus (Chicken) (see paper)
98% identity, 18% coverage

P54864 Transcription factor Jun from Serinus canaria
98% identity, 18% coverage

LOC100703685 transcription factor AP-1 from Oreochromis niloticus
95% identity, 18% coverage

5t01A / P05412 Human c-jun DNA binding domain homodimer in complex with methylated DNA (see paper)
100% identity, 94% coverage

NP_956281 transcription factor AP-1 from Danio rerio
93% identity, 19% coverage

NP_001084266 jun proto-oncogene L homeolog from Xenopus laevis
91% identity, 18% coverage

NP_001087435 jun D proto-oncogene S homeolog from Xenopus laevis
88% identity, 19% coverage

JUND_HUMAN / P17535 Transcription factor JunD; Transcription factor AP-1 subunit JunD from Homo sapiens (Human) (see 3 papers)
NP_005345 transcription factor JunD isoform JunD-FL from Homo sapiens
86% identity, 17% coverage

P52909 Transcription factor JunD from Rattus norvegicus
NP_620230 transcription factor JunD isoform JunD-FL from Rattus norvegicus
86% identity, 17% coverage

JUND_MOUSE / P15066 Transcription factor JunD; Transcription factor AP-1 subunit JunD from Mus musculus (Mouse) (see paper)
NP_034722 transcription factor JunD isoform JunD-FL from Mus musculus
86% identity, 17% coverage

NP_001096723 transcription factor JunD from Bos taurus
86% identity, 17% coverage

NP_068608 transcription factor JunB from Rattus norvegicus
81% identity, 17% coverage

5vpeD / P17535 Transcription factor fosb/jund bzip domain in complex with cognate DNA, type-i crystal (see paper)
86% identity, 87% coverage

JUNB_MOUSE / P09450 Transcription factor JunB; MyD21; Transcription factor AP-1 subunit JunB from Mus musculus (Mouse) (see paper)
NP_032442 transcription factor JunB from Mus musculus
79% identity, 17% coverage

NP_001069124 transcription factor JunB from Bos taurus
81% identity, 17% coverage

JUNB_HUMAN / P17275 Transcription factor JunB; Transcription factor AP-1 subunit JunB from Homo sapiens (Human) (see paper)
Q5U079 Transcription factor JunB from Homo sapiens
NP_002220 transcription factor JunB from Homo sapiens
81% identity, 17% coverage

XP_017208972 junB proto-oncogene, AP-1 transcription factor subunit a isoform X1 from Danio rerio
74% identity, 20% coverage

XP_010795740 transcription factor AP-1-like from Notothenia coriiceps
74% identity, 18% coverage

Q800B3 JunB protein from Takifugu rubripes
76% identity, 18% coverage

YP_007003813 protein ORF151D from Cyprinid herpesvirus 1
71% identity, 66% coverage

A0A221I039 AP-1 transcription factor subunit from Macrobrachium nipponense
69% identity, 20% coverage

NP_997915 JunB proto-oncogene, AP-1 transcription factor subunit b from Danio rerio
72% identity, 19% coverage

LOC101164062 transcription factor jun-B from Oryzias latipes
72% identity, 18% coverage

JRA_DROME / P18289 Transcription factor Jra; Jun-related antigen; Transcription factor AP-1 subunit Jra; dJRA; dJun from Drosophila melanogaster (Fruit fly) (see 4 papers)
NP_476586 Jun-related antigen, isoform A from Drosophila melanogaster
NP_724882 Jun-related antigen, isoform B from Drosophila melanogaster
66% identity, 20% coverage

NP_001037955 transcription factor jun-B from Xenopus tropicalis
70% identity, 19% coverage

NP_001090504 jun B proto-oncogene S homeolog from Xenopus laevis
68% identity, 19% coverage

LOC726289 transcription factor AP-1 from Apis mellifera
64% identity, 21% coverage

Jun / CAA73154.1 Jun from Drosophila melanogaster (see 2 papers)
62% identity, 20% coverage

LOC101736835 transcription factor JunD from Bombyx mori
62% identity, 24% coverage

5fv8E / P05412 Structure of cjun-fosw coiled coil complex.
97% identity, 55% coverage

XP_011394512 ascospore lethal-1, variant from Neurospora crassa OR74A
45% identity, 9% coverage

NP_001035403 transcription regulator protein BACH1a from Danio rerio
43% identity, 8% coverage

An02g07070 uncharacterized protein from Aspergillus niger
43% identity, 9% coverage

NP_001296999 cyclic AMP-dependent transcription factor ATF-7 isoform 2 from Mus musculus
40% identity, 14% coverage

ATF7_MOUSE / Q8R0S1 Cyclic AMP-dependent transcription factor ATF-7; cAMP-dependent transcription factor ATF-7; Activating transcription factor 7; Transcription factor ATF-A from Mus musculus (Mouse) (see 5 papers)
40% identity, 14% coverage

Q3US59 Predicted gene, 28047 from Mus musculus
40% identity, 14% coverage

AFUA_3G11330, Afu3g11330 bZIP transcription factor (AtfA), putative from Aspergillus fumigatus Af293
43% identity, 9% coverage

atfA AtfA from Emericella nidulans (see 2 papers)
43% identity, 10% coverage

AO090003000685, XP_001819834 uncharacterized protein from Aspergillus oryzae RIB40
43% identity, 10% coverage

CCM_09124 bZIP transcription factor (AtfA), putative from Cordyceps militaris CM01
45% identity, 10% coverage

New Search

For advice on how to use these tools together, see Interactive tools for functional annotation of bacterial genomes.

Statistics

The PaperBLAST database links 798,070 different protein sequences to 1,261,478 scientific articles. Searches against EuropePMC were last performed on May 12 2025.

How It Works

PaperBLAST builds a database of protein sequences that are linked to scientific articles. These links come from automated text searches against the articles in EuropePMC and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot, BRENDA, CAZy (as made available by dbCAN), BioLiP, CharProtDB, MetaCyc, EcoCyc, TCDB, REBASE, the Fitness Browser, and a subset of the European Nucleotide Archive with the /experiment tag. Given this database and a protein sequence query, PaperBLAST uses protein-protein BLAST to find similar sequences with E < 0.001.

To build the database, we query EuropePMC with locus tags, with RefSeq protein identifiers, and with UniProt accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use queries of the form "locus_tag AND genus_name" to try to ensure that the paper is actually discussing that gene. Because EuropePMC indexes most recent biomedical papers, even if they are not open access, some of the links may be to papers that you cannot read or that our computers cannot read. We query each of these identifiers that appears in the open access part of EuropePMC, as well as every locus tag that appears in the 500 most-referenced genomes, so that a gene may appear in the PaperBLAST results even though none of the papers that mention it are open access. We also incorporate text-mined links from EuropePMC that link open access articles to UniProt or RefSeq identifiers. (This yields some additional links because EuropePMC uses different heuristics for their text mining than we do.)

For every article that mentions a locus tag, a RefSeq protein identifier, or a UniProt accession, we try to select one or two snippets of text that refer to the protein. If we cannot get access to the full text, we try to select a snippet from the abstract, but unfortunately, unique identifiers such as locus tags are rarely provided in abstracts.

PaperBLAST also incorporates manually-curated protein functions:

Except for GeneRIF and ENA, the curated entries include a short curated description of the protein's function. For entries from BioLiP, the protein's function may not be known beyond binding to the ligand. Many of these entries also link to articles in PubMed.

For more information see the PaperBLAST paper (mSystems 2017) or the code. You can download PaperBLAST's database here.

Changes to PaperBLAST since the paper was written:

Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.

Secrets

PaperBLAST cannot provide snippets for many of the papers that are published in non-open-access journals. This limitation applies even if the paper is marked as "free" on the publisher's web site and is available in PubmedCentral or EuropePMC. If a journal that you publish in is marked as "secret," please consider publishing elsewhere.

Omissions from the PaperBLAST Database

Many important articles are missing from PaperBLAST, either because the article's full text is not in EuropePMC (as for many older articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an article that characterizes a protein's function but is missing from PaperBLAST, please notify the curators at UniProt or add an entry to GeneRIF. Entries in either of these databases will eventually be incorporated into PaperBLAST. Note that to add an entry to UniProt, you will need to find the UniProt identifier for the protein. If the protein is not already in UniProt, you can ask them to create an entry. To add an entry to GeneRIF, you will need an NCBI Gene identifier, but unfortunately many prokaryotic proteins in RefSeq do not have corresponding Gene identifers.

References

PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.

Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.

Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.

UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.

BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.

The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.

The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.

CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.

The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.

The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.

REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.

Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.

by Morgan Price, Arkin group
Lawrence Berkeley National Laboratory