PaperBLAST – Find papers about a protein or its homologs

 

PaperBLAST

PaperBLAST Hits for 62 a.a. (TTNADRRKAA...)

Other sequence analysis tools:

Find functional residues: SitesBLAST

Search for conserved domains

Find the best match in UniProt

Compare to protein structures

Predict transmenbrane helices: Phobius

Predict protein localization: PSORTb

Find homologs in fast.genomics

Fitness BLAST: loading...

Found 76 similar proteins in the literature:

1mdyA / P10085 Crystal structure of myod bhlh domain bound to dna: perspectives on DNA recognition and implications for transcriptional activation (see paper)
100% identity, 91% coverage

P21572 Myoblast determination protein 1 homolog from Coturnix japonica
94% identity, 21% coverage

P16075 Myoblast determination protein 1 homolog from Gallus gallus
94% identity, 21% coverage

NP_989545 myoblast determination protein 1 homolog from Gallus gallus
94% identity, 21% coverage

NP_788268 myoblast determination protein 1 from Rattus norvegicus
98% identity, 19% coverage

MYOD1_MOUSE / P10085 Myoblast determination protein 1 from Mus musculus (Mouse) (see 10 papers)
NP_034996 myoblast determination protein 1 from Mus musculus
98% identity, 19% coverage

MYOD1_RAT / Q02346 Myoblast determination protein 1 from Rattus norvegicus (Rat) (see paper)
98% identity, 19% coverage

NP_001002824 myoblast determination protein 1 from Sus scrofa
P49811 Myoblast determination protein 1 from Sus scrofa
98% identity, 19% coverage

NP_001079366 myoblast determination protein 1 homolog A from Xenopus laevis
92% identity, 21% coverage

Q7YS82 Myoblast determination protein 1 from Bos taurus
98% identity, 19% coverage

MYOD1_HUMAN / P15172 Myoblast determination protein 1; Class C basic helix-loop-helix protein 1; bHLHc1; Myogenic factor 3; Myf-3 from Homo sapiens (Human) (see 8 papers)
NP_002469 myoblast determination protein 1 from Homo sapiens
98% identity, 19% coverage

D2SP11 Myogenic factor from Bubalus bubalis
98% identity, 19% coverage

NP_001035568 myoblast determination protein 1 from Bos taurus
98% identity, 19% coverage

MYOD1_DANRE / Q90477 Myoblast determination protein 1 homolog; Myogenic factor 1 from Danio rerio (Zebrafish) (Brachydanio rerio) (see 2 papers)
NP_571337 myoblast determination protein 1 homolog from Danio rerio
92% identity, 22% coverage

XP_021155627 myoblast determination protein 1 from Columba livia
94% identity, 21% coverage

NP_988972 myoblast determination protein 1 from Xenopus tropicalis
93% identity, 21% coverage

MYOD1_TAKRU / Q6Q2A8 Myoblast determination protein 1 homolog; Myogenic factor 1; TmyoD1 from Takifugu rubripes (Japanese pufferfish) (Fugu rubripes) (see paper)
89% identity, 20% coverage

MYF5_MOUSE / P24699 Myogenic factor 5; Myf-5 from Mus musculus (Mouse) (see paper)
NP_032682 myogenic factor 5 from Mus musculus
82% identity, 24% coverage

NP_001100253 myogenic factor 5 from Rattus norvegicus
82% identity, 24% coverage

MYF5_HUMAN / P13349 Myogenic factor 5; Myf-5; Class C basic helix-loop-helix protein 2; bHLHc2 from Homo sapiens (Human) (see 2 papers)
NP_005584 myogenic factor 5 from Homo sapiens
82% identity, 24% coverage

NP_001025534 myogenic factor 5 from Gallus gallus
82% identity, 24% coverage

P17667 Myogenic factor 5 from Bos taurus
81% identity, 24% coverage

NP_001265704 myogenic factor 5 from Sus scrofa
82% identity, 24% coverage

NP_001095249 myogenic factor 5 from Xenopus laevis
P24700 Myogenic factor 5 from Xenopus laevis
79% identity, 24% coverage

NP_988932 myogenic factor 5 from Xenopus tropicalis
77% identity, 24% coverage

NP_001163702 nautilus, isoform B from Drosophila melanogaster
84% identity, 18% coverage

7z5iA / P13349 Transcription factor myf5 bound to symmetrical site
89% identity, 85% coverage

LOC118407176 transcription factor SUM-1-like from Branchiostoma floridae
84% identity, 22% coverage

LOC109480322 transcription factor SUM-1-like from Branchiostoma belcheri
79% identity, 24% coverage

NP_571651 myogenic factor 5 from Danio rerio
75% identity, 26% coverage

Smp_167400 myogenic factor, putative from Schistosoma mansoni
75% identity, 7% coverage

Q91154 Myogenic factor 5 from Notophthalmus viridescens
79% identity, 24% coverage

MYOD_DROME / P22816 Myogenic-determination protein; Protein nautilus; dMyd from Drosophila melanogaster (Fruit fly) (see 2 papers)
79% identity, 19% coverage

NP_001012406 myogenin from Sus scrofa
77% identity, 27% coverage

LOC109480329 transcription factor SUM-1-like from Branchiostoma belcheri
76% identity, 27% coverage

LOC118407021 transcription factor SUM-1-like from Branchiostoma floridae
76% identity, 27% coverage

NP_571081 myogenin from Danio rerio
82% identity, 22% coverage

NP_001167580 myogenin from Ovis aries
D3YKV7 Myogenin from Ovis aries
77% identity, 27% coverage

P49812 Myogenin from Sus scrofa
77% identity, 27% coverage

MYOG_HUMAN / P15173 Myogenin; Class C basic helix-loop-helix protein 3; bHLHc3; Myogenic factor 4; Myf-4 from Homo sapiens (Human) (see paper)
NP_002470 myogenin from Homo sapiens
77% identity, 27% coverage

MYOG_RAT / P20428 Myogenin from Rattus norvegicus (Rat) (see 3 papers)
77% identity, 21% coverage

LOC100303673 myogenin from Meleagris gallopavo
77% identity, 26% coverage

P17920 Myogenin from Gallus gallus
NP_989515 myogenin from Gallus gallus
77% identity, 26% coverage

XP_002717630 myogenin isoform X1 from Oryctolagus cuniculus
77% identity, 24% coverage

NP_001104795 myogenin from Bos taurus
77% identity, 27% coverage

NP_032683 myogenic factor 6 from Mus musculus
79% identity, 24% coverage

Q7YS81 Myogenin from Bos taurus
A7L034 Myogenin from Bubalus bubalis
77% identity, 27% coverage

MYF6_XENLA / Q92020 Myogenic factor 6; Myf-6; Muscle-specific regulatory factor 4 from Xenopus laevis (African clawed frog) (see paper)
NP_001081477 myogenic factor 6 from Xenopus laevis
79% identity, 24% coverage

NP_058811 myogenin from Rattus norvegicus
77% identity, 27% coverage

MYF6_RAT / P19335 Myogenic factor 6; Myf-6; Muscle-specific regulatory factor 4 from Rattus norvegicus (Rat) (see paper)
NP_037304 myogenic factor 6 from Rattus norvegicus
79% identity, 24% coverage

MYOG_MOUSE / P12979 Myogenin; MYOD1-related protein from Mus musculus (Mouse) (see 15 papers)
NP_112466 myogenin from Mus musculus
77% identity, 27% coverage

P34060 Myogenin from Coturnix japonica
73% identity, 27% coverage

XP_972025 transcription factor SUM-1 from Tribolium castaneum
72% identity, 29% coverage

NP_861527 myogenic factor 6 from Bos taurus
Q7YS80 Myogenic factor 6 from Bos taurus
80% identity, 23% coverage

NP_001231601 myogenic factor 6 from Sus scrofa
80% identity, 23% coverage

P23409 Myogenic factor 6 from Homo sapiens
NP_002460 myogenic factor 6 from Homo sapiens
80% identity, 23% coverage

MYOD1_CAEEL / P22980 Myoblast determination protein 1 homolog; MyoD protein 1; Helix-loop-helix protein 1 from Caenorhabditis elegans (see 4 papers)
NP_001021892 Myoblast determination protein 1 homolog from Caenorhabditis elegans
74% identity, 18% coverage

NP_001003982 myogenic factor 6 from Danio rerio
77% identity, 23% coverage

XP_009860817 transcription factor protein isoform X1 from Ciona intestinalis
70% identity, 9% coverage

LOC118406750 myogenic factor 5-like from Branchiostoma floridae
74% identity, 25% coverage

LOC109480333 myogenic factor 5-like from Branchiostoma belcheri
70% identity, 27% coverage

LOC118406741 myogenic factor 5-like from Branchiostoma floridae
74% identity, 26% coverage

LOC109480315 myogenic factor 5-like from Branchiostoma belcheri
70% identity, 20% coverage

LOC118406791 myogenic factor 6-like from Branchiostoma floridae
67% identity, 25% coverage

LOC109480330 myoblast determination protein 1 homolog B-like from Branchiostoma belcheri
68% identity, 26% coverage

NP_731326 salivary gland-expressed bHLH, isoform C from Drosophila melanogaster
45% identity, 22% coverage

PTF1A_MOUSE / Q9QX98 Pancreas transcription factor 1 subunit alpha; Pancreas-specific transcription factor 1a; bHLH transcription factor p48; p48 DNA-binding subunit of transcription factor PTF1; PTF1-p48 from Mus musculus (Mouse) (see 7 papers)
NP_061279 pancreas transcription factor 1 subunit alpha from Mus musculus
46% identity, 17% coverage

PTF1A_RAT / Q64305 Pancreas transcription factor 1 subunit alpha; Pancreas-specific transcription factor 1a; bHLH transcription factor p48; p48 DNA-binding subunit of transcription factor PTF1; PTF1-p48 from Rattus norvegicus (Rat) (see 3 papers)
NP_446416 pancreas transcription factor 1 subunit alpha from Rattus norvegicus
46% identity, 17% coverage

PTF1A_HUMAN / Q7RTS3 Pancreas transcription factor 1 subunit alpha; Class A basic helix-loop-helix protein 29; bHLHa29; Pancreas-specific transcription factor 1a; bHLH transcription factor p48; p48 DNA-binding subunit of transcription factor PTF1; PTF1-p48 from Homo sapiens (Human) (see 3 papers)
NP_835455 pancreas transcription factor 1 subunit alpha from Homo sapiens
46% identity, 17% coverage

PTF1A_DANRE / Q7ZSX3 Pancreas transcription factor 1 subunit alpha; Pancreas-specific transcription factor 1a; bHLH transcription factor p48 from Danio rerio (Zebrafish) (Brachydanio rerio) (see paper)
NP_997524 pancreas transcription factor 1 subunit alpha from Danio rerio
46% identity, 21% coverage

XP_973186 helix-loop-helix protein delilah from Tribolium castaneum
43% identity, 28% coverage

PTF1A_XENLA / Q4ZHW1 Pancreas transcription factor 1 subunit alpha; Pancreas-specific transcription factor 1a; Transcription factor Ptf1a/p48 from Xenopus laevis (African clawed frog)
NP_001167491 pancreas transcription factor 1 subunit alpha from Xenopus laevis
45% identity, 20% coverage

DEI_DROME / P41894 Helix-loop-helix protein delilah; Protein taxi from Drosophila melanogaster (Fruit fly) (see paper)
NP_001287543 taxi, isoform B from Drosophila melanogaster
42% identity, 15% coverage

XP_969845 pancreas transcription factor 1 subunit alpha from Tribolium castaneum
44% identity, 24% coverage

TWIST_BRABE / O96642 Twist-related protein; BBtwist from Branchiostoma belcheri (Amphioxus) (see paper)
39% identity, 29% coverage

TWIST_DROME / P10627 Protein twist from Drosophila melanogaster (Fruit fly) (see 3 papers)
NP_001286752 twist, isoform C from Drosophila melanogaster
35% identity, 12% coverage

New Search

For advice on how to use these tools together, see Interactive tools for functional annotation of bacterial genomes.

Statistics

The PaperBLAST database links 798,070 different protein sequences to 1,261,478 scientific articles. Searches against EuropePMC were last performed on May 12 2025.

How It Works

PaperBLAST builds a database of protein sequences that are linked to scientific articles. These links come from automated text searches against the articles in EuropePMC and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot, BRENDA, CAZy (as made available by dbCAN), BioLiP, CharProtDB, MetaCyc, EcoCyc, TCDB, REBASE, the Fitness Browser, and a subset of the European Nucleotide Archive with the /experiment tag. Given this database and a protein sequence query, PaperBLAST uses protein-protein BLAST to find similar sequences with E < 0.001.

To build the database, we query EuropePMC with locus tags, with RefSeq protein identifiers, and with UniProt accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use queries of the form "locus_tag AND genus_name" to try to ensure that the paper is actually discussing that gene. Because EuropePMC indexes most recent biomedical papers, even if they are not open access, some of the links may be to papers that you cannot read or that our computers cannot read. We query each of these identifiers that appears in the open access part of EuropePMC, as well as every locus tag that appears in the 500 most-referenced genomes, so that a gene may appear in the PaperBLAST results even though none of the papers that mention it are open access. We also incorporate text-mined links from EuropePMC that link open access articles to UniProt or RefSeq identifiers. (This yields some additional links because EuropePMC uses different heuristics for their text mining than we do.)

For every article that mentions a locus tag, a RefSeq protein identifier, or a UniProt accession, we try to select one or two snippets of text that refer to the protein. If we cannot get access to the full text, we try to select a snippet from the abstract, but unfortunately, unique identifiers such as locus tags are rarely provided in abstracts.

PaperBLAST also incorporates manually-curated protein functions:

Except for GeneRIF and ENA, the curated entries include a short curated description of the protein's function. For entries from BioLiP, the protein's function may not be known beyond binding to the ligand. Many of these entries also link to articles in PubMed.

For more information see the PaperBLAST paper (mSystems 2017) or the code. You can download PaperBLAST's database here.

Changes to PaperBLAST since the paper was written:

Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.

Secrets

PaperBLAST cannot provide snippets for many of the papers that are published in non-open-access journals. This limitation applies even if the paper is marked as "free" on the publisher's web site and is available in PubmedCentral or EuropePMC. If a journal that you publish in is marked as "secret," please consider publishing elsewhere.

Omissions from the PaperBLAST Database

Many important articles are missing from PaperBLAST, either because the article's full text is not in EuropePMC (as for many older articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an article that characterizes a protein's function but is missing from PaperBLAST, please notify the curators at UniProt or add an entry to GeneRIF. Entries in either of these databases will eventually be incorporated into PaperBLAST. Note that to add an entry to UniProt, you will need to find the UniProt identifier for the protein. If the protein is not already in UniProt, you can ask them to create an entry. To add an entry to GeneRIF, you will need an NCBI Gene identifier, but unfortunately many prokaryotic proteins in RefSeq do not have corresponding Gene identifers.

References

PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.

Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.

Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.

UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.

BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.

The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.

The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.

CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.

The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.

The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.

REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.

Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.

by Morgan Price, Arkin group
Lawrence Berkeley National Laboratory