PaperBLAST – Find papers about a protein or its homologs

 

PaperBLAST

PaperBLAST Hits for IAI46_19690 (86 a.a., MLAVATQTPS...)

Other sequence analysis tools:

Find functional residues: SitesBLAST

Search for conserved domains

Find the best match in UniProt

Compare to protein structures

Predict transmenbrane helices: Phobius

Predict protein localization: PSORTb

Find homologs in fast.genomics

Fitness BLAST: loading...

Found 46 similar proteins in the literature:

ESA_00625 helix-turn-helix transcriptional regulator from Cronobacter sakazakii ATCC BAA-894
58% identity, 97% coverage

NP_042041 excisionase and transcriptional regulator from Enterobacteria phage P4
NP_042041 transcriptional regulator from Enterobacteria phage P4
48% identity, 93% coverage

t2651 hypothetical protein from Salmonella enterica subsp. enterica serovar Typhi Ty2
56% identity, 84% coverage

t4525 phage DNA binding protein from Salmonella enterica subsp. enterica serovar Typhi Ty2
61% identity, 60% coverage

ESA_00625 hypothetical protein from Enterobacter sakazakii ATCC BAA-894
63% identity, 70% coverage

BHW77_22420 helix-turn-helix transcriptional regulator from Escherichia coli
58% identity, 74% coverage

VFA_000468 helix-turn-helix transcriptional regulator from Vibrio furnissii CIP 102972
53% identity, 69% coverage

VV2261 predicted transcriptional regulator from Vibrio vulnificus YJ016
55% identity, 70% coverage

ASA_2928 probable prophage regulatory protein from Aeromonas salmonicida subsp. salmonicida A449
52% identity, 74% coverage

A55_2025 transcriptional regulator, putative from Vibrio cholerae 1587
53% identity, 67% coverage

VC1809 transcriptional regulator, putative from Vibrio cholerae O1 biovar eltor str. N16961
53% identity, 67% coverage

VCB_002857, VCG_002259, VIF_000799 helix-turn-helix transcriptional regulator from Vibrio cholerae TMA 21
53% identity, 67% coverage

MED222_15529 predicted transcriptional regulator from Vibrio sp. MED222
51% identity, 66% coverage

Reut_A2193 Prophage CP4-57 regulatory from Ralstonia eutropha JMP134
53% identity, 67% coverage

VCJ_000314 AlpA family transcriptional regulator from Vibrio metoecus
53% identity, 69% coverage

G5B91_27730 helix-turn-helix transcriptional regulator from Pseudomonas nitroreducens
53% identity, 69% coverage

VV0515 predicted transcriptional regulator from Vibrio vulnificus YJ016
52% identity, 69% coverage

VCB_000197, VC_0497 AlpA family transcriptional regulator from Vibrio cholerae TMA 21
A51_B0476 transcriptional regulator from Vibrio cholerae MZO-3
VC0497 transcriptional regulator from Vibrio cholerae O1 biovar eltor str. N16961
52% identity, 69% coverage

VV0810 predicted transcriptional regulator from Vibrio vulnificus YJ016
44% identity, 69% coverage

VIC_001987 helix-turn-helix transcriptional regulator from Vibrio coralliilyticus ATCC BAA-450
48% identity, 71% coverage

A1Q_2003 conserved domain protein from Vibrio harveyi HY01
48% identity, 69% coverage

V12B01_05053 predicted transcriptional regulator from Vibrio splendidus 12B01
49% identity, 66% coverage

AL066_26360 helix-turn-helix transcriptional regulator from Pseudomonas nunensis
47% identity, 69% coverage

VEA_004310 phage transcriptional regulator AlpA from Vibrio sp. Ex25
54% identity, 58% coverage

XF1786 phage-related protein from Xylella fastidiosa 9a5c
39% identity, 83% coverage

VCG_003160 AlpA family transcriptional regulator from Vibrio cholerae 12129(1)
48% identity, 58% coverage

SEN1998 putative phage regulatory protein from Salmonella enterica subsp. enterica serovar Enteritidis str. P125109
SEN_RS10410 helix-turn-helix transcriptional regulator from Salmonella enterica subsp. enterica serovar Enteritidis str.
43% identity, 70% coverage

VC1785 transcriptional regulator from Vibrio cholerae O1 biovar eltor str. N16961
48% identity, 58% coverage

FTN_0372 regulatory protein, AlpA family from Francisella tularensis subsp. novicida U112
38% identity, 71% coverage

ACG06_13180 AlpA family transcriptional regulator from Pseudomonas aeruginosa
39% identity, 80% coverage

WP_019750501 helix-turn-helix transcriptional regulator from Pseudomonas juntendi
45% identity, 64% coverage

VFA_001914 helix-turn-helix transcriptional regulator from Vibrio furnissii CIP 102972
43% identity, 72% coverage

YPO1904 putative transcriptional regulator from Yersinia pestis CO92
40% identity, 67% coverage

KPK_1789 transcriptional regulator, AlpA family from Klebsiella pneumoniae 342
40% identity, 67% coverage

LPC_0208 prophage CP4-57 regulatory protein AlpA from Legionella pneumophila str. Corby
43% identity, 69% coverage

KPNJ2_RS09325 helix-turn-helix transcriptional regulator from Klebsiella pneumoniae 30684/NJST258_2
37% identity, 63% coverage

AlpA / b2624 CP4-57 prophage; DNA-binding transcriptional activator AlpA from Escherichia coli K-12 substr. MG1655 (see 5 papers)
ALPA_ECOLI / P33997 DNA-binding transcriptional activator AlpA; Prophage CP4-57 regulatory protein AlpA from Escherichia coli (strain K12) (see 2 papers)
AlpA / EW|b2624 CP4-57 prophage; DNA-binding transcriptional activator from Escherichia coli K12 (see paper)
NP_417113 DNA-binding transcriptional activator AlpA from Escherichia coli str. K-12 substr. MG1655
b2624 CP4-57 prophage; DNA-binding transcriptional activator from Escherichia coli str. K-12 substr. MG1655
37% identity, 62% coverage

J417_04000 AlpA family transcriptional regulator from Dickeya zeae MS1
33% identity, 73% coverage

C1O30_RS04025, HJ580_03985 AlpA family transcriptional regulator from Dickeya zeae
33% identity, 73% coverage

AchV4_0082 excisionase and transcriptional regulator from Achromobacter phage vB_AchrS_AchV4
36% identity, 64% coverage

PMI2608 prophage regulatory protein from Proteus mirabilis HI4320
45% identity, 66% coverage

ECs1575 putative DNA binding protein from Escherichia coli O157:H7 str. Sakai
46% identity, 58% coverage

RSc1897 HYPOTHETICAL PROTEIN from Ralstonia solanacearum GMI1000
38% identity, 62% coverage

A79_5463 conserved domain protein from Vibrio parahaemolyticus AQ3810
A79_2541 conserved domain protein from Vibrio parahaemolyticus AQ3810
39% identity, 63% coverage

SO_4821 helix-turn-helix transcriptional regulator from Shewanella oneidensis MR-1
35% identity, 63% coverage

Z1124 putative prophage regulatory protein from Escherichia coli O157:H7 EDL933
Z1563 putative prophage regulatory protein from Escherichia coli O157:H7 EDL933
38% identity, 51% coverage

New Search

For advice on how to use these tools together, see Interactive tools for functional annotation of bacterial genomes.

Statistics

The PaperBLAST database links 793,807 different protein sequences to 1,259,118 scientific articles. Searches against EuropePMC were last performed on March 13 2025.

How It Works

PaperBLAST builds a database of protein sequences that are linked to scientific articles. These links come from automated text searches against the articles in EuropePMC and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot, BRENDA, CAZy (as made available by dbCAN), BioLiP, CharProtDB, MetaCyc, EcoCyc, TCDB, REBASE, the Fitness Browser, and a subset of the European Nucleotide Archive with the /experiment tag. Given this database and a protein sequence query, PaperBLAST uses protein-protein BLAST to find similar sequences with E < 0.001.

To build the database, we query EuropePMC with locus tags, with RefSeq protein identifiers, and with UniProt accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use queries of the form "locus_tag AND genus_name" to try to ensure that the paper is actually discussing that gene. Because EuropePMC indexes most recent biomedical papers, even if they are not open access, some of the links may be to papers that you cannot read or that our computers cannot read. We query each of these identifiers that appears in the open access part of EuropePMC, as well as every locus tag that appears in the 500 most-referenced genomes, so that a gene may appear in the PaperBLAST results even though none of the papers that mention it are open access. We also incorporate text-mined links from EuropePMC that link open access articles to UniProt or RefSeq identifiers. (This yields some additional links because EuropePMC uses different heuristics for their text mining than we do.)

For every article that mentions a locus tag, a RefSeq protein identifier, or a UniProt accession, we try to select one or two snippets of text that refer to the protein. If we cannot get access to the full text, we try to select a snippet from the abstract, but unfortunately, unique identifiers such as locus tags are rarely provided in abstracts.

PaperBLAST also incorporates manually-curated protein functions:

Except for GeneRIF and ENA, the curated entries include a short curated description of the protein's function. For entries from BioLiP, the protein's function may not be known beyond binding to the ligand. Many of these entries also link to articles in PubMed.

For more information see the PaperBLAST paper (mSystems 2017) or the code. You can download PaperBLAST's database here.

Changes to PaperBLAST since the paper was written:

Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.

Secrets

PaperBLAST cannot provide snippets for many of the papers that are published in non-open-access journals. This limitation applies even if the paper is marked as "free" on the publisher's web site and is available in PubmedCentral or EuropePMC. If a journal that you publish in is marked as "secret," please consider publishing elsewhere.

Omissions from the PaperBLAST Database

Many important articles are missing from PaperBLAST, either because the article's full text is not in EuropePMC (as for many older articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an article that characterizes a protein's function but is missing from PaperBLAST, please notify the curators at UniProt or add an entry to GeneRIF. Entries in either of these databases will eventually be incorporated into PaperBLAST. Note that to add an entry to UniProt, you will need to find the UniProt identifier for the protein. If the protein is not already in UniProt, you can ask them to create an entry. To add an entry to GeneRIF, you will need an NCBI Gene identifier, but unfortunately many prokaryotic proteins in RefSeq do not have corresponding Gene identifers.

References

PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.

Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.

Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.

UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.

BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.

The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.

The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.

CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.

The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.

The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.

REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.

Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.

by Morgan Price, Arkin group
Lawrence Berkeley National Laboratory