PaperBLAST

PaperBLAST – Find papers about a protein or its homologs

PaperBLAST

PaperBLAST Hits for MCAODC_17600 (51 a.a., MKYPTGVENH...)

Other sequence analysis tools:

Find functional residues: SitesBLAST

Search for conserved domains

Find the best match in UniProt

Compare to protein structures

Predict transmenbrane helices: Phobius

Predict protein localization: PSORTb

Find homologs in fast.genomics

Fitness BLAST: loading...

Show query sequence

Found 8 similar proteins in the literature:

t1929 integrase from Salmonella enterica subsp. enterica serovar Typhi Ty2
STY1011 integrase from Salmonella enterica subsp. enterica serovar Typhi str. CT18
86% identity, 7% coverage

Benign form of acute osteitis of the spine in young children
BREMNER, British medical journal 1953
The pKO2 linear plasmid prophage of Klebsiella oxytoca
Casjens, Journal of bacteriology 2004
- “...(11, 70) as well as prophages Sti1 and Sti5 (genes STY1011 to STY1077 and STY2467 to STY2470, respectively [13], in the genome of S. enterica serovar Typhi CT18...”
Differences in gene content among Salmonella enterica serovar typhi isolates
Boyd, Journal of clinical microbiology 2003
- “...the reference genome. Upstream of region II, ORFs STY1011 to STY1044 show significant homology to the lambdoid phage Gifsy-2 from S. enterica serovar...”

STM1005 Gifsy-2 prophage; integrase from Salmonella typhimurium LT2
86% identity, 7% coverage

Genome archaeology of two laboratory Salmonella enterica enterica sv Typhimurium
Zaworski, G3 (Bethesda, Md.) 2021
- “...LT2 prophages located between 2,728,977 and 2,776,819 (STM2584 to STM2636) and between 1,098,231 and 1,143,702 (STM1005 to STM1056), respectively. These two related prophages are patchily similar to each other; while overall DNA identity is 44.1%, this 42kb region is interspersed with highly divergent segments and unrelated...”
Efficient inter-species conjugative transfer of a CRISPR nuclease for targeted bacterial killing
Hamilton, Nature communications 2019
- “...per minute). Conjugations were performed with (filled circles) or without (filled diamonds) sgRNA targeting the STM1005 locus cloned into pNuc- cis and pNuc- trans . Both plasmids encoded the TevSpCas9 nuclease. Data are plotted on a log10 as boxplots with data points from independent biological replicates....”
Identification of novel factors involved in modulating motility of Salmonella enterica serotype typhimurium
Bogomolnaya, PloS one 2014
- “...Transcription tctD invF, STM2912, STM3696, STM4417, arcA STM0859, ydiP, torR, STM4315 Replication, recombination and repair STM1005 STM1861 Translation, ribosomal structure and biogenesis valS STM1552 Posttranslational modification, protein turnover, chaperones STM2743, sspA Energy production and conversion STM0762, STM0858 Cell cycle control, cell division, chromosome partitioning STM2594 Defense...”
Host gene expression changes and DNA amplification during temperate phage induction
Frye, Journal of bacteriology 2005
- “...of California, Berkeley Gifsy-2 xis::Kan), JF108 (LT2 STM1005 Gifsy-2 int::Kan), and JF109 (STM2635-2636 Gifsy-1 xis-int::Kan). Peroxide and mitomycin C assays....”
The genome of Salmonella enterica serovar gallinarum: distinct insertions/deletions and rare rearrangements
Wu, Journal of bacteriology 2005
- “...genomic order to Gifsy-1 (STM2584 to STM2636), Gifsy-2 (STM1005 to STM1056), Fels-1 (STM893 to STM929), and Fels-2 (STM2694 to STM2772). (The genes of serovar...”
Genomic comparisons of Salmonella enterica serovar Dublin, Agona, and Typhimurium strains recently isolated from milk filters and bovine samples from Ireland, using a Salmonella microarray
Reen, Applied and environmental microbiology 2005
- “...lost the central regions (STM2702 to STM2730). Gifsy-2 (STM1005 to STM1056) (region X) was present in all serovar Typhimurium strains and all serovar Dublin...”

STM14_RS05480 FimB-specific integrase Int from Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S
86% identity, 7% coverage

Salmonella enterica Serovar Typhimurium 14028s Genomic Regions Required for Colonization of Lettuce Leaves
Montano, Frontiers in microbiology 2020
- “...to STM14_RS08175 Mut9 K_71/72_E4 14627431511389 48,646 STM14_RS07705 to STM14_RS07965 Mut10 C_02_G4 10515141057804 6,290 STM14_RS05455 to STM14_RS05480 The plate-well position corresponds to the original MGD collection reported by Porwollik et al. (2014) . Full annotation of deleted genes is listed in Supplementary Table S3 . To confirm...”

ECs3013 putative integrase from Escherichia coli O157:H7 str. Sakai
86% identity, 7% coverage

Variability in the Occupancy of Escherichia coli O157 Integration Sites by Shiga Toxin-Encoding Prophages
Henderson, Toxins 2021
- “...junction yehV -outF 607 bp yehV -inR ECs2939 ECSP2932 ACAGCACCGTTTGGCATTTTAG yehV right junction yehV -inF ECs3013 ECSP2997 GAAGCAGAGGGAATATGGGAGAACT 654 bp yehV -outR yehV with Stx1a phage yehV long-F ECs2974 ( stx1a ) Absent CAAACAAATTATCCCCTGTGCCACTA 19.4 kb yehV long-R ECs3015 ( yeh W) ECSP3000 ( yeh W)...”
Genomic diversity of pathogenic Escherichia coli of the EHEC 2 clonal complex
Abu-Ali, BMC genomics 2009
- “...positive for the stx1 gene, except in O111:H8 strains. Excisionase and integrase genes (ECs3012 and ECs3013) were divergent/absent in most of the EHEC 2 strains. Overall, the gene content of Sp15 in strains negative for the stx1 gene was different from those in stx1 positive strains...”

Q859D2 Putative integrase from Escherichia coli phage EH297
86% identity, 7% coverage

ICEEc2, a new integrative and conjugative element belonging to the pKLC102/PAGI-2 family, identified in Escherichia coli strain BEN374
Roche, Journal of bacteriology 2010
- “...Q8XCM9, Q9KR84, A8CG97, O22009, P27077, Q6HA01, Q77WA5, Q859D2, Q9T205, YP_001700534, and YP_001718763. A multiple alignment was performed using MUSCLE (12),...”

Z1424 integrase for bacteriophage BP-933W from Escherichia coli O157:H7 EDL933
81% identity, 6% coverage

Genomic anatomy of Escherichia coli O157:H7 outbreaks
Eppinger, Proceedings of the National Academy of Sciences of the United States of America 2011
- “...at the wrbA locus (Z1423/Z1504) harboring phage integrase intW (Z1424). The wrbA gene locus is unoccupied in lineage I/II strains carrying this toxin type, but...”
Genome evolution in major Escherichia coli O157:H7 lineages
Zhang, BMC genomics 2007
- “...100% 100% 0% S-loop#16/OI#8 ECs0280 (Z0317) putative tail fiber protein 100% 100% 0% S-loop#69/OI#45 ECs1160 (Z1424) 1 putative integrase 93% 0% 0% S-loop#69/OI#45 ECs1161 (Z1425) 1 putative excisionase 93% 0% 0% S-loop#69/OI#45 ECs1162 (Z1426) 1 hypothetical protein 93% 0% 0% S-loop#69/OI#45 ECs1163 (Z1428) 1 hypothetical protein...”

ECs1160 putative integrase from Escherichia coli O157:H7 str. Sakai
81% identity, 6% coverage

Variability in the Occupancy of Escherichia coli O157 Integration Sites by Shiga Toxin-Encoding Prophages
Henderson, Toxins 2021
- “...bp wrbA -outR ECs1252 ECSP1175 CTTTGGCGCGATTTTACTCAATG wrbA left junction wrbA -outF 605 bp wrbA -inR ECs1160 Absent CCAGCGCCAGCATGGTCTAC wrbA right junction wrbA -inF1 ECs1251 Absent TTTTTCCCTCGCCCATAACCTAT 1445 bp wrbA -outR wrbA with Stx2a phage wrbA long-F ECs1158 ( agp ) ECSP1172 ( agp) GAAAGTCGGCAACTCGCTGGTAGA 22.4 kb...”
- “...genes in the Sakai genome. 2 The ECs1159.5 primer targeted the sequence between ECs1159 and ECs1160 corresponding to the N -terminal portion of the interrupted wrbA gene. References 1. Karmali M.A. Petric M. Lim C. McKeough P.C. Arbus G.S. Lior H. The association between idiopathic hemolytic...”
Genes essential for the morphogenesis of the Shiga toxin 2-transducing phage from Escherichia coli O157:H7
Mondal, Scientific reports 2016
- “...36 . Among 91 protein-coding genes (open reading frames ORFs) identified on the Sp5 genome (ECs1160 to ECs1251 in the O157 Sakai genome annotation; for convenience, these genes are referred to as ORF1 to ORF91 in this article; note that five small ORFs annotated in the...”
Genomic diversity of pathogenic Escherichia coli of the EHEC 2 clonal complex
Abu-Ali, BMC genomics 2009
- “...mostly representing serotype O111:H8, had more Sp5 ( stx2 -phage) genes. Integrase and excisionase genes (ECs1160 and ECs1161), and the block of genes at the beginning of the phage, ECs11601187, were missing from most strains. The rest of Sp5 genes, which encode replication proteins O and...”
Genomic regions conserved in lineage II Escherichia coli O157:H7 strains
Steele, Applied and environmental microbiology 2009
- “...Sakai (14) intergenic region between ORFs ECs1159 and ECs1160 (98% homology over 365 nt), encoding a hypothetical protein and a putative integrase, adjacent to...”
Rapid determination of Escherichia coli O157:H7 lineage types and molecular subtypes by using comparative genomic fingerprinting
Laing, Applied and environmental microbiology 2008
- “...coli Sakai intergenic region between ECs1159 and ECs1160 A07 ....................E. coli Sakai ORF ECs1928, encoding a hypothetical protein, adjacent to...”
Genome evolution in major Escherichia coli O157:H7 lineages
Zhang, BMC genomics 2007
- “...of lineage and LSPA type divergent ORFs in S-loop#69. The first cluster, consisting of ORFs ECs1160 to ECs1163 located upstream of the stx2 genes in E. coli Sakai, was missing in all four lineage I/II and the 12 lineage II strains but was conserved in all...”
- “...protein 100% 100% 0% S-loop#16/OI#8 ECs0280 (Z0317) putative tail fiber protein 100% 100% 0% S-loop#69/OI#45 ECs1160 (Z1424) 1 putative integrase 93% 0% 0% S-loop#69/OI#45 ECs1161 (Z1425) 1 putative excisionase 93% 0% 0% S-loop#69/OI#45 ECs1162 (Z1426) 1 hypothetical protein 93% 0% 0% S-loop#69/OI#45 ECs1163 (Z1428) 1 hypothetical...”

STM0893 Fels-1 prophage; putative integrase from Salmonella typhimurium LT2
82% identity, 6% coverage

Population structure, case clusters, and genetic lesions associated with Canadian Salmonella 4,[5],12:i:- isolates
Clark, PloS one 2021
- “...instead of being completely absent. U.S. clone isolates have lost the entire Fels-1 prophage from STM0893 to STM0929 (Cluster II in Soyer et al., 2009 [ 7 ]), leaving STM0892 adjacent to STM0930 (see closed genome PNCS015054, GenBank accession no. CP037877). Fels-1 is absent in U.S....”
MassCode liquid arrays as a tool for multiplexed high-throughput genetic profiling
Richmond, PloS one 2011
- “...GAG AAG AT Probe 446 TTT GTT TAC CTC GCT CAC GCT CTA 146/28 Typhimurium STM0893 F 394 CAG CGT TTC TTT ATT AGG AG 220 R P TGG GTT TTG TGG AAT GTA Probe 450 ACG GGC AGC AAA CTG AAA TAA TCC 196/24 Agona...”
Salmonella enterica serotype 4,5,12:i:-, an emerging Salmonella serotype that represents multiple distinct clones
Soyer, Journal of clinical microbiology 2009
- “...previously characterized by genomic microarrays (18). Cluster II (STM0893 to STM0929), which includes 35 Fels-1 prophage genes and two adjacent genes, was...”
Genomic comparisons of Salmonella enterica serovar Dublin, Agona, and Typhimurium strains recently isolated from milk filters and bovine samples from Ireland, using a Salmonella microarray
Reen, Applied and environmental microbiology 2005
- “...pSLT. In the present study, we found that Fels-1 (STM0893 to STM0929) (region IX) was missing in all 18 strains examined (Table 2). The Fels-2 prophage (STM2694...”
Host gene expression changes and DNA amplification during temperate phage induction
Frye, Journal of bacteriology 2005
- “...JF105 (LT2 STM0894 Fels-1 xis::Kan), JF106 (LT2 STM0893 Fels-1 int::Cm), JF107 (LT2 STM1006 * Corresponding author. Mailing address: Sidney Kimmel Cancer...”
Host restriction of Salmonella enterica serotype Typhimurium pigeon isolates does not correlate with loss of discrete genes
Andrews-Polymenis, Journal of bacteriology 2004
- “...lacked a third chromosomal region. Region I included genes STM0893 to STM929 that encompassed the entire genome of the Fels-1 prophage present in LT2 (Fig. 1)....”
DNA microarray-based typing of an atypical monophasic Salmonella enterica serovar
Garaizar, Journal of clinical microbiology 2002
- “...I II III IV V VI a Gene no.a STM0517 STM0893 STM2616 STM2694 STM2758 STM2440 to to to to to STM0529 STM0929 STM2617 STM2740 STM2773 STM numbers correspond to...”

New Search

For advice on how to use these tools together, see Interactive tools for functional annotation of bacterial genomes.

Statistics

The PaperBLAST database links 798,070 different protein sequences to 1,261,478 scientific articles. Searches against EuropePMC were last performed on May 12 2025.

PaperBLAST builds a database of protein sequences that are linked to scientific articles. These links come from automated text searches against the articles in EuropePMC and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot, BRENDA, CAZy (as made available by dbCAN), BioLiP, CharProtDB, MetaCyc, EcoCyc, TCDB, REBASE, the Fitness Browser, and a subset of the European Nucleotide Archive with the /experiment tag. Given this database and a protein sequence query, PaperBLAST uses protein-protein BLAST to find similar sequences with E < 0.001.

To build the database, we query EuropePMC with locus tags, with RefSeq protein identifiers, and with UniProt accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use queries of the form "locus_tag AND genus_name" to try to ensure that the paper is actually discussing that gene. Because EuropePMC indexes most recent biomedical papers, even if they are not open access, some of the links may be to papers that you cannot read or that our computers cannot read. We query each of these identifiers that appears in the open access part of EuropePMC, as well as every locus tag that appears in the 500 most-referenced genomes, so that a gene may appear in the PaperBLAST results even though none of the papers that mention it are open access. We also incorporate text-mined links from EuropePMC that link open access articles to UniProt or RefSeq identifiers. (This yields some additional links because EuropePMC uses different heuristics for their text mining than we do.)

For every article that mentions a locus tag, a RefSeq protein identifier, or a UniProt accession, we try to select one or two snippets of text that refer to the protein. If we cannot get access to the full text, we try to select a snippet from the abstract, but unfortunately, unique identifiers such as locus tags are rarely provided in abstracts.

PaperBLAST also incorporates manually-curated protein functions:

Proteins from NCBI's RefSeq are included if a GeneRIF entry links the gene to an article in PubMed^®. GeneRIF also provides a short summary of the article's claim about the protein, which is shown instead of a snippet.
Proteins from Swiss-Prot (the curated part of UniProt) are included if the curators identified experimental evidence for the protein's function (evidence code ECO:0000269). For these proteins, the fields of the Swiss-Prot entry that describe the protein's function are shown (with bold headings).
Proteins from BRENDA, a curated database of enzymes, are included if they are linked to a paper in PubMed and their full sequence is known.
Every protein from the non-redundant subset of BioLiP, a database of ligand-binding sites and catalytic residues in protein structures, is included. Since BioLiP itself does not include descriptions of the proteins, those are taken from the Protein Data Bank. Descriptions from PDB rely on the original submitter of the structure and cannot be updated by others, so they may be less reliable. (For SitesBLAST and Sites on a Tree, we use a larger subset of BioLiP so that every ligand is represented among a group of structures with similar sequences, but for PaperBLAST, we use the non-redundant set provided by BioLiP.)
Every protein from EcoCyc, a curated database of the proteins in Escherichia coli K-12, is included, regardless of whether they are characterized or not.
Proteins from the MetaCyc metabolic pathway database are included if they are linked to a paper in PubMed and their full sequence is known.
Proteins from the Transport Classification Database (TCDB) are included if they have known substrate(s), have reference(s), and are not described as uncharacterized or putative. (Some of the references are not visible on the PaperBLAST web site.)
Every protein from CharProtDB, a database of experimentally characterized protein annotations, is included.
Proteins from the CAZy database of carbohydrate-active enzymes are included if they are associated with an Enzyme Classification number. Even though CAZy does not provide links from individual protein sequences to papers, these should all be experimentally-characterized proteins.
Proteins from the REBASE database of restriction enzymes are included if they have known specificity.
Every protein with an evidence-based reannotation (based on mutant phenotypes) in the Fitness Browser is included.
Sequence-specific transcription factors (including sigma factors and DNA-binding response regulators) with experimentally-determined DNA binding sites from the PRODORIC database of gene regulation in prokaryotes.
Putative transcription factors from RegPrecise that have manually-curated predictions for their binding sites. These predictions are based on conserved putative regulatory sites across genomes that contain similar transcription factors, so PaperBLAST clusters the TFs at 70% identity and retains just one member of each cluster.
Coding sequence (CDS) features from the European Nucleotide Archive (ENA) are included if the /experiment tag is set (implying that there is experimental evidence for the annotation), the nucleotide entry links to paper(s) in PubMed, and the nucleotide entry is from the STD data class (implying that these are targeted annotated sequences, not from shotgun sequencing). Also, to filter out genes whose transcription or translation was detected, but whose function was not studied, nucleotide entries or papers with more than 25 such proteins are excluded. Descriptions from ENA rely on the original submitter of the sequence and cannot be updated by others, so they may be less reliable.

Except for GeneRIF and ENA, the curated entries include a short curated description of the protein's function. For entries from BioLiP, the protein's function may not be known beyond binding to the ligand. Many of these entries also link to articles in PubMed.

For more information see the PaperBLAST paper (mSystems 2017) or the code. You can download PaperBLAST's database here.

Changes to PaperBLAST since the paper was written:

November 2023: incorporated PRODORIC and RegPrecise. Many PRODORIC entries were not linked to a protein sequence (no UniProt identifier), so we added this information.
February 2023: BioLiP changed their download format. PaperBLAST now includes their non-redundant subset. SitesBLAST and Sites on a Tree use a larger non-redundant subset that ensures that every ligand is represented within each cluster. This should ensure that every binding site is represented.
June 2022: incorporated some coding sequences from ENA with the /experiment tag.
March 2022: incorporated BioLiP.
April 2020: incorporated TCDB.
April 2019: EuropePMC now returns table entries in their search results. This has expanded PaperBLAST's database, but most of the new entries are of low relevance, and the resulting snippets are often just lists of locus tags with annotations.
February 2018: the alignment page reports the conservation of the hit's functional sites (if available from from Swiss-Prot or UniProt)
January 2018: incorporated BRENDA.
December 2017: incorporated MetaCyc, CharProtDB, CAZy, REBASE, and the reannotations from the Fitness Browser.
September 2017: EuropePMC no longer returns some table entries in their search results. This has shrunk PaperBLAST's database, but has also reduced the number of low-relevance hits.

Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.

Secrets

PaperBLAST cannot provide snippets for many of the papers that are published in non-open-access journals. This limitation applies even if the paper is marked as "free" on the publisher's web site and is available in PubmedCentral or EuropePMC. If a journal that you publish in is marked as "secret," please consider publishing elsewhere.

Omissions from the PaperBLAST Database

Many important articles are missing from PaperBLAST, either because the article's full text is not in EuropePMC (as for many older articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an article that characterizes a protein's function but is missing from PaperBLAST, please notify the curators at UniProt or add an entry to GeneRIF. Entries in either of these databases will eventually be incorporated into PaperBLAST. Note that to add an entry to UniProt, you will need to find the UniProt identifier for the protein. If the protein is not already in UniProt, you can ask them to create an entry. To add an entry to GeneRIF, you will need an NCBI Gene identifier, but unfortunately many prokaryotic proteins in RefSeq do not have corresponding Gene identifers.

References

PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.

Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.

Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.

UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.

BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.

The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.

The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.

CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.

The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.

The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.

REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.

Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.

by Morgan Price, Arkin group
Lawrence Berkeley National Laboratory

PaperBLAST – Find papers about a protein or its homologs

PaperBLAST

PaperBLAST Hits for MCAODC_17600 (51 a.a., MKYPTGVENH...)

New Search

Statistics

How It Works

Secrets

Omissions from the PaperBLAST Database

References