PaperBLAST
PaperBLAST Hits for MCAODC_17600 (51 a.a., MKYPTGVENH...)
Show query sequence
>MCAODC_17600
MKYPTGVENHGGKLRIWFVYKGVRVRENLGFLTQQKTGALQVSYAPLFVTQ
Running BLASTp...
Found 8 similar proteins in the literature:
t1929 integrase from Salmonella enterica subsp. enterica serovar Typhi Ty2
STY1011 integrase from Salmonella enterica subsp. enterica serovar Typhi str. CT18
86% identity, 7% coverage
STM1005 Gifsy-2 prophage; integrase from Salmonella typhimurium LT2
86% identity, 7% coverage
- Genome archaeology of two laboratory Salmonella enterica enterica sv Typhimurium
Zaworski, G3 (Bethesda, Md.) 2021 - “...LT2 prophages located between 2,728,977 and 2,776,819 (STM2584 to STM2636) and between 1,098,231 and 1,143,702 (STM1005 to STM1056), respectively. These two related prophages are patchily similar to each other; while overall DNA identity is 44.1%, this 42kb region is interspersed with highly divergent segments and unrelated...”
- Efficient inter-species conjugative transfer of a CRISPR nuclease for targeted bacterial killing
Hamilton, Nature communications 2019 - “...per minute). Conjugations were performed with (filled circles) or without (filled diamonds) sgRNA targeting the STM1005 locus cloned into pNuc- cis and pNuc- trans . Both plasmids encoded the TevSpCas9 nuclease. Data are plotted on a log10 as boxplots with data points from independent biological replicates....”
- Identification of novel factors involved in modulating motility of Salmonella enterica serotype typhimurium
Bogomolnaya, PloS one 2014 - “...Transcription tctD invF, STM2912, STM3696, STM4417, arcA STM0859, ydiP, torR, STM4315 Replication, recombination and repair STM1005 STM1861 Translation, ribosomal structure and biogenesis valS STM1552 Posttranslational modification, protein turnover, chaperones STM2743, sspA Energy production and conversion STM0762, STM0858 Cell cycle control, cell division, chromosome partitioning STM2594 Defense...”
- Host gene expression changes and DNA amplification during temperate phage induction
Frye, Journal of bacteriology 2005 - “...of California, Berkeley Gifsy-2 xis::Kan), JF108 (LT2 STM1005 Gifsy-2 int::Kan), and JF109 (STM2635-2636 Gifsy-1 xis-int::Kan). Peroxide and mitomycin C assays....”
- The genome of Salmonella enterica serovar gallinarum: distinct insertions/deletions and rare rearrangements
Wu, Journal of bacteriology 2005 - “...genomic order to Gifsy-1 (STM2584 to STM2636), Gifsy-2 (STM1005 to STM1056), Fels-1 (STM893 to STM929), and Fels-2 (STM2694 to STM2772). (The genes of serovar...”
- Genomic comparisons of Salmonella enterica serovar Dublin, Agona, and Typhimurium strains recently isolated from milk filters and bovine samples from Ireland, using a Salmonella microarray
Reen, Applied and environmental microbiology 2005 - “...lost the central regions (STM2702 to STM2730). Gifsy-2 (STM1005 to STM1056) (region X) was present in all serovar Typhimurium strains and all serovar Dublin...”
STM14_RS05480 FimB-specific integrase Int from Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S
86% identity, 7% coverage
ECs3013 putative integrase from Escherichia coli O157:H7 str. Sakai
86% identity, 7% coverage
Q859D2 Putative integrase from Escherichia coli phage EH297
86% identity, 7% coverage
- ICEEc2, a new integrative and conjugative element belonging to the pKLC102/PAGI-2 family, identified in Escherichia coli strain BEN374
Roche, Journal of bacteriology 2010 - “...Q8XCM9, Q9KR84, A8CG97, O22009, P27077, Q6HA01, Q77WA5, Q859D2, Q9T205, YP_001700534, and YP_001718763. A multiple alignment was performed using MUSCLE (12),...”
Z1424 integrase for bacteriophage BP-933W from Escherichia coli O157:H7 EDL933
81% identity, 6% coverage
- Genomic anatomy of Escherichia coli O157:H7 outbreaks
Eppinger, Proceedings of the National Academy of Sciences of the United States of America 2011 - “...at the wrbA locus (Z1423/Z1504) harboring phage integrase intW (Z1424). The wrbA gene locus is unoccupied in lineage I/II strains carrying this toxin type, but...”
- Genome evolution in major Escherichia coli O157:H7 lineages
Zhang, BMC genomics 2007 - “...100% 100% 0% S-loop#16/OI#8 ECs0280 (Z0317) putative tail fiber protein 100% 100% 0% S-loop#69/OI#45 ECs1160 (Z1424) 1 putative integrase 93% 0% 0% S-loop#69/OI#45 ECs1161 (Z1425) 1 putative excisionase 93% 0% 0% S-loop#69/OI#45 ECs1162 (Z1426) 1 hypothetical protein 93% 0% 0% S-loop#69/OI#45 ECs1163 (Z1428) 1 hypothetical protein...”
ECs1160 putative integrase from Escherichia coli O157:H7 str. Sakai
81% identity, 6% coverage
- Variability in the Occupancy of Escherichia coli O157 Integration Sites by Shiga Toxin-Encoding Prophages
Henderson, Toxins 2021 - “...bp wrbA -outR ECs1252 ECSP1175 CTTTGGCGCGATTTTACTCAATG wrbA left junction wrbA -outF 605 bp wrbA -inR ECs1160 Absent CCAGCGCCAGCATGGTCTAC wrbA right junction wrbA -inF1 ECs1251 Absent TTTTTCCCTCGCCCATAACCTAT 1445 bp wrbA -outR wrbA with Stx2a phage wrbA long-F ECs1158 ( agp ) ECSP1172 ( agp) GAAAGTCGGCAACTCGCTGGTAGA 22.4 kb...”
- “...genes in the Sakai genome. 2 The ECs1159.5 primer targeted the sequence between ECs1159 and ECs1160 corresponding to the N -terminal portion of the interrupted wrbA gene. References 1. Karmali M.A. Petric M. Lim C. McKeough P.C. Arbus G.S. Lior H. The association between idiopathic hemolytic...”
- Genes essential for the morphogenesis of the Shiga toxin 2-transducing phage from Escherichia coli O157:H7
Mondal, Scientific reports 2016 - “...36 . Among 91 protein-coding genes (open reading frames ORFs) identified on the Sp5 genome (ECs1160 to ECs1251 in the O157 Sakai genome annotation; for convenience, these genes are referred to as ORF1 to ORF91 in this article; note that five small ORFs annotated in the...”
- Genomic diversity of pathogenic Escherichia coli of the EHEC 2 clonal complex
Abu-Ali, BMC genomics 2009 - “...mostly representing serotype O111:H8, had more Sp5 ( stx2 -phage) genes. Integrase and excisionase genes (ECs1160 and ECs1161), and the block of genes at the beginning of the phage, ECs11601187, were missing from most strains. The rest of Sp5 genes, which encode replication proteins O and...”
- Genomic regions conserved in lineage II Escherichia coli O157:H7 strains
Steele, Applied and environmental microbiology 2009 - “...Sakai (14) intergenic region between ORFs ECs1159 and ECs1160 (98% homology over 365 nt), encoding a hypothetical protein and a putative integrase, adjacent to...”
- Rapid determination of Escherichia coli O157:H7 lineage types and molecular subtypes by using comparative genomic fingerprinting
Laing, Applied and environmental microbiology 2008 - “...coli Sakai intergenic region between ECs1159 and ECs1160 A07 ....................E. coli Sakai ORF ECs1928, encoding a hypothetical protein, adjacent to...”
- Genome evolution in major Escherichia coli O157:H7 lineages
Zhang, BMC genomics 2007 - “...of lineage and LSPA type divergent ORFs in S-loop#69. The first cluster, consisting of ORFs ECs1160 to ECs1163 located upstream of the stx2 genes in E. coli Sakai, was missing in all four lineage I/II and the 12 lineage II strains but was conserved in all...”
- “...protein 100% 100% 0% S-loop#16/OI#8 ECs0280 (Z0317) putative tail fiber protein 100% 100% 0% S-loop#69/OI#45 ECs1160 (Z1424) 1 putative integrase 93% 0% 0% S-loop#69/OI#45 ECs1161 (Z1425) 1 putative excisionase 93% 0% 0% S-loop#69/OI#45 ECs1162 (Z1426) 1 hypothetical protein 93% 0% 0% S-loop#69/OI#45 ECs1163 (Z1428) 1 hypothetical...”
STM0893 Fels-1 prophage; putative integrase from Salmonella typhimurium LT2
82% identity, 6% coverage
- Population structure, case clusters, and genetic lesions associated with Canadian Salmonella 4,[5],12:i:- isolates
Clark, PloS one 2021 - “...instead of being completely absent. U.S. clone isolates have lost the entire Fels-1 prophage from STM0893 to STM0929 (Cluster II in Soyer et al., 2009 [ 7 ]), leaving STM0892 adjacent to STM0930 (see closed genome PNCS015054, GenBank accession no. CP037877). Fels-1 is absent in U.S....”
- MassCode liquid arrays as a tool for multiplexed high-throughput genetic profiling
Richmond, PloS one 2011 - “...GAG AAG AT Probe 446 TTT GTT TAC CTC GCT CAC GCT CTA 146/28 Typhimurium STM0893 F 394 CAG CGT TTC TTT ATT AGG AG 220 R P TGG GTT TTG TGG AAT GTA Probe 450 ACG GGC AGC AAA CTG AAA TAA TCC 196/24 Agona...”
- Salmonella enterica serotype 4,5,12:i:-, an emerging Salmonella serotype that represents multiple distinct clones
Soyer, Journal of clinical microbiology 2009 - “...previously characterized by genomic microarrays (18). Cluster II (STM0893 to STM0929), which includes 35 Fels-1 prophage genes and two adjacent genes, was...”
- Genomic comparisons of Salmonella enterica serovar Dublin, Agona, and Typhimurium strains recently isolated from milk filters and bovine samples from Ireland, using a Salmonella microarray
Reen, Applied and environmental microbiology 2005 - “...pSLT. In the present study, we found that Fels-1 (STM0893 to STM0929) (region IX) was missing in all 18 strains examined (Table 2). The Fels-2 prophage (STM2694...”
- Host gene expression changes and DNA amplification during temperate phage induction
Frye, Journal of bacteriology 2005 - “...JF105 (LT2 STM0894 Fels-1 xis::Kan), JF106 (LT2 STM0893 Fels-1 int::Cm), JF107 (LT2 STM1006 * Corresponding author. Mailing address: Sidney Kimmel Cancer...”
- Host restriction of Salmonella enterica serotype Typhimurium pigeon isolates does not correlate with loss of discrete genes
Andrews-Polymenis, Journal of bacteriology 2004 - “...lacked a third chromosomal region. Region I included genes STM0893 to STM929 that encompassed the entire genome of the Fels-1 prophage present in LT2 (Fig. 1)....”
- DNA microarray-based typing of an atypical monophasic Salmonella enterica serovar
Garaizar, Journal of clinical microbiology 2002 - “...I II III IV V VI a Gene no.a STM0517 STM0893 STM2616 STM2694 STM2758 STM2440 to to to to to STM0529 STM0929 STM2617 STM2740 STM2773 STM numbers correspond to...”
For advice on how to use these tools together, see
Interactive tools for functional annotation of bacterial genomes.
The PaperBLAST database links 798,070 different protein sequences to 1,261,478 scientific articles. Searches against EuropePMC were last performed on May 12 2025.
PaperBLAST builds a database of protein sequences that are linked
to scientific articles. These links come from automated text searches
against the articles in EuropePMC
and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot,
BRENDA,
CAZy (as made available by dbCAN),
BioLiP,
CharProtDB,
MetaCyc,
EcoCyc,
TCDB,
REBASE,
the Fitness Browser,
and a subset of the European Nucleotide Archive with the /experiment tag.
Given this database and a protein sequence query,
PaperBLAST uses protein-protein BLAST
to find similar sequences with E < 0.001.
To build the database, we query EuropePMC with locus tags, with RefSeq protein
identifiers, and with UniProt
accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use
queries of the form "locus_tag AND genus_name" to try to ensure that
the paper is actually discussing that gene. Because EuropePMC indexes
most recent biomedical papers, even if they are not open access, some
of the links may be to papers that you cannot read or that our
computers cannot read. We query each of these identifiers that
appears in the open access part of EuropePMC, as well as every locus
tag that appears in the 500 most-referenced genomes, so that a gene
may appear in the PaperBLAST results even though none of the papers
that mention it are open access. We also incorporate text-mined links
from EuropePMC that link open access articles to UniProt or RefSeq
identifiers. (This yields some additional links because EuropePMC
uses different heuristics for their text mining than we do.)
For every article that mentions a locus tag, a RefSeq protein
identifier, or a UniProt accession, we try to select one or two
snippets of text that refer to the protein. If we cannot get access to
the full text, we try to select a snippet from the abstract, but
unfortunately, unique identifiers such as locus tags are rarely
provided in abstracts.
PaperBLAST also incorporates manually-curated protein functions:
- Proteins from NCBI's RefSeq are included if a
GeneRIF
entry links the gene to an article in
PubMed®.
GeneRIF also provides a short summary of the article's claim about the
protein, which is shown instead of a snippet.
- Proteins from Swiss-Prot (the curated part of UniProt)
are included if the curators
identified experimental evidence for the protein's function (evidence
code ECO:0000269). For these proteins, the fields of the Swiss-Prot entry that
describe the protein's function are shown (with bold headings).
- Proteins from BRENDA,
a curated database of enzymes, are included if they are linked to a paper in PubMed
and their full sequence is known.
- Every protein from the non-redundant subset of
BioLiP,
a database
of ligand-binding sites and catalytic residues in protein structures, is included. Since BioLiP itself
does not include descriptions of the proteins, those are taken from the
Protein Data Bank.
Descriptions from PDB rely on the original submitter of the
structure and cannot be updated by others, so they may be less reliable.
(For SitesBLAST and Sites on a Tree, we use a larger subset of BioLiP so that every
ligand is represented among a group of structures with similar sequences, but for
PaperBLAST, we use the non-redundant set provided by BioLiP.)
- Every protein from EcoCyc, a curated
database of the proteins in Escherichia coli K-12, is included, regardless
of whether they are characterized or not.
- Proteins from the MetaCyc metabolic pathway database
are included if they are linked to a paper in PubMed and their full sequence is known.
- Proteins from the Transport Classification Database (TCDB)
are included if they have known substrate(s), have reference(s),
and are not described as uncharacterized or putative.
(Some of the references are not visible on the PaperBLAST web site.)
- Every protein from CharProtDB,
a database of experimentally characterized protein annotations, is included.
- Proteins from the CAZy database of carbohydrate-active enzymes
are included if they are associated with an Enzyme Classification number.
Even though CAZy does not provide links from individual protein sequences to papers,
these should all be experimentally-characterized proteins.
- Proteins from the REBASE database
of restriction enzymes are included if they have known specificity.
- Every protein with an evidence-based reannotation (based on mutant phenotypes)
in the Fitness Browser is included.
- Sequence-specific transcription factors (including sigma factors and DNA-binding response regulators)
with experimentally-determined DNA binding sites from the
PRODORIC database of gene regulation in prokaryotes.
- Putative transcription factors from RegPrecise
that have manually-curated predictions for their binding sites. These predictions are based on
conserved putative regulatory sites across genomes that contain similar transcription factors,
so PaperBLAST clusters the TFs at 70% identity and retains just one member of each cluster.
- Coding sequence (CDS) features from the
European Nucleotide Archive (ENA)
are included if the /experiment tag is set (implying that there is experimental evidence for the annotation),
the nucleotide entry links to paper(s) in PubMed,
and the nucleotide entry is from the STD data class
(implying that these are targeted annotated sequences, not from shotgun sequencing).
Also, to filter out genes whose transcription or translation was detected, but whose function
was not studied, nucleotide entries or papers with more than 25 such proteins are excluded.
Descriptions from ENA rely on the original submitter of the
sequence and cannot be updated by others, so they may be less reliable.
Except for GeneRIF and ENA,
the curated entries include a short curated
description of the protein's function.
For entries from BioLiP, the protein's function may not be known beyond binding to the ligand.
Many of these entries also link to articles in PubMed.
For more information see the
PaperBLAST paper (mSystems 2017)
or the code.
You can download PaperBLAST's database here.
Changes to PaperBLAST since the paper was written:
- November 2023: incorporated PRODORIC and RegPrecise. Many PRODORIC entries were not linked to a protein sequence (no UniProt identifier), so we added this information.
- February 2023: BioLiP changed their download format. PaperBLAST now includes their non-redundant subset. SitesBLAST and Sites on a Tree use a larger non-redundant subset that ensures that every ligand is represented within each cluster. This should ensure that every binding site is represented.
- June 2022: incorporated some coding sequences from ENA with the /experiment tag.
- March 2022: incorporated BioLiP.
- April 2020: incorporated TCDB.
- April 2019: EuropePMC now returns table entries in their search results. This has expanded PaperBLAST's database, but most of the new entries are of low relevance, and the resulting snippets are often just lists of locus tags with annotations.
- February 2018: the alignment page reports the conservation of the hit's functional sites (if available from from Swiss-Prot or UniProt)
- January 2018: incorporated BRENDA.
- December 2017: incorporated MetaCyc, CharProtDB, CAZy, REBASE, and the reannotations from the Fitness Browser.
- September 2017: EuropePMC no longer returns some table entries in their search results. This has shrunk PaperBLAST's database, but has also reduced the number of low-relevance hits.
Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.
PaperBLAST cannot provide snippets for many of the papers that are
published in non-open-access journals. This limitation applies even if
the paper is marked as "free" on the publisher's web site and is
available in PubmedCentral or EuropePMC. If a journal that you publish
in is marked as "secret," please consider publishing elsewhere.
Many important articles are missing from PaperBLAST, either because
the article's full text is not in EuropePMC (as for many older
articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an
article that characterizes a protein's function but is missing from
PaperBLAST, please notify the curators at UniProt
or add an entry to GeneRIF.
Entries in either of these databases will eventually be incorporated
into PaperBLAST. Note that to add an entry to UniProt, you will need
to find the UniProt identifier for the protein. If the protein is not
already in UniProt, you can ask them to create an entry. To add an
entry to GeneRIF, you will need an NCBI Gene identifier, but
unfortunately many prokaryotic proteins in RefSeq do not have
corresponding Gene identifers.
References
PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.
Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.
Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.
UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.
BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.
The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.
The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.
CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.
The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.
The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.
REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.
Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.
by Morgan Price,
Arkin group
Lawrence Berkeley National Laboratory