PaperBLAST
PaperBLAST Hits for MCAODC_24440 (70 a.a., MNGKSRLASY...)
Show query sequence
>MCAODC_24440
MNGKSRLASYVPKGKEKQAMKQQKAMLIALIVICLTVIVTALVTRKDLCEVRVRTGQTEV
AVFTAYEPEE
Running BLASTp...
Found 18 similar proteins in the literature:
Z1342 putative cell killing protein encoded within cryptic prophage CP-933M from Escherichia coli O157:H7 EDL933
100% identity, 100% coverage
SEN1387 regulator of hokC from Salmonella enterica subsp. enterica serovar Enteritidis str. P125109
93% identity, 100% coverage
ECs1520 prophage maintenance protein from Escherichia coli O157:H7 str. Sakai
90% identity, 100% coverage
ECs2198 MokW from Escherichia coli O157:H7 str. Sakai
87% identity, 100% coverage
STY2054A host cell-killing modulation protein from Salmonella enterica subsp. enterica serovar Typhi str. CT18
92% identity, 93% coverage
Z2054 putative killer protein encoded by prophage CP-933O; paralogous proteins disrupt host membranes when produced in excess from Escherichia coli O157:H7 EDL933
95% identity, 83% coverage
- Development of a High Resolution Virulence Allelic Profiling (HReVAP) Approach Based on the Accessory Genome of Escherichia coli to Characterize Shiga-Toxin Producing E. coli (STEC)
Michelacci, Frontiers in microbiology 2016 - “...AGCTTGCCAATGTCGCAGGA 18563451856517 173 TTCATTGTTCAACCGCCCCG Z2048 TGGCTTTGCCGGAGACAGAA 18577001857842 143 TTTAACCTGCGCCCTGACGT adfO Z2053 AACTGTCGCCGCAATCCGAA 18606161860778 163 GTCTGGCGCTATTTCCACGACA Z2054 AAGGTCAAGGAGAAGCAGGCT 18612311861334 104 TCTTTCCTCGTTACCAGTGCCGT Z2056 TGACTGGCTGTTGCGTCATGT 18624841862594 111 TGCCAGCACAACACCATTGC Z2057 TATCAAAAGCCGGGGAGCGT 18633711863465 95 TTTTATTGCCAGCCGTCCGGA Z2060 ATGCGGAGCTGCAGAGTGAA 18655161865652 137 TTCTGCCGGTTTTTCGCACG Z2065 CACAGCAACCTGCGCTTGTT 18673691867445 77 TGGCACTGCGCGTTAAACAC Z2066 TTACGGTGCGGCATCGAGAA 18676591867735 77 TGCGCGCCCATGAACTGAAA Z2069...”
- “...OI-57 (Figure 2 and Supplementary Table 1 , Sheet 3 ). In particular, two ORFs, Z2054, and Z2101, were positive in more than 95% of the strains tested, while genes Z2037, Z2039, Z2056, Z2057, Z2060, Z2069, Z2071, Z2084, Z2086, Z2096, Z2118, Z2131, and Z2146, were positive...”
- OI-57, a genomic island of Escherichia coli O157, is present in other seropathotypes of Shiga toxin-producing E. coli associated with severe human disease
Imamovic, Infection and immunity 2010 - “...namely, the putative virulence factor adfO and ORF Z2054, encoding a putative bacterial cell killing factor, according to the EDL933 strain sequence (GenBank...”
- “...Z1611 Fwd Z1611 Z1780 Z1780 Z1914 Z1914 Z2053 Z2053 Z2054 Z2054 Z2066 Z2066 Z2096 Z2096 Z2097 Z2097 Z2098 Z2098 Z2104 Z2104 Z2105 Z2105 Z2121 Z2121 Z2148 Z2148...”
HokD / b1562 Qin prophage; toxic protein HokD from Escherichia coli K-12 substr. MG1655 (see 8 papers)
HOKD_ECOLI / P0ACG6 Toxic protein HokD; Protein RelF from Escherichia coli (strain K12) (see paper)
b1562 Qin prophage; small toxic polypeptide from Escherichia coli str. K-12 substr. MG1655
S1668 prophage maintenance protein from Shigella flexneri 2a str. 2457T
EC042_1336, LF82_1023, NRG857_07840 type I toxin-antitoxin system toxin HokD from Escherichia coli O83:H1 str. NRG 857C
98% identity, 73% coverage
- function: Toxic component of a type I toxin-antitoxin (TA) system (Probable). When overexpressed kills cells within 2 minutes; causes collapse of the transmembrane potential and arrest of respiration (PubMed:3019679).
- 18th Congress of the European Hematology Association, Stockholm, Sweden, June 13–16, 2013
, Haematologica 2013 - Escherichia coli toxin/antitoxin pair MqsR/MqsA regulate toxin CspD
Kim, Environmental microbiology 2010 - “...1.2 Small toxic membrane polypeptide hokA b4455 2.1 4.9 1.1 Small toxic membrane polypeptide hokD b1562 1.5 9.8 1.0 Polypeptide destructive to membrane potential Metabolism related gltL b0652 4.0 1.1 1.1 ATP-binding protein of glutamate/aspartate transport system gltK b0653 5.3 1.1 1.1 Glutamate/aspartate transport system permease...”
- Sxy induces a CRP-S regulon in Escherichia coli
Sinha, Journal of bacteriology 2009 - “...75 76 77 78 79 79 b3863 b0294 b3334 b0245 b4326 b1562 b0235 b0325 b3554 b4327 b0032 b2821 b2819 polA matA gspM ykfI yjiD hokD ykfJ yahK yiaF yjiE carA ptrA recD...”
- RNA-seq analysis of the influence of anaerobiosis and FNR on Shigella flexneri
Vergara-Irigaray, BMC genomics 2014 - “...-3.40 SF1231 conserved hypothetical protein -3.71 -1.60 SF5M90T_427 ybaA conserved hypothetical protein -3.88 Phage related S1668 relF prophage maintenance protein 1.75 SF5M90T_1793 putative phage integrase protein 1.45 -1.60 SF5M90T_1056 hypothetical bacteriophage protein 1.14 SF5M90T_740 putative bacteriophage protein -1.93 a Genomes used as reference are: S. flexneri...”
- Phage production is blocked in the adherent-invasive Escherichia coli LF82 upon macrophage infection
Misson, PLoS pathogens 2023 - “...shock-like protein CspB 59 Tritos LF82_280 ImmA/IrrE metallo-endopeptidase 14 Tritos LF82_281 hypothetical protein 30 Tritos LF82_1023 HokD toxin 3792 Cartapus LF82_413 CI repressor protein 5 Cartapus LF82_783 putative exonuclease protein 15 Cartapus LF82_789 hypothetical protein 7 Cyrano CYRAN_45 putative repressor 10 Cyrano CYRAN_26 KacT Acetyltransferase-type toxin...”
- Repertoire and Diversity of Toxin - Antitoxin Systems of Crohn's Disease-Associated Adherent-Invasive Escherichia coli. New Insight of T his Emergent E. coli Pathotype
Bustamante, Frontiers in microbiology 2020 - “...Not annotated c1,466,912.1,467,052 HOK_GEF Superfamily (cl27487, PRK09738) Hok-3 Homologous to hokB in K-12 I TA6 NRG857_07840 c1,626,517.1,626,672 $ HOK_GEF Superfamily (cl27487, pfam01848) Hok-4 Homologous to hokD in K-12 I TA7 Not annotated c2,175,707.2,175,763 NI Ibs-1 Homologous to ibsA in K-12 I TA8 Not annotated c2,176,038.2,176,091 NI...”
- Gene duplications in the E. coli genome: common themes among pathotypes
Bernabeu, BMC genomics 2019 - “...Region 1 21 EC042_1328 2.1 6 23 EC042_1330 2.6 5 24 EC042_1333 2.5 4.2 25 EC042_1336 1.8 0.6 26 EC042_1342 3.5 4.8 27 EC042_1343 2.8 4.8 28 EC042_1344 1.7 4 29 EC042_1349 2.4 3.8 30 EC042_1353 4.3 3.3 31 EC042_1371 1.5 6.1 32 EC042_1372 1.5 3.8...”
EC958_RS07365 type I toxin-antitoxin system toxin HokD from Escherichia coli O25b:H4-ST131
98% identity, 73% coverage
UTI89_C0018 gef membrane toxin from Escherichia coli UTI89
68% identity, 99% coverage
MokC / b0018 regulatory protein MokC from Escherichia coli K-12 substr. MG1655 (see 2 papers)
b0018 regulatory protein for HokC, overlaps CDS of hokC from Escherichia coli str. K-12 substr. MG1655
79% identity, 81% coverage
ECs0016 Gef protein from Escherichia coli O157:H7 str. Sakai
77% identity, 81% coverage
STY2054 Similar to cell-killing genes toxin/antitoxin system comprised of two overlapping transcriptional units (hok/mok). This CDS is equivalent to hok (host cell killing) and is similar to Bacteriophage 933W host killer protein hokW TR:Q9T212 (EMBL:AF125520) (51 aa) fasta scores: E(): 1.3e-20, 96.1% id in 51 aa and Escherichia coli O157:H7 host killer protein hokW TR:Q9KXA2 (EMBL:AP000422) (51 aa) fasta scores: E(): 1.3e-20, 96.1% id in 51 aa from Salmonella enterica subsp. enterica serovar Typhi str. CT18
90% identity, 73% coverage
Gef / b4412 protein HokC from Escherichia coli K-12 substr. MG1655 (see 7 papers)
HOKC_ECOLI / P0ACG4 Toxic protein HokC; Protein Gef from Escherichia coli (strain K12) (see 2 papers)
TC 1.E.53.1.1 / P0ACG4 Toxic protein, HokC or Gef of the Hok/Gef family. When injected into melanoma cells, gef caused the appearance of pore-like structures in the cell membrane from Escherichia coli (see 6 papers)
K8B90_RS03850, NRG857_00085, NRG857_RS00075 type I toxin-antitoxin system toxin MokC from Escherichia coli O83:H1 str. NRG 857C
YP_025292 protein HokC from Escherichia coli str. K-12 substr. MG1655
80% identity, 71% coverage
- function: Toxic component of a type I toxin-antitoxin (TA) system. When overexpressed kills cells within minutes; causes collapse of the transmembrane potential and arrest of respiration (PubMed:10361310). Its toxic effect is probably neutralized by antisense antitoxin RNA SokC (PubMed:10361310).
subunit: Homodimer; disulfide-linked. - substrates: endolysin
- Contribution of Toxin–Antitoxin Systems to Adherent-Invasive E. coli Pathogenesis
Bustamante, Microorganisms 2024 - “...TA ID Toxin Antitoxin Family/Domain Comments TA214828 K8B90_RS03460 (symE) -(symR) symER/SymE (toxin) Type I TA214832 K8B90_RS03850 (hokC) -(sokC) hok-sok/- Type I TA214852 K8B90_RS22020 (ldrD) -(rdlD) ldrD-rdlD/Ldr (toxin) Type I TA214826 K8B90_RS00850 (higB) K8B90_RS00855 (higA) higBA (relBE)/HTH (antitoxin) Type II TA214834 K8B90_RS04020 (ccdB) K8B90_RS04015 (ccdA) ccdAB/CcdA (antitoxin)...”
- “...strain according to TADB 3.0 [ 40 ]. TA ID Toxin Antitoxin Family/Domain Comments TA027329 NRG857_RS00075 (hokC) -(sokC) hok-sok/- Type I; TA1 at [ 9 ] TA027349 NRG857_RS17965 (ldrD) -(rdlD) ldrD-rdlD/Ldr (toxin) Type I; TA14 at [ 9 ] TA027353 NRG857_RS22450 (symE) -(symR) symER/SymE (toxin) Type...”
- Repertoire and Diversity of Toxin - Antitoxin Systems of Crohn's Disease-Associated Adherent-Invasive Escherichia coli. New Insight of T his Emergent E. coli Pathotype
Bustamante, Frontiers in microbiology 2020 - “...Type TA no. Locus Location # Conserved Domain (Accession) & Given name Comments I TA1 NRG857_00085 c15,471.15,623 HOK_GEF Superfamily (cl27487, pfam01848) Hok-1 Homologous to hokC in E. coli K12; close to GI I TA2 NRG857_02625 577,301.577,453 $ HOK_GEF Superfamily (cl27487, pfam01848) Hok-2.1 Homologous to hokE in...”
- Saturation Mutagenesis of the Transmembrane Region of HokC in Escherichia coli Reveals Its High Tolerance to Mutations.
Lara, International journal of molecular sciences 2021 - GeneRIF: Saturation Mutagenesis of the Transmembrane Region of HokC in Escherichia coli Reveals Its High Tolerance to Mutations.
HOK_ECOLX / P11895 Protein Hok from Escherichia coli (see paper)
42% identity, 64% coverage
- function: Toxic component of a type I toxin-antitoxin (TA) system. Part of the plasmid-stabilizing activity of plasmid R1; when R1 is lost cells die rapidly. When overexpressed kills cells within 5 minutes; causes collapse of the transmembrane potential and arrest of respiration (PubMed:3019679). Its toxic effect is partially neutralized by antisense RNA Sok.
NRG857_RS23320 type I toxin-antitoxin system Hok family toxin from Escherichia coli O83:H1 str. NRG 857C
49% identity, 56% coverage
- Contribution of Toxin–Antitoxin Systems to Adherent-Invasive E. coli Pathogenesis
Bustamante, Microorganisms 2024 - “...at [ 9 ] TA027597 NRG857_RS23200 (srnB) -(sok) hok-sok/- Type I; on plasmid pO83_CORR TA027602 NRG857_RS23320 (hok) -(sok) hok-sok/- Type I; on plasmid pO83_CORR TA027331 NRG857_RS00245 (ccdB) NRG857_RS00240 (ccdA) ccdAB/CcdA (antitoxin) Type II; TA17 at [ 9 ] TA027332 NRG857_RS01300 (yafO) NRG857_RS01295 (yafN) yafN-yafO (relBE)/YafO-YafN Type...”
FLMA_ECOLI / P62670 Stable plasmid inheritance protein; F leading maintenance protein from Escherichia coli (strain K12) (see paper)
UTI89_P098 stable plasmid inheritance protein from Escherichia coli UTI89
42% identity, 64% coverage
- function: Toxic component of a type I toxin-antitoxin (TA) system (Probable). Part of the plasmid maintenance system, encodes a toxic protein that collapses the transmembrane potential and arrests respiration (By similarity). When the adjacent non-translated flmB (sok) gene is disrupted FlmA no longer functions in plasmid maintenance (i.e. FlmB probably encodes an antisense antitoxin RNA) (PubMed:3070354). The flmA and flmB RNAs interact, and presence of flmB on a separate plasmid from one encoding flmA and flmB allows loss of the latter plasmid (PubMed:3049248). Translation of FlmA may be coupled to the upstream flmC gene (PubMed:3049248).
- Plasmid parB contributes to uropathogenic Escherichia coli colonization in vivo by acting on biofilm formation and global gene regulation
Song, Frontiers in molecular biosciences 2022 (no snippet)
ASA_P5G151 hypothetical protein from Aeromonas salmonicida subsp. salmonicida A449
43% identity, 67% coverage
SEN1135 phage encoded Hok-like membrane protein from Salmonella enterica subsp. enterica serovar Enteritidis str. P125109
64% identity, 56% coverage
For advice on how to use these tools together, see
Interactive tools for functional annotation of bacterial genomes.
The PaperBLAST database links 793,807 different protein sequences to 1,259,118 scientific articles. Searches against EuropePMC were last performed on March 13 2025.
PaperBLAST builds a database of protein sequences that are linked
to scientific articles. These links come from automated text searches
against the articles in EuropePMC
and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot,
BRENDA,
CAZy (as made available by dbCAN),
BioLiP,
CharProtDB,
MetaCyc,
EcoCyc,
TCDB,
REBASE,
the Fitness Browser,
and a subset of the European Nucleotide Archive with the /experiment tag.
Given this database and a protein sequence query,
PaperBLAST uses protein-protein BLAST
to find similar sequences with E < 0.001.
To build the database, we query EuropePMC with locus tags, with RefSeq protein
identifiers, and with UniProt
accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use
queries of the form "locus_tag AND genus_name" to try to ensure that
the paper is actually discussing that gene. Because EuropePMC indexes
most recent biomedical papers, even if they are not open access, some
of the links may be to papers that you cannot read or that our
computers cannot read. We query each of these identifiers that
appears in the open access part of EuropePMC, as well as every locus
tag that appears in the 500 most-referenced genomes, so that a gene
may appear in the PaperBLAST results even though none of the papers
that mention it are open access. We also incorporate text-mined links
from EuropePMC that link open access articles to UniProt or RefSeq
identifiers. (This yields some additional links because EuropePMC
uses different heuristics for their text mining than we do.)
For every article that mentions a locus tag, a RefSeq protein
identifier, or a UniProt accession, we try to select one or two
snippets of text that refer to the protein. If we cannot get access to
the full text, we try to select a snippet from the abstract, but
unfortunately, unique identifiers such as locus tags are rarely
provided in abstracts.
PaperBLAST also incorporates manually-curated protein functions:
- Proteins from NCBI's RefSeq are included if a
GeneRIF
entry links the gene to an article in
PubMed®.
GeneRIF also provides a short summary of the article's claim about the
protein, which is shown instead of a snippet.
- Proteins from Swiss-Prot (the curated part of UniProt)
are included if the curators
identified experimental evidence for the protein's function (evidence
code ECO:0000269). For these proteins, the fields of the Swiss-Prot entry that
describe the protein's function are shown (with bold headings).
- Proteins from BRENDA,
a curated database of enzymes, are included if they are linked to a paper in PubMed
and their full sequence is known.
- Every protein from the non-redundant subset of
BioLiP,
a database
of ligand-binding sites and catalytic residues in protein structures, is included. Since BioLiP itself
does not include descriptions of the proteins, those are taken from the
Protein Data Bank.
Descriptions from PDB rely on the original submitter of the
structure and cannot be updated by others, so they may be less reliable.
(For SitesBLAST and Sites on a Tree, we use a larger subset of BioLiP so that every
ligand is represented among a group of structures with similar sequences, but for
PaperBLAST, we use the non-redundant set provided by BioLiP.)
- Every protein from EcoCyc, a curated
database of the proteins in Escherichia coli K-12, is included, regardless
of whether they are characterized or not.
- Proteins from the MetaCyc metabolic pathway database
are included if they are linked to a paper in PubMed and their full sequence is known.
- Proteins from the Transport Classification Database (TCDB)
are included if they have known substrate(s), have reference(s),
and are not described as uncharacterized or putative.
(Some of the references are not visible on the PaperBLAST web site.)
- Every protein from CharProtDB,
a database of experimentally characterized protein annotations, is included.
- Proteins from the CAZy database of carbohydrate-active enzymes
are included if they are associated with an Enzyme Classification number.
Even though CAZy does not provide links from individual protein sequences to papers,
these should all be experimentally-characterized proteins.
- Proteins from the REBASE database
of restriction enzymes are included if they have known specificity.
- Every protein with an evidence-based reannotation (based on mutant phenotypes)
in the Fitness Browser is included.
- Sequence-specific transcription factors (including sigma factors and DNA-binding response regulators)
with experimentally-determined DNA binding sites from the
PRODORIC database of gene regulation in prokaryotes.
- Putative transcription factors from RegPrecise
that have manually-curated predictions for their binding sites. These predictions are based on
conserved putative regulatory sites across genomes that contain similar transcription factors,
so PaperBLAST clusters the TFs at 70% identity and retains just one member of each cluster.
- Coding sequence (CDS) features from the
European Nucleotide Archive (ENA)
are included if the /experiment tag is set (implying that there is experimental evidence for the annotation),
the nucleotide entry links to paper(s) in PubMed,
and the nucleotide entry is from the STD data class
(implying that these are targeted annotated sequences, not from shotgun sequencing).
Also, to filter out genes whose transcription or translation was detected, but whose function
was not studied, nucleotide entries or papers with more than 25 such proteins are excluded.
Descriptions from ENA rely on the original submitter of the
sequence and cannot be updated by others, so they may be less reliable.
Except for GeneRIF and ENA,
the curated entries include a short curated
description of the protein's function.
For entries from BioLiP, the protein's function may not be known beyond binding to the ligand.
Many of these entries also link to articles in PubMed.
For more information see the
PaperBLAST paper (mSystems 2017)
or the code.
You can download PaperBLAST's database here.
Changes to PaperBLAST since the paper was written:
- November 2023: incorporated PRODORIC and RegPrecise. Many PRODORIC entries were not linked to a protein sequence (no UniProt identifier), so we added this information.
- February 2023: BioLiP changed their download format. PaperBLAST now includes their non-redundant subset. SitesBLAST and Sites on a Tree use a larger non-redundant subset that ensures that every ligand is represented within each cluster. This should ensure that every binding site is represented.
- June 2022: incorporated some coding sequences from ENA with the /experiment tag.
- March 2022: incorporated BioLiP.
- April 2020: incorporated TCDB.
- April 2019: EuropePMC now returns table entries in their search results. This has expanded PaperBLAST's database, but most of the new entries are of low relevance, and the resulting snippets are often just lists of locus tags with annotations.
- February 2018: the alignment page reports the conservation of the hit's functional sites (if available from from Swiss-Prot or UniProt)
- January 2018: incorporated BRENDA.
- December 2017: incorporated MetaCyc, CharProtDB, CAZy, REBASE, and the reannotations from the Fitness Browser.
- September 2017: EuropePMC no longer returns some table entries in their search results. This has shrunk PaperBLAST's database, but has also reduced the number of low-relevance hits.
Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.
PaperBLAST cannot provide snippets for many of the papers that are
published in non-open-access journals. This limitation applies even if
the paper is marked as "free" on the publisher's web site and is
available in PubmedCentral or EuropePMC. If a journal that you publish
in is marked as "secret," please consider publishing elsewhere.
Many important articles are missing from PaperBLAST, either because
the article's full text is not in EuropePMC (as for many older
articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an
article that characterizes a protein's function but is missing from
PaperBLAST, please notify the curators at UniProt
or add an entry to GeneRIF.
Entries in either of these databases will eventually be incorporated
into PaperBLAST. Note that to add an entry to UniProt, you will need
to find the UniProt identifier for the protein. If the protein is not
already in UniProt, you can ask them to create an entry. To add an
entry to GeneRIF, you will need an NCBI Gene identifier, but
unfortunately many prokaryotic proteins in RefSeq do not have
corresponding Gene identifers.
References
PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.
Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.
Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.
UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.
BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.
The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.
The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.
CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.
The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.
The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.
REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.
Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.
by Morgan Price,
Arkin group
Lawrence Berkeley National Laboratory