PaperBLAST
Full List of Papers Linked to VIMSS7417180
ethA / P9WNF9 ethionamide monooxygenase EthA from Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) (see 7 papers)
ETHA_MYCTU / P9WNF9 FAD-containing monooxygenase EthA; Baeyer-Villiger monooxygenase EtaA; BVMO; Prodrug activator EtaA; EC 1.14.13.- from Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) (see 4 papers)
ETHA_MYCBO / Q7TVI2 FAD-containing monooxygenase EthA; Baeyer-Villiger monooxygenase; BVMO; Prodrug activator EtaA; EC 1.14.13.- from Mycobacterium bovis (strain ATCC BAA-935 / AF2122/97) (see paper)
YP_003033907 monooxygenase ethA from Mycobacterium tuberculosis KZN 1435
NP_218371 monooxygenase EthA from Mycobacterium tuberculosis H37Rv
MT3969 monooxygenase, flavin-binding family from Mycobacterium tuberculosis CDC1551
Rv3854c MONOOXYGENASE ETHA from Mycobacterium tuberculosis H37Rv
JTY_3919 monooxygenase from Mycobacterium bovis BCG str. Tokyo 172
MT49_RS20315 FAD-containing monooxygenase EthA from Mycobacterium tuberculosis 49-02
- function: Monooxygenase able to convert a wide range of ketones to the corresponding esters or lactones via a Baeyer-Villiger oxidation reaction. Can act on long-chain aliphatic ketones (2-hexanone to 2- dodecanone) and on aromatic ketones (phenylacetone and benzylacetone). Is also able to catalyze enantioselective sulfoxidation of methyl-p- tolylsulfide. In vivo, likely functions as a BVMO, but the exact nature of the physiological substrate(s) remains to be established.
function: Is responsible for the activation of several thiocarbamide- containing pro-drugs into cytotoxic species. Thus, catalyzes the oxidation of the antitubercular pro-drug ethionamide (ETH) to the corresponding sulfoxide, which is further oxidized by EthA to 2-ethyl- 4-amidopyridine, presumably via the unstable doubly oxidized sulfinic acid intermediate; the final metabolite 2-ethyl-4-amidopyridine has no antitubercular activity, so the cytotoxic species is a metabolite intermediate formed by EthA. Also oxidizes thiacetazone (TAC), thiobenzamide, and isothionicotinamide and therefore is probably responsible, as suggested by the observation of crossover resistance, for the oxidative activation of these other thioamide antitubercular drugs.
catalytic activity: ethionamide + NADPH + O2 + H(+) = ethionamide S-oxide + NADP(+) + H2O (RHEA:47616)
cofactor: FAD (Binds 1 FAD per subunit.)
subunit: Exists as a mixture of relatively large homooligomers ranging from 200 to 600 kDa.
disruption phenotype: Inactivation of this gene leads to a strong resistance to ETH. - function: Monooxygenase able to convert a wide range of ketones to the corresponding esters or lactones via a Baeyer-Villiger oxidation reaction. Can act on long-chain aliphatic ketones (2-hexanone to 2- dodecanone) and on aromatic ketones (phenylacetone and benzylacetone). Is also able to catalyze enantioselective sulfoxidation of methyl-p- tolylsulfide. In vivo, likely functions as a BVMO, but the exact nature of the physiological substrate(s) remains to be established.
function: Is responsible for the activation of several thiocarbamide- containing pro-drugs, such as ethionamide (ETH), isoxyl (ISO) and thiacetazone (TAC), into reactive species.
catalytic activity: ethionamide + NADPH + O2 + H(+) = ethionamide S-oxide + NADP(+) + H2O (RHEA:47616)
cofactor: FAD (Binds 1 FAD per subunit.)
disruption phenotype: Deletion of this gene leads to a strong resistance to ETH, ISO and TAC. - Escherichia coli Overexpressing a Baeyer-Villiger Monooxygenase from Acinetobacter radioresistens Becomes Resistant to Imipenem
Minerdi, Antimicrobial agents and chemotherapy 2016 - “...tuberculosis KZN 1435, ethionamide monooxygenase (YP_003033907); EtaA Mycobacterium tuberculosis SUMu001, ethionamide monooxygenase (ZP_07416557.1); EtaA...”
- Identification of a novel Baeyer-Villiger monooxygenase from Acinetobacter radioresistens: close relationship to the Mycobacterium tuberculosis prodrug activator EtaA
Minerdi, Microbial biotechnology 2012 - “...monooxygenase (NP_218371.1); EtaA M ycobacterium tuberculosis H37Ra=ethionamide monooxygenase (YP_001285245.1); EtaA M ycobacterium tuberculosis KZN1435=ethionamide monooxygenase (YP_003033907); EtaA M ycobacterium tuberculosis GM1503=ethionamide monooxygenase (ZP_03534438.1), MtmOIV S treptomyces argillaceus =mithramycin monooxygenase (3FMW_A); (B) Acinetobacter baumannii AB900=terminal alkane1monooxygenase (ZP_04661203.1); Acinetobacter baumannii ACICU=terminal alkane1monooxygenase (YP_001846325.1); Acinetobacter sp. 6013113=alkane1monooxygenase (ZP_06781771.1); Acinetobacter...”
- Whole-Transcriptome and -Genome Analysis of Extensively Drug-Resistant Mycobacterium tuberculosis Clinical Isolates Identifies Downregulation of ethA as a Mechanism of Ethionamide Resistance.
de, Antimicrobial agents and chemotherapy 2017 - GeneRIF: Whole-transcriptome and -genome analysis of extensively drug-resistant Mycobacterium tuberculosis clinical isolates identifies downregulation of ethA as a mechanism of ethionamide resistance.
- Genotypic Analysis of Genes Associated with Independent Resistance and Cross-Resistance to Isoniazid and Ethionamide in Mycobacterium tuberculosis Clinical Isolates.
Rueda, Antimicrobial agents and chemotherapy 2015 - GeneRIF: The greatest rate of mutations causing ethionamide resistance were observed in katG, ethA, in mshA.
- Transcriptional Profiling of Mycobacterium tuberculosis Exposed to In Vitro Lysosomal Stress
Lin, Infection and immunity 2016 - “...MT2401 MT3194 MT3423 MT3424 MT3426 MT3427 MT3849 MT3850 MT3969 Rv0032 Rv0252 Rv0711 Rv1405c Rv1552 Rv1553 Rv1555 Rv1736c Rv1856c Rv2007c Rv2029c Rv2338c Rv3111...”
- The Mycobacterium tuberculosis Rv2745c plays an important role in responding to redox stress
McGillivray, PloS one 2014 - “...1.108 1.405 1.696 1.403 2.645 Intermediary metabolism MT3949 bfrB Rv3841 1.440 1.332 1.177 1.316 2.490 MT3969 ethA Rv3854c 1.173 1.226 1.200 2.297 Cell wall associated MT0870 lpqS hypothetical protein Rv0847 1.258 1.157 1.404 1.273 2.416 MT1379 murI glutamate racemase Rv1338 3.322 2.855 3.729 3.302 9.863 Genes...”
- “...1.483 2.794 Intermediary metabolism MT3949 bfrB ferritin family protein Rv3841 3.178 3.401 1.809 2.796 6.944 MT3969 ethA monooxygenase, flavin-binding family Rv3854c 3.630 2.818 3.209 3.219 9.310 MT3349 rubA rubredoxin Rv3251c 2.793 2.330 2.688 2.604 6.078 MT3348 rubB rubredoxin Rv3250c 2.064 1.773 2.262 2.033 4.093 Cell wall...”
- Discordance Between Phenotypic and WGS-Based Drug Susceptibility Testing Results for Some Anti-Tuberculosis Drugs: A Snapshot Study of Paired Mycobacterium tuberculosis Isolates with Small Genetic Distance
Sadovska, Infection and drug resistance 2024 - “...two more phenotypically resistant isolate pairs belonging to the SIT42, the Ile338Ser variant at locus Rv3854c ( ethA gene) was simultaneously detected, and its association with Mtb ETO resistance has not yet been clarified. Table 8 Ethionamide Resistance-Conferring Variants, and Comparison with Phenotypic Drug Susceptibility Testing...”
- “...No. of isolates Match/Total No. of isolate pairs No pDST data available (No. of isolates/pairs) Rv3854c ethA 110del Associated with resistance 1 2/4 1/2 768del Associated with resistance (interim) 1/2 0/1 1029del 0/2 0/1 1152del 4/2 1290del No data 190 1/2 0/1 Rv1483 fabG1 C-15T Associated...”
- Universal Lineage-Independent Markers of Multidrug Resistance in Mycobacterium tuberculosis
Hlanze, Microorganisms 2024 - “...and SM Rv0045c Possible hydrolase 83 AM, CM, EMB , MFX, OFX, PTO, and PZA Rv3854c Monooxygenase EthA 337 AM, CM, EMB , OFX, PTO, PZA , and SM Rv0823c Transcriptional regulatory protein 322 AM, CM, EMB , MFX, OFX, PTO, and PZA Rv3919c Glucose-inhibited division...”
- Characteristic SNPs defining the major multidrug-resistant Mycobacterium tuberculosis clusters identified by EuSeqMyTB to support routine surveillance, EU/EEA, 2017 to 2019
de, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin 2024 - “...histidine kinase TrcS Rv1032c Leu213Leu (ctg/ctA) 29 32 Mainly T (4.8) 4327088_C 2 Monooxygenase EthA Rv3854c Leu129Arg (ctc/cGc) 20 20 Ural (4.2.1) 130881_G 3 NA 16 16 Euro-American (4.6.2) 1485300_G 4 NA 14 15 Mainly T (4.8) 1208477_G 5 Hypothetical protein Rv1084 Ala281Gly (gcg/gGg) 13 15...”
- Whole-genome sequencing-based genetic diversity, transmission dynamics, and drug-resistant mutations in Mycobacterium tuberculosis isolated from extrapulmonary tuberculosis patients in western Ethiopia
Chekesa, Frontiers in public health 2024 - “...embB Rv3795 4248003 Gln497Arg Missense_variant 1(1.12) Capreomycin tlyA Rv1694 1918647 Asn236Lys Missense_variant 2(2.25) Ethionamide ethA Rv3854c 4326765 708delC Frameshift_variant 2(2.25) Remarkably, one isolate (EN068) presented with mixed infection and multidrug resistance, featuring resistance mutations in the rpoB (Asn437Thr, Ser441Ala, Leu464Met) and katG (Asn218Lys) genes ( Supplementary...”
- Identification of Mycobacterium tuberculosis transcriptional repressor EthR inhibitors: Shape-based search and machine learning studies
Chikhale, Heliyon 2024 - “...very well understood for ETH. Besra and Baulard [ 18 ], explained the ETH activator Rv3854c and termed it EthA, it is analogous to the various monooxygenases and induces ETH sensitivity on overexpression in Mycobacteria. Later, Montellano and coworkers established the EtaA as a FAD-containing enzyme...”
- The efflux pumps Rv1877 and Rv0191 play differential roles in the protection of Mycobacterium tuberculosis against chemical stress
Sao, Frontiers in microbiology 2024 - “..., rv1520 and monooxygenases ( rv2378c, rv0385 and rv0793 in our study but not specifically rv3854c and rv1393c ) that were found to be downregulated during EI ( Rodriguez et al., 2002 ), were also downregulated in the mutants of our study ( Supplementary Tables S3S5...”
- Insight into Population Structure and Drug Resistance of Pediatric Tuberculosis Strains from China and Russia Gained through Whole-Genome Sequencing
Zhdanova, International journal of molecular sciences 2023 - “...amikacin aftA, embC Rv3792,Rv3793 ethambutol embA Rv3794 ethambutol embB Rv3795 ethambutol ubiA Rv3806c ethambutol ethA Rv3854c ethionamide gid B Rv3919c streptomycin ijms-24-10302-t005_Table 5 Table 5 Performance characteristics of the molecular detection of drug resistance. Drugs Country Sensitivity Specificity PPV Rifampicin Russia 0.89 (0.750.96) 1.00 (0.851.00) 1.00...”
- Drug Degradation Caused by mce3R Mutations Confers Contezolid (MRX-I) Resistance in Mycobacterium tuberculosis
Pi, Antimicrobial agents and chemotherapy 2022 (secret) - Drug degradation caused by mce3R mutations confers contezolid (MRX-I) resistance in Mycobacterium tuberculosis
Pi, 2022 - Bioinformatic Mining and Structure-Activity Profiling of Baeyer-Villiger Monooxygenases from Mycobacterium tuberculosis
Tomas, mSphere 2022 - “...a comprehensive bioinformatic analysis that identified six BVMOs in M. tuberculosis , including Rv3083 (MymA), Rv3854c (EthA), Rv0565c, and Rv0892, which were selected for further characterization. Homology modeling and substrate docking analysis, performed on this subset, suggested that Rv0892 is closer to the cyclohexanone BVMO, while...”
- “...Indeed, the M. tuberculosis genome includes at least three BVMO genes that can activate ETH: rv3854c (also known as ethA for ETH activator) ( 4 ), rv3083 (also known as mymA ) ( 5 ), and rv0565c ( 6 ). EthA and MymA are both type...”
- Anti-tuberculosis drug development via targeting the cell envelope of Mycobacterium tuberculosis
Xu, Frontiers in microbiology 2022 - “...structure to INH and is also a prodrug. It is activated by the enzyme ethA (Rv3854c, a monooxygenase), and binds NAD + to form an ETH-NAD adduct which inhibits the same target site as INH ( Vilcheze and Jacobs Jr., 2014 ). Aside adduct-forming compounds, there...”
- The rate and role of pseudogenes of the Mycobacterium tuberculosis complex
Soler-Camargo, Microbial genomics 2022 - “...genes previously associated with antibiotic resistance carrying frameshifts in selected M. tuberculosis genomes: Rv3083, ethA (Rv3854c), gid (Rv3919c), Rv2752c, mmpL5 (Rv0676c), Rv0678, pncA (Rv2043c), and tlyA (Rv1694). Three resistance-related genes were also detected with a frameshift in few M. bovis strains: rpoB (Rv0667), eis (Rv2416), and...”
- Transcriptional regulation and drug resistance in Mycobacterium tuberculosis
Miotto, Frontiers in cellular and infection microbiology 2022 - “...regulon; ( ii ) Rv3855 (EthR) is well-known for its regulatory role on ethA ( rv3854c ), which encodes a Baeyer-Villager monooxygenase involved in the activation of ETO ( Engohang-Ndong etal., 2004 ). Despite the role of mutations affecting ethA in ETO resistance is evident (...”
- "Upcycling" known molecules and targets for drug-resistant TB
Roubert, Frontiers in cellular and infection microbiology 2022 - “...25101874 0.38 2015 CID, pubchem compound identification number; DprE1, decaprenylphosphoryl--D-ribose 2-epimerase (Rv3790); EthA, monooxygenase EthA (Rv3854c); DnaN, DNA polymerase III DnaN (Rv0002); ClpC1, TP-dependent Clp protease ATP-binding subunit ClpC (Rv3596c); InhA, NADH-dependent enoyl-[acyl-carrier-protein] reductase (Rv1484); MmpL3, ransmembrane transporter (Rv0206c); LeuRS, leucyl-tRNA synthetase (Rv0041); DnaE1, DNA polymerase...”
- NSC19723, a Thiacetazone-Like Benzaldehyde Thiosemicarbazone Improves the Efficacy of TB Drugs In Vitro and In Vivo
Singh, Microbiology spectrum 2022 - “...( 26 28 ). TAC is converted to its sulfenic acid form by monooxygenase (EthA, Rv3854c) ( 29 ). Subsequently, sulfenic acid binds to FAS-II systems dehydratase HadAB via a disulphide bound with a cysteine (Cys61) residue of HadA ( 30 , 31 ). This disulphide...”
- The In Vivo Transcriptomic Blueprint of Mycobacterium tuberculosis in the Lung
Coppola, Frontiers in immunology 2021 - “...17 (Rv0005, Rv0284, Rv1161, Rv1297, Rv1398c, Rv1611, Rv1783, Rv1925, Rv2031c, Rv3051c, Rv3219, Rv3248c, Rv3583c, Rv3841, Rv3854c, Rv3874, and Rv3875) were shared with the RT-PCR datasets. Furthermore, a partial validation was performed by using a Mtb RNA-Seq dataset from seven human active TB sputum samples ( Data...”
- “...abundant transcripts have previously been described as Mtb antigens and four (Rv0005, Rv1305, Rv3601c, and Rv3854c) as targets of approved TB drugs ( 7 , 19 , 20 ) ( Figure1B , left panel). To the best of our knowledge, 35 of the top ranked Mtb...”
- Application of Computational Methods in Understanding Mutations in Mycobacterium tuberculosis Drug Resistance
Mugumbate, Frontiers in molecular biosciences 2021 - “...promoter. On the contrary, ETH is activated by the enzyme EthA encoded by the gene Rv3854c to the toxic S-oxide then to 2-ethyl-4-aminopyrimidine ( DeBarber et al., 2000 ; Baulard et al., 2000 ). The transcription of the FAD-containing monooxygenase, EthA, is controlled by another gene...”
- The multi-target aspect of an MmpL3 inhibitor: The BM212 series of compounds bind EthR2, a transcriptional regulator of ethionamide activation
Moorey, Cell surface (Amsterdam, Netherlands) 2021 - “...region ( Blondiaux et al., 2017 ), in analogy to EthR and EthA (Rv3855 and Rv3854c). EthA is regulated by the transcriptional repressor EthR ( Engohang-Ndong et al., 2004 ). Inhibitors of EthR stimulate the transcription of the ethA gene, which improves the bioactivation of the...”
- “...has been extensively researched. The genes responsible for encoding these proteins, ethA and ethR ( rv3854c and rv3855 ) , share a promotor region which is repressed by EthR ( Baulard et al., 2000 , Engohang-Ndong et al., 2004 ). Inhibition of EthR improves ETA potency...”
- An explainable machine learning platform for pyrazinamide resistance prediction and genetic feature identification of Mycobacterium tuberculosis
Zhang, Journal of the American Medical Informatics Association : JAMIA 2021 - “...Drug Resistance Database23 are Rv3795, Rv1267c, Rv0341, Rv3854c, Rv0006, Rv0005, Rv1694, Rv1908c, Rv2245, Rv1854c, Rv2427A, Rv2428, Rv1483, Rv1484, Rv3919c,...”
- Characterization of Drug-Resistant Lipid-Dependent Differentially Detectable Mycobacterium tuberculosis
Mesman, Journal of clinical medicine 2021 - “...Rv1599* hisD A259V non-synonymous Probable histidinol dehydrogenase HisD (HDH) intermediary metabolism and respiration 0 1 Rv3854c a ethA L48F non-synonymous Monooxygenase EthA intermediary metabolism and respiration 0 1 Rv0092 a ctpA S678P non-synonymous Cation transporter P-type ATPase a CtpA cell wall and cell processes 0 1...”
- HigB1 Toxin in Mycobacterium tuberculosis Is Upregulated During Stress and Required to Establish Infection in Guinea Pigs
Sharma, Frontiers in microbiology 2021 - “...ahpC 2.31 Down Alkyl hydroperoxide reductase subunit AhpC Rv3086 adhD 2.31 Down Alcohol dehydrogenase D Rv3854c ethA 2.24 Down Monooxygenase EthA Rv0079 Rv0079 2.23 Down Unknown protein Rv0311 Rv0311 2.19 Down Unknown protein Rv3084 lipR 2.19 Down Acetyl-hydrolase LipR Rv1956 Rv1956 2.18 Down Antitoxin HigA Rv2729c...”
- Insertion and deletion evolution reflects antibiotics selection pressure in a Mycobacterium tuberculosis outbreak
Godfroid, PLoS pathogens 2020 - “...Table 2 ). (f) Single base-pair deletion in the beginning of the ethA gene (MT49_RS20315, Rv3854c). This deletion occurs at position 110 of the coding sequence (7.5% of the CDS length), which results in a frameshift where the resulting protein is truncated with a length of...”
- “...MT49_RS09595 Rv1843c guaB1 GuaB1 family IMP dehydrogenase-related protein nsSNP MT49_RS12315 Rv2337c hypothetical protein nsSNP MT49_RS20315 Rv3854c ethA FAD-containing monooxygenase EthA ABR-conferring 1bp deletion in mmaA3 sSNP MT49_RS03380 Rv0645c mmaA1 mycolic acid methyltransferase MmaA1 Involved in membrane biogenesis iSNP MT49_RS08155 Rv1535 hypothetical protein IS6110 insertion MT49_RS16445 Rv3126c...”
- Tuberculosis in Liberia: high multidrug-resistance burden, transmission and diversity modelled by multiple importation events
López, Microbial genomics 2020 - “...(%) Drug embB Rv3795 4249583 G/A D1024N LR56 17 3+ 100 EMB LR81 97.44 ethA Rv3854c 4327378 G/T Y32STOP LR02 1 2 100 ETH gid Rv3919c 4407816 GC/G 1bp del LR07 10 3+ 100 SM LR11 100 LR13 100 LR14 100 4407982 A/G L74S LR41 4...”
- Novel target and cofactor repertoire for the transcriptional regulator JTY_0672 from <i>Mycobacterium bovis</i> BCG
Wang, Frontiers in microbiology 2024 - “...ESX-1 secretion-associated protein EspE 1.5877 JTY_0720 30S ribosomal protein S17 0.42136 JTY_1747 Transmembrane protein 3.0037 JTY_3919 Monooxygenase EthA 1.5871 JTY_2023 Hypothetical protein 1.2117 JTY_0816 Monooxygenase 1.1253 JTY_1752 Hypothetical protein 3.7301 JTY_2008 Universal stress protein 1.6244 JTY_2019 Ferredoxin FdxA 3.1965 JTY_3150 Two component sensor histidine kinase DevS...”
- “...JTY_2044 , JTY_1747 ), while down-regulation occurred in five genes ( JTY_0673 , JTY_3929 , JTY_3919 , JTY_0696 , JTY_3931 ) ( Figure 1C ). These results demonstrated that JTY_0672 can act as an activator as well as a repressor. JTY_0672 bound to the JTY_3148 promoter...”
- Insertion and deletion evolution reflects antibiotics selection pressure in a Mycobacterium tuberculosis outbreak
Godfroid, PLoS pathogens 2020 - “...( Table 2 ). (f) Single base-pair deletion in the beginning of the ethA gene (MT49_RS20315, Rv3854c). This deletion occurs at position 110 of the coding sequence (7.5% of the CDS length), which results in a frameshift where the resulting protein is truncated with a length...”
- “...sSNP MT49_RS09595 Rv1843c guaB1 GuaB1 family IMP dehydrogenase-related protein nsSNP MT49_RS12315 Rv2337c hypothetical protein nsSNP MT49_RS20315 Rv3854c ethA FAD-containing monooxygenase EthA ABR-conferring 1bp deletion in mmaA3 sSNP MT49_RS03380 Rv0645c mmaA1 mycolic acid methyltransferase MmaA1 Involved in membrane biogenesis iSNP MT49_RS08155 Rv1535 hypothetical protein IS6110 insertion MT49_RS16445...”
For advice on how to use these tools together, see
Interactive tools for functional annotation of bacterial genomes.
The PaperBLAST database links 793,807 different protein sequences to 1,259,118 scientific articles. Searches against EuropePMC were last performed on March 13 2025.
PaperBLAST builds a database of protein sequences that are linked
to scientific articles. These links come from automated text searches
against the articles in EuropePMC
and from manually-curated information from GeneRIF, UniProtKB/Swiss-Prot,
BRENDA,
CAZy (as made available by dbCAN),
BioLiP,
CharProtDB,
MetaCyc,
EcoCyc,
TCDB,
REBASE,
the Fitness Browser,
and a subset of the European Nucleotide Archive with the /experiment tag.
Given this database and a protein sequence query,
PaperBLAST uses protein-protein BLAST
to find similar sequences with E < 0.001.
To build the database, we query EuropePMC with locus tags, with RefSeq protein
identifiers, and with UniProt
accessions. We obtain the locus tags from RefSeq or from MicrobesOnline. We use
queries of the form "locus_tag AND genus_name" to try to ensure that
the paper is actually discussing that gene. Because EuropePMC indexes
most recent biomedical papers, even if they are not open access, some
of the links may be to papers that you cannot read or that our
computers cannot read. We query each of these identifiers that
appears in the open access part of EuropePMC, as well as every locus
tag that appears in the 500 most-referenced genomes, so that a gene
may appear in the PaperBLAST results even though none of the papers
that mention it are open access. We also incorporate text-mined links
from EuropePMC that link open access articles to UniProt or RefSeq
identifiers. (This yields some additional links because EuropePMC
uses different heuristics for their text mining than we do.)
For every article that mentions a locus tag, a RefSeq protein
identifier, or a UniProt accession, we try to select one or two
snippets of text that refer to the protein. If we cannot get access to
the full text, we try to select a snippet from the abstract, but
unfortunately, unique identifiers such as locus tags are rarely
provided in abstracts.
PaperBLAST also incorporates manually-curated protein functions:
- Proteins from NCBI's RefSeq are included if a
GeneRIF
entry links the gene to an article in
PubMed®.
GeneRIF also provides a short summary of the article's claim about the
protein, which is shown instead of a snippet.
- Proteins from Swiss-Prot (the curated part of UniProt)
are included if the curators
identified experimental evidence for the protein's function (evidence
code ECO:0000269). For these proteins, the fields of the Swiss-Prot entry that
describe the protein's function are shown (with bold headings).
- Proteins from BRENDA,
a curated database of enzymes, are included if they are linked to a paper in PubMed
and their full sequence is known.
- Every protein from the non-redundant subset of
BioLiP,
a database
of ligand-binding sites and catalytic residues in protein structures, is included. Since BioLiP itself
does not include descriptions of the proteins, those are taken from the
Protein Data Bank.
Descriptions from PDB rely on the original submitter of the
structure and cannot be updated by others, so they may be less reliable.
(For SitesBLAST and Sites on a Tree, we use a larger subset of BioLiP so that every
ligand is represented among a group of structures with similar sequences, but for
PaperBLAST, we use the non-redundant set provided by BioLiP.)
- Every protein from EcoCyc, a curated
database of the proteins in Escherichia coli K-12, is included, regardless
of whether they are characterized or not.
- Proteins from the MetaCyc metabolic pathway database
are included if they are linked to a paper in PubMed and their full sequence is known.
- Proteins from the Transport Classification Database (TCDB)
are included if they have known substrate(s), have reference(s),
and are not described as uncharacterized or putative.
(Some of the references are not visible on the PaperBLAST web site.)
- Every protein from CharProtDB,
a database of experimentally characterized protein annotations, is included.
- Proteins from the CAZy database of carbohydrate-active enzymes
are included if they are associated with an Enzyme Classification number.
Even though CAZy does not provide links from individual protein sequences to papers,
these should all be experimentally-characterized proteins.
- Proteins from the REBASE database
of restriction enzymes are included if they have known specificity.
- Every protein with an evidence-based reannotation (based on mutant phenotypes)
in the Fitness Browser is included.
- Sequence-specific transcription factors (including sigma factors and DNA-binding response regulators)
with experimentally-determined DNA binding sites from the
PRODORIC database of gene regulation in prokaryotes.
- Putative transcription factors from RegPrecise
that have manually-curated predictions for their binding sites. These predictions are based on
conserved putative regulatory sites across genomes that contain similar transcription factors,
so PaperBLAST clusters the TFs at 70% identity and retains just one member of each cluster.
- Coding sequence (CDS) features from the
European Nucleotide Archive (ENA)
are included if the /experiment tag is set (implying that there is experimental evidence for the annotation),
the nucleotide entry links to paper(s) in PubMed,
and the nucleotide entry is from the STD data class
(implying that these are targeted annotated sequences, not from shotgun sequencing).
Also, to filter out genes whose transcription or translation was detected, but whose function
was not studied, nucleotide entries or papers with more than 25 such proteins are excluded.
Descriptions from ENA rely on the original submitter of the
sequence and cannot be updated by others, so they may be less reliable.
Except for GeneRIF and ENA,
the curated entries include a short curated
description of the protein's function.
For entries from BioLiP, the protein's function may not be known beyond binding to the ligand.
Many of these entries also link to articles in PubMed.
For more information see the
PaperBLAST paper (mSystems 2017)
or the code.
You can download PaperBLAST's database here.
Changes to PaperBLAST since the paper was written:
- November 2023: incorporated PRODORIC and RegPrecise. Many PRODORIC entries were not linked to a protein sequence (no UniProt identifier), so we added this information.
- February 2023: BioLiP changed their download format. PaperBLAST now includes their non-redundant subset. SitesBLAST and Sites on a Tree use a larger non-redundant subset that ensures that every ligand is represented within each cluster. This should ensure that every binding site is represented.
- June 2022: incorporated some coding sequences from ENA with the /experiment tag.
- March 2022: incorporated BioLiP.
- April 2020: incorporated TCDB.
- April 2019: EuropePMC now returns table entries in their search results. This has expanded PaperBLAST's database, but most of the new entries are of low relevance, and the resulting snippets are often just lists of locus tags with annotations.
- February 2018: the alignment page reports the conservation of the hit's functional sites (if available from from Swiss-Prot or UniProt)
- January 2018: incorporated BRENDA.
- December 2017: incorporated MetaCyc, CharProtDB, CAZy, REBASE, and the reannotations from the Fitness Browser.
- September 2017: EuropePMC no longer returns some table entries in their search results. This has shrunk PaperBLAST's database, but has also reduced the number of low-relevance hits.
Many of these changes are described in Interactive tools for functional annotation of bacterial genomes.
PaperBLAST cannot provide snippets for many of the papers that are
published in non-open-access journals. This limitation applies even if
the paper is marked as "free" on the publisher's web site and is
available in PubmedCentral or EuropePMC. If a journal that you publish
in is marked as "secret," please consider publishing elsewhere.
Many important articles are missing from PaperBLAST, either because
the article's full text is not in EuropePMC (as for many older
articles), or because the paper does not mention a protein identifier such as a locus tag, or because of PaperBLAST's heuristics. If you notice an
article that characterizes a protein's function but is missing from
PaperBLAST, please notify the curators at UniProt
or add an entry to GeneRIF.
Entries in either of these databases will eventually be incorporated
into PaperBLAST. Note that to add an entry to UniProt, you will need
to find the UniProt identifier for the protein. If the protein is not
already in UniProt, you can ask them to create an entry. To add an
entry to GeneRIF, you will need an NCBI Gene identifier, but
unfortunately many prokaryotic proteins in RefSeq do not have
corresponding Gene identifers.
References
PaperBLAST: Text-mining papers for information about homologs.
M. N. Price and A. P. Arkin (2017). mSystems, 10.1128/mSystems.00039-17.
Europe PMC in 2017.
M. Levchenko et al (2017). Nucleic Acids Research, 10.1093/nar/gkx1005.
Gene indexing: characterization and analysis of NLM's GeneRIFs.
J. A. Mitchell et al (2003). AMIA Annu Symp Proc 2003:460-464.
UniProt: the universal protein knowledgebase.
The UniProt Consortium (2016). Nucleic Acids Research, 10.1093/nar/gkw1099.
BRENDA in 2017: new perspectives and new tools in BRENDA.
S. Placzek et al (2017). Nucleic Acids Research, 10.1093/nar/gkw952.
The EcoCyc database: reflecting new knowledge about Escherichia coli K-12.
I. M. Keeseler et al (2016). Nucleic Acids Research, 10.1093/nar/gkw1003.
The MetaCyc database of metabolic pathways and enzymes.
R. Caspi et al (2018). Nucleic Acids Research, 10.1093/nar/gkx935.
CharProtDB: a database of experimentally characterized protein annotations.
R. Madupu et al (2012). Nucleic Acids Research, 10.1093/nar/gkr1133.
The carbohydrate-active enzymes database (CAZy) in 2013.
V. Lombard et al (2014). Nucleic Acids Research, 10.1093/nar/gkt1178.
The Transporter Classification Database (TCDB): recent advances
M. H. Saier, Jr. et al (2016). Nucleic Acids Research, 10.1093/nar/gkv1103.
REBASE - a database for DNA restriction and modification: enzymes, genes and genomes.
R. J. Roberts et al (2015). Nucleic Acids Research, 10.1093/nar/gku1046.
Deep annotation of protein function across diverse bacteria from mutant phenotypes.
M. N. Price et al (2016). bioRxiv, 10.1101/072470.
by Morgan Price,
Arkin group
Lawrence Berkeley National Laboratory