TremaGene : tremagene faq

A Postdoctoral Research Associate position is avaiable in the Mitreva Lab! Follow the link for details!

Bioinformatics Workshop for Helminth Genomics (2015) class materials now freely available!

NTDs get some well deserved coverage outside the usual channels: Financial Times Special Report: Neglected Tropical Diseases

TremaGene FAQ

Tremagene is our repository of available trematode genes & transcripts. Data is pulled from multiple sources: from collaborators, published datasets, and/or from internal assemblies that have been annotated for genes & gene-related features. Typically we only host the most current geneset available for each species, as versions change older data is removed. Almost all included gene-specific annotation (eg. IPR ids, Kegg identifiers, expression data, etc...) is run locally and stored in a backend MySQL db. Annotations are integrated with other site resources. Genes annotated with Kegg identifiers provide links into our TremaPath viewer, GO term annotations are linked into the GO Consortium term definitions, and so forth. And if the user drills down to view a single gene, that gene's sequence can be forwarded into our TremaBlast tool to search against our other hosted datasets

One of the more useful features of TremaGene is the ability to define a slice of the hosted data, and then download protein and/or nucleotide sequence (as available). Using the filters described below a user can define the set of genes they want, and then dump them into a fasta file for download. Subsets defined by specific annotations (eg. all F.hepatica genes annotated with the IPR id IPR006548) can be had, or users can choose to download protein fasta for an entire species (or multiple species) at once.

Searching TremaGene

TremaGene can be searched using IPR, GO and/or KO id filters. First click on the [+] Expand label for the Species selection section and select 1 or more species to start your query. Note that if you select no species from the list, your query will be applied to all species in TremaGene (depending on the complexity of your query, this may take a long time to complete). After selecting the species to focus on, open the sections below to set specific filters you'd like to apply. You're able to request a specific gene name (or comma-delimited list of gene names), orthologous group name if such is available, IPR id, GO term and/or KO id. You are allowed to enter comma-delimited lists of any of those ids as well. Note that filtering on multiple ids of a single type will return genes/transcripts annotated by any of those ids. But if you set filters using 2 or more id types (i.e. IPR id + KO id), each gene or transcript returned will be required to have at least one id from each list you supplied.

You will then arrive at a page showing the slice of data you've retrieved from TremaGene. The Query Definition section now displays the query you made to extract the results below. Then the Data Download section allows you to download the full fasta for all the genes/transcripts you requested. Note that if the type of fasta you request (nucleotide or protein) doesn't exist for any of the genes or transcripts in your list, the output file will still display the gene name(s) as headers, but the sequence record for those will be empty. The Results section will list all the genes and/or transcripts in your return set, organized by species, then by group if available (typically groups built by software such as orthomcl or inparanoid). Each gene or transcript name is a link that will take you to a final detail page showing the available annotations for that entity. You will also be able to download that single entity, or forward its sequence into TremaBlast for further analysis.

Index of prefixes

Here is an index showing the version and organism gene name prefixes referenced in TremaGene:
Clonorchis sinensis (50HG assembly version: 3.5)                    CS1
Echinostoma caproni (50HG assembly version: 1.5.4)               ECPE
Fasciola hepatica (50HG assembly version: 1.0)                         D915
Opisthorchis viverrini (NCBI BioProject id:PRJNA222628)          KER
Schistosoma curassoni (50HG assembly version: 1.0.4)          SCUD
Schistosoma haematobium (50HG assembly version: 3.0)      MS3
Schistosoma japonicum (50HG assembly version: 1.0)             SJC
Schistosoma mansoni (50HG assembly version: 5.2)                SMP
Schistosoma margrebowiei (50HG assembly version: 1.5.4)    SMRZ
Schistosoma mattheei (50HG assembly version: 1.0.4)             SMTD
Schistosoma rodhaini (50HG assembly version: 1.0.4)              SROB
Trichobilharzia regenti (50HG assembly version: 1.0.4)              TRE

TremaGene annotations

Annotations provided in TremaGene are:

InterPro id (IPR): "InterPro is a resource that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several different databases (referred to as member databases) that make up the InterPro consortium." (description lifted from the InterPro website)

Gene Ontology term (GO): GO is at the heart of a (largely successful) effort to standardize the descriptions of gene products across databases. The effort was founded by FlyBase, Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD), but has now been adopted by many databases. GO terms form a controlled vocabulary that is a useful annotation for disseminating the function & localization of a gene.

Kegg identifier (K0): "KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies" (description lifted from KEGG website)

RNAseq based expression in FPKM: Gene expression per stage and/or tissue is described in terms of FPKM (Fragments Per Kilobase of exon per Million fragments mapped). Illumina RNAseq reads are mapped against a genome, and the htseq-count program is used to produce a number of reads hitting each coding exon in the reference. The total number of reads hitting coding exons per gene are summed, and then this formula is applied to get FPKM:

FPKM = number_reads_hitting_gene / (gene_legnth_in_Kbp * library_size)
where library_size is the sum of all reads mapped to the coding exons of all genes in all samples for the given stage or tissue in millions (i.e. the sum of all reads / 1000000)

Larger values mean more expression. Be aware that these values will be re-calculated when additional data for the given gene/stage(or tissue) combination is made available. This value represents the FPKM detected across all data we have for that gene and stage(or tissue) combination. v1.0           Copyright Statement
  User support forum User Support
The Genome Institute Washington University School of Medicine