| Online Publication | Bioinformatics | Report | http://werner.yellowcouch.org/Papers/tablemapping/ |
Werner Van Belle1* - werner@yellowcouch.org, werner.van.belle@gmail.com
1- Department of Medical Genetics (MEDGEN); University Hospital Northern Norway (UNN); Tromsų; Norway
Abstract: This document describes cross joining of gene tables between Celera's mouse genome identifiers, Celera human genome identifiers and the more useful Ensembl identifiers. The context in which this research is set are the genes FKRP and TAF4. By using siRNA's we interfered with the transcription and measured their effect upon the transcriptome. The Applied Biosystems 1700 micro-array scanner measured and reported the transcription quantities. Two micro-array types were used: the human genome survey v2.0 micro-arrays and the mouse genome survey v2.0 micro-arrays. Based on the different micro-array measurements we wanted to predict which proteins would be influenced in a cell system if we know the up/down regulation of the measured probes. To this end we wanted to use the human protein interaction map (as defines by Rual'06), which uses Ensembl annotated genes. This of course formed a major problem. First, the Applied Biosystems scanner does not export Ensembl gene annotations. Secondly, the human protein interaction map might not be a good model for a mouse micro-array, so we needed to go through various orthologs. This document tells two stories: first, and most annoyingly: how to get Ensembl identifiers into an Applied Biosystems micro-array. Secondly, and slightly more interesting, how to retrieve a mouse to human ortholog mapping from Ensembl.
Keywords: ortholog gene mapping celere gene identifiers applied biosystems gene identifiers probe sequences
Reference: Werner Van Belle; (Ortholog) Mapping the Applied Biosystems Human/Mouse Survey v2.0 Micro-arrays to Ensembl Gene Identifiers; Bioinformatics; Yellowcouch Scientific; 23 pages; May 2007
Files: mcg2ensemusg.csv.gz, hcg2ensg.csv.gz, ensmus2ensg.csv.gz, genemappings.pdf, sql-source.tgz
In this document we work with three forms of information. First there is a local database, which we call FkrpTaf4. It will contain all the information we need to perform further analysis. Secondly there is the public accessible ensembldb. Most people know this through the web-interface. What is probably lesser known is that this database is also immediately accessible through SQL, which makes it optimal for our purposes (http://ensembldb.ensembl.org). The third set of information are local comma separated file (CSV), which are used a) to transfer information from the Ensembl database into our own FkrpTaf4 database and b) to import data from the Applied Biosystems into our FkrpTaf4 database).
For each query we do not mention whether something is a temporary table neither do we drop tables if they already exist. This is properly done in the actual SQL files, but since it is of little relevance to understanding what is happening we omit this information here. There is however one catch: if a table is a temporary table it cannot be reused within the same query. In that case, one might need to make a copy into a second temporary table. For each created table we also list a small example output that illustrates the contents of each table.
To execute a query on a database one can do the following mysql -user=werner -D FkrpTaf4 -A -batch <import-ab.sql
For the Ensembl database one can use mysql -user=anonymous
-h ensembldb.ensembl.org -D homo_sapiens_core_44_36f -A -batch
<swissprot2ensembl.sql >imports/swissprot2ensg.csv
or mysql -user=anonymous -h ensembldb.ensembl.org -D mus_musculus_core_44_36e
-A
-batch <swissprot2ensembl.sql >imports/swissprot2ensmusg.csv
We measured the influence of FKRP and TAF4 on various cell systems through measuring the transcriptional changes with human and mouse genome survey arrays. Once this was done, we aimed to integrate this data into a protein interaction map as to find the proteins that are likely influenced mostly by the proteins of interest. We encountered some major obstacles to this approach. First, the Applied Biosystems 1700 scanner does not provide Ensembl annotated genes. Secondly, the Applied Biosystems 1700 scanner does not provide the probe sequences, making an automatic mapping to either the mouse or human genome more complicated than it should be and thirdly: exporting the Unigene/Swissprot annotated genes from the Applied Biosystems 1700 machine was prohibitively slow. In the end we exported all tables using a tedious 10 columns at a time approach. This lead to 6 tables that we could join afterward. One conducting high throughput proteomics will find this 'small bug' a major issue, since the export times become prohibitively long. In other words: aside from the fact that Applied Biosystems provides only 56% useful measurements, the Applied Biosystems 1700 scanner also seems rather unusable in high throughput settings.
To import the Applied Biosystems tables, we exported three different experiments in 6 separate files. We also had to clean out some spaces that were added in the allset1 data. When this operation is performed, we have three tables: FkrpSiRna1, FkrpSiRna2 and FkrpScrambled. The file import-ab contains all the details on the import process.
To illustrate the mechanism, we elaborate somewhat on allset1. The target table must first be created, thereby reflecting the columns as they occur in the original Applied Biosystems CSV tables. We also introduce assay_name, probe_id, gene_name and sample_name as indices since we later need to join on these columns. If we don't do this, most operations will be extremely slow. The query below is ran on the FkrpTaf4 schema.
(Assay_Name VARCHAR(128),
INDEX (Assay_Name, Probe_ID, Gene_ID, Sample_Name),
Row FLOAT,
Col FLOAT,
Probe_ID VARCHAR(128),
Probe_Type TEXT,
Gene_ID VARCHAR(128),
X FLOAT,
Y FLOAT,
Assay_Normalized_Signal FLOAT,
Signal FLOAT,
CL_Sig FLOAT,
CL_Raw FLOAT,
SDEV FLOAT,
CV FLOAT,
S_N FLOAT,
CL_Sig_Error FLOAT,
CL_Raw_SDEV FLOAT,
Flags INT,
Sample_Name VARCHAR(128),
id INT AUTO_INCREMENT PRIMARY KEY);
INTO TABLE allset1;
Of course, it would be nice if the sample_name could be compared between table1 and table6. In practice, this could not be done since alslet1 included a 10 character at the end of each sample name. To get rid of those we needed the following update. The query below is ran on the FkrpTaf4 schema.
Assay_name,row,col,probe_id,probe_type,
gene_id,x,y,assay_normalized_signal,
signal,cl_sig,cl_raw,sdev,cv,s_n,
cl_sig_error,cl_raw _sdev,flags,
Trim('\r' FROM sample_name),id
FROM allset1;
| Assay_name | Row | Col | Probe_id | Probe_type | gene_id |
+-------------+---+---+-----+------+-------+
| HB00588 3/1/07 12:12 PM | 189 | 77 | 100002 | probe | hCG1643199.4 |
| HB00588 3/1/07 12:12 PM | 45 | 39 | 100003 | probe | hCG2041918 |
| HB00588 3/1/07 12:12 PM | 109 | 152 | 100027 | probe | hCG31426.2 |
| HB00588 3/1/07 12:12 PM | 70 | 7 | 100036 | probe | hCG1979099.1 |
| HB00588 3/1/07 12:12 PM | 173 | 129 | 100037 | probe | hCG42687.4 |
| HB00588 3/1/07 12:12 PM | 114 | 54 | 100039 | probe | hCG2015782 |
| HB00588 3/1/07 12:12 PM | 68 | 51 | 100044 | probe | hCG36953.3 |
| HB00588 3/1/07 12:12 PM | 153 | 123 | 100045 | probe | hCG1776836.3 |
| HB00588 3/1/07 12:12 PM | 75 | 157 | 100051 | probe | hCG1642464.3 |
| HB00588 3/1/07 12:12 PM | 146 | 139 | 100052 | probe | hCG22993.3 |
+-----+-----+-------------+-----+-----+-----+-----+
| x | y | assay_normalized_signal | signal | cl_sig | cl_raw | sdev |
+-----+-----+-------------+-----+-----+-----+-----+
| 991.21 | 1135.54 | 161.22 | 171638 | 170314 | 177617 | 1233.32 |
| 575.93 | 875.26 | 0.15 | 194.56 | 140.64 | 3674.4 | 194.56 |
| 1792.77 | 1560.84 | 0.2 | 251.37 | -38.76 | 4600.85 | 251.37 |
| 232.78 | 1145.18 | 8.32 | 10512.5 | 11065.4 | 16749.1 | 691.04 |
| 1549.67 | 960.67 | 44.04 | 46886.9 | 45509.2 | 51538.9 | 590.25 |
| 740.1 | 1617.52 | 11.22 | 14174.1 | 13962.3 | 20199.8 | 579.15 |
| 706.61 | 1122.22 | 0.17 | 212.63 | -75.6 | 4337.4 | 212.63 |
| 1484.97 | 746.02 | 0.26 | 277.94 | -626.52 | 4945.81 | 277.94 |
| 1846.74 | 1194.41 | 0.23 | 291.73 | -496.82 | 4405.79 | 291.73 |
| 1655.39 | 670.39 | 0.58 | 621.79 | -64.09 | 6495.53 | 621.79 |
+----+----+-------+-------+----+-------+--+
| cv | s_n | CL_sig_error | CL_Raw_sdev | Flags | sample_name | id |
+----+----+-------+-------+----+-------+--+
| 0.05 | 139.17 | 296.57 | 0 | 0 | II-1 | 2 |
| 1.14 | 0.88 | 103.51 | 0 | 1 | II-1 | 3 |
| 31.09 | -0.03 | 131.81 | 0 | 1 | II-1 | 4 |
| 0.08 | 15.21 | 173.15 | 0 | 0 | II-1 | 5 |
| 0.05 | 79.44 | 174.7 | 0 | 0 | II-1 | 6 |
| 0.06 | 24.47 | 189.76 | 0 | 0 | II-1 | 7 |
| 4.14 | -0.24 | 125.93 | 0 | 1 | II-1 | 8 |
| 0.51 | -1.96 | 98.86 | 0 | 1 | II-1 | 9 |
| 0.61 | -1.65 | 138.64 | 0 | 1 | II-1 | 10 |
| 5.63 | -0.18 | 113.36 | 0 | 1 | II-1 | 11 |
The following query imports, among other things, the important mCG, Swissprot and Unigene identifiers into the FkrpTaf4 database. The query below is ran on the FkrpTaf4 schema.
(Assay_Name VARCHAR(128),
Probe_ID VARCHAR(128),
Gene_ID VARCHAR(128),
Sample_Name VARCHAR(128),
INDEX (Assay_Name, Probe_ID, Gene_ID, Sample_Name),
Status TEXT,
SwissProt TEXT,
UniGene TEXT,
dbEST TEXT,
hCG TEXT,
hCP TEXT,
hCT TEXT,
mCG TEXT,
mCP TEXT,
mCT TEXT,
rCG TEXT,
rCP TEXT,
rCT TEXT);
LOAD DATA LOCAL INFILE 'imports/all-set6.ab.csv'
INTO TABLE allset6;
| assay_name | probe_id | gene_id | sample_name | status |
+-------------+-----+-------+-------+------+
| HB00588 3/1/07 12:12 PM | 100002 | hCG1643199.4 | II-1 | pseudogene |
| HB00588 3/1/07 12:12 PM | 100003 | hCG2041918 | II-1 | current |
| HB00588 3/1/07 12:12 PM | 100027 | hCG31426.2 | II-1 | current |
| HB00588 3/1/07 12:12 PM | 100036 | hCG1979099.1 | II-1 | current |
| HB00588 3/1/07 12:12 PM | 100037 | hCG42687.4 | II-1 | current |
| HB00588 3/1/07 12:12 PM | 100039 | hCG2015782 | II-1 | current |
| HB00588 3/1/07 12:12 PM | 100044 | hCG36953.3 | II-1 | current |
| HB00588 3/1/07 12:12 PM | 100045 | hCG1776836.3 | II-1 | current |
| HB00588 3/1/07 12:12 PM | 100051 | hCG1642464.3 | II-1 | current |
| HB00588 3/1/07 12:12 PM | 100052 | hCG22993.3 | II-1 | current |
+------------------+------+--------------------------+
| swissprot | unigene | hcg |
+------------------+------+--------------------------+
| | | hCG1643199.4 |
| | | hCG2041918 |
| O95201;P13682;P17027;P51523;Q15776 | Hs.57679 | hCG31426.2 |
| Q92610 | Hs.368756 | hCG1820838.1;hCG1979099.1;hCG1989348;hCG1994281.1 |
| | Hs.302903 | hCG42687.4 |
| Q9UJX3 | Hs.530379 | hCG2015782 |
| | Hs.278954 | hCG36953.3 |
| | | hCG1776836.3 |
| | | hCG1642464.3 |
| | Hs.199068 | hCG22993.3 |
We extract 3 different tables: FkrpSiRna1, FkrpSiRna2 and FkrpScrambled using the following SQL statements. Each table will have duplicate rows for specific genes. This is because they have also been measured multiple times. FkrpSiRna2 is smaller than the two others since we only had two slides and not three.
Allset1 and allset6 are imported in 2.1.1 and 2.1.2. The query below is ran on the FkrpTaf4 schema.
SELECT allset1.Gene_ID, Assay_Normalized_Signal
FROM allset6, allset1
WHERE (allset6.Sample_Name="1-1"
or allset6.Sample_Name="2-1"
or allset6.Sample_Name="3-1")
and allset6.sample_name=allset1.sample_name
and allset1.Assay_Name=allset6.Assay_Name
and allset1.gene_id=allset6.gene_id
and allset1.Probe_ID=allset6.Probe_ID;
| Gene_ID | Assay_Normalized_Signal |
+-------+-------------+
| AK079773.1 | 1.58 |
| mCG9222.3 | 19.46 |
| mCG5316.2 | 0.31 |
| mCG1036527.1 | 6.28 |
| mCG1050139 | 0.2 |
| mCG121612 | 36.77 |
| mCG142727 | 0.14 |
| mCG130331.1 | 0.35 |
| mCG1045481.1 | 1.99 |
| mCG141353 | 0.34 |
Allset1 and allset6 are imported in 2.1.1 and 2.1.2. The query below is ran on the FkrpTaf4 schema.
SELECT allset1.Gene_ID, Assay_Normalized_Signal
FROM allset6, allset1
WHERE (allset6.Sample_Name="1-2"
or allset6.Sample_Name="2-2")
and allset6.sample_name=allset1.sample_name
and allset1.Assay_Name=allset6.Assay_Name
and allset1.gene_id=allset6.gene_id
and allset1.Probe_ID=allset6.Probe_ID;
+-------+-------------+
| Gene_ID | Assay_Normalized_Signal |
+-------+-------------+
| AK079773.1 | 1.33 |
| mCG9222.3 | 15.48 |
| mCG5316.2 | 0.19 |
| mCG1036527.1 | 4 |
| mCG1050139 | 0.2 |
| mCG121612 | 40.8 |
| mCG142727 | 0.21 |
| mCG130331.1 | 0.45 |
| mCG1045481.1 | 6.6 |
| mCG141353 | 0.33 |
Allset1 and allset6 are imported in 2.1.1 and 2.1.2. The query below is ran on the FkrpTaf4 schema.
SELECT allset1.Gene_ID, Assay_Normalized_Signal
FROM allset6, allset1
WHERE (allset6.Sample_Name="1-3"
or allset6.Sample_Name="2-3"
or allset6.Sample_Name="3-3")
and allset6.sample_name=allset1.sample_name
and allset1.Assay_Name=allset6.Assay_Name
and allset1.gene_id=allset6.gene_id
and allset1.Probe_ID=allset6.Probe_ID;
| Gene_ID | Assay_Normalized_Signal |
+-------+-------------+
| AK079773.1 | 1.67 |
| mCG9222.3 | 14.78 |
| mCG5316.2 | 0.96 |
| mCG1036527.1 | 2.03 |
| mCG1050139 | 0.12 |
| mCG121612 | 40.13 |
| mCG142727 | 0.13 |
| mCG130331.1 | 0.26 |
| mCG1045481.1 | 1.93 |
| mCG141353 | 0.23 |
The query below is ran on the FkrpTaf4 schema.
SELECT allset1.Gene_ID, Assay_Normalized_Signal
FROM allset6, allset1
WHERE (allset6.Sample_Name="I-1"
or allset6.Sample_Name="I-2"
or allset6.Sample_Name="I-3"
or allset6.Sample_Name="I-4")
and allset6.sample_name=allset1.sample_name
and allset1.Assay_Name=allset6.Assay_Name
and allset1.gene_id=allset6.gene_id
and allset1.Probe_ID=allset6.Probe_ID;
| Gene_ID | Assay_Normalized_Signal |
+-------+-------------+
| hCG2041918 | 0.08 |
| hCG31426.2 | 0.39 |
| hCG1979099.1 | 7.13 |
| hCG42687.4 | 40.41 |
| hCG2015782 | 9.79 |
| hCG36953.3 | 0.08 |
| hCG1776836.3 | 0.37 |
| hCG1642464.3 | 0.21 |
| hCG22993.3 | 0.17 |
| hCG1793655.1 | 3.59 |
The query below is ran on the FkrpTaf4 schema.
SELECT allset1.Gene_ID, Assay_Normalized_Signal
FROM allset6, allset1
WHERE (allset6.Sample_Name="II-1"
or allset6.Sample_Name="II-2"
or allset6.Sample_Name="II-3")
and allset6.sample_name=allset1.sample_name
and allset1.Assay_Name=allset6.Assay_Name
and allset1.gene_id=allset6.gene_id
and allset1.Probe_ID=allset6.Probe_ID;
| Gene_ID | Assay_Normalized_Signal |
+-------+-------------+
| hCG2041918 | 0.15 |
| hCG31426.2 | 0.2 |
| hCG1979099.1 | 8.32 |
| hCG42687.4 | 44.04 |
| hCG2015782 | 11.22 |
| hCG36953.3 | 0.17 |
| hCG1776836.3 | 0.26 |
| hCG1642464.3 | 0.23 |
| hCG22993.3 | 0.58 |
| hCG1793655.1 | 4.2 |
The query below is ran on the FkrpTaf4 schema.
SELECT allset1.Gene_ID, Assay_Normalized_Signal
FROM allset6, allset1
WHERE (allset6.Sample_Name="Si I"
or allset6.Sample_Name="Si II"
or allset6.Sample_Name="Si III"
or allset6.Sample_Name="Si IV")
and allset6.sample_name=allset1.sample_name
and allset1.Assay_Name=allset6.Assay_Name
and allset1.gene_id=allset6.gene_id
and allset1.Probe_ID=allset6.Probe_ID;
| Gene_ID | Assay_Normalized_Signal |
+-------+-------------+
| hCG2041918 | 0.16 |
| hCG31426.2 | 0.12 |
| hCG1979099.1 | 8.83 |
| hCG42687.4 | 53.12 |
| hCG2015782 | 14.13 |
| hCG36953.3 | 0.11 |
| hCG1776836.3 | 0.22 |
| hCG1642464.3 | 0.14 |
| hCG22993.3 | 0.54 |
| hCG1793655.1 | 1.02 |
The query below is ran on the FkrpTaf4 schema.
SELECT allset1.Gene_ID, Assay_Normalized_Signal as signal
FROM allset6, allset1
WHERE (allset6.Sample_Name="Scr I"
or allset6.Sample_Name="Scr II"
or allset6.Sample_Name="Scr III"
or allset6.Sample_Name="Scr IV")
and allset6.sample_name=allset1.sample_name
and allset1.Assay_Name=allset6.Assay_Name
and allset1.gene_id=allset6.gene_id
and allset1.Probe_ID=allset6.Probe_ID;
| Gene_ID | Assay_Normalized_Signal |
+-------+-------------+
| hCG2041918 | 0.55 |
| hCG31426.2 | 0.16 |
| hCG1979099.1 | 10.38 |
| hCG42687.4 | 56.67 |
| hCG2015782 | 11.33 |
| hCG36953.3 | 0.3 |
| hCG1776836.3 | 0.31 |
| hCG1642464.3 | 1.23 |
| hCG22993.3 | 0.32 |
| hCG1793655.1 | 1.61 |
To integrate the human protein interaction map into the previously mentioned network, we need to go through the external annotated identifiers. The Applied Biosystems 1700 scanner can produce tables that include the Unigene and Swissprot identifiers. Based on these we can find the Ensembl genes and thus link them back to our hCG identifiers.
To make this work we first need to find various sources of data in the Ensembl database.
The query below can be ran on the homo_sapiens_core_44_36f ensembldb schema or on the mus_musculus_core_44_36e schema. Depending on the choice of database we map ENSG... or ENSMUSG... identifiers to their description.
FROM gene_stable_id JOIN gene using (gene_id);
| stable_id | description |
+---------+----------------------------------------------------------------------------------------------------------+
| ENSG00000129824 | 40S ribosomal protein S4, Y isoform 1. [Source:Uniprot/SWISSPROT;Acc:P22090] |
| ENSG00000067646 | Zinc finger Y-chromosomal protein. [Source:Uniprot/SWISSPROT;Acc:P08048] |
| ENSG00000176679 | Homeobox protein TGIF2LY (TGFB-induced factor 2-like protein, Y- linked) (TGF(beta)induced transcription factor 2-like protein) (TGIF- like on the Y). [Source:Uniprot/SWISSPROT;Acc:Q8IUE0] | | ENSG00000099715 | Protocadherin-11 Y-linked precursor (Protocadherin-11) (Protocadherin- 22) (Protocadherin on the Y chromosome) (PCDH-Y) (Protocadherin prostate cancer) (Protocadherin-PC). [Source:Uniprot/SWISSPROT;Acc:Q9BZA8] | | ENSG00000173394 | NULL |
| ENSG00000168757 | testis specific protein, Y-linked 2 [Source:RefSeq_peptide;Acc:NP_072095] |
| ENSG00000186406 | RNA binding motif (Fragment). [Source:Uniprot/SPTREMBL;Acc:Q13381] |
| ENSG00000129816 | testis-specific transcript, Y-linked 1 (TTTY1) on chromosome Y [Source:RefSeq_dna;Acc:NR_001538] |
| ENSG00000197285 | testis-specific transcript, Y-linked 2 (TTTY2) on chromosome Y [Source:RefSeq_dna;Acc:NR_001536] |
| ENSG00000206198 | testis-specific transcript, Y-linked 21 (TTTY21) on chromosome Y [Source:RefSeq_dna;Acc:NR_001535] |
| stable_id | description |
+----------+------------------------------------------------------------+
| ENSMUSG00000053211 | zinc finger protein 2, Y linked [Source:MarkerSymbol;Acc:MGI:99213] |
| ENSMUSG00000068457 | ubiquitously transcribed tetratricopeptide repeat gene, Y chromosome [Source:MarkerSymbol;Acc:MGI:894810] |
| ENSMUSG00000069053 | Ubiquitin-activating enzyme E1 Y (Ubiquitin-activating enzyme E1). [Source:Uniprot/SWISSPROT;Acc:P31254] |
| ENSMUSG00000056673 | jumonji, AT rich interactive domain 1D (Rbp2 like) [Source:MarkerSymbol;Acc:MGI:99780] |
| ENSMUSG00000069049 | eukaryotic translation initiation factor 2, subunit 3, structural gene Y-linked [Source:MarkerSymbol;Acc:MGI:1349430] |
| ENSMUSG00000069045 | DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked [Source:MarkerSymbol;Acc:MGI:1349406] |
| ENSMUSG00000069044 | ubiquitin specific peptidase 9, Y chromosome [Source:MarkerSymbol;Acc:MGI:1313274] |
| ENSMUSG00000069618 | RIKEN cDNA 1700012B15 gene [Source:MarkerSymbol;Acc:MGI:1921423] |
| ENSMUSG00000020671 | RAB10, member RAS oncogene family [Source:MarkerSymbol;Acc:MGI:105066] |
| ENSMUSG00000075505 | NULL |
To create a mapping from a Swissprot identifier to an Ensembl identifier requires us to join the stable_gene_id, gene and xref tables. In addition, it seems that sometimes multiple mappings are necessary. Namely, an external identifier can refer to a transcript or to the gene immediately. To resolve this we need the union of 2 queries.
This is a query to map the external id immediately onto the gene. The query below can be ran on the homo_sapiens_core_44_36f ensembldb schema or on the mus_musculus_core_44_36e schema.
FROM xref, gene, gene_stable_id
WHERE external_db_id=2200
and xref_id=display_xref_id
and gene.gene_id=gene_stable_id.gene_id;
| dbprimary_acc | stable_id | description |
+--------+----------+--------------------------------------------------------------------------------------------------------------------------------+
| P46425 | ENSMUSG00000038155 | Glutathione S-transferase P 2 (EC 2.5.1.18) (GST YF-YF) (GST-piA) (GST class-pi) (Gst P2). [Source:Uniprot/SWISSPROT;Acc:P46425] |
| Q8K2L9 | ENSMUSG00000033450 | T-cell activation Rho GTPase-activating protein (T-cell activation GTPase-activating protein). [Source:Uniprot/SWISSPROT;Acc:Q8K2L9] |
| P83882 | ENSMUSG00000049751 | 60S ribosomal protein L36a (60S ribosomal protein L44). [Source:Uniprot/SWISSPROT;Acc:P83882] |
| P05531 | ENSMUSG00000054626 | X-linked lymphocyte-regulated protein PM1. [Source:Uniprot/SWISSPROT;Acc:P05531] |
| Q9WV98 | ENSMUSG00000021079 | Mitochondrial import inner membrane translocase subunit Tim9. [Source:Uniprot/SWISSPROT;Acc:Q9WV98] |
| Q6IE32 | ENSMUSG00000060201 | Serine protease inhibitor Kazal-type 7 precursor (Esophagus cancer- related gene 2 protein) (ECRG-2). [Source:Uniprot/SWISSPROT;Acc:Q6IE32] |
| O88574 | ENSMUSG00000031609 | Histone deacetylase complex subunit SAP30 (Sin3-associated polypeptide, 30 kDa) (Sin3 corepressor complex subunit SAP30). [Source:Uniprot/SWISSPROT;Acc:O88574] |
| Q9JI46 | ENSMUSG00000024213 | Diphosphoinositol polyphosphate phosphohydrolase 1 (EC 3.6.1.52) (DIPP-1) (muDIPP1) (Diadenosine 5',5'''-P1,P6-hexaphosphate hydrolase 1) (EC 3.6.1.-) (Nucleoside diphosphate-linked moiety X motif 3) (Nudix motif 3). [Source:Uniprot/SWISSPROT;Acc:Q9JI46] |
| Q03740 | ENSMUSG00000070870 | Gamma crystallin E. [Source:Uniprot/SWISSPROT;Acc:Q03740] |
| P27545 | ENSMUSG00000055694 | LAG1 longevity assurance homolog 1 (UOG-1 protein). [Source:Uniprot/SWISSPROT;Acc:P27545] |
This query maps the external id to its transcript, which is then mapped onto its producing gene. The query below can be ran on the homo_sapiens_core_44_36f ensembldb schema or on the mus_musculus_core_44_36e schema.
FROM xref, transcript, gene, gene_stable_id
WHERE external_db_id=2200
and xref_id=transcript.display_xref_id
and gene.gene_id=transcript.gene_id
and gene.gene_id=gene_stable_id.gene_id;
| dbprimary_acc | stable_id | description |
+--------+----------+-----------------------------------------------------------------------+
| P46425 | ENSMUSG00000038155 | Glutathione S-transferase P 2 (EC 2.5.1.18) (GST YF-YF) (GST-piA) (GST class-pi) (Gst P2). [Source:Uniprot/SWISSPROT;Acc:P46425] |
| Q8BKE5 | ENSMUSG00000066307 | RIKEN cDNA E130016E03 gene [Source:MarkerSymbol;Acc:MGI:2444973] |
| Q8K2L9 | ENSMUSG00000033450 | T-cell activation Rho GTPase-activating protein (T-cell activation GTPase-activating protein). [Source:Uniprot/SWISSPROT;Acc:Q8K2L9] |
| P83882 | ENSMUSG00000049751 | 60S ribosomal protein L36a (60S ribosomal protein L44). [Source:Uniprot/SWISSPROT;Acc:P83882] |
| P05531 | ENSMUSG00000054626 | X-linked lymphocyte-regulated protein PM1. [Source:Uniprot/SWISSPROT;Acc:P05531] |
| Q9WV98 | ENSMUSG00000021079 | Mitochondrial import inner membrane translocase subunit Tim9. [Source:Uniprot/SWISSPROT;Acc:Q9WV98] |
| Q9WV98 | ENSMUSG00000021079 | Mitochondrial import inner membrane translocase subunit Tim9. [Source:Uniprot/SWISSPROT;Acc:Q9WV98] |
| Q9WV98 | ENSMUSG00000021079 | Mitochondrial import inner membrane translocase subunit Tim9. [Source:Uniprot/SWISSPROT;Acc:Q9WV98] |
| Q6IE32 | ENSMUSG00000060201 | Serine protease inhibitor Kazal-type 7 precursor (Esophagus cancer- related gene 2 protein) (ECRG-2). [Source:Uniprot/SWISSPROT;Acc:Q6IE32] |
| Q80WG5 | ENSMUSG00000007476 | phytanoyl-CoA dioxygenase domain containing 1 [Source:MarkerSymbol;Acc:MGI:3612860] |
dbprimary_acc as swissprot,
stable_id as ensembl,
gene.description as description
FROM xref, gene, gene_stable_id
WHERE external_db_id=2200
AND xref_id=display_xref_id
AND gene.gene_id=gene_stable_id.gene_id
UNION DISTINCT
SELECT DISTINCT
dbprimary_acc as swissprot,
stable_id as ensembl,
gene.description as description
FROM xref, transcript, gene, gene_stable_id
WHERE external_db_id=2200
AND xref_id=transcript.display_xref_id
AND gene.gene_id=transcript.gene_id
AND gene.gene_id=gene_stable_id.gene_id
Unigene identifiers are mapped as well through their transcripts but in a slightly different manner. By executing the following statement on the Ensembl database homo_sapiens_core_44_36f we obtain the unigene2ensg mapping (All stable_id's will be of the form ENSG000...). If we execute the query on the mus_musculus_core_44_36e database we obtain the unigene2ensmusg mapping (all id's will be of the form ENSMUSG000...).
dbprimary_acc as unigene,
stable_id as ensembl
FROM xref, object_xref, transcript, gene_stable_id
WHERE external_db_id=4100
and object_xref.xref_id=xref.xref_id
and transcript.transcript_id=object_xref.ensembl_id
and gene_stable_id.gene_id=transcript.gene_id;
| unigene | ensembl |
+------+----------+
| Mm.27038 | ENSMUSG00000036083 |
| Mm.39752 | ENSMUSG00000021573 |
| Mm.41636 | ENSMUSG00000002733 |
| Mm.76494 | ENSMUSG00000028528 |
| Mm.253378 | ENSMUSG00000043556 |
| Mm.260194 | ENSMUSG00000020474 |
| Mm.268582 | ENSMUSG00000071256 |
| Mm.317248 | ENSMUSG00000066443 |
| Mm.317248 | ENSMUSG00000041453 |
| Mm.390885 | ENSMUSG00000014077 |
One might assume now that we can simply link the mCG identifiers through their Swissprot or Unigene link to the Ensembl human gene using one of the above tables. That is however incorrect. Such a table will be empty since no Swissprot identifier nor Unigene identifier from the human genome is reused in the mouse genome. Instead we need to go through an ortholog mapping. We start out with creating a mCG2ensmusg table first. This is of course again more tricky than it looks at first sight. The problem that we have now is that the list of Swissprot identifiers linked to each mCG gene is separated with semicolons (;). To find them we must thus work the other way around: create a set of all the Swissprot fuckers we have and find them back inside the mCG tables. To reduce calculation time we first create a small table containing all the mCG|Swissprot links
By appending a ';' to the end of each Swissprot identifier list, we are sure that every one of our Swissprot identifiers will be found when we attach a ; to it as well. If we don't do this we might find MM123 back in the list 'MM1234; MM44'. All that is left now is to make the joins of the swissport2ensmusg vs mcg2ensmusg and unigen2ensmusg vs mcg2ensmusg. The query below is ran in the FkrpTaf4 database.
SELECT DISTINCT
Gene_ID,
Concat(Trim(SwissProt),';') as swissprot,
Concat(Trim(UniGene),';') as unigene
FROM allset6;
| Gene_ID | swissprot | unigene |
+-------+-------------------+------+
| hCG31426.2 | O95201;P13682;P17027;P51523;Q15776; | Hs.57679; |
| hCG1979099.1 | Q92610; | Hs.368756; |
| hCG42687.4 | ; | Hs.302903; |
| hCG2015782 | Q9UJX3; | Hs.530379; |
| hCG36953.3 | ; | Hs.278954; |
| hCG22993.3 | ; | Hs.199068; |
| hCG33215.3 | ; | Hs.30011; |
| hCG22998.3 | ; | Hs.294009; |
| hCG2039675 | Q8IUX4;Q9UH17; | Hs.337667; |
| hCG14966.2 | O14944; | Hs.115263; |
This table creates a mapping from the various mCG identifiers we found to their associated ensmusg identifier. It relies on the mCGBlurb table (3.2.1) created before (that table contains multiple Swissprot/Unigene identifiers in a semicolon separated/terminated list. To find back which Swissprot/Unigene identifiers occur in each mCGBlurb field we search for each of them in turn. This is less than optimal and could be optimized. However, the query only takes 7 minutes, so I don't care that much (at the moment). The Swissprot2Ensmusg table was imported in section 5.3.1. The Unigene2Ensmusg table was imported in section 5.3.2. The query below is ran in the FkrpTaf4 database.
SELECT DISTINCT
Gene_ID,
Ensembl
FROM Unigene2Ensmusg u2m, CG2Blurb blurb1
WHERE InStr(blurb1.UniGene,CONCAT(u2m.UniGene,";"))
UNION DISTINCT
SELECT DISTINCT Gene_ID, Ensembl
FROM Swissprot2Ensmusg s2m, CG2Blurb blurb2
WHERE InStr(blurb2.UniGene,CONCAT(s2m.swissprot,";"))
| Gene_ID | Ensembl |
+-------+----------+
| mCG126572.1 | ENSMUSG00000062203 |
| mCG141162 | ENSMUSG00000049152 |
| mCG113184.1 | ENSMUSG00000050876 |
| mCG132220.1 | ENSMUSG00000030137 |
| mCG19273.2 | ENSMUSG00000028465 |
| mCG5925.1 | ENSMUSG00000053560 |
| mCG141342 | ENSMUSG00000051048 |
| mCG1031868.1 | ENSMUSG00000048355 |
| mCG1036470.1 | ENSMUSG00000038077 |
| mCG49016.1 | ENSMUSG00000025795 |
In order to integrate the FKRP results we needed to map the mouse genes to the human genome, thereby respecting the function the different genes preform. inter-species genes with the same function are called orthologs. Ensembl provides a comparative database of genes between different species. However automatically mapping one onto the other was not as straightforward as one would expect. Theoretically one could write a query that would take all the stably annotated mouse genes, find them back in the homology table, determine the homology family and then find the human gene within that same family. The major problem that we encountered was that the Ensembl database has over 31'000'000 homology members, making straightforward joins of various tables a less than optimal solution. We optimized the querying using the following tricks.
Below we assume that we have a list of gene identifiers (form ENSMUSG...123) in the tomap.mouse_id column. The goal now is to create a new mapping from all these mouse_id's to human gene_ids. The tomap table is in our case defined as all potential mouse Ensembl id. Joining the stable_id's from Swissprot2mid (3.1.2) and Unigene2mid (3.1.3) provides us with the necessary things.
(mouse_id VARCHAR(32) PRIMARY KEY)
SELECT DISTINCT ensembl as mouse_id
FROM SwissProt2Mid
UNION DISTINCT
SELECT DISTINCT ensembl as mouse_id
FROM UniGene2Mid;
| mouse_id |
+----------+
| ENSMUSG00000000028 |
| ENSMUSG00000000031 |
| ENSMUSG00000000037 |
| ENSMUSG00000000056 |
| ENSMUSG00000000058 |
| ENSMUSG00000000078 |
| ENSMUSG00000000088 |
| ENSMUSG00000000093 |
| ENSMUSG00000000103 |
This table will list all the homolog members that belong to the mouse family for which we are interested in the mapping. The tomap table is created in 4.1.
(member_id int UNIQUE,
stable_id VARCHAR(32) UNIQUE)
SELECT DISTINCT member_id, map.stable_id
FROM Tomap
JOIN mus_musculus_core_44_36e.gene_stable_id mus
ON mus.stable_id=Tomap.ensembl
JOIN ensembl_compara_44.member map
ON mus.stable_id=map.stable_id;
| member_id | stable_id |
+------+----------+
| 828231 | ENSMUSG00000000028 |
| 338749 | ENSMUSG00000000031 |
| 682679 | ENSMUSG00000000037 |
| 1073599 | ENSMUSG00000000056 |
| 157512 | ENSMUSG00000000058 |
| 780907 | ENSMUSG00000000078 |
| 877703 | ENSMUSG00000000088 |
| 1052453 | ENSMUSG00000000093 |
| 667029 | ENSMUSG00000000103 |
| 1056719 | ENSMUSG00000000120 |
This table will be used as a starting point to find the homologies we are interested in. Afterward we will compare all the members of the homologies we like to a second table of human_homologs.
This table lists all possible target homologs (in our case, all stable gene members of the human genome)
(member_id int UNIQUE,
stable_id VARCHAR(32) UNIQUE)
SELECT DISTINCT member_id, map.stable_id
FROM homo_sapiens_core_44_36f.gene_stable_id hum
JOIN ensembl_compara_44.member map
ON hum.stable_id=map.stable_id;
| member_id | stable_id |
+------+---------+
| 3 | ENSG00000198763 |
| 5 | ENSG00000198804 |
| 7 | ENSG00000198712 |
| 9 | ENSG00000198744 |
| 11 | ENSG00000198899 |
| 13 | ENSG00000198938 |
| 15 | ENSG00000198840 |
| 17 | ENSG00000198868 |
| 19 | ENSG00000198886 |
| 21 | ENSG00000198786 |
(member_id int PRIMARY KEY,
stable_id VARCHAR(32) UNIQUE)
SELECT * FROM mouse_homolog
UNION DISTINCT
SELECT * FROM human_homolog;
| member_id | stable_id |
+------+----------+
| 338098 | ENSG00000168394 |
| 338108 | ENSMUSG00000025147 |
| 338125 | ENSG00000204261 |
| 338142 | ENSG00000204259 |
| 338145 | ENSMUSG00000043186 |
| 338185 | ENSMUSG00000037887 |
| 338194 | ENSG00000204258 |
| 338215 | ENSG00000204257 |
| 338272 | ENSG00000204256 |
| 338283 | ENSMUSG00000073786 |
The homologs that could be interesting are those that are no paralogs and those that have members in either the human gene or in the mouse gene. The first constraint is implemented by going through the homology compara table. The second constraint is implemented using the homolog_members table we made before.
(homology_id int,
member_id int,
INDEX (homology_id),
INDEX (member_id))
SELECT h.homology_id, a.member_id
FROM homology h
JOIN ensembl_compara_44.homology_member b USING (homology_id)
JOIN homolog_member a USING (member_id)
WHERE description!="between_species_paralog"
AND description!="within_species_paralog";
| homology_id | member_id |
+-------+------+
| 3097693 | 1 |
| 3097793 | 1 |
| 3097840 | 1 |
| 3097935 | 1 |
| 3097966 | 1 |
| 3097987 | 1 |
| 3098140 | 1 |
| 3098283 | 1 |
| 3098381 | 1 |
| 3098403 | 1 |
| 3098478 | 1 |
| 3098500 | 1 |
| 5290964 | 3 |
| 5291000 | 3 |
| 5291061 | 3 |
| 5291114 | 3 |
| 5291145 | 3 |
| 5291260 | 3 |
| 5291274 | 3 |
| 5291346 | 3 |
| 5291503 | 3 |
| 5291563 | 3 |
| 5291571 | 3 |
| 712679 | 5 |
| 712702 | 5 |
| 712865 | 5 |
| 712903 | 5 |
| 712931 | 5 |
| 712987 | 5 |
| 713089 | 5 |
| 713115 | 5 |
We are only interested in orthologs (4.5) that include a mouse ho