FAQ
From BioInfo
[edit] Experiment
[edit] What arrays are currently available?
The Microarray Core Facility offers arrays for gene expression, CGH, ChIP-on-chip, and other tiling applications from:
[edit] How much RNA do I need for a gene expression experiment?
[edit] How do I check the status of my microarray experiment request?
The status of your microarray experiments can be checked at the GNomEx web site. To check the status of your experiment:
- Log on to GNomEx using your UNID and password.
- Click the "Track" link on the left side of the window, under "Microarray Hybridization Requests". The "Track Microarray Hybridization Requests" window should appear.
- If you click the "Find Requests" button, this window will show the status of all requests ever made by your lab. You can use the form at the top of the window to restrict your view to requests made by a particular person, requests associated with a particular project, made within some date range, or having a completed or not completed status. After entering one or more values in the form, click the "Find Requests" button to check the status of matching requests.
- The lower part of the window will display the matching requests in a table, with one row per hybridization. For each hybridization you will see the date each sample was labeled and the date on which the hybridization and data extraction were performed.
[edit] Analysis
[edit] How do I run the TiMAT2 CorrelationMap application on chIP-chip promoter array data?
##Here's an example: #Make all of the possible intervals from a T2 run serialized window file java -jar -Xmx1000M ~/Apps/IntervalMaker -s -50 -i 1 -o 2 -g 250 -z 60 -f \ /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/all_Win #Make a text file containing chromosome, start, and stop for each interval, # this represents all of the interrogated promoters on the zebrafish array java -jar -Xmx1000M ~/Apps/IntervalReportPrinter -c -f \ /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/all_Win1Indx86672 #Find the best window within each promoter (if desired use the -m option to # find the lowest scoring window for identifying reduced regions from a diff analysis) java -jar -Xmx1000M ~/Apps/BestWindowScoreExtractor -w \ /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/all_Win \ -r /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/all_Win1Indx86672.xls \ -z 60 -i 1 > /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/bestWin.xls #Parse the output file printing the row number as the first column, the # first four columns, and skipping the first four lines java -jar ~/Apps/PrintSelectColumns -i 0,1,2,3 -n 4 -r -f \ /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/bestWin.xls #Run the CorrelationMap application on the parsed file java -jar ~/Apps/CorrelationMaps -w 1000000 -g zv7 -f \ /Users/nix/HCI/PIs/Cairns/ZebraFish/Results/H3K4K27me3Combine/Win/bestWin.PSC.xls
[edit] How do I convert TiMAT2 xxx.bar files to text files?
There is a converter called Bar2Gr in the T2 package. Download it from SourceForge or use the installed version on hci-bio.
nixlaptop:~ nix$ ssh u0028003@hci-bio u0028003@hci-bio's password: Last login: Wed Aug 8 09:36:36 2007 from 155.100.234.87 [u0028003@hci-bio ~]$ java -jar /home/BioApps/T2/Apps/Bar2Gr ************************************************************************************** ** Bar2Gr: Nov 2006 ** ************************************************************************************** Converts xxx.bar to text xxx.gr files. -f The full path directory/file name for your xxx.bar file(s). Example: java -Xmx1500M -jar pathTo/T2/Apps/Bar2Gr -f /affy/BarFiles/ **************************************************************************************
[edit] How do I download my Agilent microarray data?
Results from completed microarray experiments are available via the GNomEx web site. To collect your results:
- Log on to GNomEx using your UNID and password.
- Click the "Fetch" link on the left side of the window, beneath "Microarray Hybridization Requests".
- The "Fetch Microarray Hybridization Results" page should appear. To view every request available for your lab, simply click the "Find Results" button. To limit your view to one particular request or the requests that a particular person made, enter a request number in the request id # box (e.g. "5054R"), or select the requestor's name on the "Submitted by:" menu, and click the "Find Results" button. You can also limit your view by date range or project.
- A new table of information will appear which is organized by microarray request. Each line represents either the bioanalyzer results for the request, or the results for a single microarray in the request. A check box appears next to each array or bioanalyzer result that is available. Check the boxes for the arrays or bioanalyzer results you want, and click the "Download" button.
- All the data files associated with the selected items will be collected and compressed into a Zip file, and will be sent to your computer. Your web browser should offer you the choice of opening or saving the Zip file.
- When you open the Zip file (which may happen automatically on some computers) you'll find a folder for each request named with the request's id (e.g. "5054R"). Each request folder will contain a subfolder for each selected microarray. These subfolders are named with the experiment number for the array (e.g. "5054E1", "5054E2", and so on). Bioanalyzer results will be in a subfolder named "bioanalysis".
[edit] What are all these Agilent data files?
- The data files produced during an Agilent microarray experiment (whether gene expression, CGH, or ChIP-chip) include raw data in the form of TIFF images, low-resolution JPEG images for visual examination, quality control files, and numerical data in the form of a text file. All the files for a particular array will have names that begin with the array's experiment number. For example, in request number 5054R, the names of all the data files for the first array will begin with "5054E1".
- The Microarray Core Facility uses Agilent's Extended Dynamic Range Scan, which is a technique that reduces the number saturated spots while increasing the dynamic range of the instrument. In this procedure the array is scanned at two different intensities (which produces two TIFF images). Data from these two images is synthesized into a single data file.
- Here is a list of files produced from a typical gene expression experiment:
- experiment_number_251486815301_S01_GE2-v5_95_Aug07_1_1.jpg
- JPEG image of array
- experiment_number_251486815301_S01_GE2-v5_95_Aug07_1_1.pdf
- quality control report
- experiment_number_251486815301_S01_GE2-v5_95_Aug07_1_1.txt
- Text file with the numeric data for each spot on the array
- request_number.fep
- XML document that contains the parameters used for this run of the Agilent Feature Extraction software - the software that translates the image of the array into numerical data
- request_number_200708151242.rtf
- RTF (Word) document with the Project Run Summary for the array, a brief report describing when the array was scanned, the array's format, and the number of saturated spots
- request_number_251486815301_S01_H.tif
- TIFF from high-intensity scan
- request_number_251486815301_S01_L.tif
- TIFF from low-intensity scan
- H002334_LastBatchReport.rtf
- same as the Project Run Summary
- QCReport_Graphs
- A folder that contains files used in the Quality Control Report
[edit] Which expression microarray gene selection method should I use?
A comparison of three methods used to select differentially expressed genes from a microarray experiment: SAM, RankProd, LIMMA.
Cody Olsen, Huntsman Cancer Institute, University of Utah, July 2006
You have some microarray data (or you are planning on getting some) and you want to know how to find those interesting genes—the ones that are consistently differentially expressed in your experiment. Not only must you decide how to clean and process the raw data and do background correction and normalization, you must also choose a method with which to analyze your data. Once the data is analyzed, you can compile a list of possibly differentially expressed genes. Among common methods of identifying differentially expressed genes in microarray experiments are: Significance Analysis of Microarrays (SAM), linear models (such as limma in R), and non-parametric methods such as Rank Product Analysis (RankProd in R). Hopefully this quick comparison and overview will aid you in making a decision.
Data from Ed Levine’s lab at the Moran Eye Center at the University of Utah was used to compare these three methods. Three 2-color Agilent microarrays were treated with unique pooled samples of P0Het on the Cy3 channel, and P0Null on the Cy5 channel. Microarrays were processed in the University of Utah Microarray Shared Resource Lab. A list of genes which were differentially expressed between the two classes P0Het-P0Null was of interest.
The three methods were applied to the same set of normalized data, and three lists of 300 genes were obtained. Two important factors used to estimate whether a gene is differentially expressed are the magnitude of the difference between groups and the variability of the gene. Since each of the three methods estimate and deal with these quantities differently, we will not get the same list of differentially expressed genes from all three methods.
Linear Models
The R package “limma” is used to create and test linear models for microarray data. Limma uses a moderated t-statistic to test the average difference in log expression levels between the two groups for each gene. The moderated t-statistic is the average log ratio divided by a standard error which is calculated using information from the replicates of the given gene and information from across all genes. Once all possible tests have been done, a variety of multiple comparison procedures are available to control for the false discovery rate of the experiment.(Smythe, 2004)
The t-statistic used in limma assures that the final list of genes includes genes that are consistently different between groups. Limma will choose a gene that is moderately different and consistent, before it will choose a gene that is extremely different each sample, but whose expression is highly variable. A profile plot of the top 300 genes selected by limma is shown in the figure below.
Profile of LIMMA's Top 300 Genes
Some of the most extremely expressed genes are not selected by limma, while genes who are moderately over or under expressed but which have little variability across sample are chosen.
Why should I use a linear model?
Linear models are powerful and can deal with complicated experimental designs in an elegant way. Linear models are a standard statistical tool, and one can get a traditional t-statistic from limma in addition to the moderated one the authors of the package suggest using. There are also other statistics that can be calculated including Bayesian estimators. Of course, certain assumptions are made when fitting a linear model, which are not met in the microarray case, but similar assumptions are made for the other two methods as well. Limma is a package in R’s Bioconductor which is free open source software.
SAM
Significance Analysis of Microarrays uses a test statistic similar to that used by limma. SAM uses a different method to calculate the variability of a gene using information from the specific gene as well information from other genes. Although the test statistic is similar to that used by limma, SAM will choose a slightly different set of genes. The figure below shows a profile plot of genes chosen by SAM.
Profile of SAM's Top 300 Genes
In this experiment, SAM didn’t choose any genes which were under-expressed. They seem to have been too variable. SAM seems to value consistency more than limma does, and will choose genes with a smaller difference in expression as long as that difference is consistent across samples.
Why should I use SAM?
SAM is free for academic users and can be easily used in Excel or R. The genes selected by SAM will be very consistent, if not greatly over or under expressed. Like limma, SAM uses a t-test and one can expect a list similar to one obtained from limma, although p-values (SAM uses a “q-value”) will differ.
RankProd
The RankProd package in R can be used to analyze microarray data with a rank-based nonparametric method which may be able to identify differentially expressed genes not identified by the other two methods.
The RankProd statistic for a gene is the geometric mean rank of that gene and was propose by Breitling et al as a useful test statistic in the microarray case (2004). Its distribution is estimated by randomly permuting the observed ranks. Using this estimated distribution, RankProd gives estimates of the false discovery rate for each gene. RankProd calls this the “PFP" which is interpreted as the false discovery rate for the whole list of genes which have an equal or smaller p-value. The procedure ranks each gene two ways, from most to least expressed, and from least to most expressed.
The rank product of a gene is affected by extreme ranks more than, say, the sum of expression values is affected by extreme expression values. The rank product analysis, therefore, picks up genes that are extremely expressed in at least one sample, even though they are less extreme in other samples. A profile plot of genes selected by RankProd is shown below.
Profile of RankProd's Top 300 Genes
One can see that a whole group of genes who were extremely under expressed in one or two samples are chosen up by RankProd, but were not chosen by limma or SAM due to their variability. RankProd seems to value large differences more than small standard errors. Unlike the first two lists, genes with missing values are included in RankProd’s top 300.
Why choose RankProd?
Rank Prod seems to be more in line with a Fold-Change criterion and may select genes with more biological significance according to Breitling et al (2004). Genes selected by RankProd might be less consistent across samples, but will have large differences in some of the samples.
How do they Compare?
Below is a Venn Diagram showing the overlap between methods in this case. SAM and limma had the most overlap (268 common genes) while limma and RankProd were the least similar pair (152 common genes). The list of genes from RankProd was the most unique, with 122 genes not found significant by either of the other methods. There was some agreement between all three methods though; half of the genes selected by each method were significant in both of the other two.
More Information
R / Biocionductor:
- To learn more about R-software go to the R-software homepage: http://www.r-project.org/
- Also look at the Bioconductor site: http://www.bioconductor.org/
- How to install Bionconductor packages (limma, rankprod, siggenes–where SAM can be found, and many others): http://www.bioconductor.org/download
LIMMA:
- Limma details: http://www.bioconductor.org/packages/1.9/bioc/html/limma.html
- Reference - Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, Vol. 3, No. 1, Article 3. http://www.statsci.org/smyth/pubs/ebayes.pdf
RankProd:
- RankProd vignette: http://www.bioconductor.org/repository/devel/vignette/RankProd.pdf
- This paper by Breitling et al goes through a very detailed comparison of RankProd,SAM, and Fold-change gene selection criteria.
- Reference - Breitling, R., Armengaud,P., Amtmann, A., and Heryzk,P. (2004). Rank Products: A simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments, FEBS Letter, 57383-92 http://www.dcs.gla.ac.uk/~rb106x/publications/RankProducts_FEBS.pdf
SAM:
- Information about SAM can be found at: http://www-stat.stanford.edu/~tibs/SAM/
- Bioconductor siggenes vignette: http://www.bioconductor.org/repository/devel/vignette/siggenes.pdf
- Reference - V.Tusher, R.Tibshirani, and G. Chu. Significance analysis of microarrays applies to transcriptional responses to ionizing radiation. Proc. Natl. Acad. Sci. USA., 98:5116-5121, 2001. http://www-stat.stanford.edu/~tibs/SAM/pnassam.pdf
[edit] How do I convert from one log base to another?
[edit] How do I convert from log ratios to fold changes?
[edit] How do I hyperlink my results to an annotation database?
[edit] How do I merge annotations into my results?
[edit] Where do I get the design file for an Agilent array?
The layout of each Agilent microarray is described in a "design file" which is needed for loading Agilent-format microarray data into analysis software such as MeV or Agilent CGH Analytics.
Download
The Agilent array design files can be obtained from Agilent. You will need a microarray bar code to download the design file. The bar code number of each array is a 12-digit number (typically beginning with "251...") that is embedded in the names of your Agilent microarray data files. You can also find the barcode number in the header of the .txt format Agilent data file. The barcode should be on row 3, directly beneath a cell with the word "FeatureExtractor_Barcode". If the data file is opened in Excel, the barcode is usually in cell T3.
Formats
The design files are available in several different formats. The format you need depends on which analysis software you use.
- Agilent CGH Analytics
- GEML (.xml) format
- MeV
- Tab-delimited text (.txt) format
- Agilent scanner configuration
- DNA: Back of slide
- Barcode: Left side
- Scan: Landscape
[edit] What analysis software is available?
See Software
[edit] How do I convert from Z scores to P values?
[edit] How do I upload my results to GEO or ArrayExpress?
- Main Geo web site
- GEO's web deposit guide
- Main ArrayExpress web site
- ArrayExpress' web uploader MIAMExpress
[edit] How do I get the sequence for a particular microarray probe?
Agilent Probes
Agilent probe sequences are available via Agilent's eArray web site. You will need a login to use this site, which can be obtained from Brian Dalley or the Bioinformatics Core.
Step by step:
Once you've logged in to eArray, follow these steps to find the sequence of a probe:
- Click on the "Probes" tab.
- In the "Search Term" field enter a probe id (e.g. A_44_P409518), an accession number (e.g. NM_057188) or a gene symbol (e.g. Gmpr).
- Select a species by clicking on the "Select and add species" link. This will open the species selection window, which is a little tricky to use. Earray stores the species as "H. sapiens" or "R. norvegicus", so a search for "Human" or "rat" probably won't produce what you want. Try searching by the species name, like "sapiens", "musculus", "norvegicus", etc. Perform the search, and click on the result you want (e.g. R. norvegicus), then click the "Add>" button, then "Done". Whew!
- Back in the main search page, select a Folder for the search: either "Agilent Catalog" for Agilent-designed probes (including commercial arrays) or "University of Utah" for custom-designed probes.
- Click the "Search" button.
- Once the search is done, click on the checkbox for the probe(s) of interest.
- Finally, click on the "Show Statistics" button at the bottom of the page.
Affymetrix Probes
[edit] How do I intraconvert between different gene names?
There are several web sites, that can get you 90% of the way there. Check out these two http://discover.nci.nih.gov/matchminer/index.jsp and http://david.abcc.ncifcrf.gov/conversion.jsp . For those with no match try manually punching the name into the UCSC and Ensembl search bars (http://www.ensembl.org/index.html http://genome.ucsc.edu/cgi-bin/hgGateway).
[edit] What is P-value Adjustment?
P value adjustment is an essential part of microarray analysis. Raw p values are calculated for each individual gene on a microarray using a t test, for example. But since there are about 40,000 different probes on a typical gene expression array, you are doing 40,000 independent statistical tests on the data. Even at a reasonable p value of 0.05 or 0.01 you should expect many many false positive results. P value adjustment using a method such as the Benjamini and Hochberg method will change the raw p values (individual genes) into a false discovery rate for the whole experiment. For example if you find 500 genes significant at an adjusted p value of 0.05, you know that approximately 25 of the 500 genes (5%) are false positives.
[edit] Administration
[edit] How do I get a uNID for non U of U people?
Make them an affiliate member:
- Download and complete the POI Form. HCI's Working Department Org ID is 00215




