首頁 » 實驗方法 » 生物芯片 » 生物芯片

Basic Analysis of NimbleGen ChIP-on-chip Data using Bioconductor/R

來源:生物谷  2010/1/28   訪問量:12392


Hybridization of chromatin immuno-precipitation (ChIP) material to tiling arrays at NimbleGen service facilities usually leaves the customer with a set of data files that are of limited use. Most information about the experiment is gained by either displaying immuno-precipitate(IP)/input ratio tracks (the GFF files provided) of individual hybridisation experiments with NimbleGen’s SignalMap software or by scanning the list of peaks identified by the automated data analysis. Summary profiles from replicate experiments cannot be investigated; further calculations are left to the customers. Apart from the fact that data quality cannot be directly evaluated, raw data is not corrected for systematic signal distortions in the generation of ratio GFF files. Furthermore, the robustness of the provided peak finding procedures is questionable, as the algorithm will frequently identify many "significant" peaks in noise-only experiments.

Several independent software tools are available that provide a more robust and more detailed analysis. Unfortunately, those tools frequently tend to underperform on datasets that are largely different from the ones they have been built for. Apart from simple problems such as compatibility issues between different organisms and naming conventions for chromosomes many tools aim, e.g., for identifying peaks or peak centers. In the realm of epigenetics, however, many features are distributed rather broadly and peak centers are not the only features defining biological states. Overall, many tools do not provide sufficient flexibility and transparency for the user to know and control what is actually happening with the data. This will ultimately lead to non-reproducibility and/or analysis failures once the tools are modified.

With increasing amounts of quantitative data biologists are often left with endpoints of analyses that they simply have to trust if they are not given the possibility to look behind the procedures. This protocol mainly aims for aiding the biologists to get a rather unbiased look at the quantitative data of their NimbleGen tiling array experiments during initial processing steps. Ultimately, summary tracks of the profiles will be generated that permit visual browsing of the data, which serves as important inspiration for downstream analyses. The procedures introduced here can in principle applied to any other 2-color tiling array data (Agilent) and/or 2-sample comparison data on Affymetrix chips (single colour).


R is a free and platform-independent scripting framework for statistical analyses (Click here for more information). Its modular structure allows for integration of specific tools such as the Bioconductor modules for bioinformatic applications (Click here for more information). These extensions facilitate processing of biological data by providing easy-to-use interfaces to statistical core modules and by adding specific algorithms such as normalization steps for microarray analyses. Other highlights of R/Bioconductor are its excellent graphing capabilities, the scriptability which allows for automatization of complex analysis pipelines, and the existence of a large community creating and maintaining state-of-the-art tools. Most importantly, however, data in R is ready for explorative examination (for example correlations) and complex downstream analyses.

Unfortunately, data handling in R is driven by commands typed into a terminal window. This implies a rather steep learning curve, but there are several reasons not to shy away: At first there is plenty of documentation on R usage for biologists. I recommend the public domain resources "R-introduction for Biostatistics" by Kim Seefeld (Get Pdf version) and the "R & BioConductor Manual" by Thomas Girke (View HTML version). In addition, there are several good introductory books published on R/Bioconductor ("Bioinformatics and Computational Biology Solutions Using R and Bioconductor" by Robert Gentleman et al. and "Bioconductor Case Studies (Use R!)" by Florian Hahne et al.). Secondly, the commands to be typed can simply be copy-pasted or loaded from preformed text files. The analysis steps executed in this protocol can therefore be adjusted to any other NimbleGen dataset by editing a few lines in a sample description file. This protocol does not require the reader to be familiar with R/Bioconductor.

Rationale of the analysis pipeline

Quality control: the validity and robustness of any analysis will depend on the quality of the raw data. The most important quality parameters are the distribution of raw signal intensities and the independence of IP/input ratio and absolute signal strength. In addition, arrays with severe hybridisation artefacts should be identified.

Normalization: shifts in signal intensity distributions between the two channels as well as between arrays have to be adjusted. Dependences of ratio and absolute signal strengths should be eliminated.

Probe summary statistics: For each probe on the array a mean ratio value of all replicates and a statistical test for enrichment is provided.

Region summarization: The spatial organization of bound probes can be exploited to identify statistically significant regions of binding.

Data visualization: The results of the analyses will be exported for visualization in genome browsing applications.

Back to top


Installation of R/Bioconductor

Get the recent version of R for your platform by choosing a mirror server next to you in http://cran.r-project.org/mirrors.html. Follow the installation instructions provided. GUI versions of R are launched by double-clicking the corresponding icons in OSX or Windows. Linux R requires issuing the command 'R' in a terminal window.

In addition to the R base packages provided with the standard installation, this protocols requires some modules, which in part are bioconductor packages. The following chain of commands pasted into the R terminal will install all packages required. The installation will take some time and require some hard disk space (1-3 GB).

Code presented in here is preceded by the default R command prompt (>), which should not be part of the pasted command. No details on what is actually performed by the commands are provided here. The entire code (also the one for installation) without the command prompts is provided (http://genome1.bio.med.uni-muenchen.de/downloads.htm) in fully commented text files (noe_installation.R) that can either be copy-pasted or executed using the source command - source("noe_installation.R") - from within the terminal window.

> install.packages("st", dependencies=T)
> install.packages("locfdr", dependencies=T)
> install.packages("tileHMM", dependencies=T)
> install.packages("gplots", dependencies=T)
> source("http://bioconductor.org/biocLite.R")
> biocLite("geneplotter")
> biocLite("limma")
> biocLite("vsn")

The example dataset

Here I demonstrate the protocol on a published data set (Straub et al, 2008). The target protein is Drosophila MSL2, a member of the Dosage Compensation Complex that specifically binds the X chromosome in male flies. The data comprises 3 biological replicates and includes one dye swap. Immuno-precipitated (IP) and Input material was hybridized to a custom NimbleGen array that uses an isothermal design layout (i.e. oligo probes of varying lengths).

Organizing the workspace folder

All data belonging to the experiment are organized into one folder (Figure 1). This comprises the corresponding *.pair files found in the RawDataFiles folder of the service DVD and the *.ndf and *.pos files found in the layout folder within the DesignInformation folder. The experiment folder should also contain a description file of the samples (sample_key.txt, see below), an R script files that defines functions for some more complicated routines (noe_source.R) and an R script file that contains the commands of the analysis pipeline (noe_protocol.R).

Figure 1: Organization of files

The fully commented R script files can be downloaded from here: http://genome1.bio.med.uni-muenchen.de/downloads.htm

An example folder containing the raw data and layout information, as well as sample_key and R-script files can be obtained here: http://genome1.bio.med.uni-muenchen.de/dl/protocol.zip

Creating and editing of the sample key file

The use of a sample description file facilitates recycling of the R-scripts in that experiment-specific data is kept outside of the script. The file 'sample_key.txt' provided here is a tab-delimited table that can be created or edited in Microsoft Excel (Figure 2). Please note that the file has to be saved as a tab-delimited text file. For this protocol 3 columns are required: a column named 'file' that contains the name of the raw data file (*.pair), a column 'sample.type' that contains 'e' for enriched if IP material has been hybridized or 'i' for input if total input chromatin has been used. In addition, the column 'sample.name' groups the corresponding raw data from the two channels of the sample array.

Figure 2: Contents of the sample key file

Reading the raw data

Launch R and navigate to your experiment folder. In R.app for OSX this is simply achieved by dragging the folder icon onto the R icon in the launch bar. Otherwise set the working directory accordingly using the R command "setwd()". To check that you are in the proper working directory test the contents of the folder using "list.files()".

> setwd("/Path/to/your/dataset1")
> list.files()

Reading of both raw data and layout information is performed by calling specific functions that are provided in the file 'noe_source.R' and get activated upon sourcing the script file. At first the sample key file is read to extract information on the raw data files. The raw signal intensities are then loaded into a matrix with columns corresponding to hybridization channels and rows corresponding to probes. The name of the row is the Probe ID as provided in the NimbleGen layout.

The layout information is extracted from the headline of the first raw signal file read. This should correspond to the name of the *.ndf and *.pos file in the folder. Parts of the layout and position information are then read into a data frame (comment 1). Non-experimental probe information is omitted from both signal and layout tables. Only experimental probes are included in the processing pipeline, therefore the number of table rows - each corresponding to one probe - is smaller than the actual number of probes on the array. Both the signal matrix and the layout frame are sorted by id of the probe. This will ensure that a probe has the same row index in both tables.

After reading of the data, the tables for sample description, raw signals, and layout are saved into separate R data files. This speeds up reading of the raw data next time they are needed.

[1] [2] [3]  下一頁