|
|
|
|
High-Dimensional Models and Microarray Data Analysis
Jian Huang
Over the past decade DNA microarray technology has attracted tremendous interest in basic science labs, clinical labs and in industry. Microarrays are capable of monitoring the expression of thousands of genes simultaneously and have many important applications in biological and biomedical research. They are used, for example, to characterize disease states, determine the effects of certain treatments, and to examine the process of development. Microarrays are also increasingly used for identifying genes and genomic regions that increase the risk of common and complex diseases such as diabetes, heart diseases, and autism.
While microarrays have become a routine tool in research, analysis of microarray data is challenging. A hallmark of microarray data is high-dimensionality, since a typical microarray study surveys at least thousands of genes, but the sample size is often at most in the hundreds. This is called a "large p, small n" problem in statistics, where p refers to the number of variables (genes), and n refers to the sample size (the number of subjects participating in the study). Standard methods are not applicable to such problem since they require that p is smaller than n. Two other important features of microarray data are sparsity and the presence of cluster structure. Sparsity is due to the fact that the number of genes important to a trait or disease is usually small. The task of finding such genes for a given trait can be formulated as a variable selection problem in statistical modeling. Cluster structure is present since genes in the same biological pathways or functional groups tend to be correlated. Incorporation of such information in statistical modeling facilitates the identification of statistically and biologically significant patterns from data.
I have been working on approaches for correlating microarray data with a clinical outcome. These methods take into account the features described above. The focus is on developing variable selection methods for the identification of genes and pathways that are associated with disease such as age related macular degeneration or a disease related quantitative trait such as the survival time of lymphoma patients.

[Click on the image for a larger view.]
Image from "A Primer of Genome Science" by Greg Gibson and Spencer Muse (Sinauer Associates, 2002).
The image above is part of a cDNA microarry. Each pixel in the image represents part of the DNA sequence of a gene. Red pixels indicates genes with relatively higher expression in the treatment sample than in the mutant sample. The dendrogram on left side indicates that genes tend to be clustered according to their expression across the samples, the one on the top suggests that samples can also be clustered using gene expression.
MRI Tissue Classification of the Human Brain
Dai Feng and Luke Tierney
Magnetic Resonance Imaging (MRI) is an important non-invasive tool for
understanding the structure and function of the human brain. One
important task is to use MR images to identify the major tissue, white
matter (WM), gray matter (GM), and cerebro-spinal fluid (CSF), within
a particular subject's brain. This is valuable, for example in
detecting diseases, in preparation for surgery, and to aid in
subsequent functional studies of the brain.
An MR image is based on a discretization of the viewing area into a
3-dimensional array of volume elements, or voxels. Typical images
consist of a 256 x 256 x 256 array of one cubic millimeter voxels.
Segmentation is usually based on a T1-weighted image providing one
measurement for each voxel. The measurements contain some noise that
is usually modeled as normally distributed and independent from voxel
to voxel. A simple model views each voxel as homogeneous, belonging
entirely to one of the three major tissue types; the measurements are
thus normally distributed with means depending on the tissue types of
their voxels. The tissue types are not known and need to be identified
from the image. Since nearby volumes tend to be of the same tissue
type, a Markov random field model can be used to capture the spatial
similarity of voxels. A Markov chain Monte Carlo approach can be used
to fit this model.
A more realistic model than the one just described would take into
account the fact that the volume elements are not homogeneous; while
some may contain only one tissue type, others on the interface will
contain two or possibly three different tissue types. Our approach to
this problem is to construct a higher resolution image in which each
voxel is divided into 8 subvoxels. For each voxel the measured value
is the sum of the unobserved measurements for the subvoxels. The
subvoxels are in turn assumed to be homogeneous and follow the simpler
model described above. This approach provides more accurate tissue
classification and also allows more effective estimation of the
proportion of each voxel that belongs to each of the major
tissue types.
The image on the left shows a coronal slice of a T1-weighted MR image
of a brain, and the image on the right shows the corresponding tissue
classifications with CSF shown in dark gray, GM in medium gray, and WM
in light gray.

[Click on each image for a larger view.]
|
|
Coarsened-Data Statistical Methods for Spatial Epidemiology
Dale Zimmerman
The estimation of intensity and spatial variation in relative risk are important inference problems in spatial epidemiologic studies. A standard data assimilation component of these studies is the assignment of a geocode, i.e. point-level spatial coordinates, to the address of each subject in the study population. Unfortunately, when geocoding is performed by the standard automated method of street-segment matching to a georeferenced road file and subsequent interpolation, it is rarely completely successful. Typically, 10% to 30% of the addresses in the study population, and even higher percentages in particular subgroups, fail to geocode, potentially leading to a selection bias, called geographic bias, and an inefficient analysis. Missing-data methods could be considered for analyzing such data; however, since there is almost always some geographic information coarser than a point (e.g. a zip code) observed for the addresses that fail to geocode, a coarsened-data analysis is more appropriate. Recently I have been developing methodology for estimating spatial intensity and relative risk functions from coarsened geocoded data. Using this new methodology, substantial improvements in the estimation quality of coarsened-data analyses relative to analyses of only the observations that geocode have been demonstrated.
For example, using data from a rural health study in Iowa in which only 64% of rural addresses and 85% of non-rural addresses geocoded, but imprecise locational information was available for all addresses, I obtained a kernel-based intensity estimate using only the data that geocoded and a coarsened-data intensity estimate using all the data. Pointwise ratios of each of these two estimates to the complete-data kernel intensity estimate are displayed in the figure. The coarsened-data estimate more closely approximates the complete-data estimate; in fact its integrated absolute error is less than half that of the incomplete-data estimate.
 [Click on the image for a larger view.]
|