Gene Expression Experimental Data Analysis
Goals
Get familiar with exploratory data analysis techniques
- hierarchical clustering (heatmaps and line plots)
- principal component analysis
Data
- differential expression of E. coli cells in biofilm and in suspension (i.e., clump of cells vs cells floating in solution)
- GSE3905 dataset from HW7 (article)
Preprocessing
A preview
There are quite a few tab-delimited fields:
Fields
| Time slot | Biofilm | Suspension |
|---|---|---|
| 4 h | GSM88912 | GSM88916 |
| 7 h | GSM88913 | GSM88917 |
| 15 h | GSM88914 | GSM88918 |
| 24 h | GSM88915 | GSM88919 |
Format
| Field | Description |
|---|---|
| 0 | IDREF |
| 1 | IDENTIFIER |
| 2 | 15h-suspension |
| 3 | 15h-biofilm |
| 4 | 24h-suspension |
| 5 | 24h-biofilm |
| 6 | 7h-suspension |
| 7 | 7h-biofilm |
| 8 | 4h-suspension |
| 9 | 4h-biofilm |
Tasks
Write a data parser (~30 lines of Perl):
- Read the data into Perl
- Calculate the differences between suspension and biofilm
- Output a CSV file like this:
| IDREF | diff4 | diff7 | diff15 | diff24 |