Materials: Presentation

Videos: Data Exploratory Classification DEA (Statistics) DEA (ANOVA) Enrichment Single Cell Task2

Webex E-mail: Skype: pvn.public


0. Data and software download

Before we start, let’s download the data and install the software for ThermoFisher / Affymetrix microarrays data analysis.

Small LUSC dataset

TAC software

Microarray raw CEL files

Installation instructions

If you use Windows, please download TAC software and run installation. For iOS and Linux you will need additional software in order to run TAC. Options:

  • WineHQ - free software.

  • Crossover - commercial software with 14-day trial version available.

1. Simple example of data analysis

Let us try to work with gene expression data directly in Excel. This is definetely not the best choice, but will help you to feel the data. Please download a subset of TCGA LUSC data: lusc20.txt

It contains gene expression for 10 normal and 10 cancer lung tissues.

Task 1:

  1. Download the data and save as txt file.

  2. Import data to Excel. Ensure that you use “Open..” from Excel. Otherwise gene names (e.g. SEPT11) will be damaged!

Alternatively (in some systems the problems with decimal separator are severe!) please use prepared lusc20.xlsx file

  1. Calculate:
  • average expression for each gene: global, normal, cancer =AVERAGE()

  • exclude genes that are not detected (this is already done for lusc20.xlsx)

  • log fold chage: logFC = MeanTumour - MeanNormal (“-”, not “/”, as we work in log scale)

  • perform a t-test comparing tumour and normal tissues =T.TEST()

  • assign rank to p-value, either manualy or by =RANK.AVG()

  • estimate FDR: FDR = m * pv / k, where m - number of genes, pv - p-value, k - rank (1..m)

  1. Draw several plots used for visualization in transcriptomics:
  • MA-plot (x: AverageGlobal, y:logFC)

  • Volcano-plot (x: logFC, y: -log10(FDR))

  1. Run online enrichment analysis. Select 1000 genes with lowest FDR (ensure that they all have FDR<0.05) and feed them to Enrichr. Investigate Pathways:Reactome2016, Ontologies:GO BioProc 2018

2. TAC software (optional: we do it in Webex)

Transcriptome Analysis Console is a user-friendly tool for analysis ex.Affymetrix arrays, that are bought niw by ThermoFisher. Please install it and import CEL files from SCC_CEL.zip.

You might need registration to download library files. If you do not have one, use login: Ask for the password.

Task 2:

  1. Annotate and import the data, Samples with “N” in the name come from normal tissue. In order to speed-up the analysis - select “Gene” insted of “Gene+Exon” (optional).

  2. See PCA visualization

  3. Perform differential expression analysis (DEA)

  4. Export the results of DEA and the data

  5. Check functional annotation of the significant genes

3. Independent work

This task was generated for you to give a better control over the material. Please, perform analysis of timecourse experiment for IFNg-stimulated A375 cell line. The dataset is discussed in Nazarov et al, 2013. Do not hesitate contacting me to ask questions.

The data are available in 2 versions:

  1. Text file mrna_ifng.amd.txt.

  2. Excel file mrna_ifng.amd.xlsx

Please try to use (1) to investigate issues of importing to Excel. Hint: decimal separator is “.” and GeneSymbol should be “Text”. If you find it difficult - use (2).

Task 3:

  • Remove transcript clusters (rows) which have no GeneSymbol or which have maximal expression over all samples below 5

  • Find mRNA activated by IFNg-stimulation at 24h: T24 vs T00. Consider genes with FDR<0.01

  • Put these genes (GeneSymbol) to Enrichr and investigate their biological functions

Additional (funny) materials

  1. Principal Component Analysis Explained Visually by Victor Powell

  2. Use t-SNE Effectively Wattenberg, et al., “How to Use t-SNE Effectively”, Distill, 2016.

  3. Understanding UMAP by Andy Coenen and Adam Pearce

  4. Dimensionality Reduction for Data Visualization: PCA vs TSNE vs UMAP vs LDA by Sivakar Sivarajah

  5. Tensorflow Playground to understand classification with neural networks.

  6. ANOVA via Partek Shoes Example.