{ "cells": [ { "cell_type": "markdown", "id": "improved-explorer", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Cancer RNA-Seq Data Clustering" ] }, { "cell_type": "markdown", "id": "christian-experience", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "In this notebook, we will learn to cluster data, i.e., to divide the data points into distinct groups, so that there is relatively small variation within a group and larger variation between groups. This task belongs to \"unsupervised learning\", because the training data is not labeled. Compared to classification where our goal is to match the known answers, here we try to find patterns in the data without additional information." ] }, { "cell_type": "markdown", "id": "conventional-parliament", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "The dataset that we will use as our example is the [gene expression cancer RNA-Seq dataset](https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq). It contains the expression levels of 20531 genes from 801 patients having different types of tumor. Our goal is to analyze the data and cluster them into groups, so that these groups may correspond to different tumor types. The dataset in fact comes with labels --- the patients were diagnosed with 5 types of tumor: BRCA, KIRC, COAD, LUAD and PRAD. But when we analyze the data, we will pretend that the diagnoses are not known (or not all correct?) and see how well we can figure them out." ] }, { "cell_type": "markdown", "id": "golden-combine", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Load data" ] }, { "cell_type": "markdown", "id": "delayed-porter", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "The dataset can be loaded and printed using the `pandas` package, which is a popular data analysis and manipulation package. The data loaded in `pandas` are in the format of `DataFrame`s, and they can be operated by `numpy` just like a normal array (in most cases)." ] }, { "cell_type": "code", "execution_count": 1, "id": "solved-nothing", "metadata": { "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "lined-resistance", "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('data/cancer-data.csv', header=0, index_col=0) # load data, using row-0 as column names and column-0 as row names\n", "labels = pd.read_csv('data/cancer-labels.csv', header=0, index_col=0) # load labels, will use later to check results" ] }, { "cell_type": "code", "execution_count": 3, "id": "flush-demand", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | gene_0 | \n", "gene_1 | \n", "gene_2 | \n", "gene_3 | \n", "gene_4 | \n", "gene_5 | \n", "gene_6 | \n", "gene_7 | \n", "gene_8 | \n", "gene_9 | \n", "... | \n", "gene_20521 | \n", "gene_20522 | \n", "gene_20523 | \n", "gene_20524 | \n", "gene_20525 | \n", "gene_20526 | \n", "gene_20527 | \n", "gene_20528 | \n", "gene_20529 | \n", "gene_20530 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sample_0 | \n", "0.0 | \n", "2.017209 | \n", "3.265527 | \n", "5.478487 | \n", "10.431999 | \n", "0.0 | \n", "7.175175 | \n", "0.591871 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "4.926711 | \n", "8.210257 | \n", "9.723516 | \n", "7.220030 | \n", "9.119813 | \n", "12.003135 | \n", "9.650743 | \n", "8.921326 | \n", "5.286759 | \n", "0.000000 | \n", "
sample_1 | \n", "0.0 | \n", "0.592732 | \n", "1.588421 | \n", "7.586157 | \n", "9.623011 | \n", "0.0 | \n", "6.816049 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "4.593372 | \n", "7.323865 | \n", "9.740931 | \n", "6.256586 | \n", "8.381612 | \n", "12.674552 | \n", "10.517059 | \n", "9.397854 | \n", "2.094168 | \n", "0.000000 | \n", "
sample_2 | \n", "0.0 | \n", "3.511759 | \n", "4.327199 | \n", "6.881787 | \n", "9.870730 | \n", "0.0 | \n", "6.972130 | \n", "0.452595 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "5.125213 | \n", "8.127123 | \n", "10.908640 | \n", "5.401607 | \n", "9.911597 | \n", "9.045255 | \n", "9.788359 | \n", "10.090470 | \n", "1.683023 | \n", "0.000000 | \n", "
sample_3 | \n", "0.0 | \n", "3.663618 | \n", "4.507649 | \n", "6.659068 | \n", "10.196184 | \n", "0.0 | \n", "7.843375 | \n", "0.434882 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "6.076566 | \n", "8.792959 | \n", "10.141520 | \n", "8.942805 | \n", "9.601208 | \n", "11.392682 | \n", "9.694814 | \n", "9.684365 | \n", "3.292001 | \n", "0.000000 | \n", "
sample_4 | \n", "0.0 | \n", "2.655741 | \n", "2.821547 | \n", "6.539454 | \n", "9.738265 | \n", "0.0 | \n", "6.566967 | \n", "0.360982 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "5.996032 | \n", "8.891425 | \n", "10.373790 | \n", "7.181162 | \n", "9.846910 | \n", "11.922439 | \n", "9.217749 | \n", "9.461191 | \n", "5.110372 | \n", "0.000000 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
sample_796 | \n", "0.0 | \n", "1.865642 | \n", "2.718197 | \n", "7.350099 | \n", "10.006003 | \n", "0.0 | \n", "6.764792 | \n", "0.496922 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "6.088133 | \n", "9.118313 | \n", "10.004852 | \n", "4.484415 | \n", "9.614701 | \n", "12.031267 | \n", "9.813063 | \n", "10.092770 | \n", "8.819269 | \n", "0.000000 | \n", "
sample_797 | \n", "0.0 | \n", "3.942955 | \n", "4.453807 | \n", "6.346597 | \n", "10.056868 | \n", "0.0 | \n", "7.320331 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "6.371876 | \n", "9.623335 | \n", "9.823921 | \n", "6.555327 | \n", "9.064002 | \n", "11.633422 | \n", "10.317266 | \n", "8.745983 | \n", "9.659081 | \n", "0.000000 | \n", "
sample_798 | \n", "0.0 | \n", "3.249582 | \n", "3.707492 | \n", "8.185901 | \n", "9.504082 | \n", "0.0 | \n", "7.536589 | \n", "1.811101 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "5.719386 | \n", "8.610704 | \n", "10.485517 | \n", "3.589763 | \n", "9.350636 | \n", "12.180944 | \n", "10.681194 | \n", "9.466711 | \n", "4.677458 | \n", "0.586693 | \n", "
sample_799 | \n", "0.0 | \n", "2.590339 | \n", "2.787976 | \n", "7.318624 | \n", "9.987136 | \n", "0.0 | \n", "9.213464 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "5.785237 | \n", "8.605387 | \n", "11.004677 | \n", "4.745888 | \n", "9.626383 | \n", "11.198279 | \n", "10.335513 | \n", "10.400581 | \n", "5.718751 | \n", "0.000000 | \n", "
sample_800 | \n", "0.0 | \n", "2.325242 | \n", "3.805932 | \n", "6.530246 | \n", "9.560367 | \n", "0.0 | \n", "7.957027 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "6.403075 | \n", "8.594354 | \n", "10.243079 | \n", "9.139459 | \n", "10.102934 | \n", "11.641081 | \n", "10.607358 | \n", "9.844794 | \n", "4.550716 | \n", "0.000000 | \n", "
801 rows × 20531 columns
\n", "\n", " | Class | \n", "
---|---|
sample_0 | \n", "PRAD | \n", "
sample_1 | \n", "LUAD | \n", "
sample_2 | \n", "PRAD | \n", "
sample_3 | \n", "PRAD | \n", "
sample_4 | \n", "BRCA | \n", "
... | \n", "... | \n", "
sample_796 | \n", "BRCA | \n", "
sample_797 | \n", "LUAD | \n", "
sample_798 | \n", "COAD | \n", "
sample_799 | \n", "PRAD | \n", "
sample_800 | \n", "PRAD | \n", "
801 rows × 1 columns
\n", "