{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SLT-CE-2: Deterministic Annealing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
1. Sections II.A.1 (principled derivation of deterministic annealing) and II.A.3 (Mass-constrained clustering) of 'Deterministic annealing for clustering, compression, classification, regression, and related optimization problems', Kenneth Rose, 1998, http://ieeexplore.ieee.org/document/726788/ \n", "
2. \n", "\n", "
3. \n", "The wine data set, http://www3.dsi.uminho.pt/pcortez/wine5.pdf\n", "
4. \n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn as skl\n", "from sklearn.utils.validation import check_is_fitted\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.datasets import make_blobs\n", "import sklearn.svm as svm\n", "from sklearn import cluster\n", "\n", "import pandas as pd\n", "import numpy as np\n", "from treelib import Tree\n", "\n", "import matplotlib.pyplot as plt\n", "from matplotlib import cm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

## \n", "Section 4.0\n", " Complete all problems in this and previous sections to get a grade of 4.0 \n", "

\n", "\n", "

\n", " For this exercise, it is of utmost importance to read reference [1] about deterministic annealing clustering (DAC). Our implementation will be based on this reference. Please shortly summarize what they refer to as the preferred implementation of the DAC algorithm.\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Put your markdown text here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", " In order to avoid headaches with numerical instabilities, we first try our algorithm on a simple artificially generated data as below. Run the bloc below to have a look at the data. Later when we have everything implemented, we will examine some real world data. \n", "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_clusters = 4\n", "ran_s = 42\n", "\n", "# Generate artificial dataset\n", "X, y_true = make_blobs(n_samples=7000, centers=4,\n", " cluster_std=0.3, random_state=ran_s,\n", " center_box=(-8.0, 8.0),\n", " shuffle=False)\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y_true, train_size=6000, random_state=42)\n", "\n", "plt.figure()\n", "plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=40, cmap='viridis')\n", "plt.title(\"Training data\")\n", "\n", "plt.figure()\n", "plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=40, cmap='viridis')\n", "plt.title(\"Test data\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", " Implement the fit method for the template class DeterministicAnnealing, according to the contract outlined in its docstring. (The template class DeterministicAnnealing is in file DA.py which you can open in your favourite IDE) For the implementation, it may help to take a look at both get_distance method and fit _calculate_cluster_probs method and implement them as well. Of course you are free to change all these methods or/and write additional methods for your purpose.\n", " You can add more class methods as necessary.\n", " See http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html for complementary information.\n", "

\n", "

\n", " While implementing, you can run the bloc below to test your implementation.\n", "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from DA import DeterministicAnnealingClustering\n", "\n", "DAC = DeterministicAnnealingClustering(\n", " n_clusters=n_clusters, random_state=ran_s)\n", "DAC.fit(X_train)\n", "y_DAC = DAC.predict(X_test)\n", "y_DAC_hard = np.argmax(y_DAC, axis=1)\n", "plt.figure()\n", "plt.scatter(X_test[:, 0], X_test[:, 1], c=y_DAC_hard, s=40, cmap='viridis')\n", "plt.title(\"DA clustering\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

## \n", "Section 4.5\n", " Complete all problems in this section to get an additional (+0.5) point to the previous points. Note that you can have a maximum of 6 points at the end.\n", "

\n", "\n", "

\n", " In this section we implement a plot which will help us better understand the DA method, and could also be a help for better debugging of your implementation.\n", " \n", "

\n", "
• \n", " Modify your implementation of fit function such that plot_phase_diagram method will produce a plot similar to the phase diagram plot shown in Figure 2 of the reference paper.\n", "
• \n", "
\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Produce a phase diagram plot of the expected distortion D, as shown in figure 2 of reference [1]. For this, extend DAC.fit to save the expected distortion during annealing as an additional attribute self.distortion.\n", " You might also want to save the number of effective clusters and the temperature along the way.\n", "
\n", "

\n", "\n", "#### extend DAC.fit(self, X):\n", " # ...\n", " # Save information for each (n-th) annealing step:\n", " # self.distortion = [d0, d1, d2, ...]\n", " # self.n_eff_clusters = [e0, e1, e2, ...]\n", " # self.temp = [t0, t1, t2, ...]\n", " # ..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DAC.plot_phase_diagram()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

## \n", "Section 5.0\n", " Complete all problems in this section to get an additional (+0.5) point to the previous points.\n", "

\n", "
\n", "Here we implement another plot which helps better undetrstad the dynamics of the algorithm.\n", "
• \n", " Implement DAC.plot_bifurcation, which should create a bifurcation plot.
\n", " Modify DAC.fit to keep track of the distances, using the tree object DAC.bifurcation_tree. When a cluster splits, it creates two child nodes. Each node should store its centroid vector, and the distance to the parent centroid vector. After splitting, the parent node is not updated anymore.
\n", " In the bifurcation plot, the horizontal distance of a child node to its parent node should be exactly the distance to the parent centroid vector. The two child nodes should move in opposite directions, i.e. one to the left of the parent and one to the right.\n", "
• \n", "
\n", "\n", "This section could bit a bit annoying, you can also jump to the next sections and come back here later. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DAC.plot_bifurcation()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

## \n", "Section 5.5\n", " Complete all problems in this section to get an additional (+0.5) point to the previous points.\n", "

\n", "\n", "

\n", "Now we are ready to use some real world data. This might need some tweaking and handling of numberical instabilities. Please make sure your understand the data.\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "Read the wine data [3], which contains 11 physiochemical attributes, and two labels (quality and color).\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", " Create an instance of your DAC class with n_clusters = 2 and fit the first 6000 samples of the wine data set. Record the execution time. Furthermore, create an instance of the sklearn k-means class, and fit it with the same parameters. Again record the execution time. Make sure that the hyper parameters (initial temperature, min temperature, convergence criteria, noise, etc.) make sense and lead to a reasonable clustering\n", "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from DA import read_data_csv\n", "X, y = read_data_csv(\"wine-data.csv\", y_names=[\"quality\", \"color\"])\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y[\"color\"], train_size=6000, random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "DAC = DeterministicAnnealingClustering(n_clusters=2, random_state=42)\n", "DAC.fit(X_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "kmeans = cluster.KMeans(n_clusters=2,random_state=42)\n", "kmeans.fit(X_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "y_kmeans = kmeans.predict(X_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "%%time\n", "y_DAC = DAC.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

## \n", "Section 6.0\n", " Complete all problems in this section to get an additional (+0.5) point to the previous points.\n", "

\n", "
\n", "
• Before we can compute the confusion matrix, we need to perform some post-processing on the DAC cluster assignments.\n", " Explain what the function postprocess (defined below) does, and why we need it. To do so, complete the docstring of the function postprocess.\n", "
• \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def postprocess(y_DAC, y_kmeans):\n", " \"\"\"TODO: Add explanation\"\"\"\n", " \n", " y_DAC_hard = np.argmax(y_DAC, axis=1)\n", " \n", " n_clusters = len(np.unique(y_DAC_hard))\n", " dac2kmeans = []\n", " for cluster in range(n_clusters):\n", " argmax = np.argmax(y_DAC[:, cluster])\n", " dac2kmeans.append(y_kmeans[argmax])\n", " \n", " y_DAC_new = []\n", " for dac_label in y_DAC_hard:\n", " y_DAC_new.append(dac2kmeans[dac_label])\n", " \n", " return np.array(y_DAC_new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "skl.metrics.confusion_matrix(y_kmeans, postprocess(y_DAC, y_kmeans))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "
\n", "
• Read the docstring of transform method and understand what it does.\n", "
• \n", "
• \n", " Use DAC.transform and kmeans.transform to transform both, X_train and X_test. \n", "
• \n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "X_train_DAC = DAC.transform(X_train)\n", "X_test_DAC = DAC.transform(X_test)\n", "\n", "X_train_kmeans = kmeans.transform(X_train)\n", "X_test_kmeans = kmeans.transform(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
• \n", " Fit an SVM classifier with default parameters to the untransformed data, and to the transformed data.\n", " Compare the performance of predicting whether the color of a wine is red or white.\n", "
• \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "svm = svm.SVC(random_state=42)\n", "svm.fit(X_train, y_train)\n", "svm.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "svm_DAC = svm.SVC(random_state=42)\n", "svm_DAC.fit(X_train_DAC, y_train)\n", "svm_DAC.score(X_test_DAC, y_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "svm = svm.SVC(random_state=42)\n", "svm.fit(X_train_kmeans, y_train)\n", "svm.score(X_test_kmeans, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
• \n", " Produce two scatter plots, one for X_train_DAC and one for X_train_kmeans.
\n", " Make the marker color indicate the wine color.\n", "
• \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
• \n", " Create a fixed 2D embedding (e.g. with LLE, t-SNE, MDS) of the wine data and color the markers according to quality and color. Fit and transform X_train with DAC(n_clusters=2,3,4,5,6,7,8,...). Produce a plot of the SVM score svm_DAC.score(X_test_DAC, y_test) as a function of n_clusters.. Each time use marker shapes to display the cluster memberships, and compare to the labels color and quality.\n", "
• \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", " %%time\n", " lle = skl.manifold.LocallyLinearEmbedding(random_state=...)\n", " lle.fit(...)\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
• \n", " So far, our implementation of DAC assumed that our data is compatible with the euclidian metric. Argue why this assumption is not justified for the wine-data. Suggest a better alternative (no implementation required!).\n", "
• \n", "
\n", "

\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }