- \n",
"
- Sections II.A.1 (principled derivation of deterministic annealing) and II.A.3 (Mass-constrained clustering) of 'Deterministic annealing for clustering, compression, classification, regression, and related optimization problems', Kenneth Rose, 1998, http://ieeexplore.ieee.org/document/726788/ \n", " \n", "\n", "
- \n", "The wine data set, http://www3.dsi.uminho.pt/pcortez/wine5.pdf\n", " \n", " \n", "

\n",
" For this exercise, it is of utmost importance to read reference [1] about deterministic annealing clustering (DAC). Our implementation will be based on this reference. Please shortly summarize what they refer to as the *preferred implementation* of the DAC algorithm.\n",
"

\n", " In order to avoid headaches with numerical instabilities, we first try our algorithm on a simple artificially generated data as below. Run the bloc below to have a look at the data. Later when we have everything implemented, we will examine some real world data. \n", "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_clusters = 4\n", "ran_s = 42\n", "\n", "# Generate artificial dataset\n", "X, y_true = make_blobs(n_samples=7000, centers=4,\n", " cluster_std=0.3, random_state=ran_s,\n", " center_box=(-8.0, 8.0),\n", " shuffle=False)\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y_true, train_size=6000, random_state=42)\n", "\n", "plt.figure()\n", "plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=40, cmap='viridis')\n", "plt.title(\"Training data\")\n", "\n", "plt.figure()\n", "plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=40, cmap='viridis')\n", "plt.title(\"Test data\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n",
" Implement the **fit method** for the template class DeterministicAnnealing, according to the contract outlined in its docstring. (The template class DeterministicAnnealing is in file **DA.py** which you can open in your favourite IDE) For the implementation, it may help to take a look at both **get_distance method** and **fit _calculate_cluster_probs method** and implement them as well. Of course you are free to change all these methods or/and write additional methods for your purpose.\n",
" You can add more class methods as necessary.\n",
" See http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html for complementary information.\n",
"

\n", " While implementing, you can run the bloc below to test your implementation.\n", "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from DA import DeterministicAnnealingClustering\n", "\n", "DAC = DeterministicAnnealingClustering(\n", " n_clusters=n_clusters, random_state=ran_s)\n", "DAC.fit(X_train)\n", "y_DAC = DAC.predict(X_test)\n", "y_DAC_hard = np.argmax(y_DAC, axis=1)\n", "plt.figure()\n", "plt.scatter(X_test[:, 0], X_test[:, 1], c=y_DAC_hard, s=40, cmap='viridis')\n", "plt.title(\"DA clustering\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " In this section we implement a plot which will help us better understand the DA method, and could also be a help for better debugging of your implementation.\n", " \n", "

- \n",
"
- \n",
" Modify your implementation of
**fit**function such that**plot_phase_diagram**method will produce a plot similar to the phase diagram plot shown in Figure 2 of the reference paper.\n", " \n",
"

- \n",
" Produce a phase diagram plot of the expected distortion D, as shown in figure 2 of reference [1]. For this, extend DAC.fit to save the expected distortion during annealing as an additional attribute self.distortion.\n",
" You might also want to save the number of effective clusters and the temperature along the way.\n",
"

- \n",
"Here we implement another plot which helps better undetrstad the dynamics of the algorithm.\n",
"
- \n",
" Implement DAC.plot_bifurcation, which should create a bifurcation plot.

\n", " Modify DAC.fit to keep track of the distances, using the tree object DAC.bifurcation_tree. When a cluster splits, it creates two child nodes. Each node should store its centroid vector, and the distance to the parent centroid vector. After splitting, the parent node is not updated anymore.

\n", " In the bifurcation plot, the horizontal distance of a child node to its parent node should be exactly the distance to the parent centroid vector. The two child nodes should move in opposite directions, i.e. one to the left of the parent and one to the right.\n", " \n",
"

\n", "Now we are ready to use some real world data. This might need some tweaking and handling of numberical instabilities. Please make sure your understand the data.\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Read the wine data [3], which contains 11 physiochemical attributes, and two labels (quality and color).\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n",
" Create an instance of your DAC class with n_clusters = 2 and **fit the first 6000 samples** of the wine data set. Record the execution time. Furthermore, create an instance of the sklearn k-means class, and fit it with the same parameters. Again record the execution time. Make sure that the hyper parameters (initial temperature, min temperature, convergence criteria, noise, etc.) make sense and lead to a reasonable clustering\n",
"

- \n",
"
- Before we can compute the confusion matrix, we need to perform some post-processing on the DAC cluster assignments.\n", " Explain what the function postprocess (defined below) does, and why we need it. To do so, complete the docstring of the function postprocess.\n", " \n", "

- \n",
"
- Read the docstring of
**transform method**and understand what it does.\n", " \n",
" - \n", " Use DAC.transform and kmeans.transform to transform both, X_train and X_test. \n", " \n", " \n", "

- \n",
"
- \n", " Fit an SVM classifier with default parameters to the untransformed data, and to the transformed data.\n", " Compare the performance of predicting whether the color of a wine is red or white.\n", " \n", "

- \n",
"
- \n",
" Produce two scatter plots, one for X_train_DAC and one for X_train_kmeans.

\n", " Make the marker color indicate the wine color.\n", " \n",
"

- \n",
"
- \n", " Create a fixed 2D embedding (e.g. with LLE, t-SNE, MDS) of the wine data and color the markers according to quality and color. Fit and transform X_train with DAC(n_clusters=2,3,4,5,6,7,8,...). Produce a plot of the SVM score svm_DAC.score(X_test_DAC, y_test) as a function of n_clusters.. Each time use marker shapes to display the cluster memberships, and compare to the labels color and quality.\n", " \n", "

- \n",
"
- \n", " So far, our implementation of DAC assumed that our data is compatible with the euclidian metric. Argue why this assumption is not justified for the wine-data. Suggest a better alternative (no implementation required!).\n", " \n", "