To receive notifications about scheduled maintenance, please subscribe to the mailing-list gitlab-operations@sympa.ethz.ch. You can subscribe to the mailing-list at https://sympa.ethz.ch

Commit 6e4eca6c authored by alinadu's avatar alinadu
Browse files

added coding exercise 4

parent c9404489
%% Cell type:code id: tags:
``` python
legi:
```
%% Cell type:markdown id: tags:
# SLT-CE-4: Constant Shift Embedding
%% Cell type:markdown id: tags:
## Task
%% Cell type:markdown id: tags:
Many real-world phenomena are described by pairwise proximity data, modeling interactions between the entities of the system. This in contrast to the more common situation where each data sample is given as a feature vector. Even though the clustering of the proximity data may be performed directly on the data matrix, there are some advantatages of embedding the data into a vector space. For example, it enables the use of some standard preprocessing techniques such as denoising or dimensionality reduction. In this coding exercise, we will explore the tecnhique called _Constant Shift Embedding_ for restating pairwise clustering problems in vector spaces [1] while preserving the cluster structure. We will apply the algorithm described in [1] to cluster the groups of research community members based on the email correspondence matrix. The data and its description is given in [2].
### References
[1] [Optimal cluster preserving embedding of nonmetric proximity data](https://ieeexplore.ieee.org/document/1251147)
[2] [email-Eu-core](https://snap.stanford.edu/data/email-Eu-core.html)
%% Cell type:markdown id: tags:
<h2 style="background-color:#f0b375;"> Setup </h2>
We start by importing necessary python packages.
%% Cell type:code id: tags:
``` python
import numpy as np
import sklearn as skl
import matplotlib.pylab as plt
import pylab
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
import scipy.linalg as la
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib import ticker
# Fix randoom seed for reproducibility
np.random.seed(42)
```
%% Cell type:markdown id: tags:
The number of nodes is hardcoded for simplicity (taken from [2]):
%% Cell type:code id: tags:
``` python
NUM_NODES = 1005
```
%% Cell type:markdown id: tags:
We load the file which contains the list of interactions between the community members (nodes). Our data matrix represents an undirected graph which connects two nodes if there was at least one email sent between the two corresponding community members. Thus our data matrix is essentially an adjacency matrix.
%% Cell type:code id: tags:
``` python
# initialize data matrix which will be an adjacency matrix
DATA = np.zeros((NUM_NODES, NUM_NODES))
# fill out the symmetric adjacency matrix
with open("email-Eu-core.txt") as file:
for line in file:
pair = [int(x) for x in line.split()]
DATA[pair[0], pair[1]] = 1
DATA[pair[1], pair[0]] = 1
```
%% Cell type:markdown id: tags:
Note that DATA specifies an adjacency matrix of the email graph. It's not claimed to be a proper dissimilarity matrix required by CSE algorithm. So, you are allowed to perform any manipulations to construct a suitable (dis-)similarity matrix for the further analysis.
%% Cell type:markdown id: tags:
Next we define a class which contains main functionalities - TO BE IMPLEMENTED.
%% Cell type:code id: tags:
``` python
class ConstantShiftEmbedding(skl.base.BaseEstimator, skl.base.TransformerMixin):
"""Template class for Constant Shift Embedding (CSE)
Attributes:
PMAT (np.ndarray): Proximity matrix used for calculating the embeddings.
S (np.ndarray): Similarity matrix.
D (np.ndarray): Dissimilarity matrix.
"""
def __init__(self):
self.PMAT = None
self.S = None
self.D = None
# Add/change parameters, if necessary.
def fit(self, PMAT):
""" Calculate similarity/dissimiliarity matrix and all
the necessary variables for calculating the embeddings.
Args:
PMAT (np.ndarray): proximity matrix
"""
# Save data
self.PMAT = PMAT
## IMPLEMENT THIS METHOD
def get_embedded_vectors(self, p):
"""Return embeddings
Args:
p (np.ndarray): cut-off value in eigenspectrum
Returns:
Xp (np.ndarray): embedded vectors
"""
## IMPLEMENT THIS METHOD
```
%% Cell type:markdown id: tags:
<h2 style="background-color:#f0b375;">
Section 4.0
<span style=font-size:50%> Complete all problems in this section to get a pass on this exercise. </span>
</h2>
<p style="background-color:#adebad;">Describe briefly and consicely the model given in [1]. Explain the main steps of _Constant Shift Embedding_ algorithm. See <a href="https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed">markdown cheatsheet</a> for text editing.</p>
- **Constant Shift Embedding**: (Put your answer here)
%% Cell type:markdown id: tags:
<p style="background-color:#adebad;">
Implement Constant Shift Embedding. We start off by making an instance of the corresponding class.
</p>
%% Cell type:code id: tags:
``` python
CSE = ConstantShiftEmbedding()
```
%% Cell type:markdown id: tags:
<p style="background-color:#adebad;">
Fit the data matrix. _fit(...)_ method computes necessary variables which can be later on used to produce embeddings [1].
</p>
%% Cell type:code id: tags:
``` python
CSE.fit(DATA)
```
%% Cell type:markdown id: tags:
<h2 style="background-color:#f0b375;">
Section 4.5
<span style=font-size:50%> Complete all problems in this and previous sections to get a grade of 4.5 </span>
</h2>
<p style="background-color:#adebad;">
Next, try to find approximately optimal $p = p^∗$, a cut-off value which removes noise from the data. To do that, produce an eigen-spectrum plot as shown in [1] figure 4a and briefly explain your choice of $p^∗$.
</p>
%% Cell type:code id: tags:
``` python
## Compute eigen-spectrum
```
%% Cell type:code id: tags:
``` python
## Determine a good cut-off value (and write some lines to explain your choice)
p_opt = 1 ## change accordingly
print("Chosen cut-off value is: ", p_opt)
```
%% Cell type:code id: tags:
``` python
## Plot spectrum and indicate the cut-off value on the spectrum
```
%% Cell type:markdown id: tags:
<h2 style="background-color:#f0b375;">
Section 5.0
<span style=font-size:50%> Complete all problems in this and previous sections to get a grade of 5.0 </span>
</h2>
<p style="background-color:#adebad;">
Plot the distance matrices both for the denoised ($p = p^*$ -- from the previous step) and the original versions as shown in figure 5 in [1]. Note that the distance matrix is a matrix with pairwise distances between every two points from the dataset ($d_{ij} = dist(x_i, x_j)$).<br>
Perform K-MEANS algorithm for varying number of clusters K on the embedded vectors derrived from CSE. You may use the sklearn implementation of K-MEANS. To make the aforementioned plots meaningful, sort the nodes according to the cluster belongings for every number of clusters K (see the figure 5). For now, there is no need to include the actual ground truth labels given in [2].
</p>
%% Cell type:code id: tags:
``` python
## Distance matrices
```
%% Cell type:markdown id: tags:
<h2 style="background-color:#f0b375;">
Section 5.5
<span style=font-size:50%> Complete all problems in this and previous sections to get a grade of 5.5 </span>
</h2>
<p style="background-color:#adebad;">
Producing 2D and 3D embeddings allows us to nicely visualize generated clusters. Now calculate the embeddings for p = 2 (2D case) and p = 3 (3D case) and plot clusterings for a few values of K. Alternatively, you could use $p = p^*$ for more gentle denoising, cluster the denoised embeddings and only then apply a dimensionality reduction technique to get a plot in 2,3-dimensional space. You could use PCA, LLE, t-SNE etc. figure out what works for you. As an example see figure 6 (b) from [1] where CSE is combined with PCA.
</p>
%% Cell type:code id: tags:
``` python
## Get embeddings, run K-MEANS and generate plots
```
%% Cell type:code id: tags:
``` python
## p = 2
```
%% Cell type:code id: tags:
``` python
## p = 3
```
%% Cell type:code id: tags:
``` python
## choose p > 3, for example, p = p_opt, to compute CSE embeddings
## First, cluster the computed p-dimentional embeddings and then project them onto 2-dimensional space
## for visualization using PCA, LL, t-SNE or something else
```
%% Cell type:markdown id: tags:
<h2 style="background-color:#f0b375;">
Section 6.0
<span style=font-size:50%> Complete all problems in this and previous sections to get a grade of 6.0 </span>
</h2>
<p style="background-color:#adebad;">
Finally, to evaluate the quality of the above derived clusters, let's compare our predictions with the ground truth. We will use the actual member-institution mappings given in [2]. You can reuse code from the previous coding exercises to align the cluster labels with the ground truth.
</p>
%% Cell type:code id: tags:
``` python
# Initialize community members affeliation array
AFFILIATIONS = np.zeros((NUM_NODES, ))
# Fill out the affiliation array
with open("email-Eu-core-department-labels.txt") as file:
for line in file:
pair = [int(x) for x in line.split()]
AFFILIATIONS[pair[0]] = pair[1]
# Number of organizations is
print("The true number of clusters (departments) is: ",len(np.unique(AFFILIATIONS)))
```
%% Cell type:markdown id: tags:
<p style="background-color:#adebad;">
Visually or quantitatively, in a clever and convincing way, show that the K-MEANS generated clusters overlap with the ground truth clusters (member affiliations). How can we measure the overlapping of the predicted and true clusters?
</p>
%% Cell type:code id: tags:
``` python
## Here you can provide plots and calculations
```
%% Cell type:markdown id: tags:
Please, write here your explanations, observation and thoughts about results of the experiments above.
%% Cell type:markdown id: tags:
## Comments
We hope you found this exercise instructive.
Feel free to leave comments below, we will read them carefully.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment