<a href="https://colab.research.google.com/github/NeuromatchAcademy/course-content-dl/blob/main/tutorials/W3D3_UnsupervisedAndSelfSupervisedLearning/instructor/W3D3_Tutorial1.ipynb" target="_blank"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"/></a>   <a href="https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content-dl/main/tutorials/W3D3_UnsupervisedAndSelfSupervisedLearning/instructor/W3D3_Tutorial1.ipynb" target="_blank"><img alt="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"/></a>

# Tutorial 1: Un/Self-supervised learning methods

**Week 3, Day 3: Unsupervised and self-supervised learning**

**By Neuromatch Academy**

__Content creators:__ Arna Ghosh, Colleen Gillon, Tim Lillicrap, Blake Richards

__Content reviewers:__ Atnafu Lambebo, Hadi Vafaei, Khalid Almubarak, Melvin Selim Atay, Kelson Shilling-Scrivo

__Content editors:__ Anoop Kulkarni, Spiros Chavlis

__Production editors:__ Deepak Raya, Gagana B, Spiros Chavlis

---
# Tutorial Objectives

In this tutorial, you will learn about the importance of learning good representations of data.

Specific objectives for this tutorial:
*   Train logistic regressions (A) directly on input data and (B) on representations learned from the data.
*   Compare the classification performances achieved by the different networks.
*   Compare the representations learned by the different networks.
*   Identify the advantages of self-supervised learning over supervised or traditional unsupervised methods.

In [None]:
# @markdown
from IPython.display import IFrame
from ipywidgets import widgets
out = widgets.Output()
with out:
    print(f"If you want to download the slides: https://osf.io/download/wvt34/")
    display(IFrame(src=f"https://mfr.ca-1.osf.io/render?url=https://osf.io/wvt34/?direct%26mode=render%26action=download%26mode=render", width=730, height=410))
display(out)

---
# Setup

##  Install dependencies


In [None]:
# @title Install dependencies

# @markdown Download dataset, modules, and files needed for the tutorial from GitHub.

# @markdown This cell will download the library from OSF, but you can check out the code in https://github.com/colleenjg/neuromatch_ssl_tutorial.git

import os, sys, shutil, importlib

REPO_PATH = "neuromatch_ssl_tutorial"
download_str = "Downloading"
if os.path.exists(REPO_PATH):
  download_str = "Redownloading"
  shutil.rmtree(REPO_PATH)

# Download from github repo directly
# !git clone git://github.com/colleenjg/neuromatch_ssl_tutorial.git --quiet

from io import BytesIO
from urllib.request import urlopen
from zipfile import ZipFile

zipurl = 'https://osf.io/smqvg/download'
print(f"{download_str} and unzipping the file... Please wait.")
with urlopen(zipurl) as zipresp:
  with ZipFile(BytesIO(zipresp.read())) as zfile:
    zfile.extractall()

# Correct now-broken use of deprecated np.product method
for module in ["data.py", "load.py", "models.py"]:
  with open(f"neuromatch_ssl_tutorial/modules/{module}", "r") as f:
    source = f.read()
    source = source.replace("np.product(", "np.prod(")
    with open(f"neuromatch_ssl_tutorial/modules/{module}", "w") as f:
      f.write(source)

print("Download completed!")

##  Install and import feedback gadget


In [None]:
# @title Install and import feedback gadget

!pip3 install vibecheck datatops --quiet

from vibecheck import DatatopsContentReviewContainer
def content_review(notebook_section: str):
    return DatatopsContentReviewContainer(
        "",  # No text prompt
        notebook_section,
        {
            "url": "https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab",
            "name": "neuromatch_dl",
            "user_key": "f379rz8y",
        },
    ).render()


feedback_prefix = "W3D3_T1"

In [None]:
# Imports
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt

# Import modules designed for use in this notebook.
from neuromatch_ssl_tutorial.modules import data, load, models, plot_util
from neuromatch_ssl_tutorial.modules import data, load, models, plot_util
importlib.reload(data)
importlib.reload(load)
importlib.reload(models)
importlib.reload(plot_util)

##  Figure settings


In [None]:
# @title Figure settings
import logging
logging.getLogger('matplotlib.font_manager').disabled = True

import ipywidgets as widgets  # Interactive display
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/content-creation/main/nma.mplstyle")

plt.rc('axes', unicode_minus=False) # To ensure negatives render correctly with xkcd style
import warnings
warnings.filterwarnings("ignore")

##  Plotting functions


 Function to plot a histogram of RSM values: `plot_rsm_histogram(rsms, colors)`


In [None]:
# @title Plotting functions

# @markdown Function to plot a histogram of RSM values: `plot_rsm_histogram(rsms, colors)`
def plot_rsm_histogram(rsms, colors, labels=None, nbins=100):
  """
  Function to plot histogram based on Representational Similarity Matrices

  Args:
    rsms: List
      List of values within RSM
    colors: List
      List of colors for histogram
    labels: List
      List of RSM Labels
    nbins: Integer
      Specifies number of histogram bins

  Returns:
    Nothing
  """
  fig, ax = plt.subplots(1)
  ax.set_title("Histogram of RSM values", y=1.05)

  min_val = np.min([np.nanmin(rsm) for rsm in rsms])
  max_val = np.max([np.nanmax(rsm) for rsm in rsms])

  bins = np.linspace(min_val, max_val, nbins+1)

  if labels is None:
    labels = [labels] * len(rsms)
  elif len(labels) != len(rsms):
    raise ValueError("If providing labels, must provide as many as RSMs.")

  if len(rsms) != len(colors):
    raise ValueError("Must provide as many colors as RSMs.")

  for r, rsm in enumerate(rsms):
    ax.hist(
        rsm.reshape(-1), bins, density=True, alpha=0.4,
        color=colors[r], label=labels[r]
        )
  ax.axvline(x=0, ls="dashed", alpha=0.6, color="k")
  ax.set_ylabel("Density")
  ax.set_xlabel("Similarity values")
  ax.legend()
  plt.show()

##  Helper functions


In [None]:
# @title Helper functions

from IPython.display import display, Image # to visualize images

# @markdown Function to set test custom torch RSM function: `test_custom_torch_RSM_fct()`
def test_custom_torch_RSM_fct(custom_torch_RSM_fct):
  """
  Function to set test implementation of custom_torch_RSM_fct

  Args:
    custom_torch_RSM_fct: f_name
      Function to test

  Returns:
    Nothing
  """
  rand_feats = torch.rand(100, 1000)
  RSM_custom = custom_torch_RSM_fct(rand_feats)
  RSM_ground_truth = data.calculate_torch_RSM(rand_feats)

  if torch.allclose(RSM_custom, RSM_ground_truth, equal_nan=True):
    print("custom_torch_RSM_fct() is correctly implemented.")
  else:
    print("custom_torch_RSM_fct() is NOT correctly implemented.")


# @markdown Function to set test custom contrastive loss function: `test_custom_contrastive_loss_fct()`
def test_custom_contrastive_loss_fct(custom_simclr_contrastive_loss):
  """
  Function to set test implementation of custom_simclr_contrastive_loss

  Args:
    custom_simclr_contrastive_loss: f_name
      Function to test

  Returns:
    Nothing
  """
  rand_proj_feat1 = torch.rand(100, 1000)
  rand_proj_feat2 = torch.rand(100, 1000)
  loss_custom = custom_simclr_contrastive_loss(rand_proj_feat1, rand_proj_feat2)
  loss_ground_truth = models.contrastive_loss(rand_proj_feat1,rand_proj_feat2)

  if torch.allclose(loss_custom, loss_ground_truth):
    print("custom_simclr_contrastive_loss() is correctly implemented.")
  else:
    print("custom_simclr_contrastive_loss() is NOT correctly implemented.")

##  Set random seed


 Executing `set_seed(seed=seed)` you are setting the seed


In [None]:
# @title Set random seed

# @markdown Executing `set_seed(seed=seed)` you are setting the seed

# For DL its critical to set the random seed so that students can have a
# baseline to compare their results to expected results.
# Read more here: https://pytorch.org/docs/stable/notes/randomness.html

# Call `set_seed` function in the exercises to ensure reproducibility.
import random
import torch

def set_seed(seed=None, seed_torch=True):
  """
  Handles variability by controlling sources of randomness
  through set seed values

  Args:
    seed: Integer
      Set the seed value to given integer.
      If no seed, set seed value to random integer in the range 2^32
    seed_torch: Bool
      Seeds the random number generator for all devices to
      offer some guarantees on reproducibility

  Returns:
    Nothing
  """
  if seed is None:
    seed = np.random.choice(2 ** 32)
  random.seed(seed)
  np.random.seed(seed)
  if seed_torch:
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

  print(f'Random seed {seed} has been set.')


# In case that `DataLoader` is used
def seed_worker(worker_id):
  """
  DataLoader will reseed workers following randomness in
  multi-process data loading algorithm.

  Args:
    worker_id: integer
      ID of subprocess to seed. 0 means that
      the data will be loaded in the main process
      Refer: https://pytorch.org/docs/stable/data.html#data-loading-randomness for more details

  Returns:
    Nothing
  """
  worker_seed = torch.initial_seed() % 2**32
  np.random.seed(worker_seed)
  random.seed(worker_seed)

##  Set device (GPU or CPU). Execute `set_device()`


In [None]:
# @title Set device (GPU or CPU). Execute `set_device()`
# especially if torch modules used.

# Inform the user if the notebook uses GPU or CPU.

def set_device():
  """
  Set the device. CUDA if available, CPU otherwise

  Args:
    None

  Returns:
    Nothing
  """
  device = "cuda" if torch.cuda.is_available() else "cpu"
  if device != "cuda":
    print("WARNING: For this notebook to perform best, "
        "if possible, in the menu under `Runtime` -> "
        "`Change runtime type.`  select `GPU` ")
  else:
    print("GPU is enabled in this notebook.")

  return device

In [None]:
# Set global variables
SEED = 2021
set_seed(seed=SEED)
DEVICE = set_device()

 ### Pre-load variables (allows each section to be run independently)


In [None]:
# @markdown ### Pre-load variables (allows each section to be run independently)

# Section 1
dSprites = data.dSpritesDataset(
    os.path.join(REPO_PATH, "dsprites", "dsprites_subset.npz")
    )

dSprites_torchdataset = data.dSpritesTorchDataset(
  dSprites,
  target_latent="shape"
  )

train_sampler, test_sampler = data.train_test_split_idx(
  dSprites_torchdataset,
  fraction_train=0.8,
  randst=SEED
  )

supervised_encoder = load.load_encoder(REPO_PATH,
                                       model_type="supervised",
                                       verbose=False)

# Section 2
custom_torch_RSM_fct = None  # Default is used instead

# Section 3
random_encoder = load.load_encoder(REPO_PATH,
                                   model_type="random",
                                   verbose=False)

# Section 4
vae_encoder = load.load_encoder(REPO_PATH,
                                model_type="vae",
                                verbose=False)

# Section 5
invariance_transforms = torchvision.transforms.RandomAffine(
    degrees=90,
    translate=(0.2, 0.2),
    scale=(0.8, 1.2)
    )
dSprites_invariance_torchdataset = data.dSpritesTorchDataset(
    dSprites,
    target_latent="shape",
    simclr=True,
    simclr_transforms=invariance_transforms
    )

# Section 6
simclr_encoder = load.load_encoder(REPO_PATH,
                                   model_type="simclr",
                                   verbose=False)

---
# Section 0: Introduction

##  Video 0: Introduction


In [None]:
# @title Video 0: Introduction
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'Q3b_EqFUI00'), ('Bilibili', 'BV1D64y1s78e')]
tab_contents = display_videos(video_ids, W=730, H=410)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

##  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Introduction_Video")

---
# Section 1: Representations are important

*Time estimate: ~30mins*

##  Video 1: Why do representations matter?


In [None]:
# @title Video 1: Why do representations matter?
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'lj5uTUo6W88'), ('Bilibili', 'BV1g54y1J7cE')]
tab_contents = display_videos(video_ids, W=730, H=410)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

##  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Why_do_representations_matter_Video")

## Section 1.1: Introducing the dSprites dataset

In this tutorial, we will be using a subset of the openly available **dSprites dataset** to investigate the importance of learning good representations.

_**Note on dataset:** For convenience, we will be using a subset of the original, full dataset which is available [here](https://github.com/deepmind/dsprites-dataset/), on GitHub._

### Interactive Demo 1.1.1: Exploring the dSprites dataset

In this first demo, we will get to know the **dSprites dataset**. This dataset is made up of black and white images (20,000 images total in the subset we are using).

The images in the dataset can be described using different combinations of **latent dimension values**, sampled from:
- **Shapes (3):** square (1.0), oval (2.0) or heart (3.0)
- **Scales (6):** 0.5 to 1.0
- **Orientations (40):** 0 to 2$\pi$
- **Positions in X (32):** 0 to 1 (left to right)
- **Positions in Y (32):** 0 to 1 (top to bottom)

As a result, **each image carries 5 labels.** One for each of the latent dimensions.

We will first load the dataset into the `dSprites` object, which is an instance of the `data.dSpritesDataset` class.

In [None]:
dSprites = data.dSpritesDataset(
    os.path.join(REPO_PATH, "dsprites", "dsprites_subset.npz")
    )

Next, we use the `dSpritesDataset` class method `show_images()` to plot a few images from the dataset, with their latent dimension values printed below.

**Interactive Demo:** View a different set of randomly sampled images by passing the random state argument `randst` any integer or the value `None`. (The original setting is `randst=SEED`.)

In [None]:
# DEMO: To view different images, set randst to any integer value.
dSprites.show_images(num_images=10, randst=SEED)

To better understand the `posX` and `posY` latent dimensions (which will be most relevant in **Bonus 2**), we plot the images with some annotations. The annotations (in red) do not modify the actual images; they are added **purely for visualization purposes**, and show:
  -  the **edges** of the `posX` and `posY` spans, and
  - the **center**, i.e., `(posX, posY)`, for each shape.

_**Note on shape positions:** Notice that all shape centers are positioned **within the area marked by the red square**. `posX` and `posY` actually describe the relative position of the center of a shape within this area: `posX=0` (left) to `posX=1` (right), and `posY=0` (top) to `posY=1` (bottom). No shape center appears outside, in the buffer area. This choice in the dSprites dataset design ensures that shapes of different scales and rotations **all appear fully**._

In [None]:
# DEMO: To view different images, set randst to any integer value.
dSprites.show_images(num_images=10, randst=SEED, annotations="pos")

## Section 1.2: Training a classifier with and without representations

Now, we will investigate how 2 different types of classifiers perform when trained to decode the shape latent dimension of images in the **dSprites dataset**.

Specifically, we will train **one classifier directly on the images**, and **another on the output of an encoder network**.

The **encoder network** we will use here and throughout the tutorial is the multi-layer convolutional network, pictured below. It comprises 2 consecutive convolutional layers, followed by 3 fully connected layers, and uses average pooling and batch normalization between layers, as well as rectified linear units as non-linearities.

The **classifier layer** then takes the encoder features as input, predicting, for example, the shape latent dimension of encoded input images.

_**Note on terminology:** In this tutorial, both the terms **representations** and **features** are used to refer to the data embeddings learned in the final layer of the encoder network (of dimension 1x84, and indicated by a red dashed box) which are fed to the classifiers._

 ### Encoder network schematic


In [None]:
# @markdown ### Encoder network schematic
Image(filename=os.path.join(REPO_PATH, "images", "feat_encoder_schematic.png"), width=1200)

The following code:
*    Seeds modules that will use random processes, to ensure the results are consistently reproducible, using the `seed_processes()` function,
*    Collects the dSprites dataset into a torch dataset using the `data.dSpritesTorchDataset` class,
*    Initializes a training and a test sampler to keep the two datasets separate using the `data.train_test_splix_idx()` function.

In [None]:
# Set the seed before building any dataset/network initializing or training,
# to ensure reproducibility
set_seed(SEED)

# Initialize a torch dataset, specifying the target latent dimension for
# the classifier
dSprites_torchdataset = data.dSpritesTorchDataset(
  dSprites,
  target_latent="shape"
  )

# Initialize a train_sampler and a test_sampler to keep the two sets
# consistently separate
train_sampler, test_sampler = data.train_test_split_idx(
  dSprites_torchdataset,
  fraction_train=0.8,  # 80:20 data split
  randst=SEED
  )

print(f"Dataset size: {len(train_sampler)} training, "
      f"{len(test_sampler)} test images")

### Interactive Demo 1.2.1: Training a logistic regression classifier directly on images

The following code:
*    trains a logistic regression directly on the training set images to classify their shape, and assesses its performance on the test set images using the `models.train_classifier()` function.

_**Interactive Demo:** Try a few different `num_epochs` settings to see whether performance improves with more training, e.g., between 1 and 50 epochs. (The original setting is `num_epochs=25`)._

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_What_models_Video")

In [None]:
# Call this before any dataset/network initializing or training,
# to ensure reproducibility
set_seed(SEED)

num_epochs = 25  # DEMO: Try different numbers of training epochs

# Train a classifier directly on the images
print("Training a classifier directly on the images...")
_ = models.train_classifier(
  encoder=None,
  dataset=dSprites_torchdataset,
  train_sampler=train_sampler,
  test_sampler=test_sampler,
  freeze_features=True,  # There is no feature encoder to train here, anyway
  num_epochs=num_epochs,
  verbose=True  # Print results
  )

As we can observe, the classifier trained directly on the images performs only a bit above chance (39.55%) on the test set, after 25 training epochs.

<b>Shape classification results using different feature encoders:

| _Chance_ |  | None (raw data) |
| - | - | --- |
| _33.33%_ |  | 39.55% |

### Coding Exercise 1.2.1: Training a logistic regression classifier along with an encoder

The following code:
*    Uses the same dSprites torch dataset (`dSprites_torchdataset`) initialized above, as well as the training and test samplers (`train_sampler`, `test_sampler`),
*    Again, seed modules for substructures that use random processes, to ensure the results are consistently reproducible,
*    Initializes an encoder network to use in the supervised network using the `models.EncoderCore` class,
*    Sets a proposed number of epochs to use when training the classifier and encoder (`num_epochs=10`).

**Exercise:** Train a classifier, along with the encoder, to classify the input images according to shape, using `models.train_classifier()`. How does it perform?

**Hints**:
- `models.train_classifier()`:
    - Is introduced in **Interactive Demo 1.2.1**.
    - Takes `freeze_features` as an input argument:
        - If set to `True`, the encoder is frozen, and so only the classifier layer is trained.
        - If set to `False`, the encoder is **not** frozen, and is trained along with the classifier layer.

```python
def train_supervised_encoder(num_epochs, seed):
  """
  Helper function to train the encoder in a supervised way

  Args:
    num_epochs: Integer
      Number of epochs the supervised encoder is to be trained for
    seed: Integer
      The seed value for the dataset/network

  Returns:
    supervised_encoder: nn.module
      The trained encoder with mentioned parameters/hyperparameters
  """
  # Call this before any dataset/network initializing or training,
  # to ensure reproducibility
  set_seed(seed)

  # Initialize a core encoder network on which the classifier will be added
  supervised_encoder = models.EncoderCore()

  #################################################
  # Fill in missing code below (...),
  # then remove or comment the line below to test your implementation
  raise NotImplementedError("Exercise: Train a supervised encoder and classifier.")
  #################################################
  # Train an encoder and classifier on the images, using models.train_classifier()
  print("Training a supervised encoder and classifier...")
  _ = models.train_classifier(
      encoder=...,
      dataset=...,
      train_sampler=...,
      test_sampler=...,
      freeze_features=...,
      num_epochs=num_epochs,
      verbose=...  # print results
      )

  return supervised_encoder



num_epochs = 10  # Proposed number of training epochs
## Uncomment below to test your function
# supervised_encoder = train_supervised_encoder(num_epochs=num_epochs, seed=SEED)

```

```
Network performance after 10 encoder and classifier training epochs (chance: 33.33%):
    Training accuracy: 100.00%
    Testing accuracy: 98.70%
````

In [None]:
# to_remove solution
def train_supervised_encoder(num_epochs, seed):
  """
  Helper function to train the encoder in a supervised way

  Args:
    num_epochs: Integer
      Number of epochs the supervised encoder is to be trained for
    seed: Integer
      The seed value for the dataset/network

  Returns:
    supervised_encoder: nn.module
      The trained encoder with mentioned parameters/hyperparameters
  """
  # Call this before any dataset/network initializing or training,
  # to ensure reproducibility
  set_seed(seed)

  # Initialize a core encoder network on which the classifier will be added
  supervised_encoder = models.EncoderCore()
  # Train an encoder and classifier on the images, using models.train_classifier()
  print("Training a supervised encoder and classifier...")
  _ = models.train_classifier(
      encoder=supervised_encoder,
      dataset=dSprites_torchdataset,
      train_sampler=train_sampler,
      test_sampler=test_sampler,
      freeze_features=False,
      num_epochs=num_epochs,
      verbose=True  # print results
      )

  return supervised_encoder



num_epochs = 10  # Proposed number of training epochs
## Uncomment below to test your function
supervised_encoder = train_supervised_encoder(num_epochs=num_epochs, seed=SEED)

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Logistic_regression_classifier_Exercise")

When the classifier is trained with an encoder network, however, it achieves very high classification accuracy (~98.70%) on the test set, after only 10 training epochs.

<b>Shape classification results using different feature encoders:

| _Chance_ |  | None (raw data) | Supervised |
| - | - | --- | --- |
| _33.33%_ |  | 39.55% | 98.70% |

---
# Section 2: Supervised learning induces invariant representations

*Time estimate: ~20mins*

##  Video 2: Supervised Learning and Invariance


In [None]:
# @title Video 2: Supervised Learning and Invariance
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'ZQka4k8ZOs0'), ('Bilibili', 'BV1d54y1E76W')]
tab_contents = display_videos(video_ids, W=730, H=410)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

##  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Supervised_learning_and_invariance_Video")

## Section 2.1: Examining Representational Similarity Matrices (RSMs)

To examine the representations learned by the encoder network, we use **Representational Similarity Matrices (RSMs)**. In these matrices, the similarity between the encoder's representations of each possible pair of images is plotted to reveal overall structure in representation space.

_**Note on cosine similarity:** Here, we use cosine similarity as a measure of representational similarity. Cosine similarity measures the angle between 2 vectors, and can be thought of as their normalized dot product._

### Coding Exercise 2.1.1: Complete a function that calculates RSMs

The following code:
*    Lays out the skeleton of a function `custom_torch_RSM_fct()` which calculates an RSM from features,
*    Tests the custom function against the solution implementation.

**Exercise:** Complete the `custom_torch_RSM_fct()` implementation.

**Hints**:
- `custom_torch_RSM_fct()`:
    - Takes 1 input argument:
        - `features` (2D torch Tensor): Feature matrix (nbr items x nbr features)
    - Returns 1 output:
        - `rsm` (2D torch Tensor): Similarity matrix (nbr items x nbr items)
    - Uses `torch.nn.functional.cosine_similarity()`.
- `torch.nn.functional.cosine_similarity()`:
    - Takes 3 arguments, in order:
        - `x1` (torch Tensor),
        - `x2` (torch Tensor),
        - `dim` (int)
    - Returns the similarity between `x1` and `x2` along dimension `dim`.

**Detailed hint**:
- To use `torch.nn.functional.cosine_similarity()` to measure the similarity of `features` to **itself** for each possible **pair of items**:
    - Pass 2 versions of `features` as `x1` and `x2`, respectively.
    - Ensure that for `x1` and `x2`, the **features dimension is at the same position** , and specify that dimension with `dim`.
    - To obtain the similarity between each possible pair of items, ensure that for `x1` and `x2`, the **items dimensions are orthogonal** to one another (i.e., at different positions).
    - Don't forget that to achieve this, singleton dimensions (i.e., dimensions of length 1) can be used.

```python
def custom_torch_RSM_fct(features):
  """
  Custom function to calculate representational similarity matrix (RSM) of a feature
  matrix using pairwise cosine similarity.

  Args:
    features: 2D torch.Tensor
      Feature matrix of size (nbr items x nbr features)

  Returns:
    rsm: 2D torch.Tensor
      Similarity matrix of size (nbr items x nbr items)
  """

  num_items, num_features = features.shape

  #################################################
  # Fill in missing code below (...),
  # Complete the function below given the specific guidelines.
  # Use torch.nn.functional.cosine_similarity()
  # then remove or comment the line below to test your function
  raise NotImplementedError("Exercise: Implement RSM calculation.")
  #################################################
  # EXERCISE: Implement RSM calculation
  rsm = ...

  if not rsm.shape == (num_items, num_items):
    raise ValueError(f"RSM should be of shape ({num_items}, {num_items})")

  return rsm



## Test implementation by comparing output to solution implementation
# test_custom_torch_RSM_fct(custom_torch_RSM_fct)

```

```
custom_torch_RSM_fct() is correctly implemented.
```

In [None]:
# to_remove solution
def custom_torch_RSM_fct(features):
  """
  Custom function to calculate representational similarity matrix (RSM) of a feature
  matrix using pairwise cosine similarity.

  Args:
    features: 2D torch.Tensor
      Feature matrix of size (nbr items x nbr features)

  Returns:
    rsm: 2D torch.Tensor
      Similarity matrix of size (nbr items x nbr items)
  """

  num_items, num_features = features.shape

  # EXERCISE: Implement RSM calculation
  rsm = torch.nn.functional.cosine_similarity(
      features.reshape(1, num_items, num_features),
      features.reshape(num_items, 1, num_features),
      dim=2
      )

  if not rsm.shape == (num_items, num_items):
    raise ValueError(f"RSM should be of shape ({num_items}, {num_items})")

  return rsm



## Test implementation by comparing output to solution implementation
test_custom_torch_RSM_fct(custom_torch_RSM_fct)

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Function_that_calculates_RSMs_Exercise")

### Interactive Demo 2.1.1: Plotting the supervised network encoder RSM along different latent dimensions

In this demo, we calculate an RSM for representations of the test set images generated by the supervised network encoder.

The following code:
*    Calculates and plots the RSM for the test set, with rows and columns sorted by whichever latent dimension is specified (e.g., `sorting_latent="shape"`) using `models.plot_model_RSMs()`.

**Interactive Demo:** In the current example, the rows and columns of the RSM are organized along the `shape` latent dimension. Try organizing them along one of the other latent dimensions (`"scale"`, `"orientation"`, `"posX"` or `"posY"`) to see whether different patterns emerge. (The original setting is `sorting_latent="shape"`.)

In [None]:
sorting_latent = "shape"  # DEMO: Try sorting by different latent dimensions
print("Plotting RSMs...")
_ = models.plot_model_RSMs(
    encoders=[supervised_encoder],  # We pass the trained supervised_encoder
    dataset=dSprites_torchdataset,
    sampler=test_sampler,  # We want to see the representations on the held out test set
    titles=["Supervised network encoder RSM"],  # Plot title
    sorting_latent=sorting_latent,
    )

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Supervised_network_encoder_RSM_Interactive_Demo")

### Discussion 2.1.1: What patterns do the RSMs reveal about how the encoder represents different images?
**A.** What does the yellow (maximal similarity color) diagonal, going from the top left to the bottom right, correspond to?
**B.** What pattern can be observed when comparing RSM values for pairs of images that share a similar latent value (e.g., 2 heart images) vs pairs of images that do not (e.g., a heart and a square image)?
**C.** Do some shapes appear to be encoded more similarly than others?
**D.** Do some latent dimensions show clearer RSM patterns than others? Why might that be so?

 #### Supporting images for Discussion response examples for 2.1.1


In [None]:
# @markdown #### Supporting images for Discussion response examples for 2.1.1
Image(filename=os.path.join(REPO_PATH, "images", "rsms_supervised_encoder_10ep_bs1000_seed2021.png"), width=1200)

In [None]:
# to_remove explanation

"""
A. The yellow diagonal corresponds to the similarity between each encoded image and itself.
   Since each encoded image is, of course, identical to itself, the similarity
   is 1 at each point on the diagonal.

B. The pattern we observe is that there are square sections of the RSM that have
   higher similarity values than the rest, and these sections lie along the
   yellow diagonal.
   These sections correspond to the similarities between **encoded images of the
   same shape** (e.g., 2 hearts), which are generally **higher than the similarities
   between encoded images of different shapes** (e.g., a heart and a square),
   when using this trained, supervised encoder.

C. It is a bit subtle, but it looks like the **hearts and squares** might be
   encoded more similarly to one another than the **ovals and squares**, in general.
   This is based on the fact that the RSM values for hearts x squares
   (bottom left and top right) appear to be lighter (more yellow, hence higher) than the RSM
   values for ovals x squares (top middle and middle left),
   which are a bit darker (more blue, hence lower).

D. If we sort by different latent dimensions (e.g., `scale`, `orientation`, `posX` or `posY`),
   we do not see as much structure in the RSMs. This is because the supervised
   encoder is specifically trained on a shape classification task, which forces
   it to encode images of the same shape more similarly, and images of different
   shapes more differently. It is not trained to distinguish scales, orientations
   or positions. If it **were** trained to predict `orientation`, `scale` or `position`,
   we could expect to see similar RSM patterns, with high similarity along the
   diagonal for the predicted dimension.
""";

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_What_patterns_do_the_RSMs_reveal_Discussion")

---
# Section 3: Random projections don’t work as well

*Time estimate: ~20mins*


##  Video 3: Random Representations


In [None]:
# @title Video 3: Random Representations
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'LVM7Fm5T6Fs'), ('Bilibili', 'BV1Jf4y15789')]
tab_contents = display_videos(video_ids, W=730, H=410)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

##  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Random_representations_Video")

## Section 3.1: Examining RSMs of a random encoder

To determine whether the patterns observed in the RSMs of the supervised network encoder are trivial, we investigate whether they also emerge from the **random projections of an untrained encoder**.

### Coding Exercise 3.1.1: Plotting a random network encoder RSM along different latent dimensions

In this exercise, we repeat the same analysis as in **Section 2.1**, but with a random encoder.

The following code:
*    Initializes an encoder network to use in the random network using the `models.EncoderCore` class,
*    Proposes a latent dimension along which to sort the rows and columns (`sorting_latent="shape"`).

**Exercise:**
*    Visualize the RSMs for the supervised and random network encoders, using `models.plot_model_RSMs()`.
*    Visualize the RSMs, organized along different latent dimensions (`"scale"`, `"orientation"`, `"posX"` or `"posY"`), and compare the patterns observed for the supervised versus the random encoder network.

**Hint**: `models.plot_model_RSMs()` is introduced in **Interactive Demo 2.1.1**.

```python
def plot_rsms(seed):
  """
  Helper function to plot Representational Similarity Matrices (RSMs)

  Args:
    seed: Integer
      The seed value for the dataset/network

  Returns:
    random_encoder: nn.module
      The encoder with mentioned parameters/hyperparameters
  """
  # Call this before any dataset/network initializing or training,
  # to ensure reproducibility
  set_seed(seed)

  # Initialize a core encoder network that will not get trained
  random_encoder = models.EncoderCore()

  # Try sorting by different latent dimensions
  sorting_latent = "shape"

  #################################################
  # Fill in missing code below (...),
  # then remove or comment the line below to test your implementation
  raise NotImplementedError("Exercise: Plot RSMs.")
  #################################################
  # Plot RSMs
  print("Plotting RSMs...")
  _ = models.plot_model_RSMs(
      encoders=[..., ...],  # Pass both encoders
      dataset=...,
      sampler=...,  # To see the representations on the held out test set
      titles=["Supervised network encoder RSM",
              "Random network encoder RSM"],  # Plot titles
      sorting_latent=sorting_latent,
      )

  return random_encoder



## Uncomment below to test your function
# random_encoder = plot_rsms(seed=SEED)

```

In [None]:
# to_remove solution
def plot_rsms(seed):
  """
  Helper function to plot Representational Similarity Matrices (RSMs)

  Args:
    seed: Integer
      The seed value for the dataset/network

  Returns:
    random_encoder: nn.module
      The encoder with mentioned parameters/hyperparameters
  """
  # Call this before any dataset/network initializing or training,
  # to ensure reproducibility
  set_seed(seed)

  # Initialize a core encoder network that will not get trained
  random_encoder = models.EncoderCore()

  # Try sorting by different latent dimensions
  sorting_latent = "shape"

  # Plot RSMs
  print("Plotting RSMs...")
  _ = models.plot_model_RSMs(
      encoders=[supervised_encoder, random_encoder],  # Pass both encoders
      dataset=dSprites_torchdataset,
      sampler=test_sampler,  # To see the representations on the held out test set
      titles=["Supervised network encoder RSM",
              "Random network encoder RSM"],  # Plot titles
      sorting_latent=sorting_latent,
      )

  return random_encoder



## Uncomment below to test your function
with plt.xkcd():
  random_encoder = plot_rsms(seed=SEED)

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Plotting_a_random_network_encoder_Exercise")

### Discussion 3.1.1: What does comparing these RSMs reveal about the potential value of trained versus random encoder representations?

**A.** What patterns, if any, are visible in the random network encoder RSM?
**B.** Which encoder network is most likely to produce meaningful representations?

 #### Supporting images for Discussion response examples for 3.1.1: All random encoder RSMs


In [None]:
# @markdown #### Supporting images for Discussion response examples for 3.1.1: All random encoder RSMs
Image(filename=os.path.join(REPO_PATH, "images", "rsms_random_encoder_0ep_bs0_seed2021.png"), width=1000)

In [None]:
# to_remove explanation

"""
A. Only the yellow diagnonal identity line is visible. No other patterns emerge,
   as most images are encoded with near 0 similarity to one another, using the random encoder.

B. The trained, supervised network produces more meaningful representations,
   as the similarity between different encoded images actually
   **captures certain meaningful conceptual similarities between the different images**,
   specifically shape similarities. In other words, the image representations
   obtained with the trained, supervised encoding reflect the fact that two hearts
   are more conceptually similar to each other in terms of shape than a heart and a square.
""";

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Trained_vs_Random_encoder_Discussion")

### Coding Exercise 3.1.2: Evaluating the classification performance of a logistic regression trained on the representations produced by a random network encoder

In this exercise, we repeat a similar analysis to **Section 1.2**, but with the random encoder network. Importantly, this time, the encoder parameters must stay frozen during training by setting `freeze_features=True`. Instead of being provided ahead of time a suggestion for a reasonable number of training epochs, we use the training loss array to select a good value.


The following code:
* Trains a logistic regression on top of the random encoder network to classify images based on shape, and assesses its performance on the test set images using `models.train_classifier()` with `freeze_features=True` to ensure that the encoder is **not** trained, and only the classifier is.

**Exercise:**
* Set a number of epochs for which to train the classifier.
* Plot the training loss array (`random_loss_array`, i.e., training loss at each epoch) returned when training the model.
* Rerun the classifier if more training epochs are needed based on the progression of the training loss.


```python
def plot_loss(num_epochs, seed):
  """
  Helper function to plot the loss function of the random-encoder

  Args:
    num_epochs: Integer
      Number of the epochs the random encoder is to be trained for
    seed: Integer
      The seed value for the dataset/network

  Returns:
    random_loss_array: List
      Loss per epoch
  """
  # Call this before any dataset/network initializing or training,
  # to ensure reproducibility
  set_seed(seed)

  # Train classifier on the randomly encoded images
  print("Training a classifier on the random encoder representations...")
  _, random_loss_array, _, _ = models.train_classifier(
      encoder=random_encoder,
      dataset=dSprites_torchdataset,
      train_sampler=train_sampler,
      test_sampler=test_sampler,
      freeze_features=True,  # Keep the encoder frozen while training the classifier
      num_epochs=num_epochs,
      verbose=True  # Print results
      )
  #################################################
  # Fill in missing code below (...),
  # then remove or comment the line below to test your implementation
  raise NotImplementedError("Exercise: Plot loss array.")
  #################################################
  # Plot the loss array
  fig, ax = plt.subplots()
  ax.plot(...)
  ax.set_title(...)
  ax.set_xlabel(...)
  ax.set_ylabel(...)

  return random_loss_array



## Set a reasonable number of training epochs
num_epochs = 25
## Uncomment below to test your plot
# random_loss_array = plot_loss(num_epochs=num_epochs, seed=SEED)

```

```
Network performance after 25 classifier training epochs (chance: 33.33%):
    Training accuracy: 46.02%
    Testing accuracy: 44.67%
```

In [None]:
# to_remove solution
def plot_loss(num_epochs, seed):
  """
  Helper function to plot the loss function of the random-encoder

  Args:
    num_epochs: Integer
      Number of the epochs the random encoder is to be trained for
    seed: Integer
      The seed value for the dataset/network

  Returns:
    random_loss_array: List
      Loss per epoch
  """
  # Call this before any dataset/network initializing or training,
  # to ensure reproducibility
  set_seed(seed)

  # Train classifier on the randomly encoded images
  print("Training a classifier on the random encoder representations...")
  _, random_loss_array, _, _ = models.train_classifier(
      encoder=random_encoder,
      dataset=dSprites_torchdataset,
      train_sampler=train_sampler,
      test_sampler=test_sampler,
      freeze_features=True,  # Keep the encoder frozen while training the classifier
      num_epochs=num_epochs,
      verbose=True  # Print results
      )

  # Plot the loss array
  fig, ax = plt.subplots()
  ax.plot(random_loss_array)
  ax.set_title("Loss of classifier trained on a random encoder.")
  ax.set_xlabel("Epoch number")
  ax.set_ylabel("Training loss")

  return random_loss_array



## Set a reasonable number of training epochs
num_epochs = 25
## Uncomment below to test your plot
with plt.xkcd():
  random_loss_array = plot_loss(num_epochs=num_epochs, seed=SEED)

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Evaluating_the_classification_performance_Exercise")

The network loss training is fairly stable by 25 epochs, at which point the classifier performs at 44.67% accuracy on the test dataset.

<b>Shape classification results using different feature encoders:

| _Chance_ |  | None (raw data) | Supervised | Random |
| - | - | --- | --- | --- |
| _33.33%_ |  | 39.55% | 98.70% | 44.67% |

### Discussion 3.1.2: What can we conclude about the potential consequences of using random projections with a dataset like dSprites?

**A.** How does the classifier performance compare to the classifier trained directly on the images?
**B.** How does the classifier performance compare to the classifier trained along with the encoder (supervised encoder)?
**C.** What explains these different performances?

In [None]:
# to_remove explanation

"""
A. The classifier trained with the random encoder appears to perform a bit
   better than the classifier trained directly on the raw data.

B. The classifier trained with the random encoder performs substantially worse
   than the classifier trained along with the encoder (supervised encoder).

C. The random encoder projects the raw data to a lower dimensional
   feature space (84 features instead of 64 x 64 pixels).
   This **reduction in dimensionality**, as well as the possibility that some
   of the features **may randomly carry some shape-relevant information**,
   may explain the slight improvement in classification performance over training
   directly on the raw data. However, since the features are random,
   they are far less useful for the shape classification task than the
   supervised encoder's features which were specifically tuned to that task.
""";

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Random_projections_with_dSprites_Discussion")

---
# Section 4: Generative approaches to representation learning can fail

*Time estimate: ~30mins*


##  Video 4: Generative models


In [None]:
# @title Video 4: Generative models
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'NUittg0EKSM'), ('Bilibili', 'BV1YP4y147UT')]
tab_contents = display_videos(video_ids, W=730, H=410)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

##  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Generative_models_Video")

## Section 4.1: Examining the RSMs of a Variational Autoencoder

We next ask - What kind of representations a network can learn in the absence of labelled data? To answer this question, we first look at a **generative model**, namely the **Variational Autoencoder (VAE)**.

Given that generative models typically require more training than supervised models, instead of pre-training a network here, we will load one that was **pre-trained for 300 epochs**. Importantly, the **encoder shares the same architecture** as the one used for the supervised and random examples above.

The following code:
* Loads the parameters of a full Variational AutoEncoder (VAE) network (encoder and decoder) pre-trained on the generative task of reconstructing the input images, under the Kullback–Leibler Divergence (KLD) minimization constraint over the latent space that characterizes VAEs, using `load.load_encoder()` and `load.load_decoder()`.

In [None]:
# Call this before any dataset/network initializing or training,
# to ensure reproducibility
set_seed(SEED)

# Load VAE encoder and decoder pre-trained on the reconstruction and KLD tasks
vae_encoder = load.load_encoder(REPO_PATH, model_type="vae")
vae_decoder = load.load_vae_decoder(REPO_PATH)

### Interactive Demo 4.1.1: Plotting example reconstructions using the pre-trained VAE encoder and decoder

In this demo, we sample images from the test set, and take a look at the quality of the reconstructions using `models.plot_vae_reconstructions()`.

**Interactive Demo:** Try plotting different images from the test dataset by selecting different `test_sampler.indices` values. (Original setting is `indices=test_sampler.indices[:10]`.)

In [None]:
models.plot_vae_reconstructions(
    vae_encoder,  # Pre-trained encoder
    vae_decoder,  # Pre-trained decoder
    dataset=dSprites_torchdataset,
    indices=test_sampler.indices[:10],  # DEMO: Select different indices to plot from the test set
    title="VAE test set image reconstructions",
    )

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Pretrained_VAE_Interactive_Demo")

### Discussion 4.1.1: How does the VAE perform on the reconstruction task?
**A.** Which latent features does the network appear to preserve well, and which does it preserve less well?
**B.** Based on the reconstruction performance, what do you expect to see in the different RSMs?

**Note on reconstruction quality:** This VAE network uses a basic VAE loss with a convolutional encoder (our core encoder network), and a deconvolutional decoder. This can lead to some blurriness in the reconstructed shapes which a more sophisticated VAE could overcome.

In [None]:
# to_remove explanation

"""
A. The reconstructions preserve most of the latent features well:
  - Shape: ovals shapes are well preserved, whereas squares lose some sharpness
    in their edges. Hearts are poorly preserved, as the sharp angles of their edges are lost.
  - Scale: shape scales are well preserved.
  - Orientation: orientations are well preserved.
  - PosX and PosY: shape positions are very well preserved.

B. Since several of the latent features of the images are well preserved
   in the reconstructions, it is possible that the VAE encoder has indeed
   learned a feature space very similar to the known latent dimensions of the data.
   However, it is also possible that the VAE encoder instead learned a different
   latent feature space that is good enough to achieve reasonable image reconstruction.
   Examining the RSMs should shed light on that.
""";

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_VAE_on_the_reconstruction_task_Discussion")

### Interactive Demo 4.1.2: Visualizing the VAE encoder RSMs, organized along different latent dimensions

We will now compare the pre-trained VAE encoder network RSM to the previously generated encoder RSMs.

**Interactive Demo:** Visualize the RSMs, organized along different latent dimensions (`"scale"`, `"orientation"`, `"posX"` or `"posY"`), and compare the patterns observed for the different encoder networks. (The original setting is `sorting_latent="shape"`.)

In [None]:
sorting_latent = "shape"  # DEMO: Try sorting by different latent dimensions
print("Plotting RSMs...")
_ = models.plot_model_RSMs(
    encoders=[supervised_encoder, random_encoder, vae_encoder],  # Pass all three encoders
    dataset=dSprites_torchdataset,
    sampler=test_sampler,  # To see the representations on the held out test set
    titles=["Supervised network encoder RSM", "Random network encoder RSM",
            "VAE network encoder RSM"],  # Plot titles
    sorting_latent=sorting_latent,
    )

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_VAE_encoder_RSMs_Interactive_Demo")

### Discussion 4.1.2: What can we conclude about the the ability of generative models like VAEs to construct a meaningful representation space?

**A.** What structure can be observed in the pre-trained VAE encoder RSMs when sorted along the different latent dimensions, and what does that suggest about the feature space learned by the VAE encoder?
**B.** How do the pre-trained VAE encoder RSMs compare to the supervised and random encoder network RSMs?
**C.** What explains these different RSMs?
**D.** How well will the pre-trained VAE encoder likely perform on the shape classification task, as compared to the other encoder networks?
**E.** Might the pre-trained VAE encoder be better suited to predicting a different latent dimension?

 #### Supporting images for Discussion response examples for 4.1.2: All VAE encoder RSMs


In [None]:
# @markdown #### Supporting images for Discussion response examples for 4.1.2: All VAE encoder RSMs
Image(filename=os.path.join(REPO_PATH, "images", "rsms_vae_encoder_300ep_bs500_seed2021.png"), width=1000)

In [None]:
# to_remove explanation

"""
A. The VAE RSMs show very little structure along the diagonal when sorted by `shape` or `orientation`,
   even though both features were reasonably well preserved in the reconstructions.
   Some weak structure may be visible for `scale`, but the **clearest structure emerges for `posX` and `posY`**.
   This structure shows that the VAE encoder encodes shapes at nearby positions
   more similarly to each other than shapes that are farther apart.
   Together, these results suggest that although **the VAE was able to reconstruct the images,
   it did not end up learning a feature space that fits the known latent dimensions** of the dataset, except position in x and y.

B. Like the random encoder, but unlike the supervised encoder, the pre-trained
   VAE encoder **shape RSM** shows no structure. However, the pre-trained
   VAE encoder's **position RSMs** do show more structure than either
   the supervised or random encoder RSMs.

C. The differences between the supervised and pre-trained VAE encoder RSMs
   reflect the fact that the encoders are trained on very different
   tasks: classification and reconstruction, respectively. With such different
   constraints and requirements, the feature space needed to accomplish these
   tasks is likely to be quite different, as is the case here. Of course, the
   random encoder is not trained at all, so its feature space is unlikely to
   randomly be similar to either the supervised or VAE encoder's feature space.

D. The pre-trained VAE encoder is not likely to perform very well on shape classification,
   given the lack of structure in its RSM.

E. The pre-trained VAE encoder might be better suited to **predicting `posX` or `posY`**,
   as the RSMs show that its feature space does encode these features to some extent.
""";

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Construct_a_meaningful_representation_space_Discussion")

### Coding Exercise 4.1.2: Evaluating the classification performance of a logistic regression trained on the representations produced by the pre-trained VAE network encoder

For the pre-trained VAE encoder, as the encoder parameters have already been trained, they should be kept frozen while the classifier is trained by setting `freeze_features=True`.

**Exercise:**
*     Set a number of epochs for which to train the classifier.
*     Train a classifier, along with the encoder, to classify the input images according to shape, using `models.train_classifier()`.
*     Plot the loss array returned when training the model, and update the number of training epochs, if needed.

**Hint**: `models.train_classifier()` is introduced in **Interactive Demo 1.2.1**.

```python
def vae_train_loss(num_epochs, seed):
  """
  Helper function to plot the train loss of the variational autoencoder (VAE)

  Args:
    num_epochs: Integer
      Number of the epochs the VAE is to be trained for
    seed: Integer
      The seed value for the dataset/network

  Returns:
    vae_loss_array: List
      Loss per epoch
  """
  # Call this before any dataset/network initializing or training,
  # to ensure reproducibility
  set_seed(seed)
  #################################################
  # Fill in missing code below (...),
  # then remove or comment the line below to test your implementation
  raise NotImplementedError("Exercise: Train a classifer on the pre-trained VAE encoder representations.")
  #################################################
  # Train an encoder and classifier on the images, using models.train_classifier()
  print("Training a classifier on the pre-trained VAE encoder representations...")
  _, vae_loss_array, _, _ = models.train_classifier(
      encoder=...,
      dataset=...,
      train_sampler=...,
      test_sampler=...,
      freeze_features=..., # Keep the encoder frozen while training the classifier
      num_epochs=...,
      verbose=... # Print results
      )
  #################################################
  # Fill in missing code below (...),
  # then remove or comment the line below to test your implementation
  raise NotImplementedError("Exercise: Plot the VAE classifier training loss.")
  #################################################
  # Plot the VAE classifier training loss.
  fig, ax = plt.subplots()
  ax.plot(...)
  ax.set_title(...)
  ax.set_xlabel(...)
  ax.set_ylabel(...)

  return vae_loss_array



# Set a reasonable number of training epochs
num_epochs = 25
## Uncomment below to test your function
# vae_loss_array = vae_train_loss(num_epochs=num_epochs, seed=SEED)

```

```
Network performance after 25 classifier training epochs (chance: 33.33%):
    Training accuracy: 46.48%
    Testing accuracy: 45.75%
````

In [None]:
# to_remove solution
def vae_train_loss(num_epochs, seed):
  """
  Helper function to plot the train loss of the variational autoencoder (VAE)

  Args:
    num_epochs: Integer
      Number of the epochs the VAE is to be trained for
    seed: Integer
      The seed value for the dataset/network

  Returns:
    vae_loss_array: List
      Loss per epoch
  """
  # Call this before any dataset/network initializing or training,
  # to ensure reproducibility
  set_seed(seed)
  # Train an encoder and classifier on the images, using models.train_classifier()
  print("Training a classifier on the pre-trained VAE encoder representations...")
  _, vae_loss_array, _, _ = models.train_classifier(
      encoder=vae_encoder,
      dataset=dSprites_torchdataset,
      train_sampler=train_sampler,
      test_sampler=test_sampler,
      freeze_features=True, # Keep the encoder frozen while training the classifier
      num_epochs=num_epochs,
      verbose=True # Print results
      )

  # Plot the VAE classifier training loss.
  fig, ax = plt.subplots()
  ax.plot(vae_loss_array)
  ax.set_title("Loss of classifier trained on a VAE encoder")
  ax.set_xlabel("Epoch number")
  ax.set_ylabel("Training loss")

  return vae_loss_array



# Set a reasonable number of training epochs
num_epochs = 25
## Uncomment below to test your function
with plt.xkcd():
  vae_loss_array = vae_train_loss(num_epochs=num_epochs, seed=SEED)

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Evaluate_performance_using_pretrained_VAE_Exercise")

The network loss training is fairly stable by 25 epochs, at which point the classifier performs at 45.75% accuracy on the test dataset.

<b>Shape classification results using different feature encoders:

| _Chance_ |  | None (raw data) | Supervised | Random | VAE |
| - | - | --- | --- | --- | --- |
| _33.33%_ |  | 39.55% | 98.70% | 44.67% | 45.75% |

---
# Section 5: The modern approach to self-supervised training for invariance

*Time estimate: ~10mins*

##  Video 5: Modern Approach in Self-supervised Learning


In [None]:
# @title Video 5: Modern Approach in Self-supervised Learning
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'hUWcsSFWZyw'), ('Bilibili', 'BV1Bv411n7zP')]
tab_contents = display_videos(video_ids, W=730, H=410)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

##  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Modern_approach_in_Selfsupervised_Learning_Video")

## Section 5.1: Examining different options for learning invariant representations.

We now take a look at a few options for learning invariant shape representations for a dataset such as **dSprites**.

### Interactive Demo 5.1.1: Visualizing a few different image transformations available that could be used to learn invariance

The following code:
*    Initializes a set of transforms called `invariance_transforms` using the `torchvision.transforms.RandomAffine` class,
*    Collects the dSprites dataset into a torch dataset `dSprites_invariance_torchdataset` which takes the `invariance_transforms` as input and deploys the transforms when it is called,
*    Shows a few examples of images and their transformed versions using the `data.dSpritesTorchDataset` `show_images()` method.

The `torchvision.transforms.RandomAffine` class enables us to predetermine which types and ranges of transforms will be sampled from when transforming the images, by setting the following arguments:
*    `degrees`: Absolute maximum number of degrees to rotate
*    `translate`: Absolute maximum proportion of width to shift in x, and of height to shift in y
*   `scale`: Minimum to maximum scaling factor

**Interactive Demo:** Try out a few combinations of the transformation parameters, and visualize the pairs of transformations of the same image. (The original settings are `degrees=90`, `translate=(0.2, 0.2)`, `scale=(0.8, 1.2)`.)

In [None]:
# Call this before any dataset/network initializing or training,
# to ensure reproducibility
set_seed(SEED)

# DEMO: Try some random affine data augmentations combinations to apply to the images
invariance_transforms = torchvision.transforms.RandomAffine(
    degrees=90,
    translate=(0.2, 0.2),  # (in x, in y)
    scale=(0.8, 1.2)   # min to max scaling
    )

# Initialize a simclr-specific torch dataset
dSprites_invariance_torchdataset = data.dSpritesTorchDataset(
    dSprites,
    target_latent="shape",
    simclr=True,
    simclr_transforms=invariance_transforms
    )

# Show a few example of pairs of image augmentations
_ = dSprites_invariance_torchdataset.show_images(randst=SEED)

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Image_transformations_Interactive_Demo")

---
# Section 6: How to train for invariance to transformations with a target network

*Time estimate: ~40mins*

##  Video 6: Data Transformations


In [None]:
# @title Video 6: Data Transformations
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'g6IxiUXubhM'), ('Bilibili', 'BV1H64y1t7ag')]
tab_contents = display_videos(video_ids, W=730, H=410)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

##  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Data_Transformations_Video")

## Section 6.1: Using image transformations to learn feature invariant representations in a Self-supervised Learning (SSL) network.

We will now investigate the effects of selecting certain transformations compared to others on the invariance learned by an encoder network trained with a **specific type of SSL algorithm, namely SimCLR**. Specifically, we will observe how pre-training an encoder network with SimCLR affects the performance of a classifier trained on the representations the network has learned.

### Coding Exercise 6.1.1: Complete a SimCLR loss function

The following code:
*    Lays out the skeleton of a function `custom_simclr_contrastive_loss()` which calculates the contrastive loss for a SimCLR network,
*    Tests the custom function against the solution implementation,
*    Trains SimCLR for a few epochs.

**Exercise:**
*    Complete the `custom_simclr_contrastive_loss()` implementation,
*    Plot the loss after training SimCLR with the custom loss function for a few epochs.

**Detailed hint**:
- `custom_simclr_contrastive_loss()`:
    - Takes 2 input arguments:
        - `proj_feat1` (2D torch Tensor): Projected features for first image augmentations (batch_size x feat_size)
        - `proj_feat2` (2D torch Tensor): Projected features for second image augmentations (batch_size x feat_size)
    - Computes the `similarity_matrix` for all possible pairs of image augmentations.
    - Identifies positive and negative sample indicators for indexing the `similarity_matrix`:
        - `pos_sample_indicators` (2D torch Tensor): Tensor indicating the positions of **positive** image pairs with 1s (and 0s in all other positions). (batch_size \* 2 x batch_size * 2)
        - `neg_sample_indicators` (2D torch Tensor): Tensor indicating the positions of **negative** image pairs with 1s (and 0s in all other positions). (batch_size \* 2 x batch_size * 2)
    - Computes the 2 parts of the contrastive loss, retrieving the relevant values from the `similarity_matrix` using the indicators:
        - `numerator`: Calculated from the `similarity_matrix` values for positive pairs.
        - `denominator`: Calculated from the `similarity_matrix` values for negative pairs.

```python
def custom_simclr_contrastive_loss(proj_feat1, proj_feat2, temperature=0.5):
  """
  Returns contrastive loss, given sets of projected features, with positive
  pairs matched along the batch dimension.

  Args:
    Required:
      proj_feat1: 2D torch.Tensor
        Projected features for first image with augmentations (size: batch_size x feat_size)
      proj_feat2: 2D torch.Tensor
        Projected features for second image with augmentations (size: batch_size x feat_size)
    Optional:
      temperature: Float
        relaxation temperature (default: 0.5)
        l2 normalization along with temperature effectively weights different
        examples, and an appropriate temperature can help the model learn from hard negatives.
  Returns:
    loss: Float
      Mean contrastive loss
  """
  device = proj_feat1.device

  if len(proj_feat1) != len(proj_feat2):
    raise ValueError(f"Batch dimension of proj_feat1 ({len(proj_feat1)}) "
                     f"and proj_feat2 ({len(proj_feat2)}) should be same")

  batch_size = len(proj_feat1) # N
  z1 = torch.nn.functional.normalize(proj_feat1, dim=1)
  z2 = torch.nn.functional.normalize(proj_feat2, dim=1)

  proj_features = torch.cat([z1, z2], dim=0) # 2N x projected feature dimension
  similarity_matrix = torch.nn.functional.cosine_similarity(
      proj_features.unsqueeze(1), proj_features.unsqueeze(0), dim=2
      ) # dim: 2N x 2N

  # Initialize arrays to identify sets of positive and negative examples, of
  # shape (batch_size * 2, batch_size * 2), and where
  # 0 indicates that 2 images are NOT a pair (either positive or negative, depending on the indicator type)
  # 1 indices that 2 images ARE a pair (either positive or negative, depending on the indicator type)
  pos_sample_indicators = torch.roll(torch.eye(2 * batch_size), batch_size, 1).to(device)
  neg_sample_indicators = (torch.ones(2 * batch_size) - torch.eye(2 * batch_size)).to(device)

  #################################################
  # Fill in missing code below (...),
  # then remove or comment the line below to test your function
  raise NotImplementedError("Exercise: Implement SimCLR loss.")
  #################################################
  # Implement the SimClr loss calculation
  # Calculate the numerator of the Loss expression by selecting the appropriate elements from similarity_matrix.
  # Use the pos_sample_indicators tensor
  numerator = ...

  # Calculate the denominator of the Loss expression by selecting the appropriate elements from similarity_matrix,
  # and summing over pairs for each item.
  # Use the neg_sample_indicators tensor
  denominator = ...

  if (denominator < 1e-8).any(): # Clamp to avoid division by 0
    denominator = torch.clamp(denominator, 1e-8)

  loss = torch.mean(-torch.log(numerator / denominator))

  return loss



## Uncomment below to test your function
# test_custom_contrastive_loss_fct(custom_simclr_contrastive_loss)

```

```
custom_simclr_contrastive_loss() is correctly implemented.
```

In [None]:
# to_remove solution
def custom_simclr_contrastive_loss(proj_feat1, proj_feat2, temperature=0.5):
  """
  Returns contrastive loss, given sets of projected features, with positive
  pairs matched along the batch dimension.

  Args:
    Required:
      proj_feat1: 2D torch.Tensor
        Projected features for first image with augmentations (size: batch_size x feat_size)
      proj_feat2: 2D torch.Tensor
        Projected features for second image with augmentations (size: batch_size x feat_size)
    Optional:
      temperature: Float
        Relaxation temperature (default: 0.5)
        l2 normalization along with temperature effectively weights different
        examples, and an appropriate temperature can help the model learn from hard negatives.
  Returns:
    loss: Float
      Mean contrastive loss
  """
  device = proj_feat1.device

  if len(proj_feat1) != len(proj_feat2):
    raise ValueError(f"Batch dimension of proj_feat1 ({len(proj_feat1)}) "
                     f"and proj_feat2 ({len(proj_feat2)}) should be same")

  batch_size = len(proj_feat1) # N
  z1 = torch.nn.functional.normalize(proj_feat1, dim=1)
  z2 = torch.nn.functional.normalize(proj_feat2, dim=1)

  proj_features = torch.cat([z1, z2], dim=0) # 2N x projected feature dimension
  similarity_matrix = torch.nn.functional.cosine_similarity(
      proj_features.unsqueeze(1), proj_features.unsqueeze(0), dim=2
      ) # dim: 2N x 2N

  # Initialize arrays to identify sets of positive and negative examples, of
  # shape (batch_size * 2, batch_size * 2), and where
  # 0 indicates that 2 images are NOT a pair (either positive or negative, depending on the indicator type)
  # 1 indices that 2 images ARE a pair (either positive or negative, depending on the indicator type)
  pos_sample_indicators = torch.roll(torch.eye(2 * batch_size), batch_size, 1).to(device)
  neg_sample_indicators = (torch.ones(2 * batch_size) - torch.eye(2 * batch_size)).to(device)

  # Implement the SimClr loss calculation
  # Calculate the numerator of the Loss expression by selecting the appropriate elements from similarity_matrix.
  # Use the pos_sample_indicators tensor
  numerator = torch.exp(similarity_matrix / temperature)[pos_sample_indicators.bool()]

  # Calculate the denominator of the Loss expression by selecting the appropriate elements from similarity_matrix,
  # and summing over pairs for each item.
  # Use the neg_sample_indicators tensor
  denominator = torch.sum(
      torch.exp(similarity_matrix / temperature) * neg_sample_indicators,
      dim=1
      )

  if (denominator < 1e-8).any(): # Clamp to avoid division by 0
    denominator = torch.clamp(denominator, 1e-8)

  loss = torch.mean(-torch.log(numerator / denominator))

  return loss



## Uncomment below to test your function
test_custom_contrastive_loss_fct(custom_simclr_contrastive_loss)

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_SimCLR_loss_function_Exercise")

We can now train the SimCLR encoder with the custom contrastive loss for a few epochs.

In [None]:
# Call this before any dataset/network initializing or training,
# to ensure reproducibility
set_seed(SEED)

# Train SimCLR for a few epochs
print("Training a SimCLR encoder with the custom contrastive loss...")
num_epochs = 5
_, test_simclr_loss_array = models.train_simclr(
    encoder=models.EncoderCore(),
    dataset=dSprites_invariance_torchdataset,
    train_sampler=train_sampler,
    num_epochs=num_epochs,
    loss_fct=custom_simclr_contrastive_loss
    )

# Plot SimCLR loss over a few epochs.
fig, ax = plt.subplots()
ax.plot(test_simclr_loss_array)
ax.set_title("SimCLR network loss")
ax.set_xlabel("Epoch number")
_ = ax.set_ylabel("Training loss")


Given that self-supervised models typically require more training than supervised models, instead of fully pre-training a network here, we will load one that was **pre-trained for 60 epochs**. Again, the **encoder shares the same architecture** as the one used for the supervised, random and VAE examples above.

The following code:
*    Loads the parameters of a SimCLR network pre-trained on the SimCLR contrastive task using `load.load_encoder()`.

In [None]:
# Load SimCLR encoder pre-trained on the contrastive loss
simclr_encoder = load.load_encoder(REPO_PATH, model_type="simclr")

### Interactive Demo 6.1.1: Evaluating the classification performance of a logistic regression trained on the representations produced by a SimCLR network encoder that was pre-trained using different image transformations

For the pre-trained SimCLR encoder, as with the VAE encoder, as the encoder parameters have already been trained, they should be kept frozen while the classifier is trained by setting `freeze_features=True`.

We train and test with `dSprites_torch dataset` instead of `dSprites_invariance_torch dataset`, as we are interested in the classifier performance on the real dSprites images, and not their augmentations.

**Interactive Demo:** Try different numbers of epochs for which to train the classifier. (The original setting is `num_epochs=10`.)

In [None]:
# Call this before any dataset/network initializing or training,
# to ensure reproducibility
set_seed(SEED)

print("Training a classifier on the pre-trained SimCLR encoder representations...")
_, simclr_loss_array, _, _ = models.train_classifier(
    encoder=simclr_encoder,
    dataset=dSprites_torchdataset,
    train_sampler=train_sampler,
    test_sampler=test_sampler,
    freeze_features=True,  # Keep the encoder frozen while training the classifier
    num_epochs=10,  # DEMO: Try different numbers of epochs
    verbose=True
    )

fig, ax = plt.subplots()
ax.plot(simclr_loss_array)
ax.set_title("Loss of classifier trained on a SimCLR encoder.")
ax.set_xlabel("Epoch number")
_ = ax.set_ylabel("Training loss")

```
Network performance after 10 classifier training epochs (chance: 33.33%):
    Training accuracy: 97.83%
    Testing accuracy: 97.53%
````

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Evaluate_performance_using_pretrained_SimCLR_Interactive_Demo")

The network (using the transforms proposed above) performs at 97.53% accuracy on the test dataset, after 15 classifier training epochs.

<b>Shape classification results using different feature encoders:

| _Chance_ |  | None (raw data) | Supervised | Random | VAE | SimCLR |
| - | - | --- | --- | --- | --- | --- |
| _33.33%_ |  | 39.55% | 98.70% | 44.67% | 45.75% | 97.53% |

---
# Section 7: Ethical considerations for self-supervised learning from biased datasets

##  Video 7: Un/Self-Supervised Learning


In [None]:
# @title Video 7: Un/Self-Supervised Learning
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'NT006a6nkyg'), ('Bilibili', 'BV1mP4y1473E')]
tab_contents = display_videos(video_ids, W=730, H=410)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

##  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Un_self_supervised_learning_Video")

## Section 7.1: The consequences of training models on biased datasets

If a model is trained on a biased dataset, it is likely to learn a representational encoding that reproduces these biases, impairing its ability to generalize properly and increasing the likelihood that it will propagate these biases forward.

Here, we investigate the effects of training the models on a biased subset of the training dataset. Specifically, we introduce a `train_sampler_biased`, a training dataset sampler that only samples:
*    **Squares**, if they are centered on the **lefthand** side of an image **(posX: 0 to 0.3)**,
*    **Ovals**, if they are centered in the **center** of an image **(posX: 0.35 to 0.65)**,
*    **Hearts**, if they are centered on the **righthand** side of am image **(posX: 0.7 to 1.0)**.

This sampling bias introduces a correlation between `shape` and `posX` that does not exist in the original dataset.

We then train each model as above on the dataset, and observe their performance when tested on an unbiased dataset.

_**Note on dataset size:** This biased sampling also significantly reduces the size of the training dataset available (approximately 6x). Thus, it would not be fair to compare our results here to those obtained previously in the tutorial, when we were using the full dataset. For this reason, **as a control, we will also separately train the models with `train_sampler_bias_ctrl`**, a training dataset sampler that does not share the same sampling bias as `train_sampler_biased`, but can only sample as many samples as `train_sampler_biased` can._

In [None]:
# Call this before any dataset/network initializing or training,
# to ensure reproducibility
set_seed(SEED)

bias_type = "shape_posX_spaced"  # Name of bias

# Initialize a biased training sampler and an unbiased test sampler
train_sampler_biased, test_sampler_for_biased = data.train_test_split_idx(
    dSprites_torchdataset,
    fraction_train=0.95,  # 95:5 Split to partially compensate for loss of training examples due to bias
    randst=SEED,
    train_bias=bias_type
    )

# Initialize a control, unbiased training sampler and an unbiased test sampler
train_sampler_bias_ctrl, test_sampler_for_bias_ctrl = data.train_test_split_idx(
    dSprites_torchdataset,
    fraction_train=0.95,
    randst=SEED,
    train_bias=bias_type,
    control=True
    )

print(f"Biased dataset: {len(train_sampler_biased)} training, "
      f"{len(test_sampler_for_biased)} test images")
print(f"Bias control dataset: {len(train_sampler_bias_ctrl)} training, "
      f"{len(test_sampler_for_bias_ctrl)} test images")

We plot some images sampled with `train_sampler_biased` to observe the pattern described above where `shape` and `posX` are now correlated.

To better visualize the bias introduced, we will plot them with annotations that show, in red:
  - The **edges** of each of the 3 `posX` sections, and
  - The **center**, i.e. `(posX, posY)`, for each shape.

In [None]:
print("Plotting first 20 images from the biased training dataset.\n")
dSprites.show_images(indices=train_sampler_biased.indices[:20], annotations="posX_quadrants")

We also plot some images sampled with `train_sampler_bias_ctrl` to verify visually that this biased pattern does not appear in the control dataset.

Again, the annotations are added, **purely for visualization purposes**.

In [None]:
print("Plotting sample images from the bias control training dataset.\n")
dSprites.show_images(indices=train_sampler_bias_ctrl.indices[:20], annotations="posX_quadrants")

In [None]:
# @markdown ### Function to run full training procedure
# @markdown (from initializing and pretraining encoders to training classifiers):

# @markdown `full_training_procedure(train_sampler, test_sampler)`

def full_training_procedure(train_sampler, test_sampler, title=None,
                            dataset_type="biased", verbose=True):
  """
  Funtion to load pretrained VAE and SimCLR encoders

  Args:
    train_sampler: torch.Tensor
      Training Data
    test_sampler: torch.Tensor
      Test Data
    title: String
      Title
    dataset_type: String
      Specifies if the expected model type is biased/bias-controlled
    verbose: Boolean
      If true, the shell shows all lines in the script in execution

  Returns:
    Nothing
  """
  if dataset_type not in ["biased", "bias_ctrl"]:
    raise ValueError("Expected model_type to be 'biased' or 'bias_ctrl', "
                     f"but found {model_type}.")

  supervised_encoder = models.EncoderCore()
  random_encoder = models.EncoderCore()

  # Load pre-trained VAE
  vae_encoder = load.load_encoder(
      REPO_PATH, model_type="vae", dataset_type=dataset_type,
      verbose=verbose
      )

  # Load pre-trained SimCLR encoder
  simclr_encoder = load.load_encoder(
      REPO_PATH, model_type="simclr", dataset_type=dataset_type,
      verbose=verbose
      )

  encoders = [supervised_encoder, random_encoder, vae_encoder, simclr_encoder]
  freeze_features = [False, True, True, True]
  encoder_labels = ["supervised", "random", "VAE", "SimCLR"]

  num_clf_epochs = [80, 30, 30, 30]
  print(f"\nTraining supervised encoder and classifier for {num_clf_epochs[0]} "
    f"epochs, and all other classifiers for {num_clf_epochs[1]} epochs each.")
  _ = models.train_encoder_clfs_by_fraction_labelled(
      encoders=encoders,
      dataset=dSprites_torchdataset,
      train_sampler=train_sampler,
      test_sampler=test_sampler,
      num_epochs=num_clf_epochs,
      freeze_features=freeze_features,
      subset_seed=SEED,
      encoder_labels=encoder_labels,
      title=title,
      verbose=verbose
      )

Here, we use a **biased training data sampler** (and unbiased control sampler) to observe how the different models perform. Because the dataset is much smaller, we increase the number of pre-trained and training epochs for the encoders and classifiers.

Let us start with our **unbiased control sampler**, to get a sense of the classification performance levels we should expect with a dataset this size.

In [None]:
# Call this before any dataset/network initializing or training,
# to ensure reproducibility
set_seed(SEED)

print("Training all models using the control, unbiased training dataset\n")
full_training_procedure(
    train_sampler_bias_ctrl, test_sampler_for_bias_ctrl,
    title="Classifier performances with control, unbiased training dataset",
    dataset_type="bias_ctrl"  # For loading correct pre-trained networks
    )

A similar pattern is observed here as with the full dataset, though notably most performances are a bit weaker, likely due to us (A) using a smaller training dataset, and (B) training and pre-training for fewer iterations, considering the dataset size, for time-efficiency reasons.

Using the same parameters, we now repeat the analysis with the **biased** training data sampler.

In [None]:
# Call this before any dataset/network initializing or training,
# to ensure reproducibility
set_seed(SEED)

print("Training all models using the biased training dataset\n")
full_training_procedure(
    train_sampler_biased, test_sampler_for_biased,
    title="Classifier performances with biased training dataset",
    dataset_type="biased"  # For loading correct pre-trained networks
    )

Interestingly, the SimCLR network encoder is not only the only network to perform well, it even outperforms its control performance (which uses the same test dataset), at least with this particular dataset and biasing.

_**Note on performance improvement:** This improvement for the SimCLR encoder is reflected in the pre-training loss curves (not shown here), which show that the encoder trained with the biased dataset learns faster than the encoder trained with the unbiased training set. It is possible that the dataset biasing, by reducing the variability in the dataset, makes the contrastive task easier, thus enabling the network to learn a good feature space for the classification task in fewer epochs_

### Discussion 7.1.1: How do different models cope with a biased training dataset?

**A.** Which models are most and least affected by the biased training dataset?
**B.** Which types of images in the test set are most likely causing the observed drop in performance?
**C.** Why are certain models more robust to the bias introduced here than others?
**D.** What are some methods we can employ to help mitigate the negative effects of biases in our training sets on our ability to learn good data representations with our models?

In [None]:
# to_remove explanation

"""
A. The classifier trained with the supervised encoder is most affected by the
    biased training dataset i.e., drops to chance performance.
    In contrast, the classifier trained with the pre-trained SimCLR encoder
    is least affected. Classifiers trained with the random or pre-trained VAE
    encoders also drop to chance performance.

B. It is likely that the drop in performance observed with the supervised
   encoder is due to that fact that test set image distribution is poorly
   represented by the training set image distribution. Indeed, certain
   `shape`/`posX` combinations which exist in the test set do not appear at all
   in the training set. As a result, it is likely that the classifier performs
   very poorly when classifying squares on the right or hearts on the left,
   as in the training set, all squares are on the left, and all hearts are
   on the right. In fact, it is possible that the network picked up on this
   biased relationship between `shape` and `posX` during training, and learned
   to classify shapes almost exclusively from their position in X in the image.
   Such a solution would generalize very poorly to the test set where this
   biased relationship does not exist.

C. The pre-trained SimCLR encoder is far more robust to this bias,
   as its pre-training is not limited to the training set images. Indeed, since
   it is trained on training set image augmentations, this forces the encoder
   to learn a feature space that captures a much broader distribution of images
   than exists in the training set alone. In this case, the data augmentations
   directly ensure that the SimCLR encoder is robust to the specific type of
   bias used, as they push the network to learn **representations that are
   invariant to position in x** (and y). As a result, when the classifier is
   trained on top of the pre-trained encoder with only the biased training set,
   it is more likely to learn the appropriate mapping from the feature space to
   shape, without interference from the correlated position information.

D. From these examples, we can see how self-supervised learning, and
   specifically data augmentation, has the potential to help mitigate the
   negative effects of biases that exist in our training sets. However, it is
   important to note that in this example, the dataset and classification task
   are very simple, and we actually know exactly what the bias in the
   training dataset is. This makes selecting appropriate data augmentations
   quite simple. In real-world scenarios, it is not so obvious.
   Some strategies for selecting good data augmentations in real-world
   scenarios might include:
    - identifying dimensions that a model should in theory be invariant to and
      designing augmentations tailored to promote invariance to these dimensions,
    - identifying **known sources** of bias, for example based on existing
      research in psychology and sociology, and tailoring data augmentations to
      these biases,
    - designing biased training datasets to evaluate how robust models are to
      these biases following training.

Of course, in addition to mitigation strategies, it is critically important that
we reduce biases at the source by improving data collection and dataset curation
strategies.
""";

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Biased_training_dataset_Discussion")

### Discussion 7.1.2: How do these principles apply more generally?

We have seen now how self-supervised learning (SSL) can improve a network's ability to learn good representations of data. For the purposes of this tutorial, we presented examples with a **simplified dataset**: the dSprites dataset, where we know:
(1) The latent dimensions for all images,
(2) The joint probability distribution across latent dimensions for the full dataset, and
(3) The precise nature of the bias introduced into our biased dataset **(see Bonus 2 for more details)**.

As a result, it is quite simple to design data augmentations that ensure that the pre-trained encoder will learn a good feature space for the downstream classification task.
<br>
In real-world applications, with more complex or difficult datasets,
**A.** What principles can we draw on to successfully apply SSL to learn good data representations in feature space? For example,
**B.** What challenges might we face with new datasets, compared to applying SSL to dSprites?
**C.** What types of augmentations might we use when working with non visual datasets, e.g., a speech dataset. In addition, we primarily discussed **only one type of SSL, namely SimCLR**. However, many different types of SSL exist, some of which do not use explicit data augmentations.<br>
**D.** What type of SSL task could be implemented for **sequential or time series** data. For example, you might wish to predict from electrical brain recordings what stage of sleep a person is in. How might you use the knowledge that sleep stages change slowly in time to construct a useful SSL task?

In [None]:
# to_remove explanation

"""
A.  A few examples of principles we can draw on to successfully apply
    SSL to learning good data representations include:
    - When limited labelled data is available, model pre-training
      can greatly enhance the quality of the feature space a model learns.
    - Data augmentations selection can be guided, at least in part,
      by an understanding of both:
      - the types of latent variables that underlie the data of interest
      - the types of latent variables our encoder might need to learn as
        features (e.g., shape, orientation) or become invariant to
        (e.g., color, motion, scale) in order to perform well on specific
        types of downstream tasks.

B. A few examples of challenges that we might face include:
    - Identifying useful data augmentations to learn good features for a more
      complex dataset (e.g., a biosignal dataset, a speech dataset).
    - Anticipating the types features that might be relevant to more complex
      downstream tasks (e.g., detecting pathological biosignals, classifying
      speaker identity).
    - Allocating additional training time and resources, as SSL tasks are
      typically more computationally demanding than their supervised
      counterparts.

C. When working with a non visual dataset, e.g., a speech dataset, different
   types of augmentations are needed from the ones used in this tutorial. The
   type of augmentations used will still depend on the downstream task of
   interest, of course. If the downstream task of interest is something like
   speech to text translation, it could be useful to learn representations
   that are invariant to the pitch of a voice. So, one could design an
   augmentation that randomly shifts the pitch of a speech sample. If the
   downstream task is pitch sensitive, however, like speaker identification,
   a different augmentation could be designed, like a time shift where two
   samples close in time are positive pairs for each other.

D. For sequential or time series data, one could use a very different type of
   SSL task, like a predictive task. In such a task, a network could be trained
   to predict the representation of time point t_2 from the representation of
   time point t_1. In order to successfully accomplish this task of predicting
   electrical brain activity representations sequentially, our network would
   likely learn data representations that change gradually and predictably
   in time. Since the stages of sleep also change gradually through time, this
   network would have a good chance of being successful in downstream tasks
   that rely on temporal features, like sleep staging. This type of
   predictive SSL is applied in a more sophisticated way in algorithms like
   Contrastive Predictive Coding (CPC), for example.
""";

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_General_Principles_Discussion")

---
# Summary

##  Video 8: Conclusion


In [None]:
# @title Video 8: Conclusion
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'tvZzYfi_bTI'), ('Bilibili', 'BV1Tq4y1X7e1')]
tab_contents = display_videos(video_ids, W=730, H=410)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

##  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Conclusion_Video")

---
# Bonus 1: Self-supervised networks learn representation invariance

*Time estimate: ~20mins*

##  Video 9: Invariant Representations


In [None]:
# @title Video 9: Invariant Representations
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'f8FCk519-lI'), ('Bilibili', 'BV1Ry4y1L7Hz')]
tab_contents = display_videos(video_ids, W=730, H=410)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

##  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Invariant_Representations_Bonus_Video")

## Bonus 1.1: The effects of using data transformations on invariance in SimCLR network representations

We now observe the effects of adding our data transformations on the invariance learned by a pre-trained SimCLR network encoder.

### Bonus Interactive Demo 1.1.1: Visualizing the SimCLR network encoder RSMs, organized along different latent dimensions

We will now compare the pre-trained SimCLR encoder network RSM to the previously generated encoder RSMs.

Again, we pass `dSprites_torchdataset` instead of `dSprites_invariance_torchdataset`, as we are interested in the RSMs for the real dSprites images, and not their augmentations.

**Interactive Demo:** Visualize the RSMs, organized along different latent dimensions (`"scale"`, `"orientation"`, `"posX"` or `"posY"`), and compare the patterns observed for the different encoder networks. (The original setting is `sorting_latent="shape"`.)


In [None]:
sorting_latent = "shape"  # DEMO: Try sorting by different latent dimensions
print("Plotting RSMs...")
_ = models.plot_model_RSMs(
    encoders=[supervised_encoder, vae_encoder, simclr_encoder],
    dataset=dSprites_torchdataset,
    sampler=test_sampler,  # To see the representations on the held out test set
    titles=["Supervised network encoder RSM", "VAE network encoder RSM",
            "SimCLR network encoder RSM"],  # Plot titles
    sorting_latent=sorting_latent
    )

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_SimCLR_network_encoder_RSMs_Bonus_Interactive_Demo")

### Bonus Discussion 1.1.1: What can we conclude about the ability of contrastive models like SimCLR to construct a meaningful representation space?

**A.** How do the pre-trained SimCLR encoder RSMs (sorted along different latent dimensions) compare to the supervised and pre-trained VAE encoder network RSMs?
**B.**  What explains these different RSMs?
**C.**  What advantages might some encoders have over others?
**D.**  Does a good performance by the SimCLR encoder on a contrastive task guarantee good performance on a downstream classification task?
**E.**  How might one modify the SimCLR encoder pre-training, for example, if the downstream task were to predict `orientation` instead of `shape`?

 #### Supporting images for Discussion response examples for Bonus 1.1.1: All SimCLR encoder RSMs


In [None]:
# @markdown #### Supporting images for Discussion response examples for Bonus 1.1.1: All SimCLR encoder RSMs
Image(filename=os.path.join(REPO_PATH, "images", "rsms_simclr_encoder_60ep_bs1000_deg90_trans0-2_scale0-8to1-2_seed2021.png"), width=1000)

In [None]:
# to_remove explanation

"""
A. The RSMs for the pre-trained SimCLR encoder show that it
   **encodes `shape` almost as strongly** as the supervised encoder.
   Unlike the supervised encoder, it also appears to encode `scale` quite strongly.
   In addition, unlike the pre-trained VAE encoder, it does not appear to
   encode position strongly.

B. The RSM structures observed for the pre-trained SimCLR encoder strongly
   reflect the transformations selected as augmentations during training.
   Indeed, these augmentations are closely related to several of the original
   latent dimensions used to sort the RSMs:
    - `scale` for `scale`,
    - `degrees` for `orientation`, and
    - `translation` for `posX` and `posY`

   This likely explains why almost no structure is visible in the RSMs sorted by
   `orientation`, `posX` and `posY`: **the encoder was specifically trained to
   ignore these features**, i.e. to learn a feature space that is invariant to
   differences along these dimensions. As a result of this training,
   the only original latent dimension left for the encoder to encode in
   feature space was `shape`.
   Interestingly, although the SimCLR encoder was also trained to ignore `scale`,
   it converged on a solution to the contrastive task that still encodes
   `scale` to some extent.

C. The supervised encoder is likely the best encoder for tasks that rely on
   distinguishing shapes similar to those in the dataset. The SimCLR encoder is
   likely useful for tasks that rely on differentiating shape and/or scale.
   Lastly, the VAE encoder is likely the most useful of the three for position
   decoding, as well as image reconstruction, when paired with the VAE decoder
   it was trained with.

D. Good performance by a SimCLR encoder on a contrastive task does **not** guarantee
   that the encoder will perform well on a downstream classification task.
   The performance of the encoder on the contrastive task will likely only reflect
    future classification performance if the contrastive task has been designed
    in a way that **specifically promotes learning of a feature space that is
    relevant to that downstream classification task**. For example, here the
    contrastive task drove the encoder towards becoming invariant to all latent dimensions
    other than shape, which is intuitively exactly what is needed for a shape
    classification task.
    If other, less relevant augmentations had been selected instead
    (e.g., adding random noise or inverting the colors), an encoder with the
    same amount of pre-training might have performed very poorly on the
    downstream shape classification task.

E. If we wanted to pre-train a SimCLR encoder to decode `orientation` instead of `shape`,
   we would likely **remove the `degrees` augmentation**, as it drives the encoder
   to become invariant to the different orientations of the shapes.
   To support the network's ability to generalize to new shapes, we might want to
   push the encoder to be more invariant to the `shape` dimension. To do this, we
   might use some sort of **filter augmentation**, like a Gaussian, that slightly
   distorts shape edges. However, such a shape distortion augmentation would probably
   have to be applied **carefully** in order to avoid transforming all shapes into amorphous
   blobs with no discernible orientation or making shapes appear like
   **different shapes with totally different orientations.**
   Indeed, since **orientation is at least partially determined from shape**, a
   feature space that is good for predicting orientation will likely not be
   entirely invariant to shape.
""";

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Contrastive_models_Bonus_Discussion")

---
# Bonus 2: Avoiding representational collapse


##  Video 10: Avoiding Representational Collapse


In [None]:
# @title Video 10: Avoiding Representational Collapse
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'fS2BAKVdpIY'), ('Bilibili', 'BV1Gv411E7xe')]
tab_contents = display_videos(video_ids, W=730, H=410)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

##  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Avoiding_Representational_Collapse_Bonus_Video")

## Bonus 2.1: The effects of reducing the number of negative examples used in the SimCLR contrastive loss

As seen above in the contrastive loss implementation, a strategy used to train neural networks with contrastive losses is to use large batch sizes (here, we used 1,000 examples per batch), and to use the representations of different images in a batch as **each other's negative examples**. So with a batch size of 1,000, each image has one positive paired image (its paired augmentation), and 999 negative paired images (every image but itself, including its own paired augmentation, again). This enables the contrastive loss to obtain a good estimate of the full representational similarity distribution.

To observe the consequences of sampling using fewer negative examples in the contrastive loss, we use a pre-trained SimCLR network again. However, this one was pre-trained with a parameter called `neg_pairs` set to `2`. Under the hood, this parameter affects only the contrastive loss calculation, allowing it to use **only 2 of the total available negative pairs in a batch, for each image.**

The following code:
*    Loads the parameters of a SimCLR network pre-trained on the SimCLR contrastive task, but with only 2 negative pairs used per image in the loss calculation, using `load.load_encoder()`,
*    Plots the RSMs of a few network encoders for comparison.

In [None]:
# Call this before any dataset/network initializing or training,
# to ensure reproducibility
set_seed(SEED)

# Load SimCLR encoder pre-trained on the contrastive loss
simclr_encoder_neg_pairs = load.load_encoder(
    REPO_PATH, model_type="simclr", neg_pairs=2
    )

### Bonus Coding Exercise 2.1.1: Visualizing the network encoder RSMs, organized along different latent dimensions, and plotting similarity histograms

We will now compare the RSM for the pre-trained SimCLR encoder  trained with **only 2 negative pairs** to the normal pre-trained SimCLR network encoder and the random network encoder. To help us compare the representations learned by the normal and modified SimCLR encoders, we will plot a histogram of the values that make up both RSMs.

**Exercise:**
*    Visualize the RSMs, organized along the `shape` latent dimension, and compare the patterns observed for the different encoder networks.
*    Plot a histogram of RSM values for the normal and 2-neg-pair SimCLR network encoders.

**Hint**:
*    `models.plot_model_RSMs()` returns the **data matrices** calculated for each encoder's RSM, in order.

```python
def rsms_and_histogram_plot():
  """
  Function to plot Representational Similarity Matrices (RSMs) and Histograms

  Args:
    None

  Returns:
    Nothing
  """
  sorting_latent = "shape" # Exercise: Try sorting by different latent dimensions
  # EXERCISE: Visualize RSMs for the normal SimCLR, 2-neg-pair SimCLR and random network encoders.
  print("Plotting RSMs...")
  simclr_rsm, simclr_neg_pairs_rsm, random_rsm = models.plot_model_RSMs(
      encoders=[simclr_encoder, simclr_encoder_neg_pairs, random_encoder],
      dataset=dSprites_torchdataset,
      sampler=test_sampler, # To see the representations on the held out test set
      titles=["SimCLR network encoder RSM",
              f"SimCLR network encoder RSM\n(2 negative pairs per image used in loss calc.)",
              "Random network encoder RSM"], # Plot titles
      sorting_latent=sorting_latent
      )
  #################################################
  # Fill in missing code below (...),
  # then remove or comment the line below to test your implementation
  raise NotImplementedError("Exercise: Plot histogram.")
  #################################################
  # Plot a histogram of RSM values for both SimCLR encoders.
  plot_rsm_histogram(
      [..., ...],
      colors=[...],
      labels=[..., ...],
      nbins=100
      )


## Uncomment below to test your code
# rsms_and_histogram_plot()

```

In [None]:
# to_remove solution
def rsms_and_histogram_plot():
  """
  Function to plot Representational Similarity Matrices (RSMs) and Histograms

  Args:
    None

  Returns:
    Nothing
  """
  sorting_latent = "shape" # Exercise: Try sorting by different latent dimensions
  # EXERCISE: Visualize RSMs for the normal SimCLR, 2-neg-pair SimCLR and random network encoders.
  print("Plotting RSMs...")
  simclr_rsm, simclr_neg_pairs_rsm, random_rsm = models.plot_model_RSMs(
      encoders=[simclr_encoder, simclr_encoder_neg_pairs, random_encoder],
      dataset=dSprites_torchdataset,
      sampler=test_sampler, # To see the representations on the held out test set
      titles=["SimCLR network encoder RSM",
              f"SimCLR network encoder RSM\n(2 negative pairs per image used in loss calc.)",
              "Random network encoder RSM"], # Plot titles
      sorting_latent=sorting_latent
      )

  # Plot a histogram of RSM values for both SimCLR encoders.
  plot_rsm_histogram(
      [simclr_neg_pairs_rsm, simclr_rsm],
      colors=["royalblue", "gray"],
      labels=["few neg. pairs SimCLR RSM", "normal SimCLR RSM"],
      nbins=100
      )

## Uncomment below to test your code
with plt.xkcd():
  rsms_and_histogram_plot()

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Visualizing_the_network_encoder_RSMs_Bonus_Exercise")

### Bonus Interactive Demo 2.1.1: Evaluating the classification performance of a logistic regression trained on the representations produced by a SimCLR network encoder pre-trained with only a few negative pairs

For the 2-neg-pair SimCLR encoder, as the encoder parameters have already been trained, they should again be kept frozen while the classifier is trained by setting `freeze_features=True`.

_**Interactive Demo:** Try different numbers of epochs for which to train the classifier. (The original setting is `num_epochs=25`.)_

In [None]:
# Call this before any dataset/network initializing or training,
# to ensure reproducibility
set_seed(SEED)
print("Training a classifier on the representations learned by the SimCLR "
      "network encoder pre-trained\nusing only 2 negative pairs per image "
      "for the loss calculation...")
_, simclr_neg_pairs_loss_array, _, _ = models.train_classifier(
    encoder=simclr_encoder_neg_pairs,
    dataset=dSprites_torchdataset,
    train_sampler=train_sampler,
    test_sampler=test_sampler,
    freeze_features=True,  # Keep the encoder frozen while training the classifier
    num_epochs=50,  # DEMO: Try different numbers of epochs
    verbose=True
    )

# Plot the loss array
fig, ax = plt.subplots()
ax.plot(simclr_neg_pairs_loss_array)
ax.set_title(("Loss of classifier trained on a SimCLR encoder\n"
"trained with 2 negative pairs only."))
ax.set_xlabel("Epoch number")
_ = ax.set_ylabel("Training loss")

### Bonus Discussion 2.1.1: What can we conclude about the importance of negative pairs in computing the contrastive loss for models like SimCLR?

**A.**  How does changing the number of negative pairs affect the networks' RSMs?
**B.**  How is the shape classifier likely to perform when the encoder is pre-trained with very few negative pairs?
**C.**  What, intuitively, is the role of negative pairs in shaping the feature space that a contrastive model learns, and how does this role relate to the role of positive pairs?

 #### Supporting images for Discussion response examples for Bonus 2.1.1: All SimCLR encoder (2 neg. pairs) RSMs


In [None]:
# @markdown #### Supporting images for Discussion response examples for Bonus 2.1.1: All SimCLR encoder (2 neg. pairs) RSMs
Image(filename=os.path.join(REPO_PATH, "images", "rsms_simclr_encoder_2neg_60ep_bs1000_deg90_trans0-2_scale0-8to1-2_seed2021.png"), width=1000)

In [None]:
# to_remove explanation

"""
A. As seen in the much yellower RSMs of the SimCLR encoder trained with only
   2 negative pairs used in the contrastive loss, reducing the number of
   negative pairs leads to a **substantial increase in the density of high
   feature similarity values** (near 1). This is quantified in the histogram,
   which shows that the probability density of similarity values above 0.5
   increases considerably, with the density of values of almost 1 increasing
   3-4x (from 1 to 3.5). If we look at the shape RSM, we see that this results
   in a loss of the **distinction between squares and ovals** in feature space.
   Interestingly, given that a few negative pairs are still included in the
   contrastive loss, the observed increase in high similarity values is
   counterbalanced by a concurrent increase in strongly negative similarity
   values (near -1).

B. The shape classifier is likely to still classify hearts reasonably well,
   but to do a poor job of distinguishing ovals from squares.

C. Negative pairs are used, in contrastive models like SimCLR, as a
   **counterweight to positive pairs**. Indeed, if only positive pairs
   were used to train a contrastive model, the network could settle on a
   trivial solution: a **collapsed feature space where all or most images
   are encoded identically**. Such a feature space would be entirely useless,
   as it would not preserve any information about the input data. To prevent
   this, it is important to ensure that while the network updates its weights
   to encode positive pairs similarly, **it still generally encodes other,
   randomly selected pairs of images distinctly**. The negative pairs are
   therefore used to obtain an estimate of how distinctly the network would
   encode random pairs of images. If this sample is too small
   (as with our SimCLR encoder trained with a loss calculated
   from only 2 negative pairs),
   the estimate will likely not be representative at all. In this case,
   although the network may still encode a few pairs as highly dissimilar,
   it still runs the risk of learning a partially collapsed feature space,
   as observed above. Our normal SimCLR encoder is trained with a loss calculated
   from far more (999) negative pairs, and this enables it to learn a much more
   distributed and meaningful feature space.
""";

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Negative_pairs_in_computing_the_contrastive_loss_Bonus_Discussion")

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_SimCLR_network_encoder_pretrained_with_only_a_few_negative_pairs_Bonus_Interactive_Demo")

After dropping the number of negative pairs used per image in pre-training a SimCLR encoder, classification accuracy drops to 66.75% on the test dataset, even after 50 classifier training epochs.

<b>Shape classification results using different feature encoders:

| _Chance_ |  | None (raw data) | Supervised | Random | VAE | SimCLR | SimCLR (few neg.pairs) |
| - | - | --- | --- | --- | --- | --- | --- |
| _33.33%_ |  | 39.55% | 98.70% | 44.67% | 45.75% | 97.53% | 66.75% |

---
# Bonus 3: Good representations enable few-shot learning


##  Video 11: Few-shot Supervised Learning


In [None]:
# @title Video 11: Few-shot Supervised Learning
from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents


video_ids = [('Youtube', 'okrvQDeN2cc'), ('Bilibili', 'BV1BP4y147fs')]
tab_contents = display_videos(video_ids, W=730, H=410)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

##  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_FewShot_Supervised_learning_Bonus_Video")

## Bonus 3.1: The benefits of pre-training an encoder network in a few-short learning scenario, i.e., when only few labelled examples are available

The toy dataset we have been using, **dSprites**, is thoroughly labelled along 5 different dimensions. However, this is not the case for many datasets. Some very large datasets may have few if any labels.

One of our last steps is to examine how each of our models perform in such a case when only few labelled images are available for training. In this scenario, we will train classifiers on different fractions of the training data (between 0.01 and 1.0), and see how they perform on the test set.

For the different types of encoder, this means:
*    **Supervised encoder:** As the supervised encoder can only be trained with labels, we will start from random encoders and train them end-to-end on the classification task with the fraction of labelled images allowed.
_**Note on * symbol:** Given that that network is trained end-to-end, we will train it for more epochs, and mark it with "\*" in the graphs._
*    **Random encoder:** By definition, the random encoder is untrained.
*    **VAE encoder**: As a generative model can be pre-trained on unlabelled data, we will use the VAE encoder pre-trained on the reconstruction task using the full dataset, before training the classifier layer with the fraction of labelled images allowed.
*    **SimCLR encoder**: As an SSL model can be pre-trained on unlabelled data, we will use the SimCLR encoder pre-trained on the contrastive task using the full dataset, before training the classifier layer with the fraction of labelled images allowed.

_**Note on number of training epochs:** The numbers of epochs are specified below for when the **full training dataset** is used. For each fraction of the dataset a classifier is trained on, the **number of training epochs is scaled up** to compensate for the drop in number of training examples. For example, if we specify 10 epochs for a model, the 0.1 fraction labelled classifier will be trained over ~30 epochs. Also, we use **slightly fewer epochs** than above, here, in the interest of time._

### Bonus Interactive Demo 3.1.1: Training classifiers on different encoders, using only a fraction of the full labelled dataset

In this demo, we select a few fractions (4 to 6) of the full labelled dataset with which to train the classifiers.

_**Interactive Demo:** Set `labelled_fractions` argument to a list of fractions (4 to 6 values between 0.01 and 1.0) with which to train classifiers for each encoder._

In [None]:
# Call this before any dataset/network initializing or training,
# to ensure reproducibility
set_seed(SEED)

new_supervised_encoder = models.EncoderCore()  # New, random supervised encoder

_ = models.train_encoder_clfs_by_fraction_labelled(
    encoders=[new_supervised_encoder, random_encoder, vae_encoder, simclr_encoder],
    dataset=dSprites_torchdataset,
    train_sampler=train_sampler,
    test_sampler=test_sampler,
    labelled_fractions=[0.01],  # DEMO: select 4-6 fractions to run
    num_epochs=[20, 8, 8, 8],  # Train the supervised network (end-to-end) for more epochs
    freeze_features=[False, True, True, True],  # Only train new supervised network end-to-end
    subset_seed=SEED,
    encoder_labels=["supervised", "random", "VAE", "SimCLR"],
    title="Performance of classifiers trained\nwith different network encoders",
    verbose=True
    )

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Use_a_fraction_of_the_labelled_dataset_Bonus_Interactive_Demo")

### Bonus Discussion 3.1.1: What can we conclude the advantages and disadvantages of the different encoder network types under different conditions?

**A.** Which models are most and least affected by how much labelled data is available?
**B.** What might explain why different models are affected differently?

 #### Supporting images for Discussion response examples for Bonus 3.1.1: Classifier performances for various fractions of labelled data


In [None]:
# @markdown #### Supporting images for Discussion response examples for Bonus 3.1.1: Classifier performances for various fractions of labelled data
Image(filename=os.path.join(REPO_PATH, "images", "labelled_fractions.png"), width=600)

In [None]:
# to_remove explanation

"""
A. The classifiers trained on top of the random and pre-trained VAE encoders
   perform poorly (below 50%) on the shape classification task, regardless of
   how much of the labelled data is available to train the classifier.
   In contrast, the classifier trained on top of the pre-trained
   **SimCLR encoder maintains a performance above 90%** even when it is
   trained with only 5% of the total available labelled data. The classifier
   trained along with the **supervised encoder is most heavily affected**, with
   its performance dropping to about 45% when it is trained with only 5% of the
   total available data.

B. Since the **supervised encoder** cannot be trained with unlabelled data,
   any reduction in the fraction of labelled data available effectively means a
   reduction in number of training examples. If the number of training examples
   available is quite small, the encoder may not be able to learn generalizable
   features. In contrast, since the task used to train the **SimCLR encoder does
   not require any labels**, the encoder can first be pre-trained on the full dataset
   to learn generalizable features that are relevant to the downstream classification,
   as was done in the previous sections. If this is successful, the classifier layer
   training becomes relatively simple. Indeed, if the encoder already broadly decodes
   shapes, the classifier layer needs only to learn to map shape-like feature
   to the correct shape. As the results here show, such a simple task likely
   requires far fewer training examples than training a full encoder network
   and classifier from scratch. Importantly, this is not a trivial finding,
   as shown by the overall poor shape classification performance of the
   classifiers trained on the **random** and **pre-trained VAE** encoders.
   The fact that increasing the number of labelled examples available barely
   impacts their performance suggests that their features, in contrast to the
   SimCLR encoder's features, are not very informative at all about shape.
""";

####  Submit your feedback


In [None]:
# @title Submit your feedback
content_review(f"{feedback_prefix}_Advantages_and_disadvantages_of_encoders_Bonus_Discussion")