diff --git a/docs/finn/internals.rst b/docs/finn/internals.rst
index d0c4cd20650a7cb1ef63f68ff559bebbba93ae05..652c94ac248437bdf83c0c3047f6cbd2d3b85651 100644
--- a/docs/finn/internals.rst
+++ b/docs/finn/internals.rst
@@ -206,6 +206,64 @@ How to set *mem_mode*
 ---------------------
 When the nodes in the network are converted to HLS layers, the *mem_mode* can be passed. More detailed information about the transformations that prepare the network and the transformation that performs the conversion to HLS layers can be found in chapter :ref:`nw_prep`. The *mem_mode* is passed as argument. Note that if no argument is passed, the default is *const*.
 
+
+.. _folding_factors:
+
+Constraints to folding factors per layer
+=========================================
+
+.. list-table:: Folding factor constraints
+
+   * - **Layers**
+     - **Parameters**
+     - **Constraints**
+   * - Addstreams_Batch
+     - PE
+     - inp_channels % PE == 0
+   * - ChannelwiseOp_Batch
+     - PE
+     - channels % PE == 0
+   * - ConvolutionInputGenerator
+     - SIMD
+     - inp_channels % SIMD == 0
+   * - ConvolutionInputGenerator1d
+     - SIMD
+     - inp_channels % SIMD == 0
+   * - Downsampler
+     - SIMD
+     - inp_channels % SIMD == 0
+   * - DuplicateStreams_Batch
+     - PE
+     - channels % PE == 0
+   * - Eltwise
+     - PE
+     - inp_channels % PE == 0
+   * - FMPadding_batch
+     - SIMD
+     - inp_channels % SIMD == 0
+   * - FMPadding_rtl
+     - SIMD
+     - inp_channels % SIMD == 0
+   * - Globalaccpool_Batch
+     - PE
+     - channels % PE == 0
+   * - Labelselect_Batch
+     - PE
+     - num_labels % PE == 0
+   * - MatrixVectorActivation
+     - PE & SIMD
+     - MH % PE == 0 & MW % SIMD == 0
+   * - Pool_Batch
+     - PE
+     - inp_channels % PE == 0
+   * - Thresholding_Batch
+     - PE
+     - MH % PE == 0
+   * - VectorVectorActivation
+     - PE & SIMD
+     - k_h * k_w % SIMD == 0 & channels % PE == 0
+
+
 RTL ConvolutionInputGenerator
 =============================
 
diff --git a/notebooks/advanced/3_folding.ipynb b/notebooks/advanced/3_folding.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..07b66da52fbc340cc7ff14f47e42980f011aec3a
--- /dev/null
+++ b/notebooks/advanced/3_folding.ipynb
@@ -0,0 +1,664 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# FINN - Folding\n",
+    "--------------------------------------\n",
+    "**Note: We will utilize one of the intermediate models generated in the process of the cybersecurity end2end example**\n",
+    "\n",
+    "There is a local copy of `step_convert_to_hls.onnx` in this directory, which was renamed to `cybsec_PE_SIMD.onnx` to be able to go through this tutorial without requisites. But you can also generate it yourself with the [third cybersecurity Jupyter notebook](../end2end_example/cybersecurity/3-build-accelerator-with-finn.ipynb). After the execution of the estimates only build flow, it can be found in `../end2end_example/cybersecurity/output_estimates_only/intermediate_models/step_convert_to_hls.onnx`. \n",
+    "\n",
+    "This notebook describes the use of FINN parallelization parameters (PE & SIMD), also called folding factors, to efficiently optimize models so as to extract the maximum performance out of them. \n",
+    "\n",
+    "Please be aware that the folding factors can not be selected arbitrarily, each layer has constraints on which values the parallelization parameters can be set to, for more information see here: https://finn-dev.readthedocs.io/en/latest/internals.html#constraints-to-folding-factors-per-layer\n",
+    "\n",
+    "We'll use the utility function `showInNetron()` to visualize and interact with our network in the Jupyter Notebook and `showSrc()` to show source code of FINN library calls."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from finn.util.visualization import showInNetron, showSrc"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note: The build_flow in the cybsec_mlp notebook comprises a transformation step `step_target_fps_parallelization` that automatically sets custom parallelization parameters needed to achieve a given `target_fps` by invoking the [`SetFolding` transformation](https://github.com/Xilinx/finn/blob/main/src/finn/transformation/fpgadataflow/set_folding.py#L46).\n",
+    "\n",
+    "More details of the above step can be found [here](https://github.com/Xilinx/finn/blob/main/src/finn/builder/build_dataflow_steps.py#L394)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebook shows the manual version of this step and explains how these attributes can improve performance and what are their effects on resource utilization for developers who need to maximize the performance of their network. \n",
+    "\n",
+    "For that we will use the `cybsec_PE_SIMD.onnx` file as starting point. This intermediate model from the cybersecurity example is the model representation after the high-level ONNX layers are converted to HLS layers. Each node in the graph now corresponds to an HLS C++ function call and the parallelization parameters can be set using the node attributes.\n",
+    "\n",
+    "We will take this model to show how to set the folding factors manually and analyze the estimated execution clock cycles and the resource utilization of each layer in the network."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### FINN-style Dataflow Architectures <a id='dataflow_arch'></a>\n",
+    "\n",
+    "We start with a quick recap of FINN-style dataflow architectures. The key idea in such architectures is to parallelize across layers as well as within layers by dedicating a proportionate amount of compute resources to each layer, as illustrated in the figure below.\n",
+    "\n",
+    "![](finn-dataflow.png)\n",
+    "\n",
+    "In practice, the layers are instantiated by function calls to optimized Vitis HLS building blocks from the [finn-hlslib](https://github.com/Xilinx/finn-hlslib) library.\n",
+    "\n",
+    "Since each layer will be instantiated, we can flexibly set the parallelization of each layer and thus control resources and throughput of our network, as visualized in the image below:\n",
+    "\n",
+    "![](finn-folding.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Part-1 : Loading the ONNX model.\n",
+    "\n",
+    "As discussed above, the network needs to go through a few preparation steps before it can be fed into our estimation functions.\n",
+    "\n",
+    "The `.onnx` file loaded here is taken from the cybersecurity end2end example notebook. \n",
+    "We pick the onnx file `cybsec_PE_SIMD.onnx` to which the necessary transformations have been applied for this notebook. This means, network layers mapped to necessary FINN-HLS blocks. In this case, the `MatrixVectorActivation` units. \n",
+    "\n",
+    "To interact with the `.onnx` file we use `ModelWrapper()`. This wrapper simplifies the access to different model attributes and allows us to apply custom transformations on the model.\n",
+    "\n",
+    "In the below cell, we load our onnx file and view the cybersecurity MLP network in Netron."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from qonnx.core.modelwrapper import ModelWrapper\n",
+    "model_path = os.environ[\"FINN_ROOT\"] + \"/notebooks/advanced/cybsec_PE_SIMD.onnx\" \n",
+    "model = ModelWrapper(model_path)\n",
+    "\n",
+    "showInNetron(model_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Part 2 : Parallelization Parameters: PE & SIMD"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The computational parallelism can be varied by setting the folding factors or also called parallelization parameters **PE** and **SIMD** of each layer. These parallelization attributes are subject to certain constraints and should be selected accordingly.\n",
+    "\n",
+    "To see more details about how this is implemented in the `MatrixVectorActivation` layer (MVAU), please have a look at [this documentation](https://github.com/Xilinx/finn/blob/github-pages/docs/finn-sheduling-and-folding.pptx). A schematic of the folding in an MVAU for a fully-connected layer is shown below:\n",
+    "\n",
+    "![](finn-folding-mvau.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the case of the MVAU, `PE` & `SIMD` are subject to the following constraints: \n",
+    "\n",
+    "If `MW` is the number of input features and `MH` the number of output features:\n",
+    "\n",
+    "        MW % SIMD == 0\n",
+    "        MH % PE == 0\n",
+    "        \n",
+    "Total folding in the case of the MVAU is defined as:\n",
+    "\n",
+    "    Total folding = (MH/PE) x (MW/SIMD)\n",
+    "\n",
+    "In a streaming dataflow architecture like it is in FINN designs the throughput is determined by the slowest layer. So, the goal of adjusting these parameters is to get an almost balanced pipeline i.e. equalizing the throughput rate of layers in the generated dataflow architecture.\n",
+    "\n",
+    "The FINN compiler provides analysis passes to facilitate the exploration of the folding factors of each layer. In this notebook we will show how to use these functions and explore how the parallelization parameters affect the clock cycles and the resource utilization of the generated dataflow architecture.\n",
+    "\n",
+    "We start with a naive case where `PE` & `SIMD` values across all layers are 1, this is the starting point of our exploration and is also the state the network is in after the conversion to HLS layers. If you take a look at the model using Netron and click on one of the MVAU layers, you can see that `PE` and `SIMD` are both set to 1 by default."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "showInNetron(model_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We import the analysis passes  `exp_cycles_per_layer()` and  `res_estimation()` to estimate the number of clock cycles and resource utilization of each network layer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from finn.analysis.fpgadataflow.exp_cycles_per_layer import exp_cycles_per_layer\n",
+    "from finn.analysis.fpgadataflow.res_estimation import res_estimation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Analysis passes in FINN return information about the model in form of a dictionary, you can learn more about analysis passes in general in this Jupyter notebook: [0_custom_analysis_pass.ipynb](0_custom_analysis_pass.ipynb).\n",
+    "\n",
+    "We start by calling the analysis pass `exp_cycles_per_layer()`, which returns a dictionary with the layer names as keys and the expected cycles as values. Afterwards, we plot the result in a block diagram."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cycles_dict = model.analysis(exp_cycles_per_layer)\n",
+    "cycles_dict"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "fig = plt.figure(figsize = (10, 5))\n",
+    "plt.bar(cycles_dict.keys(), cycles_dict.values(), color ='blue', width = 0.3)\n",
+    "plt.xlabel(\"Network layers\")\n",
+    "plt.ylabel(\"Number of clock cycles\")\n",
+    "plt.title(\"Clock cycles per layer PE=SIMD=1\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We observe that the bottleneck in the execution of the model on hardware would come from the execution of the first layer which takes estimated 38400 clock cycles to execute one set of its inputs.\n",
+    "\n",
+    "No matter how quickly the other layers execute, the throughput will be defined by the first layer's execution latency.\n",
+    "\n",
+    "Let's have a look now at the estimated resources per layer by calling another analysis pass.\n",
+    "The keys are again the layer names, but the values are now a dictionary with the resource estimates per layer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "res_dict = model.analysis(res_estimation)\n",
+    "res_dict"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next to the absolute numbers of LUTs, BRAM, URAM and DSPs, the analysis pass also provides information about the efficiency of the memory usage. If the memory type is not utilized, the efficiency is by default 1. You can see that above for the `URAM_efficiency`. In all other cases the efficiency indicates the actual parameter storage needed divided by the allocated BRAM/URAM storage. So, this means in our example MVAU_0 uses 5 block ram and they are 83% utilized. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After we extract that information from the model, we plot the number of LUTs. In this notebook we concentrate on the influence on the LUT usage, but by manipulating the code below, you can also extract information about memory and dsp usage."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Extracting LUTs from res_dict\n",
+    "LUTs = [res_dict[key][\"LUT\"] for key in res_dict.keys()]   \n",
+    "\n",
+    "#Plotting the bar graph of each network layer with their corresponding LUT resource utilization\n",
+    "fig = plt.figure(figsize = (10, 5))\n",
+    "plt.bar(res_dict.keys(), LUTs, color ='green', width = 0.3)\n",
+    "plt.xlabel(\"Network layers\")\n",
+    "plt.ylabel(\"Number of LUTs\")\n",
+    "plt.title(\"No. of LUTs per layer PE=SIMD=1\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since we identified above that the first layer takes the highest number of cycles to complete the execution, we will now try to adjust the folding parameters to reduce its latency at the expense of an increase in resource utilization."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Modify Parameters\n",
+    "\n",
+    "We now modify the parallelization parameters of the first network layer to reduce its latency.\n",
+    "We only extract the first `MatrixVectorActivation` block from the model and set the parallelization parameters manually.\n",
+    "\n",
+    "In the first step, we left the `PE` & `SIMD` values for all the layers on default (=1) to establish a baseline and measure the estimated clock cycles and resource utilization for each of the individual layers.\n",
+    "\n",
+    "To set `PE` & `SIMD`, we will utilize functionality from the FINN compiler. Each layer type has a Python wrapper which can be instantiated using the `getCustomOp()` function. The wrapper offers several helper functions like `get_nodeattr()` and `set_nodeattr()` to access and set the attributes of a node."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from qonnx.custom_op.registry import getCustomOp\n",
+    "\n",
+    "list_of_mvaus = model.get_nodes_by_op_type(\"MatrixVectorActivation\")\n",
+    "mvau0 = list_of_mvaus[0]\n",
+    "\n",
+    "mvau0_inst = getCustomOp(mvau0)\n",
+    "\n",
+    "# Get the node attributes to check the current setting\n",
+    "print(\"The parallelization parameters of %s were: \" % mvau0.name)\n",
+    "print(\"PE: \" + str(mvau0_inst.get_nodeattr(\"PE\")))\n",
+    "print(\"SIMD: \" + str(mvau0_inst.get_nodeattr(\"SIMD\")))\n",
+    "\n",
+    "# Set the new node attributes\n",
+    "mvau0_inst.set_nodeattr(\"PE\", 2)\n",
+    "mvau0_inst.set_nodeattr(\"SIMD\", 5)\n",
+    "\n",
+    "# Get the node attributes to check the updated setting\n",
+    "print(\"The parallelization parameters of %s are updated to: \" % mvau0.name)\n",
+    "print(\"PE: \" + str(mvau0_inst.get_nodeattr(\"PE\")))\n",
+    "print(\"SIMD: \" + str(mvau0_inst.get_nodeattr(\"SIMD\")))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We save the model and view it. On expanding the first `MatrixVectorActivation` we can see the updated `PE` & `SIMD` parameters for that layer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.save(\"cybsec_PE_SIMD_modified.onnx\")\n",
+    "showInNetron(\"cybsec_PE_SIMD_modified.onnx\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From the above total folding formula, we have reduced the total folding of our layer from `600 x 64` to `120 x 32`. Hence, resulting in an estimated `10x` decrease in the execution latency of our layer. \n",
+    "This can be observed in the new estimated clock cycles."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cycles_dict_updated = model.analysis(exp_cycles_per_layer)\n",
+    "cycles_dict_updated"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig = plt.figure(figsize = (10, 5))\n",
+    "plt.bar(cycles_dict_updated.keys(), cycles_dict_updated.values(), color ='blue', width = 0.3)\n",
+    "plt.xlabel(\"Network layers\")\n",
+    "plt.ylabel(\"Number of clock cycles\")\n",
+    "plt.title(\"Clock cycles per layer with updated folding factors\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This has of course consequences for the resource usage of the network."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "res_dict_updated = model.analysis(res_estimation)\n",
+    "res_dict_updated"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Extracting LUTs from res_dict\n",
+    "LUTs_updated = [res_dict_updated[key][\"LUT\"] for key in res_dict_updated.keys()]   \n",
+    "\n",
+    "#Plotting the bar graph of each network layer with their corresponding LUT resource utilization\n",
+    "fig = plt.figure(figsize = (10, 5))\n",
+    "plt.bar(res_dict_updated.keys(), LUTs_updated, color ='green', width = 0.3)\n",
+    "plt.xlabel(\"Network Layers\")\n",
+    "plt.ylabel(\"LUT Utilisation\")\n",
+    "plt.title(\"No. of LUTs per layer with updated folding factors\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From these numbers, we see that the first layer has been removed as the bottleneck and that the entire network can now perform one inference in ~4096 clock cycles (when the pipeline is full) as compared to the earlier configuration where it took ~38400 execution cycles.\n",
+    "\n",
+    "This decrease in execution latency of the network though comes at a cost of a 45% increase in LUT resource utilization for the first layer of the network."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Important Note : StreamingDataWidthConverters"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next to resources and performance, folding factors (or parallelization parameters) are influencing also other properties of the generated design. Since we are able to generate results in parallel, the data that gets fed into the layer needs to be packed in a specific format to provide the correct data at the correct time for the internal parallelism. Also, the data that comes out of a layer will be in a specific format depending on the internal parallelism."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To analyze the influence of the folding factors on the data streams between layers, we first will import the original model (with `PE=SIMD=1`) and then we will import the updated model, so that we can compare the two of them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dir_path = os.environ[\"FINN_ROOT\"] + \"/notebooks/advanced/\" \n",
+    "model_orig = ModelWrapper(dir_path + \"cybsec_PE_SIMD.onnx\")\n",
+    "model_updated = ModelWrapper(\"cybsec_PE_SIMD_modified.onnx\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the next step we extract the information from all layers. For MVAUs the input shape is (1, MW/SIMD, SIMD) and the output shape is (1, MH/PE, PE)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Original model\n",
+    "list_of_mvaus = model_orig.get_nodes_by_op_type(\"MatrixVectorActivation\")\n",
+    "print(\"In the original model (pe=simd=1): \")\n",
+    "for mvau in list_of_mvaus:\n",
+    "    mvau_inst = getCustomOp(mvau)\n",
+    "    print(\"Layer: \" + mvau.name)\n",
+    "    print(\"Input shape: \" + str(mvau_inst.get_folded_input_shape()))\n",
+    "    print(\"Output shape: \" + str(mvau_inst.get_folded_output_shape()))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Updated model\n",
+    "list_of_mvaus = model_updated.get_nodes_by_op_type(\"MatrixVectorActivation\")\n",
+    "print(\"In the original model (pe=simd=1): \")\n",
+    "for mvau in list_of_mvaus:\n",
+    "    mvau_inst = getCustomOp(mvau)\n",
+    "    print(\"Layer: \" + mvau.name)\n",
+    "    print(\"Input shape: \" + str(mvau_inst.get_folded_input_shape()))\n",
+    "    print(\"Output shape: \" + str(mvau_inst.get_folded_output_shape()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can see that the input and output shape for MatrixVectorActivation_0 has changed after we have changed the folding factors. These changes have direct influence on the in/out stream width. We can have a closer look at the formula to calculate the stream width of an MVAU."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "showSrc(mvau_inst.get_instream_width)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "showSrc(mvau_inst.get_outstream_width)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The input stream width can be calculated by multiplying the input bit width with SIMD and the output stream width can be calculated by multiplying the output bit width with PE."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To connect two layers with each other for the final design, the input stream width of a node needs to match the output stream width of the preceding node. If that is not the case FINN inserts DataWidthConverters (DWCs) to resolve this mismatch. Let's have a look at the input/output stream width of the layers before updating the parallelization parameters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Original model\n",
+    "list_of_mvaus = model_orig.get_nodes_by_op_type(\"MatrixVectorActivation\")\n",
+    "print(\"In the original model (pe=simd=1): \")\n",
+    "for mvau in list_of_mvaus:\n",
+    "    mvau_inst = getCustomOp(mvau)\n",
+    "    print(\"Layer: \" + mvau.name)\n",
+    "    print(\"Input stream width: \" + str(mvau_inst.get_instream_width()))\n",
+    "    print(\"Output stream width: \" + str(mvau_inst.get_outstream_width()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the original model the output stream width of one layer matches the input stream width of the following layer. So there would be no DWC required when generating the final design."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For the updated model, the situation is different. Let's have a look how the stream widths have changed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Updated model\n",
+    "list_of_mvaus = model_updated.get_nodes_by_op_type(\"MatrixVectorActivation\")\n",
+    "print(\"In the original model (pe=simd=1): \")\n",
+    "for mvau in list_of_mvaus:\n",
+    "    mvau_inst = getCustomOp(mvau)\n",
+    "    print(\"Layer: \" + mvau.name)\n",
+    "    print(\"Input stream width: \" + str(mvau_inst.get_instream_width()))\n",
+    "    print(\"Output stream width: \" + str(mvau_inst.get_outstream_width()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As we can see, the output stream width of MatrixVectorActivation_0 has now changed to `4`, while the input stream width of MatrixVectorActivation_1 stayed `2`. So, the FINN compiler would insert a DWC between these nodes, we can manually invoke this behavior by calling the transformation `InsertDWC` on our model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from finn.transformation.fpgadataflow.insert_dwc import InsertDWC\n",
+    "from qonnx.transformation.general import GiveUniqueNodeNames\n",
+    "\n",
+    "model_updated = model_updated.transform(InsertDWC())\n",
+    "model_updated = model_updated.transform(GiveUniqueNodeNames())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_updated.save(\"cybsec_DWC.onnx\")\n",
+    "showInNetron(\"cybsec_DWC.onnx\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can observe in the model that a DWC was inserted between the first two layers.\n",
+    "Since the DWC will also be a hardware block in our final FINN design, it has a latency and resources associated with it. Let's have a final look in our resource estimates."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_dwc = ModelWrapper(\"cybsec_DWC.onnx\")\n",
+    "res_dict_dwc = model_dwc.analysis(res_estimation)\n",
+    "res_dict_dwc"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since we have now one additional layer, we manipulate the data to shorten the layer names in the plot."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "layers = res_dict_dwc.keys()\n",
+    "# replace names of layers with abbreviations\n",
+    "layers = [n.replace(\"MatrixVectorActivation_\", \"MVU\") for n in layers]\n",
+    "layers = [n.replace(\"StreamingDataWidthConverter_Batch\", \"DWC\") for n in layers]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Extracting LUTs from res_dict\n",
+    "LUTs_dwc = [res_dict_dwc[key][\"LUT\"] for key in res_dict_dwc.keys()]   \n",
+    "\n",
+    "#Plotting the bar graph of each network layer with their corresponding LUT resource utilization\n",
+    "fig = plt.figure(figsize = (10, 5))\n",
+    "plt.bar(layers, LUTs_dwc, color ='red', width = 0.3)\n",
+    "plt.xlabel(\"Network Layers\")\n",
+    "plt.ylabel(\"LUT Utilisation\")\n",
+    "plt.title(\"Estimated LUT values used for each network layer\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the case of our example network, the `StreamingDataWidthConverter_Batch` layer does not consume a large number of LUT resources as shown in the graph. This might be different for larger models and if there are a higher number of DWCs inserted. Please be aware of this when setting the folding factors for your network."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/advanced/cybsec_PE_SIMD.onnx b/notebooks/advanced/cybsec_PE_SIMD.onnx
new file mode 100644
index 0000000000000000000000000000000000000000..b450cc9e43361e845fda8c95d743e1b461a1a9ad
Binary files /dev/null and b/notebooks/advanced/cybsec_PE_SIMD.onnx differ
diff --git a/notebooks/advanced/finn-dataflow.png b/notebooks/advanced/finn-dataflow.png
new file mode 100755
index 0000000000000000000000000000000000000000..ebe98d0fbd1878fabb9ae2d87bd9b111d62dc39e
Binary files /dev/null and b/notebooks/advanced/finn-dataflow.png differ
diff --git a/notebooks/advanced/finn-folding-mvau.png b/notebooks/advanced/finn-folding-mvau.png
new file mode 100755
index 0000000000000000000000000000000000000000..bbba00182c888b072432116a3a9eafbb1d8cec0e
Binary files /dev/null and b/notebooks/advanced/finn-folding-mvau.png differ
diff --git a/notebooks/advanced/finn-folding.png b/notebooks/advanced/finn-folding.png
new file mode 100755
index 0000000000000000000000000000000000000000..019b4aa1e7d2f447949d9450609b2e5e9cbd04c0
Binary files /dev/null and b/notebooks/advanced/finn-folding.png differ
diff --git a/tests/notebooks/test_jupyter_notebooks.py b/tests/notebooks/test_jupyter_notebooks.py
index 819b4ccde0333cfdf6e16f30e25fb5303fbf1f70..836f1e059efc3cfae95fa9e2ccd0b74f6fca9c11 100644
--- a/tests/notebooks/test_jupyter_notebooks.py
+++ b/tests/notebooks/test_jupyter_notebooks.py
@@ -21,6 +21,7 @@ advanced_notebooks = [
     pytest.param(notebook_advanced_dir + "0_custom_analysis_pass.ipynb"),
     pytest.param(notebook_advanced_dir + "1_custom_transformation_pass.ipynb"),
     pytest.param(notebook_advanced_dir + "2_custom_op.ipynb"),
+    pytest.param(notebook_advanced_dir + "3_folding.ipynb"),
 ]
 
 cyber_notebooks = [