Added suggestions form testing (#270)

f2d7b6b3 · Hendrik Borras · GitHub · 5dce35e9 · f2d7b6b3 · f2d7b6b3
Unverified Commit f2d7b6b3 authored 4 years ago by Hendrik Borras Committed by GitHub 4 years ago
--- a/notebooks/end2end_example/cybersecurity/1-train-mlp-with-brevitas.ipynb
+++ b/notebooks/end2end_example/cybersecurity/1-train-mlp-with-brevitas.ipynb
--- a/notebooks/end2end_example/cybersecurity/2-export-to-finn-and-verify.ipynb
+++ b/notebooks/end2end_example/cybersecurity/2-export-to-finn-and-verify.ipynb
@@ -6,7 +6,9 @@
   "source": [
    "# Verify Exported ONNX Model in FINN\n",
    "\n",
-    "**Important: This notebook depends on the 1-train-mlp-with-brevitas notebook, because we are using the ONNX model that was exported there. So please make sure the needed .onnx file is generated before you run this notebook. Also remember to 'close and halt' any other FINN notebooks, since Netron visualizations use the same port.**\n",
+    "**Important: This notebook depends on the 1-train-mlp-with-brevitas notebook, because we are using the ONNX model that was exported there. So please make sure the needed .onnx file is generated before you run this notebook.**\n",
+    "\n",
+    "**Also remember to 'close and halt' any other FINN notebooks, since Netron visualizations use the same port.**\n",
    "\n",
    "In this notebook we will show how to import the network we trained in Brevitas and verify it in the FINN compiler. \n",
    "This verification process can actually be done at various stages in the compiler [as explained in this notebook](../bnn-pynq/tfc_end2end_verification.ipynb) but for this example we'll only consider the first step: verifying the exported high-level FINN-ONNX model.\n",
@@ -122,7 +124,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -150,7 +152,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
@@ -243,7 +245,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
@@ -252,7 +254,7 @@
       "torch.Size([100, 593])"
      ]
     },
-     "execution_count": 7,
+     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -280,7 +282,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
@@ -289,7 +291,7 @@
       "IncompatibleKeys(missing_keys=[], unexpected_keys=[])"
      ]
     },
-     "execution_count": 8,
+     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -325,12 +327,15 @@
    "# replace this with your trained network checkpoint if you're not\n",
    "# using the pretrained weights\n",
    "trained_state_dict = torch.load(\"state_dict.pth\")[\"models_state_dict\"][0]\n",
+    "# Uncomment the following line if you previously chose to train the network yourself\n",
+    "#trained_state_dict = torch.load(\"state_dict_self-trained.pth\")\n",
+    "\n",
    "brevitas_model.load_state_dict(trained_state_dict, strict=False)"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -360,7 +365,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -390,14 +395,14 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
-      "ok 100 nok 0: 100%|██████████| 100/100 [00:48<00:00,  2.09it/s]\n"
+      "ok 100 nok 0: 100%|██████████| 100/100 [00:46<00:00,  2.17it/s]\n"
     ]
    }
   ],
@@ -426,7 +431,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {

 %% Cell type:markdown id: tags:

 # Verify Exported ONNX Model in FINN

-**Important: This notebook depends on the 1-train-mlp-with-brevitas notebook, because we are using the ONNX model that was exported there. So please make sure the needed .onnx file is generated before you run this notebook. Also remember to 'close and halt' any other FINN notebooks, since Netron visualizations use the same port.**
+**Important: This notebook depends on the 1-train-mlp-with-brevitas notebook, because we are using the ONNX model that was exported there. So please make sure the needed .onnx file is generated before you run this notebook.**
+
+**Also remember to 'close and halt' any other FINN notebooks, since Netron visualizations use the same port.**

 In this notebook we will show how to import the network we trained in Brevitas and verify it in the FINN compiler.
 This verification process can actually be done at various stages in the compiler [as explained in this notebook](../bnn-pynq/tfc_end2end_verification.ipynb) but for this example we'll only consider the first step: verifying the exported high-level FINN-ONNX model.
 Once this model is sucessfully verified, we'll generate an FPGA accelerator from it in the next notebook.

 %% Cell type:code id: tags:

 ``` python
 import onnx
 import torch
 ```

 %% Cell type:markdown id: tags:

 **This is important -- always import onnx before torch**. This is a workaround for a [known bug](https://github.com/onnx/onnx/issues/2394).

 %% Cell type:markdown id: tags:

 ## Outline
 -------------
 1. [Import model and visualize in Netron](#brevitas_import_visualization)
 2. [Network preperations: Tidy up transformations](#network_preparations)
 3. [Load the dataset and Brevitas model](#load_dataset)
 4. [Compare FINN and Brevitas execution](#compare_brevitas)

 %% Cell type:markdown id: tags:

 # 1. Import model and visualize in Netron <a id="brevitas_import_visualization"></a>

 Now that we have the model in .onnx format, we can work with it using FINN. To import it into FINN, we'll use the [`ModelWrapper`](https://finn.readthedocs.io/en/latest/source_code/finn.core.html#finn.core.modelwrapper.ModelWrapper). It is a wrapper around the ONNX model which provides several helper functions to make it easier to work with the model.

 %% Cell type:code id: tags:

 ``` python
 from finn.core.modelwrapper import ModelWrapper

 model_file_path = "cybsec-mlp.onnx"
 model_for_sim = ModelWrapper(model_file_path)
 ```

 %% Cell type:markdown id: tags:

 To visualize the exported model, Netron can be used. Netron is a visualizer for neural networks and allows interactive investigation of network properties. For example, you can click on the individual nodes and view the properties.

 %% Cell type:code id: tags:

 ``` python
 from finn.util.visualization import showInNetron
 showInNetron(model_file_path)
 ```

 %% Output

    Serving 'cybsec-mlp.onnx' at http://0.0.0.0:8081

    <IPython.lib.display.IFrame at 0x7fc1fc950748>

 %% Cell type:markdown id: tags:

 # 2. Network preperation: Tidy up transformations <a id="network_preparations"></a>

 Before running the verification, we need to prepare our FINN-ONNX model. In particular, all the intermediate tensors need to have statically defined shapes. To do this, we apply some transformations to the model like a kind of "tidy-up" to make it easier to process. You can read more about these transformations in [this notebook](../bnn-pynq/tfc_end2end_example.ipynb).

 %% Cell type:code id: tags:

 ``` python
 from finn.transformation.general import GiveReadableTensorNames, GiveUniqueNodeNames, RemoveStaticGraphInputs
 from finn.transformation.infer_shapes import InferShapes
 from finn.transformation.infer_datatypes import InferDataTypes
 from finn.transformation.fold_constants import FoldConstants

 model_for_sim = model_for_sim.transform(InferShapes())
 model_for_sim = model_for_sim.transform(FoldConstants())
 model_for_sim = model_for_sim.transform(GiveUniqueNodeNames())
 model_for_sim = model_for_sim.transform(GiveReadableTensorNames())
 model_for_sim = model_for_sim.transform(InferDataTypes())
 model_for_sim = model_for_sim.transform(RemoveStaticGraphInputs())
 ```

 %% Cell type:markdown id: tags:

 There's one more thing we'll do: we will mark the input tensor datatype as bipolar, which will be used by the compiler later on.

 *In the near future it will be possible to add this information to the model while exporting, instead of having to add it manually.*

 %% Cell type:code id: tags:

 ``` python
 from finn.core.datatype import DataType

 finnonnx_in_tensor_name = model_for_sim.graph.input[0].name
 finnonnx_out_tensor_name = model_for_sim.graph.output[0].name
 print("Input tensor name: %s" % finnonnx_in_tensor_name)
 print("Output tensor name: %s" % finnonnx_out_tensor_name)
 finnonnx_model_in_shape = model_for_sim.get_tensor_shape(finnonnx_in_tensor_name)
 print("Input tensor shape: %s" % str(finnonnx_model_in_shape))
 model_for_sim.set_tensor_datatype(finnonnx_in_tensor_name, DataType.BIPOLAR)
 print("Input tensor datatype: %s" % str(model_for_sim.get_tensor_datatype(finnonnx_in_tensor_name)))

 verified_model_filename = "cybsec-mlp-verified.onnx"
 model_for_sim.save(verified_model_filename)
 ```

 %% Output

    Input tensor name: global_in
    Output tensor name: global_out
    Input tensor shape: [1, 600]
    Input tensor datatype: DataType.BIPOLAR

 %% Cell type:markdown id: tags:

 Let's view our ready-to-go model. Some changes to note:

 * all intermediate tensors now have their shapes specified (indicated by numbers next to the arrows going between layers)
 * the datatype on the input tensor is set to DataType.BIPOLAR (click on the `global_in` node to view properties)

 %% Cell type:code id: tags:

 ``` python
 showInNetron(verified_model_filename)
 ```

 %% Output

    
    Stopping http://0.0.0.0:8081
    Serving 'cybsec-mlp-verified.onnx' at http://0.0.0.0:8081

    <IPython.lib.display.IFrame at 0x7fc280154278>

 %% Cell type:markdown id: tags:

 # 3. Load the Dataset and the Brevitas Model <a id="load_dataset"></a>

 We'll use some example data from the quantized UNSW-NB15 dataset (from the previous notebook) to use as inputs for the verification.

 Recall that the quantized values from the dataset are 593-bit binary {0, 1} vectors whereas our exported model takes 600-bit bipolar {-1, +1} vectors, so we'll have to preprocess it a bit before we can use it for verifying the ONNX model.

 %% Cell type:code id: tags:

 ``` python
 from torch.utils.data import DataLoader, Dataset
 from dataloader_quantized import UNSW_NB15_quantized

 test_quantized_dataset = UNSW_NB15_quantized(file_path_train='UNSW_NB15_training-set.csv', \
                                              file_path_test = "UNSW_NB15_testing-set.csv", \
                                              train=False)

 n_verification_inputs = 100
 # last column is the label, exclude it
 input_tensor = test_quantized_dataset.data[:n_verification_inputs,:-1]
 input_tensor.shape
 ```

 %% Output

    torch.Size([100, 593])

 %% Cell type:markdown id: tags:

 Let's also bring up the MLP we trained in Brevitas from the previous notebook. We'll compare its outputs to what is generated by FINN.

 %% Cell type:code id: tags:

 ``` python
 input_size = 593
 hidden1 = 64
 hidden2 = 64
 hidden3 = 64
 weight_bit_width = 2
 act_bit_width = 2
 num_classes = 1

 from brevitas.nn import QuantLinear, QuantReLU
 import torch.nn as nn

 brevitas_model = nn.Sequential(
      QuantLinear(input_size, hidden1, bias=True, weight_bit_width=weight_bit_width),
      nn.BatchNorm1d(hidden1),
      nn.Dropout(0.5),
      QuantReLU(bit_width=act_bit_width),
      QuantLinear(hidden1, hidden2, bias=True, weight_bit_width=weight_bit_width),
      nn.BatchNorm1d(hidden2),
      nn.Dropout(0.5),
      QuantReLU(bit_width=act_bit_width),
      QuantLinear(hidden2, hidden3, bias=True, weight_bit_width=weight_bit_width),
      nn.BatchNorm1d(hidden3),
      nn.Dropout(0.5),
      QuantReLU(bit_width=act_bit_width),
      QuantLinear(hidden3, num_classes, bias=True, weight_bit_width=weight_bit_width)
 )

 # replace this with your trained network checkpoint if you're not
 # using the pretrained weights
 trained_state_dict = torch.load("state_dict.pth")["models_state_dict"][0]
+# Uncomment the following line if you previously chose to train the network yourself
+#trained_state_dict = torch.load("state_dict_self-trained.pth")
+
 brevitas_model.load_state_dict(trained_state_dict, strict=False)
 ```

 %% Output

    IncompatibleKeys(missing_keys=[], unexpected_keys=[])

 %% Cell type:code id: tags:

 ``` python
 def inference_with_brevitas(current_inp):
    brevitas_output = brevitas_model.forward(current_inp)
    # apply sigmoid + threshold
    brevitas_output = torch.sigmoid(brevitas_output)
    brevitas_output = (brevitas_output.detach().numpy() > 0.5) * 1
    # convert output to bipolar
    brevitas_output = 2*brevitas_output - 1
    return brevitas_output
 ```

 %% Cell type:markdown id: tags:

 # 4. Compare FINN & Brevitas execution <a id="compare_brevitas"></a>

 %% Cell type:markdown id: tags:

 Let's make helper functions to execute the same input with Brevitas and FINN. For FINN, we'll use the [`finn.core.onnx_exec`](https://finn.readthedocs.io/en/latest/source_code/finn.core.html#finn.core.onnx_exec.execute_onnx) function to execute the exported FINN-ONNX on the inputs.

 %% Cell type:code id: tags:

 ``` python
 def inference_with_finn_onnx(current_inp):
    # convert input to numpy for FINN
    current_inp = current_inp.detach().numpy()
    # add padding and re-scale to bipolar
    current_inp = np.pad(current_inp, [(0, 0), (0, 7)])
    current_inp = 2*current_inp-1
    # reshape to expected input (add 1 for batch dimension)
    current_inp = current_inp.reshape(finnonnx_model_in_shape)
    # create the input dictionary
    input_dict = {finnonnx_in_tensor_name : current_inp}
    # run with FINN's execute_onnx
    output_dict = oxe.execute_onnx(model_for_sim, input_dict)
    #get the output tensor
    finn_output = output_dict[finnonnx_out_tensor_name]
    return finn_output
 ```

 %% Cell type:markdown id: tags:

 Now we can call our inference helper functions for each input and compare the outputs.

 %% Cell type:code id: tags:

 ``` python
 import finn.core.onnx_exec as oxe
 import numpy as np
 from tqdm import trange

 verify_range = trange(n_verification_inputs, desc="FINN execution", position=0, leave=True)
 brevitas_model.eval()

 ok = 0
 nok = 0

 for i in verify_range:
    # run in Brevitas with PyTorch tensor
    current_inp = input_tensor[i].reshape((1, 593))
    brevitas_output = inference_with_brevitas(current_inp)
    finn_output = inference_with_finn_onnx(current_inp)
    # compare the outputs
    ok += 1 if finn_output == brevitas_output else 0
    nok += 1 if finn_output != brevitas_output else 0
    verify_range.set_description("ok %d nok %d" % (ok, nok))
    verify_range.refresh() # to show immediately the update
 ```

 %% Output

-    ok 100 nok 0: 100%|██████████| 100/100 [00:48<00:00,  2.09it/s]
+    ok 100 nok 0: 100%|██████████| 100/100 [00:46<00:00,  2.17it/s]

 %% Cell type:code id: tags:

 ``` python
 if ok == n_verification_inputs:
    print("Verification succeeded. Brevitas and FINN-ONNX execution outputs are identical")
 else:
    print("Verification failed. Brevitas and FINN-ONNX execution outputs are NOT identical")
 ```

 %% Output

    Verification succeeded. Brevitas and FINN-ONNX execution outputs are identical

 %% Cell type:markdown id: tags:

 This concludes our second notebook. In the next one, we'll take the ONNX model we just verified all the way down to FPGA hardware with the FINN compiler.

 %% Cell type:code id: tags:

 ``` python
 ```

--- a/notebooks/end2end_example/cybersecurity/3-build-accelerator-with-finn.ipynb
+++ b/notebooks/end2end_example/cybersecurity/3-build-accelerator-with-finn.ipynb
@@ -41,7 +41,7 @@
    "Since version 0.5b, the FINN compiler has a `build_dataflow` tool. Compared to previous versions which required setting up all the needed transformations in a Python script, it makes experimenting with dataflow architecture generation easier. The core idea is to specify the relevant build info as a configuration `dict`, which invokes all the necessary steps to make the dataflow build happen. It can be invoked either from the [command line](https://finn-dev.readthedocs.io/en/latest/command_line.html) or with a single Python function call\n",
    "\n",
    "\n",
-    "In this notebook, we'll use the Python function call to invoke the builds to stay inside the Jupyter notebook, but feel free to experiment with reproducing what we do here with the `./run-docker.sh build_dataflow` and `./run-docker.sh build_custom` command-line entry points too, as documented [here]((https://finn-dev.readthedocs.io/en/latest/command_line.html))."
+    "In this notebook, we'll use the Python function call to invoke the builds to stay inside the Jupyter notebook, but feel free to experiment with reproducing what we do here with the `./run-docker.sh build_dataflow` and `./run-docker.sh build_custom` command-line entry points too, as documented [here](https://finn-dev.readthedocs.io/en/latest/command_line.html)."
   ]
  },
  {
@@ -277,7 +277,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Here, we can see the estimated number of clock cycles each layer will take. Recall that all of these layers will be running in parallel, and the slowest layer will determine the overall throughput of the entire neural network. FINN attempts to parallelize each layer such that they all take a similar number of cycles, and less than the corresponding number of cycles that would be required to meet `target_fps`.\n",
+    "Here, we can see the estimated number of clock cycles each layer will take. Recall that all of these layers will be running in parallel, and the slowest layer will determine the overall throughput of the entire neural network. FINN attempts to parallelize each layer such that they all take a similar number of cycles, and less than the corresponding number of cycles that would be required to meet `target_fps`. Additionally by summing up all layer cycle estimates one can obtain an estimate for the overall latency of the whole network. \n",
    "\n",
    "Finally, we can see the layer-by-layer resource estimates in the `estimate_layer_resources.json` report:"
   ]
@@ -341,7 +341,9 @@
   "source": [
    "## Launch a Build: Stitched IP, out-of-context synth and rtlsim Performance <a id=\"build_ip_synth_rtlsim\"></a>\n",
    "\n",
-    "Once we have a configuration that gives satisfactory estimates, we can move on to generating the accelerator. We can do this in different ways depending on how we want to integrate the accelerator into a larger system. For instance, if we have a larger streaming system built in Vivado or if we'd like to re-use this generated accelerator as an IP component in other projects, the `STITCHED_IP` output product is a good choice. We can also use the `OOC_SYNTH` output product to get post-synthesis resource and clock frequency numbers for our accelerator."
+    "Once we have a configuration that gives satisfactory estimates, we can move on to generating the accelerator. We can do this in different ways depending on how we want to integrate the accelerator into a larger system. For instance, if we have a larger streaming system built in Vivado or if we'd like to re-use this generated accelerator as an IP component in other projects, the `STITCHED_IP` output product is a good choice. We can also use the `OOC_SYNTH` output product to get post-synthesis resource and clock frequency numbers for our accelerator.\n",
+    "\n",
+    "**NOTE: These next builds will take several minutes since multiple calls to Vivado and a call to the RTL simulator are involved.**"
   ]
  },
  {

 %% Cell type:markdown id: tags:

 # Building the Streaming Dataflow Accelerator

 **Important: This notebook depends on the 2-cybersecurity-finn-verification notebook because we are using models that were created by these notebooks. So please make sure the needed .onnx files are generated prior to running this notebook.**

 <img align="left" src="finn-example.png" alt="drawing" style="margin-right: 20px" width="250"/>

 In this notebook, we'll use the FINN compiler generate an FPGA accelerator with a streaming dataflow architecture from our quantized MLP for the cybersecurity task. The key idea in such architectures is to parallelize across layers as well as within layers by dedicating a proportionate amount of compute resources to each layer, illustrated on the figure to the left. You can read more about the general concept in the [FINN](https://arxiv.org/pdf/1612.07119) and [FINN-R](https://dl.acm.org/doi/pdf/10.1145/3242897) papers. This is done by mapping each layer to a Vivado HLS description, parallelizing each layer's implementation to the appropriate degree and using on-chip FIFOs to link up the layers to create the full accelerator.

 These implementations offer a good balance of performance and flexibility, but building them by hand is difficult and time-consuming. This is where the FINN compiler comes in: it can build streaming dataflow accelerators from an ONNX description to match the desired throughput.

 %% Cell type:markdown id: tags:

 ## Outline
 -------------

 1. [Introduction to  `build_dataflow` Tool](#intro_build_dataflow)
 2. [Understanding the Build Configuration: `DataflowBuildConfig`](#underst_build_conf)
    2.1.[Output Products](#output_prod)
    2.2.[Configuring the Board and FPGA Part](#config_fpga)
    2.3 [Configuring the Performance](#config_perf)
 4. [Launch a Build: Only Estimate Reports](#build_estimate_report)
 5. [Launch a Build: Stitched IP, out-of-context synth and rtlsim Performance](#build_ip_synth_rtlsim)
 6. [Launch a Build: PYNQ Bitfile and Driver](#build_bitfile_driver)

 %% Cell type:markdown id: tags:

 ## Introduction to  `build_dataflow` Tool <a id="intro_build_dataflow"></a>

 Since version 0.5b, the FINN compiler has a `build_dataflow` tool. Compared to previous versions which required setting up all the needed transformations in a Python script, it makes experimenting with dataflow architecture generation easier. The core idea is to specify the relevant build info as a configuration `dict`, which invokes all the necessary steps to make the dataflow build happen. It can be invoked either from the [command line](https://finn-dev.readthedocs.io/en/latest/command_line.html) or with a single Python function call


-In this notebook, we'll use the Python function call to invoke the builds to stay inside the Jupyter notebook, but feel free to experiment with reproducing what we do here with the `./run-docker.sh build_dataflow` and `./run-docker.sh build_custom` command-line entry points too, as documented [here]((https://finn-dev.readthedocs.io/en/latest/command_line.html)).
+In this notebook, we'll use the Python function call to invoke the builds to stay inside the Jupyter notebook, but feel free to experiment with reproducing what we do here with the `./run-docker.sh build_dataflow` and `./run-docker.sh build_custom` command-line entry points too, as documented [here](https://finn-dev.readthedocs.io/en/latest/command_line.html).

 %% Cell type:markdown id: tags:

 ## Understanding the Build Configuration: `DataflowBuildConfig` <a id="underst_build_conf"></a>

 The build configuration is specified by an instance of `finn.builder.build_dataflow_config.DataflowBuildConfig`. The configuration is a Python [`dataclass`](https://docs.python.org/3/library/dataclasses.html) which can be serialized into or de-serialized from JSON files for persistence, although we'll just set it up in Python here.
 There are many options in the configuration to customize different aspects of the build, we'll only cover a few of them in this notebook. You can read the details on all the config options on [the FINN API documentation](https://finn-dev.readthedocs.io/en/latest/source_code/finn.builder.html#finn.builder.build_dataflow_config.DataflowBuildConfig).

 Let's go over some of the members of the `DataflowBuildConfig`:

 ### Output Products <a id="output_prod"></a>

 The build can produce many different outputs, and some of them can take a long time (e.g. bitfile synthesis for a large network). When you first start working on generating a new accelerator and exploring the different performance options, you may not want to go all the way to a bitfile. Thus, in the beginning you may just select the estimate reports as the output products. Gradually, you can generate the output products from later stages until you are happy enough with the design to build the full accelerator integrated into a shell.

 The output products are controlled by:

 * `generate_outputs`: list of output products (of type [`finn.builder.build_dataflow_config.DataflowOutputType`](https://finn-dev.readthedocs.io/en/latest/source_code/finn.builder.html#finn.builder.build_dataflow_config.DataflowOutputType)) that will be generated by the build. Some available options are:
    - `ESTIMATE_REPORTS` : report expected resources and performance per layer and for the whole network without any synthesis
    - `STITCHED_IP` : create a stream-in stream-out IP design that can be integrated into other Vivado IPI or RTL designs
    - `RTLSIM_PERFORMANCE` : use PyVerilator to do a performance/latency test of the `STITCHED_IP` design
    - `OOC_SYNTH` : run out-of-context synthesis (just the accelerator itself, without any system surrounding it) on the `STITCHED_IP` design to get post-synthesis FPGA resources and achievable clock frequency
    - `BITFILE` : integrate the accelerator into a shell to produce a standalone bitfile
    - `PYNQ_DRIVER` : generate a PYNQ Python driver that can be used to launch the accelerator
    - `DEPLOYMENT_PACKAGE` : create a folder with the `BITFILE` and `PYNQ_DRIVER` outputs, ready to be copied to the target FPGA platform.
 * `output_dir`: the directory where the all the generated build outputs above will be written into.
 * `steps`: list of predefined (or custom) build steps FINN will go through. Use `build_dataflow_config.estimate_only_dataflow_steps` to execute only the steps needed for estimation (without any synthesis), and the `build_dataflow_config.default_build_dataflow_steps` otherwise (which is the default value).

 ### Configuring the Board and FPGA Part <a id="config_fpga"></a>

 * `fpga_part`: Xilinx FPGA part to be used for synthesis, can be left unspecified to be inferred from `board` below, or specified explicitly for e.g. out-of-context synthesis.
 * `board`: target Xilinx Zynq or Alveo board for generating accelerators integrated into a shell. See the `pynq_part_map` and `alveo_part_map` dicts in [this file](https://github.com/Xilinx/finn-base/blob/dev/src/finn/util/basic.py#L41) for a list of possible boards.
 * `shell_flow_type`: the target [shell flow type](https://finn-dev.readthedocs.io/en/latest/source_code/finn.builder.html#finn.builder.build_dataflow_config.ShellFlowType), only needed for generating full bitfiles where the FINN design is integrated into a shell (so only needed if `BITFILE` is selected)

 ### Configuring the Performance <a id="config_perf"></a>

 You can configure the performance (and correspondingly, the FPGA resource footprint) of the generated in two ways:

 1) (basic) Set a target performance and let the compiler figure out the per-node parallelization settings.

 2) (advanced) Specify a separate .json as `folding_config_file` that lists the degree of parallelization (as well as other hardware options) for each layer.

 This notebook only deals with the basic approach, for which you need to set up:

 * `target_fps`: target inference performance in frames per second. Note that target may not be achievable due to specific layer constraints, or due to resource limitations of the FPGA.
 * `synth_clk_period_ns`: target clock frequency (in nanoseconds) for Vivado synthesis. e.g. `synth_clk_period_ns=5.0` will target a 200 MHz clock. Note that the target clock period may not be achievable depending on the FPGA part and design complexity.

 %% Cell type:markdown id: tags:

 ## Launch a Build: Only Estimate Reports <a id="build_estimate_report"></a>

 First, we'll launch a build that only generates the estimate reports, which does not require any synthesis. Note two things below: how the `generate_outputs` only contains `ESTIMATE_REPORTS`, but also how the `steps` uses a value of `estimate_only_dataflow_steps`. This skips steps like HLS synthesis to provide a quick estimate from analytical models.

 %% Cell type:code id: tags:

 ``` python
 import finn.builder.build_dataflow as build
 import finn.builder.build_dataflow_config as build_cfg

 model_file = "cybsec-mlp-verified.onnx"

 estimates_output_dir = "output_estimates_only"

 cfg = build.DataflowBuildConfig(
    output_dir          = estimates_output_dir,
    target_fps          = 1000000,
    synth_clk_period_ns = 10.0,
    fpga_part           = "xc7z020clg400-1",
    steps               = build_cfg.estimate_only_dataflow_steps,
    generate_outputs=[
        build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
    ]
 )

 build.build_dataflow_cfg(model_file, cfg)
 ```

 %% Output

    Building dataflow accelerator from cybsec-mlp-verified.onnx
    Intermediate outputs will be generated in /tmp/finn_dev_osboxes
    Final outputs will be generated in output_estimates_only
    Build log is at output_estimates_only/build_dataflow.log
    Running step: step_tidy_up [1/7]
    Running step: step_streamline [2/7]
    Running step: step_convert_to_hls [3/7]
    Running step: step_create_dataflow_partition [4/7]
    Running step: step_target_fps_parallelization [5/7]
    Running step: step_apply_folding_config [6/7]
    Running step: step_generate_estimate_reports [7/7]
    Completed successfully

    0

 %% Cell type:markdown id: tags:

 We'll now examine the generated outputs from this build. If we look under the outputs directory, we'll find a subfolder with the generated estimate reports.

 %% Cell type:code id: tags:

 ``` python
 ! ls {estimates_output_dir}
 ```

 %% Output

    build_dataflow.log  intermediate_models  report  time_per_step.json

 %% Cell type:code id: tags:

 ``` python
 ! ls {estimates_output_dir}/report
 ```

 %% Output

    estimate_layer_config_alternatives.json  estimate_network_performance.json
    estimate_layer_cycles.json		 op_and_param_counts.json
    estimate_layer_resources.json

 %% Cell type:markdown id: tags:

 We see that various reports have been generated as .json files. Let's examine the contents of the `estimate_network_performance.json` for starters. Here, we can see the analytical estimates for the performance and latency.

 %% Cell type:code id: tags:

 ``` python
 ! cat {estimates_output_dir}/report/estimate_network_performance.json
 ```

 %% Output

    {
      "critical_path_cycles": 272,
      "max_cycles": 80,
      "max_cycles_node_name": "StreamingFCLayer_Batch_0",
      "estimated_throughput_fps": 1250000.0,
      "estimated_latency_ns": 2720.0
    }

 %% Cell type:markdown id: tags:

 Since all of these reports are .json files, we can easily load them into Python for further processing. Let's define a helper function and look at the `estimate_layer_cycles.json` report.

 %% Cell type:code id: tags:

 ``` python
 import json
 def read_json_dict(filename):
    with open(filename, "r") as f:
        ret = json.load(f)
    return ret
 ```

 %% Cell type:code id: tags:

 ``` python
 read_json_dict(estimates_output_dir + "/report/estimate_layer_cycles.json")
 ```

 %% Output

    {'StreamingFCLayer_Batch_0': 80,
     'StreamingFCLayer_Batch_1': 64,
     'StreamingFCLayer_Batch_2': 64,
     'StreamingFCLayer_Batch_3': 64}

 %% Cell type:markdown id: tags:

-Here, we can see the estimated number of clock cycles each layer will take. Recall that all of these layers will be running in parallel, and the slowest layer will determine the overall throughput of the entire neural network. FINN attempts to parallelize each layer such that they all take a similar number of cycles, and less than the corresponding number of cycles that would be required to meet `target_fps`.
+Here, we can see the estimated number of clock cycles each layer will take. Recall that all of these layers will be running in parallel, and the slowest layer will determine the overall throughput of the entire neural network. FINN attempts to parallelize each layer such that they all take a similar number of cycles, and less than the corresponding number of cycles that would be required to meet `target_fps`. Additionally by summing up all layer cycle estimates one can obtain an estimate for the overall latency of the whole network.

 Finally, we can see the layer-by-layer resource estimates in the `estimate_layer_resources.json` report:

 %% Cell type:code id: tags:

 ``` python
 read_json_dict(estimates_output_dir + "/report/estimate_layer_resources.json")
 ```

 %% Output

    {'StreamingFCLayer_Batch_0': {'BRAM_18K': 27,
      'BRAM_efficiency': 0.15432098765432098,
      'LUT': 8149,
      'URAM': 0,
      'URAM_efficiency': 1,
      'DSP': 0},
     'StreamingFCLayer_Batch_1': {'BRAM_18K': 4,
      'BRAM_efficiency': 0.1111111111111111,
      'LUT': 1435,
      'URAM': 0,
      'URAM_efficiency': 1,
      'DSP': 0},
     'StreamingFCLayer_Batch_2': {'BRAM_18K': 4,
      'BRAM_efficiency': 0.1111111111111111,
      'LUT': 1435,
      'URAM': 0,
      'URAM_efficiency': 1,
      'DSP': 0},
     'StreamingFCLayer_Batch_3': {'BRAM_18K': 1,
      'BRAM_efficiency': 0.006944444444444444,
      'LUT': 341,
      'URAM': 0,
      'URAM_efficiency': 1,
      'DSP': 0},
     'total': {'BRAM_18K': 36.0, 'LUT': 11360.0, 'URAM': 0.0, 'DSP': 0.0}}

 %% Cell type:markdown id: tags:

 This particular report is useful to determine whether the current configuration will fit into a particular FPGA. If you see that the resource requirements are too high for the FPGA you had in mind, you should consider lowering the `target_fps`.

 *Note that the analytical models tend to over-estimate how much resources are needed, since they can't capture the effects of various synthesis optimizations.*

 %% Cell type:markdown id: tags:

 ## Launch a Build: Stitched IP, out-of-context synth and rtlsim Performance <a id="build_ip_synth_rtlsim"></a>

 Once we have a configuration that gives satisfactory estimates, we can move on to generating the accelerator. We can do this in different ways depending on how we want to integrate the accelerator into a larger system. For instance, if we have a larger streaming system built in Vivado or if we'd like to re-use this generated accelerator as an IP component in other projects, the `STITCHED_IP` output product is a good choice. We can also use the `OOC_SYNTH` output product to get post-synthesis resource and clock frequency numbers for our accelerator.

+**NOTE: These next builds will take several minutes since multiple calls to Vivado and a call to the RTL simulator are involved.**
+
 %% Cell type:code id: tags:

 ``` python
 import finn.builder.build_dataflow as build
 import finn.builder.build_dataflow_config as build_cfg

 model_file = "cybsec-mlp-verified.onnx"

 rtlsim_output_dir = "output_ipstitch_ooc_rtlsim"

 cfg = build.DataflowBuildConfig(
    output_dir          = rtlsim_output_dir,
    target_fps          = 1000000,
    synth_clk_period_ns = 10.0,
    fpga_part           = "xc7z020clg400-1",
    generate_outputs=[
        build_cfg.DataflowOutputType.STITCHED_IP,
        build_cfg.DataflowOutputType.RTLSIM_PERFORMANCE,
        build_cfg.DataflowOutputType.OOC_SYNTH,
    ]
 )

 build.build_dataflow_cfg(model_file, cfg)
 ```

 %% Output

    Building dataflow accelerator from cybsec-mlp-verified.onnx
    Intermediate outputs will be generated in /tmp/finn_dev_osboxes
    Final outputs will be generated in output_ipstitch_ooc_rtlsim
    Build log is at output_ipstitch_ooc_rtlsim/build_dataflow.log
    Running step: step_tidy_up [1/15]
    Running step: step_streamline [2/15]
    Running step: step_convert_to_hls [3/15]
    Running step: step_create_dataflow_partition [4/15]
    Running step: step_target_fps_parallelization [5/15]
    Running step: step_apply_folding_config [6/15]
    Running step: step_generate_estimate_reports [7/15]
    Running step: step_hls_ipgen [8/15]
    Running step: step_set_fifo_depths [9/15]
    Running step: step_create_stitched_ip [10/15]
    Running step: step_measure_rtlsim_performance [11/15]
    Running step: step_make_pynq_driver [12/15]
    Running step: step_out_of_context_synthesis [13/15]
    Running step: step_synthesize_bitfile [14/15]
    Running step: step_deployment_package [15/15]
    Completed successfully

    0

 %% Cell type:markdown id: tags:

 Among the output products, we will find the accelerator exported as IP:

 %% Cell type:code id: tags:

 ``` python
 ! ls {rtlsim_output_dir}/stitched_ip
 ```

 %% Output

    all_verilog_srcs.txt		       finn_vivado_stitch_proj.xpr
    finn_vivado_stitch_proj.cache	       ip
    finn_vivado_stitch_proj.hbs	       make_project.sh
    finn_vivado_stitch_proj.hw	       make_project.tcl
    finn_vivado_stitch_proj.ip_user_files  vivado.jou
    finn_vivado_stitch_proj.srcs	       vivado.log

 %% Cell type:markdown id: tags:

 We also have a few reports generated by these output products, different from the ones generated by `ESTIMATE_REPORTS`.

 %% Cell type:code id: tags:

 ``` python
 ! ls {rtlsim_output_dir}/report
 ```

 %% Output

    estimate_layer_resources_hls.json  rtlsim_performance.json
    ooc_synth_and_timing.json

 %% Cell type:markdown id: tags:

 In `ooc_synth_and_timing.json` we can find the post-synthesis and maximum clock frequency estimate for the accelerator. Note that the clock frequency estimate here tends to be optimistic, since out-of-context synthesis is less constrained.

 %% Cell type:code id: tags:

 ``` python
 ! cat {rtlsim_output_dir}/report/ooc_synth_and_timing.json
 ```

 %% Output

    {
      "vivado_proj_folder": "/tmp/finn_dev_osboxes/synth_out_of_context_wy3b6qf4/results_finn_design_wrapper",
      "LUT": 7073.0,
      "FF": 7534.0,
      "DSP": 0.0,
      "BRAM": 18.0,
      "WNS": 0.632,
      "": 0,
      "fmax_mhz": 106.7463706233988,
      "estimated_throughput_fps": 1334329.6327924852
    }

 %% Cell type:markdown id: tags:

 In `rtlsim_performance.json` we can find the steady-state throughput and latency for the accelerator, as obtained by rtlsim. If the DRAM bandwidth numbers reported here are below what the hardware platform is capable of (i.e. the accelerator is not memory-bound), you can expect the same steady-state throughput in real hardware.

 %% Cell type:code id: tags:

 ``` python
 ! cat {rtlsim_output_dir}/report/rtlsim_performance.json
 ```

 %% Output

    {
      "cycles": 838,
      "runtime[ms]": 0.00838,
      "throughput[images/s]": 954653.9379474939,
      "DRAM_in_bandwidth[Mb/s]": 71.59904534606204,
      "DRAM_out_bandwidth[Mb/s]": 0.11933174224343673,
      "fclk[mhz]": 100.0,
      "N": 8,
      "latency_cycles": 229
    }

 %% Cell type:markdown id: tags:

 Finally, let's have a look at `final_hw_config.json`. This is the node-by-node hardware configuration determined by the FINN compiler, including FIFO depths, parallelization settings (PE/SIMD) and others. If you want to optimize your build further (the "advanced" method we mentioned under "Configuring the performance"), you can use this .json file as the `folding_config_file` for a new run to use it as a starting point for further exploration and optimizations.

 %% Cell type:code id: tags:

 ``` python
 ! cat {rtlsim_output_dir}/final_hw_config.json
 ```

 %% Output

    {
      "Defaults": {},
      "StreamingFIFO_0": {
        "ram_style": "auto",
        "depth": 32,
        "impl_style": "rtl"
      },
      "StreamingFCLayer_Batch_0": {
        "PE": 32,
        "SIMD": 15,
        "ram_style": "auto",
        "resType": "lut",
        "mem_mode": "decoupled",
        "runtime_writeable_weights": 0
      },
      "StreamingDataWidthConverter_Batch_0": {
        "impl_style": "hls"
      },
      "StreamingFCLayer_Batch_1": {
        "PE": 4,
        "SIMD": 16,
        "ram_style": "auto",
        "resType": "lut",
        "mem_mode": "decoupled",
        "runtime_writeable_weights": 0
      },
      "StreamingDataWidthConverter_Batch_1": {
        "impl_style": "hls"
      },
      "StreamingFCLayer_Batch_2": {
        "PE": 4,
        "SIMD": 16,
        "ram_style": "auto",
        "resType": "lut",
        "mem_mode": "decoupled",
        "runtime_writeable_weights": 0
      },
      "StreamingDataWidthConverter_Batch_2": {
        "impl_style": "hls"
      },
      "StreamingFCLayer_Batch_3": {
        "PE": 1,
        "SIMD": 1,
        "ram_style": "auto",
        "resType": "lut",
        "mem_mode": "decoupled",
        "runtime_writeable_weights": 0
      }
    }

 %% Cell type:markdown id: tags:

 ## Launch a Build: PYNQ Bitfile and Driver <a id="build_bitfile_driver"></a>

 %% Cell type:code id: tags:

 ``` python
 import finn.builder.build_dataflow as build
 import finn.builder.build_dataflow_config as build_cfg

 model_file = "cybsec-mlp-verified.onnx"

 final_output_dir = "output_final"

 cfg = build.DataflowBuildConfig(
    output_dir          = final_output_dir,
    target_fps          = 1000000,
    synth_clk_period_ns = 10.0,
    board               = "Pynq-Z1",
    shell_flow_type     = build_cfg.ShellFlowType.VIVADO_ZYNQ,
    generate_outputs=[
        build_cfg.DataflowOutputType.BITFILE,
        build_cfg.DataflowOutputType.PYNQ_DRIVER,
        build_cfg.DataflowOutputType.DEPLOYMENT_PACKAGE,
    ]
 )

 build.build_dataflow_cfg(model_file, cfg)
 ```

 %% Output

    Building dataflow accelerator from cybsec-mlp-verified.onnx
    Intermediate outputs will be generated in /tmp/finn_dev_osboxes
    Final outputs will be generated in output_final
    Build log is at output_final/build_dataflow.log
    Running step: step_tidy_up [1/15]
    Running step: step_streamline [2/15]
    Running step: step_convert_to_hls [3/15]
    Running step: step_create_dataflow_partition [4/15]
    Running step: step_target_fps_parallelization [5/15]
    Running step: step_apply_folding_config [6/15]
    Running step: step_generate_estimate_reports [7/15]
    Running step: step_hls_ipgen [8/15]
    Running step: step_set_fifo_depths [9/15]
    Running step: step_create_stitched_ip [10/15]
    Running step: step_measure_rtlsim_performance [11/15]
    Running step: step_make_pynq_driver [12/15]
    Running step: step_out_of_context_synthesis [13/15]
    Running step: step_synthesize_bitfile [14/15]
    Running step: step_deployment_package [15/15]
    Completed successfully

    0

 %% Cell type:markdown id: tags:

 For our final build, the output products include the bitfile (and the accompanying .hwh file, also needed to execute correctly on PYNQ for Zynq platforms):

 %% Cell type:code id: tags:

 ``` python
 ! ls {final_output_dir}/bitfile
 ```

 %% Output

    finn-accel.bit	finn-accel.hwh

 %% Cell type:markdown id: tags:

 The generated Python driver lets us execute the accelerator on PYNQ platforms with simply numpy i/o. You can find some notebooks showing how to use FINN-generated accelerators at runtime in the [finn-examples](https://github.com/Xilinx/finn-examples) repository.

 %% Cell type:code id: tags:

 ``` python
 ! ls {final_output_dir}/driver
 ```

 %% Output

    driver.py  driver_base.py  finn  runtime_weights  validate.py

 %% Cell type:markdown id: tags:

 The reports folder contains the post-synthesis resource and timing reports:

 %% Cell type:code id: tags:

 ``` python
 ! ls {final_output_dir}/report
 ```

 %% Output

    estimate_layer_resources_hls.json  post_synth_resources.xml
    post_route_timing.rpt

 %% Cell type:markdown id: tags:

 Finally, we have the `deploy` folder which contains everything you need to copy onto the target board to get the accelerator running:

 %% Cell type:code id: tags:

 ``` python
 ! ls {final_output_dir}/deploy
 ```

 %% Output

    bitfile  driver

 %% Cell type:code id: tags:

 ``` python
 ```