[Notebooks] Change board execution section in cnv notebook

427c2243 · auphelia · c1233860 · 427c2243
Commit 427c2243 authored 2 years ago by auphelia
--- a/notebooks/end2end_example/bnn-pynq/cnv_end2end_example.ipynb
+++ b/notebooks/end2end_example/bnn-pynq/cnv_end2end_example.ipynb
@@ -462,11 +462,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## 5. Deployment and Remote Execution\n",
+    "## 5. Deployment and Execution\n",
    "\n",
-    "Now that we're done with the hardware generation, we can copy the necessary files onto our PYNQ board.\n",
-    "\n",
-    "**Make sure you've [set up the SSH keys for your PYNQ board](https://finn-dev.readthedocs.io/en/latest/getting_started.html#pynq-board-first-time-setup) before executing this step.**"
+    "The bitfile and generated driver files(s) will be copied into a deployment folder which then can be used to run the network on the PYNQ board."
   ]
  },
  {
@@ -475,33 +473,33 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "import os\n",
+    "from shutil import copy\n",
+    "from distutils.dir_util import copy_tree\n",
+    "\n",
+    "# create directory for deployment files\n",
+    "deployment_dir = make_build_dir(prefix=\"pynq_deployment_\")\n",
+    "model.set_metadata_prop(\"pynq_deployment_dir\", deployment_dir)\n",
    "\n",
-    "# set up the following values according to your own environment\n",
-    "# FINN will use ssh to deploy and run the generated accelerator\n",
-    "ip = \"192.168.2.99\"\n",
-    "username = os.getenv(\"PYNQ_USERNAME\", \"xilinx\")\n",
-    "password = os.getenv(\"PYNQ_PASSWORD\", \"xilinx\")\n",
-    "port = os.getenv(\"PYNQ_PORT\", 22)\n",
-    "target_dir = os.getenv(\"PYNQ_TARGET_DIR\", \"/home/xilinx/finn_cnv_end2end_example\")\n",
-    "# set up ssh options to only allow publickey authentication\n",
-    "options = \"-o PreferredAuthentications=publickey -o PasswordAuthentication=no\"\n",
+    "# get and copy necessary files\n",
+    "# .bit and .hwh file\n",
+    "bitfile = model.get_metadata_prop(\"bitfile\")\n",
+    "hwh_file = model.get_metadata_prop(\"hw_handoff\")\n",
+    "deploy_files = [bitfile, hwh_file]\n",
    "\n",
-    "# test access to PYNQ board\n",
-    "! ssh {options} {username}@{ip} -p {port} cat /var/run/motd.dynamic"
+    "for dfile in deploy_files:\n",
+    "    if dfile is not None:\n",
+    "        copy(dfile, deployment_dir)\n",
+    "\n",
+    "# driver.py and python libraries\n",
+    "pynq_driver_dir = model.get_metadata_prop(\"pynq_driver_dir\")\n",
+    "copy_tree(pynq_driver_dir, deployment_dir)"
   ]
  },
  {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
   "metadata": {},
-   "outputs": [],
   "source": [
-    "from finn.transformation.fpgadataflow.make_deployment import DeployToPYNQ\n",
-    "\n",
-    "model = ModelWrapper(build_dir + \"/end2end_cnv_w1a1_synth.onnx\")\n",
-    "model = model.transform(DeployToPYNQ(ip, port, username, password, target_dir))\n",
-    "model.save(build_dir + \"/end2end_cnv_w1a1_pynq_deploy.onnx\")"
+    "Next to these files, we will also need an example numpy array to test the network on the PYNQ board. (*and before you ask, that's supposed to be a cat (CIFAR-10 class number 3)*) Recall that we partitioned our original network into a parent graph that contained the non-synthesizable nodes and a child graph that contained the bulk of the network, which we turned into a bitfile. The only operator left outside the FPGA partition was a `Transpose` to convert NCHW images into NHWC ones. Thus, we can skip the execution in the parent as long as we ensure our image has the expected data layout. The example numpy array can then be saved as .npy file."
   ]
  },
  {
@@ -510,8 +508,14 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "target_dir_pynq = target_dir + \"/\" + model.get_metadata_prop(\"pynq_deployment_dir\").split(\"/\")[-1]\n",
-    "target_dir_pynq"
+    "import pkg_resources as pk\n",
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "\n",
+    "fn = pk.resource_filename(\"finn.qnn-data\", \"cifar10/cifar10-test-data-class3.npz\")\n",
+    "x = np.load(fn)[\"arr_0\"]\n",
+    "x = x.reshape(3, 32,32).transpose(1, 2, 0)\n",
+    "plt.imshow(x)"
   ]
  },
  {
@@ -520,14 +524,19 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "! ssh {options} {username}@{ip} -p {port} 'ls -l {target_dir_pynq}'"
+    "model = ModelWrapper(build_dir + \"/end2end_cnv_w1a1_pynq_deploy.onnx\")\n",
+    "iname = model.graph.input[0].name\n",
+    "ishape = model.get_tensor_shape(iname)\n",
+    "np.save(deployment_dir + \"/input.npy\", x.reshape(ishape))"
   ]
  },
  {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
   "metadata": {},
+   "outputs": [],
   "source": [
-    "We only have two more steps to be able to remotely execute the deployed bitfile with some test data from the CIFAR-10 dataset. Let's load up some test data that comes bundled with FINN -- *and before you ask, that's supposed to be a cat (CIFAR-10 class number 3)*."
+    "! ls {deployment_dir}"
   ]
  },
  {
@@ -536,54 +545,34 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "import pkg_resources as pk\n",
-    "import matplotlib.pyplot as plt\n",
-    "import numpy as np\n",
-    "\n",
-    "fn = pk.resource_filename(\"finn.qnn-data\", \"cifar10/cifar10-test-data-class3.npz\")\n",
-    "x = np.load(fn)[\"arr_0\"]\n",
-    "x = x.reshape(3, 32,32).transpose(1, 2, 0)\n",
-    "plt.imshow(x)"
+    "from shutil import make_archive\n",
+    "make_archive('deploy-on-pynq-cnv', 'zip', deployment_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Recall that we partitioned our original network into a parent graph that contained the non-synthesizable nodes and a child graph that contained the bulk of the network, which we turned into a bitfile. The only operator left outside the FPGA partition was a `Transpose` to convert NCHW images into NHWC ones. Thus, we can skip the execution in the parent as long as we ensure our image has the expected data layout, which we have done above."
+    "You can now download the created zipfile (File -> Open, mark the checkbox next to the deploy-on-pynq-tfc.zip and select Download from the toolbar), then copy it to your PYNQ board (for instance via scp or rsync). Then, run the following commands on the PYNQ board to extract the archive and run the execution:"
   ]
  },
  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "from finn.core.onnx_exec import execute_onnx\n",
-    "\n",
-    "model = ModelWrapper(build_dir + \"/end2end_cnv_w1a1_pynq_deploy.onnx\")\n",
-    "iname = model.graph.input[0].name\n",
-    "oname = model.graph.output[0].name\n",
-    "ishape = model.get_tensor_shape(iname)\n",
-    "input_dict = {iname: x.astype(np.float32).reshape(ishape)}\n",
-    "ret = execute_onnx(model, input_dict, True)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
   "metadata": {},
-   "outputs": [],
   "source": [
-    "ret[oname]"
+    "```shell\n",
+    "unzip deploy-on-pynq-cnv.zip -d finn-cnv-demo\n",
+    "cd finn-cnv-demo\n",
+    "sudo python3.6 -m pip install bitstring\n",
+    "sudo python3.6 driver.py --exec_mode=execute --batchsize=1 --bitfile=resizer.bit --inputfile=input.npy\n",
+    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We see that the network correctly predicts this as a class 3 (\"cat\"). "
+    "The output will be saved on the PYNQ board as `output.npy` and can be copied to the host and opened with `np.load()`."
   ]
  },
  {
@@ -592,7 +581,7 @@
   "source": [
    "### Validating the Accuracy on a PYNQ Board <a id='validation'></a>\n",
    "\n",
-    "All the command line prompts here are meant to be executed with `sudo` on the PYNQ board, so we'll use a workaround (`echo password | sudo -S command`) to get that working from this notebook running on the host computer.\n",
+    "All the command line prompts here are meant to be executed with `sudo` on the PYNQ board.\n",
    "\n",
    "**Ensure that your PYNQ board has a working internet connecting for the next steps, since some there is some downloading involved.**\n",
    "\n",
@@ -601,16 +590,7 @@
    "\n",
    "Command to execute on PYNQ:\n",
    "\n",
-    "```pip3 install git+https://github.com/fbcotter/dataset_loading.git@0.0.4#egg=dataset_loading```"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "! ssh {options} -t {username}@{ip} -p {port} 'echo {password} | sudo -S pip3 install git+https://github.com/fbcotter/dataset_loading.git@0.0.4#egg=dataset_loading'"
+    "```sudo pip3 install git+https://github.com/fbcotter/dataset_loading.git@0.0.4#egg=dataset_loading```"
   ]
  },
  {
@@ -624,15 +604,6 @@
    "`python3.6 validate.py --dataset cifar10 --batchsize 1000`"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "! ssh {options} -t {username}@{ip} -p {port} 'cd {target_dir_pynq}; echo {password} | sudo -S python3.6 validate.py --dataset cifar10 --batchsize 1000'"
-   ]
-  },
  {
   "cell_type": "markdown",
   "metadata": {},

 %% Cell type:markdown id: tags:

 # End-to-End FINN Flow for a Simple Convolutional Net
 -----------------------------------------------------------------

 In this notebook, we will go through the FINN steps needed to take a binarized convolutional network all the way down to a heterogeneous streaming dataflow accelerator running on the FPGA.

 It's recommended to go through the simpler [end-to-end notebook for a fully connected network](tfc_end2end_example.ipynb) first, since many steps here are very similar and we will focus on what is done differently for convolutions.

 This notebook is quite lengthy, and some of the cells (involving Vivado synthesis) may take up to an hour to finish running. To let you save and resume your progress, we will save the intermediate ONNX models that are generated in the various steps to disk, so that you can jump back directly to where you left off.

 %% Cell type:markdown id: tags:

 ## Quick Introduction to the CNV-w1a1 Network

 The particular quantized neural network (QNN) we will be targeting in this notebook is referred to as CNV-w1a1 and it classifies 32x32 RGB images into one of ten CIFAR-10 classes. All weights and activations in this network are quantized to bipolar values (either -1 or +1), with the exception of the input (which is RGB with 8 bits per channel) and the final output (which is 32-bit numbers). It first appeared in the original [FINN paper](https://arxiv.org/abs/1612.07119) from ISFPGA'17 with the name CNV, as a variant of the binarized convolutional network from the [BinaryNet paper](https://arxiv.org/abs/1602.02830), in turn inspired by the VGG-11 topology which was the runner-up for the 2014 [ImageNet Large Scale Visual Recognition Challenge](http://www.image-net.org/challenges/LSVRC/).


 You'll have a chance to interactively examine the layers that make up the network in Netron in a moment, so that's enough about the network for now.

 %% Cell type:markdown id: tags:

 ## Quick Recap of the End-to-End Flow

 The FINN compiler comes with many *transformations* that modify the ONNX representation of the network according to certain patterns. This notebook will demonstrate a *possible* sequence of such transformations to take a particular trained network all the way down to hardware, as shown in the figure below.

 %% Cell type:markdown id: tags:

 ![](finn-design-flow-example.svg)

 %% Cell type:markdown id: tags:

 The white fields show the state of the network representation in the respective step. The colored fields represent the transformations that are applied to the network to achieve a certain result. The diagram is divided into 5 sections represented by a different color, each of it includes several flow steps. The flow starts in top left corner with Brevitas export (green section), followed by the preparation of the network (blue section) for the Vivado HLS synthesis and Vivado IPI stitching (orange section), and finally building a PYNQ overlay bitfile and testing it on a PYNQ board (yellow section).
 There is an additional section for functional verification (red section) on the left side of the diagram, which we will not cover in this notebook. For details please take a look in the verification notebook which you can find [here](tfc_end2end_verification.ipynb)


 We will use the helper function `showInNetron` to show the ONNX model at the current transformation step. The Netron displays are interactive, but they only work when running the notebook actively and not on GitHub (i.e. if you are viewing this on GitHub you'll only see blank squares).

 %% Cell type:code id: tags:

 ``` python
 from finn.util.basic import make_build_dir
 from finn.util.visualization import showInNetron
 import os

 build_dir = os.environ["FINN_BUILD_DIR"]
 ```

 %% Cell type:markdown id: tags:

 ## 1. Brevitas Export, FINN Import and Tidy-Up

 Similar to what we did in the TFC-w1a1 end-to-end notebook, we will start by exporting the [pretrained CNV-w1a1 network](https://github.com/Xilinx/brevitas/tree/master/src/brevitas_examples/bnn_pynq) to ONNX, importing that into FINN and running the "tidy-up" transformations to have a first look at the topology.

 %% Cell type:code id: tags:

 ``` python
 import onnx
 from finn.util.test import get_test_model_trained
 import brevitas.onnx as bo
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.transformation.infer_shapes import InferShapes
 from qonnx.transformation.fold_constants import FoldConstants
 from qonnx.transformation.general import GiveReadableTensorNames, GiveUniqueNodeNames, RemoveStaticGraphInputs

 cnv = get_test_model_trained("CNV", 1, 1)
 bo.export_finn_onnx(cnv, (1, 3, 32, 32), build_dir + "/end2end_cnv_w1a1_export.onnx")
 model = ModelWrapper(build_dir + "/end2end_cnv_w1a1_export.onnx")
 model = model.transform(InferShapes())
 model = model.transform(FoldConstants())
 model = model.transform(GiveUniqueNodeNames())
 model = model.transform(GiveReadableTensorNames())
 model = model.transform(RemoveStaticGraphInputs())
 model.save(build_dir + "/end2end_cnv_w1a1_tidy.onnx")
 ```

 %% Cell type:markdown id: tags:

 Now that the model is exported, let's have a look at its layer structure with Netron. Remember that the visualization below is interactive, you can click on the individual nodes and view the layer attributes, trained weights and so on.

 %% Cell type:code id: tags:

 ``` python
 showInNetron(build_dir+"/end2end_cnv_w1a1_tidy.onnx")
 ```

 %% Cell type:markdown id: tags:

 You can see that the network is composed of a repeating convolution-convolution-maxpool layer pattern to extract features using 3x3 convolution kernels (with weights binarized), followed by fully connected layers acting as the classifier. Also notice the initial `MultiThreshold` layer at the beginning of the network, which is quantizing float inputs to 8-bit ones.

 %% Cell type:markdown id: tags:

 ### Adding Pre- and Postprocessing <a id='prepost'></a>

 Preprocessing and postprocessing steps can be added directly in the ONNX graph. In this case, the preprocessing step divides the input `uint8` data by 255 so the inputs to the CNV-w1a1 network are bounded between [0, 1]. The postprocessing step takes the output of the network and returns the index (0-9) of the image category with the highest probability (top-1).

 %% Cell type:code id: tags:

 ``` python
 from finn.util.pytorch import ToTensor
 from qonnx.transformation.merge_onnx_models import MergeONNXModels
 from qonnx.core.datatype import DataType

 model = ModelWrapper(build_dir+"/end2end_cnv_w1a1_tidy.onnx")
 global_inp_name = model.graph.input[0].name
 ishape = model.get_tensor_shape(global_inp_name)
 # preprocessing: torchvision's ToTensor divides uint8 inputs by 255
 totensor_pyt = ToTensor()
 chkpt_preproc_name = build_dir+"/end2end_cnv_w1a1_preproc.onnx"
 bo.export_finn_onnx(totensor_pyt, ishape, chkpt_preproc_name)

 # join preprocessing and core model
 pre_model = ModelWrapper(chkpt_preproc_name)
 model = model.transform(MergeONNXModels(pre_model))
 # add input quantization annotation: UINT8 for all BNN-PYNQ models
 global_inp_name = model.graph.input[0].name
 model.set_tensor_datatype(global_inp_name, DataType["UINT8"])
 ```

 %% Cell type:code id: tags:

 ``` python
 from qonnx.transformation.insert_topk import InsertTopK
 from qonnx.transformation.infer_datatypes import InferDataTypes

 # postprocessing: insert Top-1 node at the end
 model = model.transform(InsertTopK(k=1))
 chkpt_name = build_dir+"/end2end_cnv_w1a1_pre_post.onnx"
 # tidy-up again
 model = model.transform(InferShapes())
 model = model.transform(FoldConstants())
 model = model.transform(GiveUniqueNodeNames())
 model = model.transform(GiveReadableTensorNames())
 model = model.transform(InferDataTypes())
 model = model.transform(RemoveStaticGraphInputs())
 model.save(chkpt_name)
 ```

 %% Cell type:code id: tags:

 ``` python
 showInNetron(build_dir+"/end2end_cnv_w1a1_pre_post.onnx")
 ```

 %% Cell type:markdown id: tags:

 ## 2. How FINN Implements Convolutions: Lowering and Streamlining

 In FINN, we implement convolutions with the *lowering* approach: we convert them to matrix-matrix multiply operations, where one of the matrices is generated by sliding a window over the input image. You can read more about the sliding window operator and how convolution lowering works [in this notebook](https://github.com/maltanar/qnn-inference-examples/blob/master/3-convolutional-binarized-gtsrb.ipynb). The streaming dataflow architecture we will end up with is going to look something like this figure from the [FINN-R paper](https://arxiv.org/abs/1809.04570):

 ![](cnv-mp-fc.png)

 Note how the convolution layer looks very similar to the fully connected one in terms of the matrix-vector-threshold unit (MVTU), but now the MVTU is preceded by a sliding window unit that produces the matrix from the input image. All of these building blocks, including the `MaxPool` layer you see in this figure, exist as templated Vivado HLS C++ functions in [finn-hlslib](https://github.com/Xilinx/finn-hlslib).


 To target this kind of hardware architecture with our network we'll apply a convolution lowering transformation, in addition to streamlining. You may recall the *streamlining transformation* that we applied to the TFC-w1a1 network, which is a series of mathematical simplifications that allow us to get rid of floating point scaling operations by implementing few-bit activations as thresholding operations.

 **The current implementation of streamlining is highly network-specific and may not work for your network if its topology is very different than the example network here. We hope to rectify this in future releases.**

 %% Cell type:code id: tags:

 ``` python
 from finn.transformation.streamline import Streamline
 from qonnx.transformation.lower_convs_to_matmul import LowerConvsToMatMul
 from qonnx.transformation.bipolar_to_xnor import ConvertBipolarMatMulToXnorPopcount
 import finn.transformation.streamline.absorb as absorb
 from finn.transformation.streamline.reorder import MakeMaxPoolNHWC, MoveScalarLinearPastInvariants
 from qonnx.transformation.infer_data_layouts import InferDataLayouts
 from qonnx.transformation.general import RemoveUnusedTensors

 model = ModelWrapper(build_dir + "/end2end_cnv_w1a1_pre_post.onnx")
 model = model.transform(MoveScalarLinearPastInvariants())
 model = model.transform(Streamline())
 model = model.transform(LowerConvsToMatMul())
 model = model.transform(MakeMaxPoolNHWC())
 model = model.transform(absorb.AbsorbTransposeIntoMultiThreshold())
 model = model.transform(ConvertBipolarMatMulToXnorPopcount())
 model = model.transform(Streamline())
 # absorb final add-mul nodes into TopK
 model = model.transform(absorb.AbsorbScalarMulAddIntoTopK())
 model = model.transform(InferDataLayouts())
 model = model.transform(RemoveUnusedTensors())
 model.save(build_dir + "/end2end_cnv_w1a1_streamlined.onnx")
 ```

 %% Cell type:markdown id: tags:

 We won't go into too much detail about what happens in each transformation and why they are called in the particular order they are (feel free to visualize the intermediate steps using Netron yourself if you are curious) but here is a brief summmmary:

 * `Streamline` moves floating point scaling and addition operations closer to the input of the nearest thresholding activation and absorbs them into thresholds
 * `LowerConvsToMatMul` converts ONNX `Conv` nodes into sequences of `Im2Col, MatMul` nodes as discussed above. `Im2Col` is a custom FINN ONNX high-level node type that implements the sliding window operator.
 * `MakeMaxPoolNHWC` and `AbsorbTransposeIntoMultiThreshold` convert the *data layout* of the network into the NHWC data layout that finn-hlslib primitives use. NCHW means the tensor dimensions are ordered as `(N : batch, H : height, W : width, C : channels)` (assuming 2D images). The ONNX standard ops normally use the NCHW layout, but the ONNX intermediate representation itself does not dictate any data layout.
 * You may recall `ConvertBipolarMatMulToXnorPopcount` from the TFC-w1a1 example, which is needed to implement bipolar-by-bipolar (w1a1) networks correctly using finn-hlslib.

 Let's visualize the streamlined and lowered network with Netron. Observe how all the `Conv` nodes have turned into pairs of `Im2Col, MatMul` nodes, and many nodes including `BatchNorm, Mul, Add` nodes have disappeared and replaced with `MultiThreshold` nodes.

 %% Cell type:code id: tags:

 ``` python
 showInNetron(build_dir+"/end2end_cnv_w1a1_streamlined.onnx")
 ```

 %% Cell type:markdown id: tags:

 ## 3. Partitioning, Conversion to HLS Layers and Folding

 The next steps will be (again) very similar to what we did for the TFC-w1a1 network. We'll first convert the layers that we can put into the FPGA into their HLS equivalents and separate them out into a *dataflow partition*:

 %% Cell type:code id: tags:

 ``` python
 import finn.transformation.fpgadataflow.convert_to_hls_layers as to_hls
 from finn.transformation.fpgadataflow.create_dataflow_partition import (
    CreateDataflowPartition,
 )
 from finn.transformation.move_reshape import RemoveCNVtoFCFlatten
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.infer_data_layouts import InferDataLayouts

 # choose the memory mode for the MVTU units, decoupled or const
 mem_mode = "decoupled"

 model = ModelWrapper(build_dir + "/end2end_cnv_w1a1_streamlined.onnx")
 model = model.transform(to_hls.InferBinaryMatrixVectorActivation(mem_mode))
 model = model.transform(to_hls.InferQuantizedMatrixVectorActivation(mem_mode))
 # TopK to LabelSelect
 model = model.transform(to_hls.InferLabelSelectLayer())
 # input quantization (if any) to standalone thresholding
 model = model.transform(to_hls.InferThresholdingLayer())
 model = model.transform(to_hls.InferConvInpGen())
 model = model.transform(to_hls.InferStreamingMaxPool())
 # get rid of Reshape(-1, 1) operation between hlslib nodes
 model = model.transform(RemoveCNVtoFCFlatten())
 # get rid of Tranpose -> Tranpose identity seq
 model = model.transform(absorb.AbsorbConsecutiveTransposes())
 # infer tensor data layouts
 model = model.transform(InferDataLayouts())
 parent_model = model.transform(CreateDataflowPartition())
 parent_model.save(build_dir + "/end2end_cnv_w1a1_dataflow_parent.onnx")
 sdp_node = parent_model.get_nodes_by_op_type("StreamingDataflowPartition")[0]
 sdp_node = getCustomOp(sdp_node)
 dataflow_model_filename = sdp_node.get_nodeattr("model")
 # save the dataflow partition with a different name for easier access
 dataflow_model = ModelWrapper(dataflow_model_filename)
 dataflow_model.save(build_dir + "/end2end_cnv_w1a1_dataflow_model.onnx")
 ```

 %% Cell type:markdown id: tags:

 Notice the additional `RemoveCNVtoFCFlatten` transformation that was not used for TFC-w1a1. In the last Netron visualization you may have noticed a `Reshape` operation towards the end of the network where the convolutional part of the network ends and the fully-connected layers started. That `Reshape` is essentialy a tensor flattening operation, which we can remove for the purposes of hardware implementation. We can examine the contents of the dataflow partition with Netron, and observe the `ConvolutionInputGenerator`, `MatrixVectorActivation` and `StreamingMaxPool_Batch` nodes that implement the sliding window, matrix multiply and maxpool operations in hlslib. *Note that the MatrixVectorActivation instances following the ConvolutionInputGenerator nodes are really implementing the convolutions, despite the name. The final three MatrixVectorActivation instances implement actual FC layers.*

 %% Cell type:code id: tags:

 ``` python
 showInNetron(build_dir + "/end2end_cnv_w1a1_dataflow_parent.onnx")
 ```

 %% Cell type:markdown id: tags:

 Note that pretty much everything has gone into the `StreamingDataflowPartition` node; the only operation remaining is to apply a `Transpose` to obtain NHWC input from a NCHW input (the ONNX default).

 %% Cell type:code id: tags:

 ``` python
 showInNetron(build_dir + "/end2end_cnv_w1a1_dataflow_model.onnx")
 ```

 %% Cell type:markdown id: tags:

 Now we have to set the *folding factors* for certain layers to adjust the performance of our accelerator, similar to the TFC-w1a1 example. We'll also set the desired FIFO depths around those layers, which are important to achieve full throughput in the accelerator.

 %% Cell type:code id: tags:

 ``` python
 model = ModelWrapper(build_dir + "/end2end_cnv_w1a1_dataflow_model.onnx")
 fc_layers = model.get_nodes_by_op_type("MatrixVectorActivation")
 # each tuple is (PE, SIMD, in_fifo_depth) for a layer
 folding = [
    (16, 3, [128]),
    (32, 32, [128]),
    (16, 32, [128]),
    (16, 32, [128]),
    (4, 32, [81]),
    (1, 32, [2]),
    (1, 4, [2]),
    (1, 8, [128]),
    (5, 1, [3]),
 ]
 for fcl, (pe, simd, ififodepth) in zip(fc_layers, folding):
    fcl_inst = getCustomOp(fcl)
    fcl_inst.set_nodeattr("PE", pe)
    fcl_inst.set_nodeattr("SIMD", simd)
    fcl_inst.set_nodeattr("inFIFODepths", ififodepth)

 # use same SIMD values for the sliding window operators
 swg_layers = model.get_nodes_by_op_type("ConvolutionInputGenerator")
 for i in range(len(swg_layers)):
    swg_inst = getCustomOp(swg_layers[i])
    simd = folding[i][1]
    swg_inst.set_nodeattr("SIMD", simd)

 model = model.transform(GiveUniqueNodeNames())
 model.save(build_dir + "/end2end_cnv_w1a1_folded.onnx")
 ```

 %% Cell type:markdown id: tags:

 Below we visualize in Netron to observe the `StreamingDataWidthConverter` and `StreamingFIFO` nodes that have been inserted into graph, as well as the folding factors in the `PE` and `SIMD` attributes of each `MatrixVectorActivation`.

 %% Cell type:code id: tags:

 ``` python
 showInNetron(build_dir + "/end2end_cnv_w1a1_folded.onnx")
 ```

 %% Cell type:markdown id: tags:

 Our network is now ready and we can start with the hardware generation.

 %% Cell type:markdown id: tags:

 ## 4. Hardware Generation

 From this point onward, the steps we have to follow do not depend on the particular network and will be exactly the same as the TFC-w1a1 example. **which may take about 30 minutes depending on your host computer**. For more details about what's going on in this step, please consult the [TFC end-to-end notebook](tfc_end2end_example.ipynb) or the appropriate section in the [FINN documentation](https://finn.readthedocs.io/en/latest/hw_build.html).

 %% Cell type:code id: tags:

 ``` python
 test_pynq_board = "Pynq-Z1"
 target_clk_ns = 10

 from finn.transformation.fpgadataflow.make_zynq_proj import ZynqBuild
 model = ModelWrapper(build_dir+"/end2end_cnv_w1a1_folded.onnx")
 model = model.transform(ZynqBuild(platform = test_pynq_board, period_ns = target_clk_ns))
 ```

 %% Cell type:markdown id: tags:

 After the `ZynqBuild` we run one additional transformation to generate a PYNQ driver for the accelerator.

 %% Cell type:code id: tags:

 ``` python
 from finn.transformation.fpgadataflow.make_pynq_driver import MakePYNQDriver
 model = model.transform(MakePYNQDriver("zynq-iodma"))
 ```

 %% Cell type:code id: tags:

 ``` python
 model.save(build_dir + "/end2end_cnv_w1a1_synth.onnx")
 ```

 %% Cell type:markdown id: tags:

-## 5. Deployment and Remote Execution
+## 5. Deployment and Execution

-Now that we're done with the hardware generation, we can copy the necessary files onto our PYNQ board.
-
-**Make sure you've [set up the SSH keys for your PYNQ board](https://finn-dev.readthedocs.io/en/latest/getting_started.html#pynq-board-first-time-setup) before executing this step.**
+The bitfile and generated driver files(s) will be copied into a deployment folder which then can be used to run the network on the PYNQ board.

 %% Cell type:code id: tags:

 ``` python
-import os
-
-# set up the following values according to your own environment
-# FINN will use ssh to deploy and run the generated accelerator
-ip = "192.168.2.99"
-username = os.getenv("PYNQ_USERNAME", "xilinx")
-password = os.getenv("PYNQ_PASSWORD", "xilinx")
-port = os.getenv("PYNQ_PORT", 22)
-target_dir = os.getenv("PYNQ_TARGET_DIR", "/home/xilinx/finn_cnv_end2end_example")
-# set up ssh options to only allow publickey authentication
-options = "-o PreferredAuthentications=publickey -o PasswordAuthentication=no"
-
-# test access to PYNQ board
-! ssh {options} {username}@{ip} -p {port} cat /var/run/motd.dynamic
-```
-
-%% Cell type:code id: tags:
-
-``` python
-from finn.transformation.fpgadataflow.make_deployment import DeployToPYNQ
-
-model = ModelWrapper(build_dir + "/end2end_cnv_w1a1_synth.onnx")
-model = model.transform(DeployToPYNQ(ip, port, username, password, target_dir))
-model.save(build_dir + "/end2end_cnv_w1a1_pynq_deploy.onnx")
-```
+from shutil import copy
+from distutils.dir_util import copy_tree

-%% Cell type:code id: tags:
+# create directory for deployment files
+deployment_dir = make_build_dir(prefix="pynq_deployment_")
+model.set_metadata_prop("pynq_deployment_dir", deployment_dir)

-``` python
-target_dir_pynq = target_dir + "/" + model.get_metadata_prop("pynq_deployment_dir").split("/")[-1]
-target_dir_pynq
-```
+# get and copy necessary files
+# .bit and .hwh file
+bitfile = model.get_metadata_prop("bitfile")
+hwh_file = model.get_metadata_prop("hw_handoff")
+deploy_files = [bitfile, hwh_file]

-%% Cell type:code id: tags:
+for dfile in deploy_files:
+    if dfile is not None:
+        copy(dfile, deployment_dir)

-``` python
-! ssh {options} {username}@{ip} -p {port} 'ls -l {target_dir_pynq}'
+# driver.py and python libraries
+pynq_driver_dir = model.get_metadata_prop("pynq_driver_dir")
+copy_tree(pynq_driver_dir, deployment_dir)
 ```

 %% Cell type:markdown id: tags:

-We only have two more steps to be able to remotely execute the deployed bitfile with some test data from the CIFAR-10 dataset. Let's load up some test data that comes bundled with FINN -- *and before you ask, that's supposed to be a cat (CIFAR-10 class number 3)*.
+Next to these files, we will also need an example numpy array to test the network on the PYNQ board. (*and before you ask, that's supposed to be a cat (CIFAR-10 class number 3)*) Recall that we partitioned our original network into a parent graph that contained the non-synthesizable nodes and a child graph that contained the bulk of the network, which we turned into a bitfile. The only operator left outside the FPGA partition was a `Transpose` to convert NCHW images into NHWC ones. Thus, we can skip the execution in the parent as long as we ensure our image has the expected data layout. The example numpy array can then be saved as .npy file.

 %% Cell type:code id: tags:

 ``` python
 import pkg_resources as pk
 import matplotlib.pyplot as plt
 import numpy as np

 fn = pk.resource_filename("finn.qnn-data", "cifar10/cifar10-test-data-class3.npz")
 x = np.load(fn)["arr_0"]
 x = x.reshape(3, 32,32).transpose(1, 2, 0)
 plt.imshow(x)
 ```

-%% Cell type:markdown id: tags:
-
-Recall that we partitioned our original network into a parent graph that contained the non-synthesizable nodes and a child graph that contained the bulk of the network, which we turned into a bitfile. The only operator left outside the FPGA partition was a `Transpose` to convert NCHW images into NHWC ones. Thus, we can skip the execution in the parent as long as we ensure our image has the expected data layout, which we have done above.
-
 %% Cell type:code id: tags:

 ``` python
-import numpy as np
-from finn.core.onnx_exec import execute_onnx
-
 model = ModelWrapper(build_dir + "/end2end_cnv_w1a1_pynq_deploy.onnx")
 iname = model.graph.input[0].name
-oname = model.graph.output[0].name
 ishape = model.get_tensor_shape(iname)
-input_dict = {iname: x.astype(np.float32).reshape(ishape)}
-ret = execute_onnx(model, input_dict, True)
+np.save(deployment_dir + "/input.npy", x.reshape(ishape))
 ```

 %% Cell type:code id: tags:

 ``` python
-ret[oname]
+! ls {deployment_dir}
+```
+
+%% Cell type:code id: tags:
+
+``` python
+from shutil import make_archive
+make_archive('deploy-on-pynq-cnv', 'zip', deployment_dir)
+```
+
+%% Cell type:markdown id: tags:
+
+You can now download the created zipfile (File -> Open, mark the checkbox next to the deploy-on-pynq-tfc.zip and select Download from the toolbar), then copy it to your PYNQ board (for instance via scp or rsync). Then, run the following commands on the PYNQ board to extract the archive and run the execution:
+
+%% Cell type:markdown id: tags:
+
+```shell
+unzip deploy-on-pynq-cnv.zip -d finn-cnv-demo
+cd finn-cnv-demo
+sudo python3.6 -m pip install bitstring
+sudo python3.6 driver.py --exec_mode=execute --batchsize=1 --bitfile=resizer.bit --inputfile=input.npy
 ```

 %% Cell type:markdown id: tags:

-We see that the network correctly predicts this as a class 3 ("cat").
+The output will be saved on the PYNQ board as `output.npy` and can be copied to the host and opened with `np.load()`.

 %% Cell type:markdown id: tags:

 ### Validating the Accuracy on a PYNQ Board <a id='validation'></a>

-All the command line prompts here are meant to be executed with `sudo` on the PYNQ board, so we'll use a workaround (`echo password | sudo -S command`) to get that working from this notebook running on the host computer.
+All the command line prompts here are meant to be executed with `sudo` on the PYNQ board.

 **Ensure that your PYNQ board has a working internet connecting for the next steps, since some there is some downloading involved.**

 To validate the accuracy, we first need to install the [`dataset-loading`](https://github.com/fbcotter/dataset_loading) Python package to the PYNQ board. This will give us a convenient way of downloading and accessing the MNIST dataset.


 Command to execute on PYNQ:

-```pip3 install git+https://github.com/fbcotter/dataset_loading.git@0.0.4#egg=dataset_loading```
-
-%% Cell type:code id: tags:
-
-``` python
-! ssh {options} -t {username}@{ip} -p {port} 'echo {password} | sudo -S pip3 install git+https://github.com/fbcotter/dataset_loading.git@0.0.4#egg=dataset_loading'
-```
+```sudo pip3 install git+https://github.com/fbcotter/dataset_loading.git@0.0.4#egg=dataset_loading```

 %% Cell type:markdown id: tags:

 We can now use the `validate.py` script that was generated together with the driver to measure top-1 accuracy on the CIFAR-10 dataset.

 Command to execute on PYNQ:

 `python3.6 validate.py --dataset cifar10 --batchsize 1000`

-%% Cell type:code id: tags:
-
-``` python
-! ssh {options} -t {username}@{ip} -p {port} 'cd {target_dir_pynq}; echo {password} | sudo -S python3.6 validate.py --dataset cifar10 --batchsize 1000'
-```
-
 %% Cell type:markdown id: tags:

 We see that the final top-1 accuracy is 84.19%, which is very close to the 84.22% reported on the [BNN-PYNQ accuracy table in Brevitas](https://github.com/Xilinx/brevitas/tree/master/src/brevitas_examples/bnn_pynq).