diff --git a/.github/workflows/docker-image.yml b/.github/workflows/docker-image.yml
index 2c91a0a83b5158cdf6458bcd4c453f22757e4130..00c25a4a3150a8368405b449fdce04456ccbe88d 100644
--- a/.github/workflows/docker-image.yml
+++ b/.github/workflows/docker-image.yml
@@ -1,9 +1,10 @@
 name: DockerImage
 
 on:
+  pull_request:
+    branches: [ dev ]
   push:
-    branches:
-      - 'dev'
+    branches: [ dev ]
 
 jobs:
   docker:
@@ -11,7 +12,7 @@ jobs:
     steps:
       -
         name: checkout
-        uses: actions/checkout@v2
+        uses: actions/checkout@v3
       -
         name: Set up Docker Buildx
         uses: docker/setup-buildx-action@v1
diff --git a/.github/workflows/pre-commit.yml b/.github/workflows/pre-commit.yml
index f61af878ffadb6b4fd7a527110b105b19cee0ded..5f03379bbc37ab913f712571c630035dbad84cce 100644
--- a/.github/workflows/pre-commit.yml
+++ b/.github/workflows/pre-commit.yml
@@ -16,7 +16,7 @@ jobs:
         uses: actions/checkout@v3
 
       - name: Setup Python
-        uses: actions/setup-python@v3
+        uses: actions/setup-python@v4
         with:
           python-version: '3.8'
 
diff --git a/.github/workflows/quicktest-dev-pr.yml b/.github/workflows/quicktest-dev-pr.yml
index a726ab584f4879407ef9be92719beb42d7a86af7..e2ba47ec296f73cfd7c0eede98bac3acd066075a 100644
--- a/.github/workflows/quicktest-dev-pr.yml
+++ b/.github/workflows/quicktest-dev-pr.yml
@@ -15,7 +15,7 @@ jobs:
 
     steps:
       - name: checkout
-        uses: actions/checkout@v2
+        uses: actions/checkout@v3
 
       - name: DockerRunQuicktest
         run: |
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 5a7f70f8f69293d8dcef9b64c763aa606d5d73f5..126a4ac4b2bee7f3eaaf610646855b48d07b9e32 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -51,7 +51,7 @@ repos:
     args: ['--fix=no']
 
 - repo: https://github.com/PyCQA/isort
-  rev: 5.10.1
+  rev: 5.12.0
   hooks:
   - id: isort
 
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
index 3601fcdccff675e6f850d4636ebbfc0726f7cd4d..478957be113b686c4fabd3d071fdf6203dd37dd3 100644
--- a/.readthedocs.yaml
+++ b/.readthedocs.yaml
@@ -35,7 +35,7 @@ sphinx:
    configuration: docs/finn/conf.py
 
 python:
-   version: 3.7
+   version: 3.8
    install:
     - method: pip
       path: .
diff --git a/AUTHORS.rst b/AUTHORS.rst
index d011ce3d7ad74125b7013b7a7e987eb22e70a9f3..861b81924b187620d77f8cd47d4faff8d7f15bf8 100644
--- a/AUTHORS.rst
+++ b/AUTHORS.rst
@@ -9,7 +9,7 @@ Contributors
 * Hendrik Borras (@HenniOVP)
 * Lucian Petrica (@quetric)
 * Tobias Alonso (@Tobi-Alonso)
-* Felix Paul Jentzsch (@felixpj)
+* Felix Paul Jentzsch (@fpjentzsch)
 * Mirza Mrahorovic (@mmrahorovic)
 * Suranga Mahesh (@surangamh)
 * Peter Lehnhardt (@pete-lennart)
@@ -26,3 +26,5 @@ Contributors
 * Aziz Bahri (@azizb-xlnx)
 * Fionn O'Donohoe (@fionnodonohoe-xlnx)
 * Matthias Gehre (@mgehre-amd)
+* Hugo Le Blevec (@hleblevec)
+* Patrick Geel (@patrickgeel)
diff --git a/README.md b/README.md
index 1b8efc8f19d0b664a17320585f5ea60acbe03eb4..2e1faf8f0c4422c8690506bb5f79611c6661fa9c 100644
--- a/README.md
+++ b/README.md
@@ -28,7 +28,7 @@ Please see the [Getting Started](https://finn.readthedocs.io/en/latest/getting_s
 
 ## Documentation
 
-You can view the documentation on [readthedocs](https://finn.readthedocs.io) or build them locally using `python setup.py doc` from inside the Docker container. Additionally, there is a series of [Jupyter notebook tutorials](https://github.com/Xilinx/finn/tree/master/notebooks), which we recommend running from inside Docker for a better experience.
+You can view the documentation on [readthedocs](https://finn.readthedocs.io) or build them locally using `python setup.py doc` from inside the Docker container. Additionally, there is a series of [Jupyter notebook tutorials](https://github.com/Xilinx/finn/tree/main/notebooks), which we recommend running from inside Docker for a better experience.
 
 ## Community
 
@@ -67,4 +67,4 @@ The current implementation of the framework is based on the following publicatio
 ## Old version
 
 We previously released an early-stage prototype of a toolflow that took in Caffe-HWGQ binarized network descriptions and produced dataflow architectures. You can find it in the [v0.1](https://github.com/Xilinx/finn/tree/v0.1) branch in this repository.
-Please be aware that this version is deprecated and unsupported, and the master branch does not share history with that branch so it should be treated as a separate repository for all purposes.
+Please be aware that this version is deprecated and unsupported, and the main branch does not share history with that branch so it should be treated as a separate repository for all purposes.
diff --git a/docs/finn/brevitas_export.rst b/docs/finn/brevitas_export.rst
index 304aa30854118e1ebd3258169ee4698a873e8689..950b601f98d14e99a00841f23894770eb0bb1569 100644
--- a/docs/finn/brevitas_export.rst
+++ b/docs/finn/brevitas_export.rst
@@ -16,6 +16,6 @@ Two of the Brevitas-exported ONNX variants can be ingested by FINN:
 
 To work with either type of ONNX model, it is loaded into a :ref:`modelwrapper` provided by FINN.
 
-At this stage we can already use the functional verification flow to simulate the model using Python, this is marked in the graphic with the dotted arrow. For more details please have look at :ref:`verification`.
+At this stage we can already use the functional verification flow to simulate the model using Python. For more details please have look at :ref:`verification`.
 
 The model can now be further processed in FINN, the next flow step is :ref:`nw_prep`.
diff --git a/docs/finn/command_line.rst b/docs/finn/command_line.rst
index 12e01db5544e847a775d330929d1eea916cae74e..8c37479a28ea7c2ae76bbcce9cf5bfc53646a2cb 100644
--- a/docs/finn/command_line.rst
+++ b/docs/finn/command_line.rst
@@ -105,7 +105,7 @@ The following outputs will be generated regardless of which particular outputs a
 The other output products are controlled by the `generate_outputs` field in the
 build configuration), and are detailed below.
 
-* :py:mod:`finn.builder.build_dataflow.DataflowOutputType.ESTIMATE_REPORTS` produces a variety of reports to estimate resource usage and performance *without* running any synthesis. This can be useful for setting up the parallelization and other hardware configuration:
+* :py:mod:`finn.builder.build_dataflow_config.DataflowOutputType.ESTIMATE_REPORTS` produces a variety of reports to estimate resource usage and performance *without* running any synthesis. This can be useful for setting up the parallelization and other hardware configuration:
 
   * ``report/estimate_layer_cycles.json`` -- cycles per layer estimation from analytical model
   * ``report/estimate_layer_resources.json`` -- resources per layer estimation from analytical model
@@ -113,31 +113,31 @@ build configuration), and are detailed below.
   * ``report/estimate_network_performance.json`` -- whole-network performance estimation from analytical model
   * ``report/op_and_param_counts.json`` -- per-layer and total number of operations and parameters (independent of parallelization)
 
-* :py:mod:`finn.builder.build_dataflow.DataflowOutputType.STITCHED_IP`: produces a stitched Vivado IP block design that can be integrated with other FPGA designs in Vivado IPI:
+* :py:mod:`finn.builder.build_dataflow_config.DataflowOutputType.STITCHED_IP`: produces a stitched Vivado IP block design that can be integrated with other FPGA designs in Vivado IPI:
 
   * ``stitched_ip/finn_vivado_stitch_proj.xpr`` -- Vivado project (including Vivado IP Integrator block design) to generate the stitched IP
   * ``stitched_ip/ip`` -- exported Vivado IP for the stitched design
 
-* :py:mod:`finn.builder.build_dataflow.DataflowOutputType.RTLSIM_PERFORMANCE`: measure latency and performance for the stitched IP in RTL simulation, using PyVerilator
+* :py:mod:`finn.builder.build_dataflow_config.DataflowOutputType.RTLSIM_PERFORMANCE`: measure latency and performance for the stitched IP in RTL simulation, using PyVerilator
 
   * ``report/rtlsim_performance.json`` -- accelerator throughput and latency from RTL simulation
 
-* :py:mod:`finn.builder.build_dataflow.DataflowOutputType.OOC_SYNTH` runs out-of-context synthesis for the stitched IP. This is useful for getting post-synthesis resource counts and achievable clock frequency without having to produce a full bitfile with DMA engines:
+* :py:mod:`finn.builder.build_dataflow_config.DataflowOutputType.OOC_SYNTH` runs out-of-context synthesis for the stitched IP. This is useful for getting post-synthesis resource counts and achievable clock frequency without having to produce a full bitfile with DMA engines:
 
   * ``report/ooc_synth_and_timing.json`` -- resources and achievable clock frequency from out-of-context synthesis
 
-* :py:mod:`finn.builder.build_dataflow.DataflowOutputType.BITFILE` will run Vivado and/or Vitis to insert the FINN accelerator inside a shell, with DMA engines instantiated to move data to/from main memory:
+* :py:mod:`finn.builder.build_dataflow_config.DataflowOutputType.BITFILE` will run Vivado and/or Vitis to insert the FINN accelerator inside a shell, with DMA engines instantiated to move data to/from main memory:
 
   * ``bitfile/finn-accel.(bit|xclbin)`` -- generated bitfile depending on platform
   * ``report/post_synth_resources.xml`` -- FPGA resource utilization after synthesis
   * ``report/post_route_timing.rpt`` -- post-route timing report
 
 
-* :py:mod:`finn.builder.build_dataflow.DataflowOutputType.PYNQ_DRIVER` will generate a PYNQ Python driver that can be used to interface the generated accelerator:
+* :py:mod:`finn.builder.build_dataflow_config.DataflowOutputType.PYNQ_DRIVER` will generate a PYNQ Python driver that can be used to interface the generated accelerator:
 
   * ``driver/driver.py`` -- Python driver that can be used on PYNQ on Zynq or Alveo platforms to launch the accelerator
 
-* :py:mod:`finn.builder.build_dataflow.DataflowOutputType.DEPLOYMENT_PACKAGE`:
+* :py:mod:`finn.builder.build_dataflow_config.DataflowOutputType.DEPLOYMENT_PACKAGE`:
 
   * ``deploy/`` -- deployment package folder with a bitfile and driver, ready to be copied to target hardware platform
 
@@ -153,7 +153,7 @@ and compare it against the expected output that you provide.
 
 This is achieved by setting up the following members of the build configuration:
 
-* Set ``verify_steps`` to be a list of :py:mod:`finn.builder.build_dataflow.VerificationStepType`
+* Set ``verify_steps`` to be a list of :py:mod:`finn.builder.build_dataflow_config.VerificationStepType`
   where each element in the list indicates the output of a particular step
   that will be verified. See the documentation of the ``VerificationStepType``
   for more information.
diff --git a/docs/finn/developers.rst b/docs/finn/developers.rst
index b152dfef66d0eb47e086d3c5cd51174c5df52128..f9252f764c3f8297140f81d7ed42ab2da1218dae 100644
--- a/docs/finn/developers.rst
+++ b/docs/finn/developers.rst
@@ -12,7 +12,7 @@ Prerequisites
 
 Before starting to do development on FINN it's a good idea to start
 with understanding the basics as a user. Going through all of the
-:ref:`tutorials` is strongly recommended if you haven' already done so.
+:ref:`tutorials` is strongly recommended if you haven't already done so.
 Additionally, please review the documentation available on :ref:`internals`.
 
 Repository structure
@@ -153,7 +153,7 @@ from the FINN root directory as follows:
 
 ::
 
-  python setup.py test --addopts "-k test_brevitas_debug --pdb"
+  pytest -k test_brevitas_debug --pdb
 
 
 If you want to run tests in parallel (e.g. to take advantage of a multi-core CPU)
diff --git a/docs/finn/end_to_end_flow.rst b/docs/finn/end_to_end_flow.rst
index bc5c5230718bcc8dd50334cc1f20c3c84c012ca4..0a022067c38ec3bb3c793d288e0230013ca8b21c 100644
--- a/docs/finn/end_to_end_flow.rst
+++ b/docs/finn/end_to_end_flow.rst
@@ -9,7 +9,7 @@ As you can see in the picture, FINN has a high modularity and has the property t
    :scale: 50%
    :align: center
 
-The white fields show the state of the network representation in the respective step. The colored fields represent the transformations that are applied to the network to achieve a certain result. The diagram is divided into five sections, each of it includes several flow steps. The flow starts in top left corner with Brevitas export (green section), followed by the preparation of the network (blue section) for the Vivado HLS and Vivado IPI (orange section). There is also a section for testing and verification in software (red section) and the hardware generation and deployment on the PYNQ board (yellow section).
+The white fields show the state of the network representation in the respective step. The colored fields represent the transformations that are applied to the network to achieve a certain result. The diagram is divided into five sections, each of it includes several flow steps. The flow starts in top left corner with Brevitas export, followed by the preparation of the network for the Vitis HLS and Vivado IPI. There is also a section for testing and verification in software (in the cloud on the right) and the hardware generation and deployment on the PYNQ board.
 
 This example flow is covered in the `end2end_example <https://github.com/Xilinx/finn/tree/main/notebooks/end2end_example>`_ Jupyter notebooks.
 For a more detailed overview about the different flow sections, please have a look at the corresponding pages:
diff --git a/docs/finn/getting_started.rst b/docs/finn/getting_started.rst
index 40425c119fafdcd03292b05c7a7e71310f767239..9b3111b70eae97a3644e1de23c368bd5b09f7927 100644
--- a/docs/finn/getting_started.rst
+++ b/docs/finn/getting_started.rst
@@ -20,7 +20,7 @@ How do I use FINN?
 ==================
 
 We strongly recommend that you first watch one of the pre-recorded `FINN tutorial <https://www.youtube.com/watch?v=zw2aG4PhzmA&amp%3Bindex=2>`_
-videos, then follow the Jupyter notebook tutorials for `training and deploying an MLP for network intrusion detection <https://github.com/Xilinx/finn/tree/master/notebooks/end2end_example/cybersecurity>`_ .
+videos, then follow the Jupyter notebook tutorials for `training and deploying an MLP for network intrusion detection <https://github.com/Xilinx/finn/tree/main/notebooks/end2end_example/cybersecurity>`_ .
 You may also want to check out the other :ref:`tutorials`, and the `FINN examples repository <https://github.com/Xilinx/finn-examples>`_ .
 
 Our aim in FINN is *not* to accelerate common off-the-shelf neural networks, but instead provide you with a set of tools
@@ -28,19 +28,19 @@ to train *customized* networks and create highly-efficient FPGA implementations
 In general, the approach for using the FINN framework is as follows:
 
 1. Train your own quantized neural network (QNN) in `Brevitas <https://github.com/Xilinx/brevitas>`_. We have some `guidelines <https://bit.ly/finn-hls4ml-qat-guidelines>`_ on quantization-aware training (QAT).
-2. Export to FINN-ONNX by following `this tutorial <https://github.com/Xilinx/finn/blob/master/notebooks/basics/1_brevitas_network_import.ipynb>`_ .
-3. Use FINN's ``build_dataflow`` system on the exported model by following this `tutorial <https://github.com/Xilinx/finn/blob/master/notebooks/end2end_example/cybersecurity/3-build-accelerator-with-finn.ipynb>`_
+2. Export to FINN-ONNX by following `this tutorial <https://github.com/Xilinx/finn/blob/main/notebooks/basics/1_brevitas_network_import.ipynb>`_ .
+3. Use FINN's ``build_dataflow`` system on the exported model by following this `tutorial <https://github.com/Xilinx/finn/blob/main/notebooks/end2end_example/cybersecurity/3-build-accelerator-with-finn.ipynb>`_
 4. Adjust your QNN topology, quantization settings and ``build_dataflow`` configuration to get the desired results.
 
 Please note that the framework is still under development, and how well this works will depend on how similar your custom network is to the examples we provide.
 If there are substantial differences, you will most likely have to write your own
 Python scripts that call the appropriate FINN compiler
 functions that process your design correctly, or adding new functions (including
-Vivado HLS layers)
+Vitis HLS layers)
 as required.
-The `advanced FINN tutorials <https://github.com/Xilinx/finn/tree/master/notebooks/advanced>`_ can be useful here.
+The `advanced FINN tutorials <https://github.com/Xilinx/finn/tree/main/notebooks/advanced>`_ can be useful here.
 For custom networks, we recommend making a copy of the `BNN-PYNQ end-to-end
-Jupyter notebook tutorials <https://github.com/Xilinx/finn/tree/master/notebooks/end2end_example/bnn-pynq>`_ as a starting point, visualizing the model at intermediate
+Jupyter notebook tutorials <https://github.com/Xilinx/finn/tree/main/notebooks/end2end_example/bnn-pynq>`_ as a starting point, visualizing the model at intermediate
 steps and adding calls to new transformations as needed.
 Once you have a working flow, you can implement a command line entry for this
 by using the "advanced mode" described in the :ref:`command_line` section.
@@ -50,7 +50,8 @@ Running FINN in Docker
 FINN runs inside a Docker container, it comes with a script to easily build and launch the container. If you are not familiar with Docker, there are many excellent `online resources <https://docker-curriculum.com/>`_ to get started.
 You may want to review the :ref:`General FINN Docker tips` and :ref:`Environment variables` as well.
 If you want to use prebuilt images, read :ref:`Using a prebuilt image`.
-The ``run-docker.sh`` script that can be launched in the following modes:
+
+The above mentioned script to build and launch the FINN docker container is called `run-docker.sh <https://github.com/Xilinx/finn/blob/main/run-docker.sh>`_ . It can be launched in the following modes:
 
 Launch interactive shell
 ************************
@@ -140,10 +141,7 @@ If you are having trouble building the Docker image or need offline access, you
 
 Supported FPGA Hardware
 =======================
-**Shell-integrated accelerator + driver:** For quick deployment, we target boards supported by  `PYNQ <http://www.pynq.io/>`_ . For these platforms, we can build a full bitfile including DMAs to move data into and out of the FINN-generated accelerator, as well as a Python driver to launch the accelerator. We support the Pynq-Z1, Pynq-Z2, Ultra96, ZCU102 and ZCU104 boards.
-
-.. warning::
-  In previous FINN versions (v0.4b - v0.7) we had support for `Xilinx Alveo boards <https://www.xilinx.com/products/boards-and-kits/alveo.html>`_ using PYNQ and Vitis 2020.1, see instructions below for Alveo setup that works with older versions. Please note that with the new release with Vitis 2022.1, we do only have experimental support to automatically deployment for Alveo cards.
+**Shell-integrated accelerator + driver:** For quick deployment, we target boards supported by  `PYNQ <http://www.pynq.io/>`_ . For these platforms, we can build a full bitfile including DMAs to move data into and out of the FINN-generated accelerator, as well as a Python driver to launch the accelerator. We support the Pynq-Z1, Pynq-Z2, Ultra96, ZCU102 and ZCU104 boards, as well as Alveo cards.
 
 **Vivado IPI support for any Xilinx FPGA:** FINN generates a Vivado IP Integrator (IPI) design from the neural network with AXI stream (FIFO) in-out interfaces, which can be integrated onto any Xilinx FPGA as part of a larger system. It's up to you to take the FINN-generated accelerator (what we call "stitched IP" in the tutorials), wire it up to your FPGA design and send/receive neural network data to/from the accelerator.
 
@@ -181,12 +179,12 @@ On the target side:
 
 On the host side:
 
-1. Install Vitis 2020.1 and set up the ``VITIS_PATH`` environment variable to point to your installation.
+1. Install Vitis 2022.1 and set up the ``VITIS_PATH`` environment variable to point to your installation.
 2. Install Xilinx XRT. Ensure that the ``XRT_DEB_VERSION`` environment variable reflects which version of XRT you have installed.
 3. Install the Vitis platform files for Alveo and set up the ``PLATFORM_REPO_PATHS`` environment variable to point to your installation. *This must be the same path as the target's platform files (target step 2)*
 4. Set up the ``ALVEO_*`` environment variables accordingly for your target, see description of environment variables above.
 5. `Set up public key authentication <https://www.digitalocean.com/community/tutorials/how-to-configure-ssh-key-based-authentication-on-a-linux-server>`_. Copy your private key to the ``finn/ssh_keys`` folder on the host to get password-less deployment and remote execution.
-6. Done! You can try the ``test_end2end_vitis`` tests in the FINN Docker to verify your setup, although this will take some time.
+6. Done!
 
 Vivado/Vitis license
 *********************
@@ -214,7 +212,7 @@ We also recommend running the FINN compiler on a system with sufficiently
 strong hardware:
 
 * **RAM.** Depending on your target FPGA platform, your system must have sufficient RAM to be
-  able to run Vivado/Vitis synthesis for that part. See `this page <https://www.xilinx.com/products/design-tools/vivado/memory.html>`_
+  able to run Vivado/Vitis synthesis for that part. See `this page <https://www.xilinx.com/products/design-tools/vivado/vivado-ml.html#memory>`_
   for more information. For targeting Zynq and Zynq UltraScale+ parts, at least 8 GB is recommended. Larger parts may require up to 16 GB.
   For targeting Alveo parts with Vitis, at least 64 GB RAM is recommended.
 
diff --git a/docs/finn/hw_build.rst b/docs/finn/hw_build.rst
index 2a64b87943075ff004f79c9d457136e41e27723d..a5c486935d531f7a037f3c49ead5bc7906afa831 100644
--- a/docs/finn/hw_build.rst
+++ b/docs/finn/hw_build.rst
@@ -9,14 +9,14 @@ Hardware Build and Deployment
    :align: center
 
 A model where all layers have been converted to HLS layers can be processed by
-FINN to build a bitfile and driver targeting a Zynq system or to generate a Vivado IP Integrator (IPI)
+FINN to build a bitfile and driver targeting a Zynq or Alveo system or to generate a Vivado IP Integrator (IPI)
 design with AXI stream (FIFO) in-out interfaces, which can be integrated onto any Xilinx FPGA as part of a larger system.
 
 
 Hardware Build
 ==============
 
-Internally, the hardware build for Zynq devices consists of the following steps:
+Internally, the hardware build consists of the following steps:
 
 1. Driver generation
 2. DMA and DWC node insertion
@@ -89,9 +89,4 @@ Deployment
 Deployment and Remote Execution
 -------------------------------
 
-The bitfile and the driver file(s) are copied to the PYNQ board and can be executed there using the *onnx_exec* function with the right *exec_mode* settings. For details please have a look at transformation :py:mod:`finn.transformation.fpgadataflow.make_deployment.DeployToPYNQ` and the execution function :py:mod:`finn.core.onnx_exec`.
-
-Throughput Test
----------------
-
-FINN also offers the possibility to measure the network performance directly on the PYNQ board. This can be done by using :py:mod:`finn.core.throughput_test`. When running this function the metrics of the network are returned as dictionary.
+The bitfile and the driver file(s) are copied to the PYNQ board and can be executed there. For more information see the description in the `end2end_example <https://github.com/Xilinx/finn/tree/main/notebooks/end2end_example>`_ Jupyter notebooks.
diff --git a/docs/finn/internals.rst b/docs/finn/internals.rst
index e89fb997509538aa6ce073b8d428c88905475b55..848d22afecfb2736b91788ce3a201477b4ff5669 100644
--- a/docs/finn/internals.rst
+++ b/docs/finn/internals.rst
@@ -7,7 +7,7 @@ Internals
 Intermediate Representation: QONNX and FINN-ONNX
 ================================================
 
-FINN uses `ONNX <https://github.com/onnx/onnx>`_ as an intermediate representation (IR) for neural networks. As such, almost every component inside FINN uses ONNX and its `Python API <https://github.com/onnx/onnx/blob/master/docs/PythonAPIOverview.md>`_, so you may want to familiarize yourself with how ONNX represents DNNs. Specifically, the `ONNX protobuf description <https://github.com/onnx/onnx/blob/master/onnx/onnx.proto>`_ (or its `human-readable documentation <https://github.com/onnx/onnx/blob/master/docs/IR.md>`_ and the `operator schemas <https://github.com/onnx/onnx/blob/master/docs/Operators.md>`_ are useful as reference documents. We also provide a Jupyter notebook that can help to get familiar with ONNX by showing how to work with a simple ONNX model in FINN, see chapter :ref:`tutorials` for details.
+FINN uses `ONNX <https://github.com/onnx/onnx>`_ as an intermediate representation (IR) for neural networks. As such, almost every component inside FINN uses ONNX and its `Python API <https://github.com/onnx/onnx/blob/main/docs/PythonAPIOverview.md>`_, so you may want to familiarize yourself with how ONNX represents DNNs. Specifically, the `ONNX protobuf description <https://github.com/onnx/onnx/blob/main/onnx/onnx.proto>`_ (or its `human-readable documentation <https://github.com/onnx/onnx/blob/main/docs/IR.md>`_ and the `operator schemas <https://github.com/onnx/onnx/blob/main/docs/Operators.md>`_ are useful as reference documents. We also provide a Jupyter notebook that can help to get familiar with ONNX by showing how to work with a simple ONNX model in FINN, see chapter :ref:`tutorials` for details.
 
 .. note:: FINN supports two specialized variants of ONNX called QONNX and FINN-ONNX, and not all ONNX graphs are supported by FINN (and vice versa).
 
@@ -137,14 +137,14 @@ ModelWrapper contains more useful functions, if you are interested please have a
 Analysis Pass
 =============
 
-An analysis pass traverses the graph structure and produces information about certain properties. It gets the model in the ModelWrapper as input and returns a dictionary of the properties the analysis extracts. If you are interested in how to write an analysis pass for FINN, please take a look at the Jupyter notebook about how to write an analysis pass, see chapter :ref:`tutorials` for details. For more information about existing analysis passes in FINN, see module :py:mod:`finn.analysis`.
+An analysis pass traverses the graph structure and produces information about certain properties. It gets the model in the ModelWrapper as input and returns a dictionary of the properties the analysis extracts. If you are interested in how to write an analysis pass for FINN, please take a look at the Jupyter notebook about how to write an analysis pass, see chapter :ref:`tutorials` for details. For more information about existing analysis passes in FINN, see module :py:mod:`finn.analysis` .
 
 .. _transformation_pass:
 
 Transformation Pass
 ===================
 
-A transformation passes changes (transforms) the given model, it gets the model in the ModelWrapper as input and returns the changed model (ModelWrapper) to the FINN flow. Additional the flag *model_was_changed* which indicates if a transformation has to be performed more than once, is returned. If you are interested in how to write a transformation pass for FINN, please take a look at the Jupyter notebook about how to write a transformation pass, see chapter :ref:`tutorials` for details. For more information about existing transformation passes in FINN, see module :py:mod:`finn.transformation`.
+A transformation passes changes (transforms) the given model, it gets the model in the ModelWrapper as input and returns the changed model (ModelWrapper) to the FINN flow. Additional the flag *model_was_changed* which indicates if a transformation has to be performed more than once, is returned. If you are interested in how to write a transformation pass for FINN, please take a look at the Jupyter notebook about how to write a transformation pass, see chapter :ref:`tutorials` for details. For more information about existing transformation passes in FINN, see module :py:mod:`finn.transformation` .
 
 .. _mem_mode:
 
@@ -167,7 +167,7 @@ The following picture shows the idea behind the "const" and "decoupled" mode.
 
 Const mode
 ----------
-In *const* mode the weights are "baked in" into the Matrix-Vector-Activate-Unit (MVAU), which means they are part of the HLS code. During the IP block generation the weight values are integrated as *params.h* file in the HLS code and synthesized together with it. For the *const* mode IP block generation the `Matrix_Vector_Activate_Batch function <https://github.com/Xilinx/finn-hlslib/blob/19fa1197c09bca24a0f77a7fa04b8d7cb5cc1c1d/mvau.hpp#L93>`_ from the finn-hls library is used, which implements a standard MVAU. The resulting IP block has an input and an output stream, as shown in the above picture on the left. FIFOs in the form of verilog components are connected to these.
+In *const* mode the weights are "baked in" into the Matrix-Vector-Activate-Unit (MVAU), which means they are part of the HLS code. During the IP block generation the weight values are integrated as *params.h* file in the HLS code and synthesized together with it. For the *const* mode IP block generation the `Matrix_Vector_Activate_Batch function <https://github.com/Xilinx/finn-hlslib/blob/master/mvau.hpp#L92>`_ from the finn-hls library is used, which implements a standard MVAU. The resulting IP block has an input and an output stream, as shown in the above picture on the left. FIFOs in the form of verilog components are connected to these.
 
 Advantages:
 
@@ -185,7 +185,7 @@ Disadvantages:
 
 Decoupled mode
 --------------
-In *decoupled* mode a different variant of the MVAU with three ports is used. Besides the input and output streams, which are fed into the circuit via Verilog FIFOs, there is another input, which is used to stream the weights. For this the `streaming MVAU <https://github.com/Xilinx/finn-hlslib/blob/07a8353f6cdfd8bcdd81e309a5581044c2a93d3b/mvau.hpp#L213>`_ from the finn-hls library is used. To make the streaming possible a Verilog weight streamer component accesses the weight memory and sends the values via another FIFO to the MVAU. This component can be found in the `finn-rtllib <https://github.com/Xilinx/finn/tree/dev/finn-rtllib>`_ under the name *memstream.v*. For the IP block generation this component, the IP block resulting from the synthesis of the HLS code of the streaming MVAU and a FIFO for the weight stream are combined in a verilog wrapper. The weight values are saved in .dat files and stored in the weight memory from which the weight streamer reads. The resulting verilog component, which is named after the name of the node and has the suffix "_memstream.v", exposes only two ports to the outside, the data input and output. It therefore behaves externally in the same way as the MVAU in *const* mode.
+In *decoupled* mode a different variant of the MVAU with three ports is used. Besides the input and output streams, which are fed into the circuit via Verilog FIFOs, there is another input, which is used to stream the weights. For this the `streaming MVAU <https://github.com/Xilinx/finn-hlslib/blob/master/mvau.hpp#L214>`_ from the finn-hls library is used. To make the streaming possible a Verilog weight streamer component accesses the weight memory and sends the values via another FIFO to the MVAU. This component can be found in the `finn-rtllib <https://github.com/Xilinx/finn/tree/dev/finn-rtllib>`_ under the name *memstream.v*. For the IP block generation this component, the IP block resulting from the synthesis of the HLS code of the streaming MVAU and a FIFO for the weight stream are combined in a verilog wrapper. The weight values are saved in .dat files and stored in the weight memory from which the weight streamer reads. The resulting verilog component, which is named after the name of the node and has the suffix "_memstream.v", exposes only two ports to the outside, the data input and output. It therefore behaves externally in the same way as the MVAU in *const* mode.
 
 Advantages:
 
diff --git a/docs/finn/nw_prep.rst b/docs/finn/nw_prep.rst
index 566eda5bac38855e9ed8edfdf53193bb6c025256..6fea992cf70ad2cb29b385133ccdcf34606b2185 100644
--- a/docs/finn/nw_prep.rst
+++ b/docs/finn/nw_prep.rst
@@ -10,7 +10,7 @@ Network Preparation
 
 The main principle of FINN are analysis and transformation passes. If you like to have more information about these please have a look at section :ref:`analysis_pass` and :ref:`transformation_pass` or at chapter :ref:`tutorials` about the provided Jupyter notebooks.
 
-This page is about the network preparation, the flow step that comes after the :ref:`brevitas_export`. Its main idea is to optimize the network and convert the nodes to custom nodes that correspond to `finn-hlslib <https://github.com/Xilinx/finn-hlslib>`_ functions. In this way we get a network that we can bring to hardware with the help of Vivado. For that we have to apply several transformations on the ONNX model, which this flow step receives wrapped in the :ref:`modelwrapper`.
+This page is about the network preparation, the flow step that comes after the :ref:`brevitas_export`. Its main idea is to optimize the network and convert the nodes to custom nodes that correspond to `finn-hlslib <https://github.com/Xilinx/finn-hlslib>`_ functions. In this way we get a network that we can bring to hardware with the help of Vitis and Vivado. For that we have to apply several transformations on the ONNX model, which this flow step receives wrapped in the :ref:`modelwrapper`.
 
 Various transformations are involved in the network preparation. The following is a short overview of these.
 
diff --git a/docs/finn/source_code/finn.analysis.fpgadataflow.rst b/docs/finn/source_code/finn.analysis.fpgadataflow.rst
index b52e994ee6033d4c3c1aae6400e20e103455d7b6..57472cb670b6fa6cb95e6c137458d3a522f82f5a 100644
--- a/docs/finn/source_code/finn.analysis.fpgadataflow.rst
+++ b/docs/finn/source_code/finn.analysis.fpgadataflow.rst
@@ -30,6 +30,7 @@ finn.analysis.fpgadataflow.floorplan\_params
    :undoc-members:
    :show-inheritance:
 
+
 finn.analysis.fpgadataflow.hls\_synth\_res\_estimation
 -------------------------------------------------------------
 
@@ -38,14 +39,15 @@ finn.analysis.fpgadataflow.hls\_synth\_res\_estimation
    :undoc-members:
    :show-inheritance:
 
- finn.analysis.fpgadataflow.op\_and\_param\_counts
- --------------------------------------------------
+finn.analysis.fpgadataflow.op\_and\_param\_counts
+--------------------------------------------------
 
- .. automodule:: finn.analysis.fpgadataflow.op_and_param_counts
+.. automodule:: finn.analysis.fpgadataflow.op_and_param_counts
     :members:
     :undoc-members:
     :show-inheritance:
 
+
 finn.analysis.fpgadataflow.post\_synth\_res
 --------------------------------------------------
 
@@ -54,6 +56,7 @@ finn.analysis.fpgadataflow.post\_synth\_res
    :undoc-members:
    :show-inheritance:
 
+
 finn.analysis.fpgadataflow.res\_estimation
 -------------------------------------------------
 
diff --git a/docs/finn/source_code/finn.builder.rst b/docs/finn/source_code/finn.builder.rst
index 2433cab83d1aa140010f4082ec8323bdaa8c6ff4..caadf3f91f7c9aa06f04be356e9c3594fc208d2d 100644
--- a/docs/finn/source_code/finn.builder.rst
+++ b/docs/finn/source_code/finn.builder.rst
@@ -9,9 +9,9 @@ finn.builder.build\_dataflow
 ----------------------------
 
 .. automodule:: finn.builder.build_dataflow
-   :members:
-   :undoc-members:
-   :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
 
 finn.builder.build\_dataflow\_config
 ------------------------------------
@@ -26,6 +26,6 @@ finn.builder.build\_dataflow\_steps
 ------------------------------------
 
 .. automodule:: finn.builder.build_dataflow_steps
-  :members:
-  :undoc-members:
-  :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/finn/source_code/finn.core.rst b/docs/finn/source_code/finn.core.rst
index 4e3de458e153871d1d5969442af5940dc1673ecd..afa1ecffa08213db6a282076c6fdf59694f9e13e 100644
--- a/docs/finn/source_code/finn.core.rst
+++ b/docs/finn/source_code/finn.core.rst
@@ -37,6 +37,15 @@ qonnx.core.modelwrapper
    :undoc-members:
    :show-inheritance:
 
+qonnx.core.onnx\_exec
+---------------------------
+
+.. automodule:: qonnx.core.onnx_exec
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+
 finn.core.onnx\_exec
 ---------------------------
 
diff --git a/docs/finn/source_code/finn.custom_op.fpgadataflow.rst b/docs/finn/source_code/finn.custom_op.fpgadataflow.rst
index cc56ea603e589d7000fe5b2b2943e67cdb90c884..fdcf44c6d99561658b727dc64c0a1b98b247c7df 100644
--- a/docs/finn/source_code/finn.custom_op.fpgadataflow.rst
+++ b/docs/finn/source_code/finn.custom_op.fpgadataflow.rst
@@ -8,7 +8,7 @@ HLS Custom Op Nodes
 Base Class
 ----------
 
-.. automodule:: finn.custom_op.fpgadataflow
+.. automodule:: finn.custom_op.fpgadataflow.hlscustomop
    :members:
    :undoc-members:
    :show-inheritance:
@@ -29,9 +29,25 @@ finn.custom\_op.fpgadataflow.channelwise\_op\_batch
    :undoc-members:
    :show-inheritance:
 
+finn.custom\_op.fpgadataflow.checksum
+--------------------------------------
+
+.. automodule:: finn.custom_op.fpgadataflow.checksum
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+finn.custom\_op.fpgadataflow.concat
+-------------------------------------
+
+.. automodule:: finn.custom_op.fpgadataflow.concat
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
 
 finn.custom\_op.fpgadataflow.convolutioninputgenerator
--------------------------------------------------------------
+--------------------------------------------------------
 
 .. automodule:: finn.custom_op.fpgadataflow.convolutioninputgenerator
    :members:
@@ -46,6 +62,15 @@ finn.custom\_op.fpgadataflow.convolutioninputgenerator1d
    :undoc-members:
    :show-inheritance:
 
+
+finn.custom\_op.fpgadataflow.convolutioninputgenerator\_rtl
+------------------------------------------------------------
+
+.. automodule:: finn.custom_op.fpgadataflow.convolutioninputgenerator_rtl
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
 finn.custom\_op.fpgadataflow.downsampler
 -----------------------------------------
 
@@ -62,6 +87,16 @@ finn.custom\_op.fpgadataflow.duplicatestreams\_batch
    :undoc-members:
    :show-inheritance:
 
+
+finn.custom\_op.fpgadataflow.eltwise
+-------------------------------------
+
+.. automodule:: finn.custom_op.fpgadataflow.eltwise
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+
 finn.custom\_op.fpgadataflow.fmpadding\_batch
 -----------------------------------------------
 
@@ -79,7 +114,7 @@ finn.custom\_op.fpgadataflow.globalaccpool\_batch
    :show-inheritance:
 
 finn.custom\_op.fpgadataflow.iodma
------------------------------------------------
+------------------------------------
 
 .. automodule:: finn.custom_op.fpgadataflow.iodma
    :members:
@@ -102,6 +137,15 @@ finn.custom\_op.fpgadataflow.lookup
    :undoc-members:
    :show-inheritance:
 
+finn.custom\_op.fpgadataflow.matrixvectoractivation
+-----------------------------------------------------------
+
+.. automodule:: finn.custom_op.fpgadataflow.matrixvectoractivation
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+
 finn.custom\_op.fpgadataflow.pool\_batch
 -----------------------------------------------
 
@@ -127,14 +171,6 @@ finn.custom\_op.fpgadataflow.streamingdatawidthconverter\_batch
    :undoc-members:
    :show-inheritance:
 
-finn.custom\_op.fpgadataflow.matrixvectoractivation
------------------------------------------------------------
-
-.. automodule:: finn.custom_op.fpgadataflow.matrixvectoractivation
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
 finn.custom\_op.fpgadataflow.streamingfifo
 -------------------------------------------------
 
diff --git a/docs/finn/source_code/finn.custom_op.rst b/docs/finn/source_code/finn.custom_op.rst
index 20d90a7bb596d6ce5638d9b2d9bae8a5c7e5c723..cdbe957c713ef6916e4ed7baabe09135f71fdeef 100644
--- a/docs/finn/source_code/finn.custom_op.rst
+++ b/docs/finn/source_code/finn.custom_op.rst
@@ -9,6 +9,7 @@ Submodules
    :maxdepth: 2
 
    finn.custom_op.fpgadataflow
+   qonnx.custom_op.channels_last
    qonnx.custom_op.general
 
 Custom Op Nodes
diff --git a/docs/finn/source_code/finn.transformation.fpgadataflow.rst b/docs/finn/source_code/finn.transformation.fpgadataflow.rst
index b1e7075bdcfb675a894f3e66b61d59117e4f078d..9f8ec079309f16daa022e14317ebddfd7758d639 100644
--- a/docs/finn/source_code/finn.transformation.fpgadataflow.rst
+++ b/docs/finn/source_code/finn.transformation.fpgadataflow.rst
@@ -62,6 +62,14 @@ finn.transformation.fpgadataflow.create\_stitched\_ip
    :undoc-members:
    :show-inheritance:
 
+finn.transformation.fpgadataflow.derive\_characteristic
+------------------------------------------------------------
+
+.. automodule:: finn.transformation.fpgadataflow.derive_characteristic
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
 finn.transformation.fpgadataflow.externalize\_params
 ------------------------------------------------------------
 
@@ -103,6 +111,17 @@ finn.transformation.fpgadataflow.insert\_fifo
    :undoc-members:
    :show-inheritance:
 
+
+finn.transformation.fpgadataflow.insert\_hook
+----------------------------------------------------
+
+.. automodule:: finn.transformation.fpgadataflow.insert_hook
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+
+
 finn.transformation.fpgadataflow.insert\_iodma
 ----------------------------------------------------
 
diff --git a/docs/finn/source_code/finn.transformation.rst b/docs/finn/source_code/finn.transformation.rst
index 6a28eeedb2aa547ba80677864ae9fb8c6aa64097..f42b595a50ec90ef055e2818d66f4b2410c25594 100644
--- a/docs/finn/source_code/finn.transformation.rst
+++ b/docs/finn/source_code/finn.transformation.rst
@@ -20,7 +20,7 @@ Transformation Passes
 Base Class
 ----------
 
-.. automodule:: finn.transformation
+.. automodule:: qonnx.transformation.base
    :members:
    :undoc-members:
    :show-inheritance:
@@ -42,7 +42,7 @@ qonnx.transformation.bipolar\_to\_xnor
    :show-inheritance:
 
 qonnx.transformation.change\_3d\_tensors\_to\_4d
-------------------------------------------------
+-------------------------------------------------
 
 .. automodule:: qonnx.transformation.change_3d_tensors_to_4d
   :members:
@@ -57,8 +57,18 @@ qonnx.transformation.change\_datalayout
   :undoc-members:
   :show-inheritance:
 
+
+qonnx.transformation.channels\_last
+--------------------------------------------
+
+.. automodule:: qonnx.transformation.channels_last
+  :members:
+  :undoc-members:
+  :show-inheritance:
+
+
 qonnx.transformation.create\_generic\_partitions
-------------------------------------------------
+-------------------------------------------------
 
 .. automodule:: qonnx.transformation.create_generic_partitions
   :members:
@@ -171,13 +181,22 @@ qonnx.transformation.merge\_onnx\_models
   :show-inheritance:
 
 
-finn.transformation.move\_reshape
+qonnx.transformation.quant\_constant\_folding
+----------------------------------------------
+
+.. automodule:: qonnx.transformation.quant_constant_folding
+  :members:
+  :undoc-members:
+  :show-inheritance:
+
+
+qonnx.transformation.rebalance\_conv
 ----------------------------------------
 
-.. automodule:: finn.transformation.move_reshape
-   :members:
-   :undoc-members:
-   :show-inheritance:
+.. automodule:: qonnx.transformation.rebalance_conv
+  :members:
+  :undoc-members:
+  :show-inheritance:
 
 qonnx.transformation.remove
 -------------------------------------
@@ -186,3 +205,12 @@ qonnx.transformation.remove
   :members:
   :undoc-members:
   :show-inheritance:
+
+
+finn.transformation.move\_reshape
+----------------------------------------
+
+.. automodule:: finn.transformation.move_reshape
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/finn/source_code/finn.util.rst b/docs/finn/source_code/finn.util.rst
index 8dffa016327c3bbe50f21278c859c83556b2b213..7ba3b252abfa0086a8c0281eb9a792fb239d6ec3 100644
--- a/docs/finn/source_code/finn.util.rst
+++ b/docs/finn/source_code/finn.util.rst
@@ -14,6 +14,15 @@ qonnx.util.basic
    :show-inheritance:
 
 
+qonnx.util.cleanup
+----------------------
+
+.. automodule:: qonnx.util.cleanup
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+
 qonnx.util.config
 --------------------
 
@@ -22,6 +31,40 @@ qonnx.util.config
   :undoc-members:
   :show-inheritance:
 
+qonnx.util.exec\_qonnx
+----------------------
+
+.. automodule:: qonnx.util.exec_qonnx
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+qonnx.util.inference\_cost
+--------------------------
+
+.. automodule:: qonnx.util.inference_cost
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+qonnx.util.onnx
+-------------------
+
+.. automodule:: qonnx.util.onnx
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+
+qonnx.util.to\_channels\_last
+------------------------------
+
+.. automodule:: qonnx.util.to_channels_last
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+
 finn.util.basic
 ----------------------
 
@@ -64,6 +107,15 @@ finn.util.gdrive
   :undoc-members:
   :show-inheritance:
 
+finn.util.hls
+---------------
+
+.. automodule:: finn.util.hls
+  :members:
+  :undoc-members:
+  :show-inheritance:
+
+
 finn.util.imagenet
 -----------------------------
 
@@ -72,14 +124,6 @@ finn.util.imagenet
   :undoc-members:
   :show-inheritance:
 
-qonnx.util.onnx
----------------------
-
-.. automodule:: qonnx.util.onnx
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
 finn.util.platforms
 --------------------
 
diff --git a/docs/finn/source_code/modules.rst b/docs/finn/source_code/modules.rst
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/docs/finn/source_code/qonnx.custom_op.channels_last.rst b/docs/finn/source_code/qonnx.custom_op.channels_last.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3ad10d94a6b34a99e2213994a75b0f063fd3d36f
--- /dev/null
+++ b/docs/finn/source_code/qonnx.custom_op.channels_last.rst
@@ -0,0 +1,41 @@
+**************************
+Custom Op - Channels Last
+**************************
+
+Channels Last Custom Ops
+=========================
+
+qonnx.custom\_op.channels\_last.base\_wrapped\_op
+--------------------------------------------------
+
+.. automodule:: qonnx.custom_op.channels_last.base_wrapped_op
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+
+qonnx.custom\_op.channels\_last.batch\_normalization
+------------------------------------------------------
+
+.. automodule:: qonnx.custom_op.channels_last.batch_normalization
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+
+qonnx.custom\_op.channels\_last.conv
+--------------------------------------
+
+.. automodule:: qonnx.custom_op.channels_last.conv
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+
+qonnx.custom\_op.channels\_last.max\_pool
+------------------------------------------
+
+.. automodule:: qonnx.custom_op.channels_last.max_pool
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/finn/tutorials.rst b/docs/finn/tutorials.rst
index 110f77c5b10d2415ac2d2ff7b716567ec5cb76fa..7ac54501cf22a0b123b7b3d156a6a437e8045f22 100644
--- a/docs/finn/tutorials.rst
+++ b/docs/finn/tutorials.rst
@@ -46,3 +46,8 @@ The notebooks in this folder are more developer oriented. They should help you t
 * 2_custom_op
 
   * Explains the basics of FINN custom ops and how to define a new one.
+
+FINN Example FPGA Flow Using MNIST Numerals
+============================================
+
+Next to the Jupyter notebooks above there is a tutorial about the command-line build_dataflow `here <https://github.com/Xilinx/finn/tree/main/tutorials/fpga_flow>`_ which shows how to bring a FINN compiled model into the Vivado FPGA design environment.
diff --git a/fetch-repos.sh b/fetch-repos.sh
index b0f6400ed142b203b1c9f6d7ea4ac6ababcf34d1..dd23c33e1bfc17a49ec43158291d6e3b4c1a0c89 100755
--- a/fetch-repos.sh
+++ b/fetch-repos.sh
@@ -27,15 +27,16 @@
 # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-QONNX_COMMIT="f702b17cdb9d5e57f85f43a5d33890647e063de6"
+QONNX_COMMIT="dd35a8ff49d7225a07ffceeebe25a6361df48349"
 FINN_EXP_COMMIT="9cbd2787b5160e2b44e0e8164a0df1457dbd5366"
 BREVITAS_COMMIT="a5b71d6de1389d3e7db898fef72e014842670f03"
 PYVERILATOR_COMMIT="766e457465f5c0dd315490d7b9cc5d74f9a76f4f"
 CNPY_COMMIT="4e8810b1a8637695171ed346ce68f6984e585ef4"
-HLSLIB_COMMIT="d27f6b6c5d8f1bb208db395659389603f63ad4be"
+HLSLIB_COMMIT="4ddfa00b07275a3f1de1c13409e6acb489115fe2"
 OMX_COMMIT="d1065a788219ca0eb54d5e57600b1f9d7f67d4cc"
 AVNET_BDF_COMMIT="2d49cfc25766f07792c0b314489f21fe916b639b"
 XIL_BDF_COMMIT="8cf4bb674a919ac34e3d99d8d71a9e60af93d14e"
+KV260_BDF_COMMIT="98e0d3efc901f0b974006bc4370c2a7ad8856c79"
 EXP_BOARD_FILES_MD5="30eecc497c31050bd46d10ea20eba232"
 
 QONNX_URL="https://github.com/fastmachinelearning/qonnx.git"
@@ -47,6 +48,7 @@ HLSLIB_URL="https://github.com/Xilinx/finn-hlslib.git"
 OMX_URL="https://github.com/maltanar/oh-my-xilinx.git"
 AVNET_BDF_URL="https://github.com/Avnet/bdf.git"
 XIL_BDF_URL="https://github.com/Xilinx/XilinxBoardStore.git"
+KV260_BDF_URL="https://github.com/Xilinx/XilinxBoardStore.git"
 
 QONNX_DIR="qonnx"
 FINN_EXP_DIR="finn-experimental"
@@ -57,6 +59,7 @@ HLSLIB_DIR="finn-hlslib"
 OMX_DIR="oh-my-xilinx"
 AVNET_BDF_DIR="avnet-bdf"
 XIL_BDF_DIR="xil-bdf"
+KV260_SOM_BDF_DIR="kv260-som-bdf"
 
 # absolute path to this script, e.g. /home/user/bin/foo.sh
 SCRIPT=$(readlink -f "$0")
@@ -104,6 +107,7 @@ fetch_board_files() {
     unzip -q pynq-z2.zip
     cp -r $SCRIPTPATH/deps/$AVNET_BDF_DIR/* $SCRIPTPATH/deps/board_files/
     cp -r $SCRIPTPATH/deps/$XIL_BDF_DIR/boards/Xilinx/rfsoc2x2 $SCRIPTPATH/deps/board_files/;
+    cp -r $SCRIPTPATH/deps/$KV260_SOM_BDF_DIR/boards/Xilinx/kv260_som $SCRIPTPATH/deps/board_files/;
     cd $OLD_PWD
 }
 
@@ -116,6 +120,7 @@ fetch_repo $HLSLIB_URL $HLSLIB_COMMIT $HLSLIB_DIR
 fetch_repo $OMX_URL $OMX_COMMIT $OMX_DIR
 fetch_repo $AVNET_BDF_URL $AVNET_BDF_COMMIT $AVNET_BDF_DIR
 fetch_repo $XIL_BDF_URL $XIL_BDF_COMMIT $XIL_BDF_DIR
+fetch_repo $KV260_BDF_URL $KV260_BDF_COMMIT $KV260_SOM_BDF_DIR
 
 # download extra Pynq board files and extract if needed
 if [ ! -d "$SCRIPTPATH/deps/board_files" ]; then
diff --git a/finn-rtllib/fmpadding/hdl/axi2we.sv b/finn-rtllib/fmpadding/hdl/axi2we.sv
new file mode 100644
index 0000000000000000000000000000000000000000..842ba3632c4224d58f87c66e1affc4c028b60ef3
--- /dev/null
+++ b/finn-rtllib/fmpadding/hdl/axi2we.sv
@@ -0,0 +1,122 @@
+/******************************************************************************
+ * Copyright (C) 2022, Advanced Micro Devices, Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ *  1. Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *  2. Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *  3. Neither the name of the copyright holder nor the names of its
+ *     contributors may be used to endorse or promote products derived from
+ *     this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+ * OR BUSINESS INTERRUPTION). HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+ * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+ * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * @brief	AXI-Light adapter for trivial write enable interface.
+ * @author	Thomas B. PreuÃŸer <tpreusse@amd.com>
+ *****************************************************************************/
+
+module axi2we #(
+	int unsigned  ADDR_BITS
+)(
+	//- Global Control ------------------
+	input	logic  ap_clk,
+	input	logic  ap_rst_n,
+
+	//- AXI Lite ------------------------
+	// Writing
+	input	                 s_axilite_AWVALID,
+	output	                 s_axilite_AWREADY,
+	input	[ADDR_BITS-1:0]  s_axilite_AWADDR,
+
+	input	        s_axilite_WVALID,
+	output	        s_axilite_WREADY,
+	input	[31:0]  s_axilite_WDATA,
+	input	[ 3:0]  s_axilite_WSTRB,
+
+	output	       s_axilite_BVALID,
+	input	       s_axilite_BREADY,
+	output	[1:0]  s_axilite_BRESP,
+
+	// Reading tied to all-ones
+	input	       s_axilite_ARVALID,
+	output	       s_axilite_ARREADY,
+	input	[ADDR_BITS-1:0]  s_axilite_ARADDR,
+
+	output	        s_axilite_RVALID,
+	input	        s_axilite_RREADY,
+	output	[31:0]  s_axilite_RDATA,
+	output	[ 1:0]  s_axilite_RRESP,
+
+	// Write Enable Interface
+	output	logic                  we,
+	output	logic [ADDR_BITS-1:0]  wa,
+	output	logic [         31:0]  wd
+);
+
+	uwire  clk = ap_clk;
+	uwire  rst = !ap_rst_n;
+
+
+	logic  WABusy = 0;
+	logic  WDBusy = 0;
+	logic [ADDR_BITS-1:0]  Addr = 'x;
+	logic [         31:0]  Data = 'x;
+
+	assign	we = WABusy && WDBusy && s_axilite_BREADY;
+	assign	wa = Addr;
+	assign	wd = Data;
+
+	uwire  clr_wr = rst || we;
+	always_ff @(posedge clk) begin
+		if(clr_wr) begin
+			WABusy <= 0;
+			Addr <= 'x;
+			WDBusy <= 0;
+			Data <= 'x;
+		end
+		else begin
+			if(!WABusy) begin
+				WABusy <= s_axilite_AWVALID;
+				Addr   <= s_axilite_AWADDR;
+			end
+			if(!WDBusy) begin
+				WDBusy <= s_axilite_WVALID;
+				Data   <= s_axilite_WDATA;
+			end
+		end
+	end
+	assign	s_axilite_AWREADY = !WABusy;
+	assign	s_axilite_WREADY  = !WDBusy;
+	assign	s_axilite_BVALID  = WABusy && WDBusy;
+	assign	s_axilite_BRESP   = '0; // OK
+
+	// Answer all reads with '1
+	logic  RValid =  0;
+	uwire  clr_rd = rst || (RValid && s_axilite_RREADY);
+	always_ff @(posedge clk) begin
+		if(clr_rd)        RValid <=  0;
+		else if(!RValid)  RValid <= s_axilite_ARVALID;
+	end
+	assign	s_axilite_ARREADY = !RValid;
+	assign	s_axilite_RVALID  = RValid;
+	assign	s_axilite_RDATA   = '1;
+	assign	s_axilite_RRESP   = '0; // OK
+
+endmodule : axi2we
diff --git a/finn-rtllib/fmpadding/hdl/fmpadding.sv b/finn-rtllib/fmpadding/hdl/fmpadding.sv
new file mode 100644
index 0000000000000000000000000000000000000000..904c7c381f7b2499fc354ebf798e86edab262866
--- /dev/null
+++ b/finn-rtllib/fmpadding/hdl/fmpadding.sv
@@ -0,0 +1,224 @@
+/******************************************************************************
+ * Copyright (C) 2022, Advanced Micro Devices, Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ *  1. Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *  2. Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *  3. Neither the name of the copyright holder nor the names of its
+ *     contributors may be used to endorse or promote products derived from
+ *     this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+ * OR BUSINESS INTERRUPTION). HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+ * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+ * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * @brief	Feature map padding.
+ * @author	Thomas B. PreuÃŸer <tpreusse@amd.com>
+ *****************************************************************************/
+
+module fmpadding #(
+	int unsigned  XCOUNTER_BITS,
+	int unsigned  YCOUNTER_BITS,
+	int unsigned  NUM_CHANNELS,
+	int unsigned  SIMD,
+	int unsigned  ELEM_BITS,
+	int unsigned  INIT_XON,
+	int unsigned  INIT_XOFF,
+	int unsigned  INIT_XEND,
+	int unsigned  INIT_YON,
+	int unsigned  INIT_YOFF,
+	int unsigned  INIT_YEND,
+
+	localparam int unsigned  STREAM_BITS = 8*(1 + (SIMD*ELEM_BITS-1)/8)
+)(
+	//- Global Control ------------------
+	input	logic  ap_clk,
+	input	logic  ap_rst_n,
+
+	// Parameter Configuration ----------
+	input	logic         we,
+	input	logic [ 4:0]  wa,
+	input	logic [31:0]  wd,
+
+	//- AXI Stream - Input --------------
+	output	logic  s_axis_tready,
+	input	logic  s_axis_tvalid,
+	input	logic [STREAM_BITS-1:0]  s_axis_tdata,
+
+	//- AXI Stream - Output -------------
+	input	logic  m_axis_tready,
+	output	logic  m_axis_tvalid,
+	output	logic [STREAM_BITS-1:0]  m_axis_tdata
+);
+
+	uwire  clk = ap_clk;
+	uwire  rst = !ap_rst_n;
+
+	//-----------------------------------------------------------------------
+	// Parameter Sanity Checking
+	initial begin
+		automatic bit  fail = 0;
+
+		if(XCOUNTER_BITS < $clog2(1+INIT_XEND)) begin
+			$error("XCounter size too small to accommodate end count.");
+			fail = 1;
+		end
+		if(XCOUNTER_BITS < $clog2(1+INIT_XON)) begin
+			$error("XCounter size too small to accommodate ON count.");
+			fail = 1;
+		end
+		if(XCOUNTER_BITS < $clog2(1+INIT_XOFF)) begin
+			$error("XCounter size too small to accommodate OFF count.");
+			fail = 1;
+		end
+		if(YCOUNTER_BITS < $clog2(1+INIT_YEND)) begin
+			$error("YCounter size too small to accommodate end count.");
+			fail = 1;
+		end
+		if(YCOUNTER_BITS < $clog2(1+INIT_YON)) begin
+			$error("YCounter size too small to accommodate ON count.");
+			fail = 1;
+		end
+		if(YCOUNTER_BITS < $clog2(1+INIT_YOFF)) begin
+			$error("YCounter size too small to accommodate OFF count.");
+			fail = 1;
+		end
+
+		if((INIT_XEND < INIT_XON) || (INIT_XOFF <= INIT_XON)) begin
+			$warning("Initial empty X output range.");
+		end
+		if((INIT_YEND < INIT_YON) || (INIT_YOFF <= INIT_YON)) begin
+			$warning("Initial empty Y output range.");
+		end
+
+		if(fail)  $finish();
+	end
+
+	//-----------------------------------------------------------------------
+	// Dynamically configurable state
+	typedef logic [XCOUNTER_BITS-1:0]  xcount_t;
+	xcount_t  XEnd = INIT_XEND;
+	xcount_t  XOn  = INIT_XON;
+	xcount_t  XOff = INIT_XOFF;
+
+	typedef logic [YCOUNTER_BITS-1:0]  ycount_t;
+	ycount_t  YEnd = INIT_YEND;
+	ycount_t  YOn  = INIT_YON;
+	ycount_t  YOff = INIT_YOFF;
+
+	always_ff @(posedge clk) begin
+		if(we) begin
+			unique case(wa)
+			0*4:  XOn  <= wd;
+			1*4:  XOff <= wd;
+			2*4:  XEnd <= wd;
+			3*4:  YOn  <= wd;
+			4*4:  YOff <= wd;
+			5*4:  YEnd <= wd;
+
+			default:  assert(0) else begin
+				$error("Illegal write address.");
+				$stop;
+			end
+			endcase
+		end
+	end
+
+	//-----------------------------------------------------------------------
+	// Cascaded enables for the nested counters: SCount, XCount, YCount
+	uwire  sen;
+	uwire  xen;
+	uwire  yen;
+
+	//- S-Counter: SIMD fold ------------
+	initial begin
+		if((NUM_CHANNELS < 1) || (NUM_CHANNELS % SIMD != 0)) begin
+			$error("Channel count must be SIMD multiple.");
+			$finish;
+		end
+	end
+	// Count SF-2, SF-3, ..., 1, 0, -1
+	localparam int unsigned  SF = NUM_CHANNELS/SIMD;
+	typedef logic [$clog2(SF-1):0]  scount_t;
+	scount_t  SCount = SF-2;
+
+	assign	xen = sen && SCount[$left(SCount)];
+	uwire  sclr = rst || xen;
+	always_ff @(posedge clk) begin
+		if(sclr)      SCount <= SF-2;
+		else if(sen)  SCount <= SCount - 1;
+	end
+
+	//- X-Counter: image width ----------
+	xcount_t  XCount = 0;
+
+	assign	yen = xen && (XCount == XEnd);
+	uwire  xclr = rst || yen;
+	always_ff @(posedge clk) begin
+		if(xclr)      XCount <= 0;
+		else if(xen)  XCount <= XCount + 1;
+	end
+	uwire  xfwd = (XOn <= XCount) && (XCount < XOff);
+
+	//- Y-Counter: image height ---------
+	ycount_t  YCount = 0;
+
+	uwire  yclr = rst || (yen && (YCount == YEnd));
+	always_ff @(posedge clk) begin
+		if(yclr)      YCount <= 0;
+		else if(yen)  YCount <= YCount + 1;
+	end
+	uwire  yfwd = (YOn <= YCount) && (YCount < YOff);
+
+	//-----------------------------------------------------------------------
+	// Input forwarding and edge padding
+	typedef struct {
+		logic  vld;
+		logic [STREAM_BITS-1:0]  dat;
+	} buf_t;
+	buf_t  A = '{ vld: 0, dat: 'x };
+	buf_t  B = '{ vld: 0, dat: 'x };
+
+	uwire  fwd = xfwd && yfwd;
+	assign	sen = (m_axis_tready || !B.vld) && (s_axis_tvalid || A.vld || !fwd);
+	assign	s_axis_tready = !A.vld;
+	assign	m_axis_tvalid =  B.vld;
+	assign	m_axis_tdata  =  B.dat;
+
+	always_ff @(posedge clk) begin
+		if(rst) begin
+			B <= '{ vld: 0, dat: 'x };
+		end
+		else if(m_axis_tready || !B.vld) begin
+			B.vld <= s_axis_tvalid || A.vld || !fwd;
+			B.dat <= !fwd? '0 : A.vld? A.dat : s_axis_tdata;
+		end
+	end
+
+	always_ff @(posedge clk) begin
+		if(rst) begin
+			A <= '{ vld: 0, dat: 'x };
+		end
+		else begin
+			A.vld <= (A.vld || s_axis_tvalid) && ((B.vld && !m_axis_tready) || !fwd);
+			if(!A.vld)  A.dat <= s_axis_tdata;
+		end
+	end
+
+endmodule : fmpadding
diff --git a/finn-rtllib/fmpadding/hdl/fmpadding_axi.sv b/finn-rtllib/fmpadding/hdl/fmpadding_axi.sv
new file mode 100644
index 0000000000000000000000000000000000000000..5948341d000a1dd82ff363b36557f897d3a064c7
--- /dev/null
+++ b/finn-rtllib/fmpadding/hdl/fmpadding_axi.sv
@@ -0,0 +1,123 @@
+/******************************************************************************
+ * Copyright (C) 2022, Advanced Micro Devices, Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ *  1. Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *  2. Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *  3. Neither the name of the copyright holder nor the names of its
+ *     contributors may be used to endorse or promote products derived from
+ *     this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+ * OR BUSINESS INTERRUPTION). HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+ * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+ * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * @brief	Feature map padding.
+ * @author	Thomas B. PreuÃŸer <tpreusse@amd.com>
+ *****************************************************************************/
+
+module fmpadding_axi #(
+	int unsigned  XCOUNTER_BITS,
+	int unsigned  YCOUNTER_BITS,
+	int unsigned  NUM_CHANNELS,
+	int unsigned  SIMD,
+	int unsigned  ELEM_BITS,
+	int unsigned  INIT_XON,
+	int unsigned  INIT_XOFF,
+	int unsigned  INIT_XEND,
+	int unsigned  INIT_YON,
+	int unsigned  INIT_YOFF,
+	int unsigned  INIT_YEND,
+
+	localparam int unsigned  STREAM_BITS = 8*(1 + (SIMD*ELEM_BITS-1)/8)
+)(
+	//- Global Control ------------------
+	input	logic  ap_clk,
+	input	logic  ap_rst_n,
+
+	//- AXI Lite ------------------------
+	// Writing
+	input	       s_axilite_AWVALID,
+	output	       s_axilite_AWREADY,
+	input	[4:0]  s_axilite_AWADDR,
+
+	input	        s_axilite_WVALID,
+	output	        s_axilite_WREADY,
+	input	[31:0]  s_axilite_WDATA,
+	input	[ 3:0]  s_axilite_WSTRB,
+
+	output	       s_axilite_BVALID,
+	input	       s_axilite_BREADY,
+	output	[1:0]  s_axilite_BRESP,
+
+	// Reading
+	input	       s_axilite_ARVALID,
+	output	       s_axilite_ARREADY,
+	input	[4:0]  s_axilite_ARADDR,
+
+	output	        s_axilite_RVALID,
+	input	        s_axilite_RREADY,
+	output	[31:0]  s_axilite_RDATA,
+	output	[ 1:0]  s_axilite_RRESP,
+
+	//- AXI Stream - Input --------------
+	output	logic  s_axis_tready,
+	input	logic  s_axis_tvalid,
+	input	logic [STREAM_BITS-1:0]  s_axis_tdata,
+
+	//- AXI Stream - Output -------------
+	input	logic  m_axis_tready,
+	output	logic  m_axis_tvalid,
+	output	logic [STREAM_BITS-1:0]  m_axis_tdata
+);
+
+	// AXI-Lite Adapter
+	uwire         we;
+	uwire [ 4:0]  wa;
+	uwire [31:0]  wd;
+	axi2we #(.ADDR_BITS(5)) axilight_adapter (
+		.ap_clk, .ap_rst_n,
+
+		.s_axilite_AWVALID, .s_axilite_AWREADY, .s_axilite_AWADDR,
+		.s_axilite_WVALID, .s_axilite_WREADY, .s_axilite_WDATA, .s_axilite_WSTRB,
+		.s_axilite_BVALID, .s_axilite_BREADY, .s_axilite_BRESP,
+
+		.s_axilite_ARVALID, .s_axilite_ARREADY, .s_axilite_ARADDR,
+		.s_axilite_RVALID, .s_axilite_RREADY, .s_axilite_RDATA, .s_axilite_RRESP,
+
+		.we, .wa, .wd
+	);
+
+	// Actual Padding
+	fmpadding #(
+		.XCOUNTER_BITS(XCOUNTER_BITS), .YCOUNTER_BITS(YCOUNTER_BITS),
+		.NUM_CHANNELS(NUM_CHANNELS), .SIMD(SIMD),
+		.INIT_XON(INIT_XON), .INIT_XOFF(INIT_XOFF), .INIT_XEND(INIT_XEND),
+		.INIT_YON(INIT_YON), .INIT_YOFF(INIT_YOFF), .INIT_YEND(INIT_YEND),
+		.ELEM_BITS(ELEM_BITS)
+	) padding (
+		.ap_clk, .ap_rst_n,
+
+		.we, .wa, .wd,
+
+		.s_axis_tready, .s_axis_tvalid, .s_axis_tdata,
+		.m_axis_tready, .m_axis_tvalid, .m_axis_tdata
+	);
+
+endmodule : fmpadding_axi
diff --git a/finn-rtllib/fmpadding/hdl/fmpadding_axi_tb.sv b/finn-rtllib/fmpadding/hdl/fmpadding_axi_tb.sv
new file mode 100644
index 0000000000000000000000000000000000000000..741689b3a7af7ad4d07f2af569f71135c1d35c7b
--- /dev/null
+++ b/finn-rtllib/fmpadding/hdl/fmpadding_axi_tb.sv
@@ -0,0 +1,154 @@
+
+module fmpadding_axi_tb #(
+	int unsigned  XCOUNTER_BITS = 8,
+	int unsigned  YCOUNTER_BITS = 8,
+	int unsigned  NUM_CHANNELS  = 4,
+	int unsigned  SIMD          = 2,
+	int unsigned  ELEM_BITS     = 4
+)();
+	localparam int unsigned  STREAM_BITS = 8*(1 + (SIMD*ELEM_BITS-1)/8);
+
+	//- Global Control ------------------
+	logic  clk = 0;
+	always #5ns clk = !clk;
+	logic  rst;
+
+	// AXI-Light for Parameter Configuration
+	logic	       s_axilite_AWVALID;
+	uwire	       s_axilite_AWREADY;
+	logic	[2:0]  s_axilite_AWADDR;
+
+	logic	        s_axilite_WVALID;
+	uwire	        s_axilite_WREADY;
+	logic	[31:0]  s_axilite_WDATA;
+
+	//- AXI Stream - Input --------------
+	uwire  s_axis_tready;
+	logic  s_axis_tvalid;
+	logic [STREAM_BITS-1:0]  s_axis_tdata;
+
+	//- AXI Stream - Output -------------
+	logic  m_axis_tready;
+	uwire  m_axis_tvalid;
+	uwire [STREAM_BITS-1:0]  m_axis_tdata;
+
+
+	// DUT
+	fmpadding_axi #(
+		.XCOUNTER_BITS(XCOUNTER_BITS),
+		.YCOUNTER_BITS(YCOUNTER_BITS),
+		.NUM_CHANNELS(NUM_CHANNELS),
+		.SIMD(SIMD),
+		.INIT_XON(0), .INIT_XOFF(0), .INIT_XEND(0),
+		.INIT_YON(0), .INIT_YOFF(0), .INIT_YEND(0),
+		.ELEM_BITS(ELEM_BITS)
+	) dut (
+		.ap_clk(clk), .ap_rst_n(!rst),
+
+		.s_axilite_AWVALID, .s_axilite_AWREADY, .s_axilite_AWADDR,
+		.s_axilite_WVALID, .s_axilite_WREADY, .s_axilite_WDATA, .s_axilite_WSTRB('1),
+		.s_axilite_BVALID(), .s_axilite_BREADY('1),	.s_axilite_BRESP(),
+		.s_axilite_ARVALID('0), .s_axilite_ARREADY(), .s_axilite_ARADDR('x),
+		.s_axilite_RVALID(), .s_axilite_RREADY('0), .s_axilite_RDATA(), .s_axilite_RRESP(),
+
+		.s_axis_tready, .s_axis_tvalid, .s_axis_tdata,
+		.m_axis_tready, .m_axis_tvalid, .m_axis_tdata
+	);
+
+	// Stimuli
+	localparam int unsigned  IMAGES = 2;
+	localparam int unsigned  XSIZE = 10;
+	localparam int unsigned  YSIZE =  7;
+	localparam int unsigned  PAD_LEFT   = 2;
+	localparam int unsigned  PAD_RIGHT  = 3;
+	localparam int unsigned  PAD_TOP    = 1;
+	localparam int unsigned  PAD_BOTTOM = 2;
+
+	task axi_write(input logic [2:0]  wa, input logic [31:0]  wd);
+		s_axilite_AWVALID <= 1;
+		s_axilite_AWADDR <= wa;
+		@(posedge clk iff s_axilite_AWREADY);
+		s_axilite_AWVALID <= 0;
+		s_axilite_AWADDR <= 'x;
+
+		s_axilite_WVALID <= 1;
+		s_axilite_WDATA <= wd;
+		@(posedge clk iff s_axilite_WREADY);
+		s_axilite_WVALID <= 0;
+		s_axilite_WDATA <= 'x;
+	endtask : axi_write
+
+
+	initial begin
+		s_axilite_AWVALID = 0;
+		s_axilite_AWADDR = 'x;
+		s_axilite_WVALID = 0;
+		s_axilite_WDATA = 'x;
+
+		s_axis_tvalid =  0;
+		s_axis_tdata  = 'x;
+
+		// Configure Parameters
+		rst = 0;
+		@(posedge clk);
+		/* XOn  */	axi_write(0, PAD_LEFT);
+		/* XOff */	axi_write(1, XSIZE - PAD_RIGHT);
+		/* XEnd */	axi_write(2, XSIZE - 1);
+		/* YOn  */	axi_write(4, PAD_TOP);
+		/* YOff */	axi_write(5, YSIZE - PAD_BOTTOM);
+		/* YEnd */	axi_write(6, YSIZE - 1);
+		@(posedge clk);
+		rst <= 1;
+		@(posedge clk);
+		rst <= 0;
+		@(posedge clk);
+
+		// Feed data input
+		s_axis_tvalid <= 1;
+		for(int unsigned  i = 0; i < IMAGES * (XSIZE-PAD_LEFT-PAD_RIGHT) * (YSIZE-PAD_TOP-PAD_BOTTOM) * (NUM_CHANNELS/SIMD); i++) begin
+			s_axis_tdata  <= i;
+			@(posedge clk iff s_axis_tready);
+			if($urandom()%5 == 0) begin
+				s_axis_tvalid <=  0;
+				s_axis_tdata  <= 'x;
+				@(posedge clk);
+				s_axis_tvalid <=  1;
+			end
+		end
+		s_axis_tvalid <=  0;
+		s_axis_tdata  <= 'x;
+	end
+
+	// Output Throttler
+	initial begin
+		m_axis_tready =  0;
+		@(posedge clk iff !rst);
+		m_axis_tready <= 1;
+		forever @(posedge clk iff m_axis_tvalid) begin
+			m_axis_tready <= 0;
+			repeat(4-$clog2(1+$urandom()%15)) @(posedge clk);
+			m_axis_tready <= 1;
+		end
+	end
+
+	// Output logger
+	initial begin
+		@(negedge rst);
+		repeat(IMAGES) begin
+			for(int unsigned  y = 0; y < YSIZE; y++) begin
+				for(int unsigned  x = 0; x < XSIZE; x++) begin
+					automatic string  delim = " ";
+					for(int unsigned  s = 0; s < NUM_CHANNELS/SIMD; s++) begin
+						@(posedge clk iff m_axis_tvalid && m_axis_tready);
+						$write("%s%02X", delim, m_axis_tdata);
+						delim = ":";
+					end
+				end
+				$display();
+			end
+			$display("----");
+		end
+		$finish;
+	end
+
+endmodule : fmpadding_axi_tb
diff --git a/finn-rtllib/fmpadding/hdl/fmpadding_template.v b/finn-rtllib/fmpadding/hdl/fmpadding_template.v
new file mode 100644
index 0000000000000000000000000000000000000000..0b0f40f86a44ac1d905c89bed5328d6d1ea48876
--- /dev/null
+++ b/finn-rtllib/fmpadding/hdl/fmpadding_template.v
@@ -0,0 +1,118 @@
+/******************************************************************************
+ * Copyright (C) 2022, Advanced Micro Devices, Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ *  1. Redistributions of source code must retain the above copyright notice,
+ *     this list of conditions and the following disclaimer.
+ *
+ *  2. Redistributions in binary form must reproduce the above copyright
+ *     notice, this list of conditions and the following disclaimer in the
+ *     documentation and/or other materials provided with the distribution.
+ *
+ *  3. Neither the name of the copyright holder nor the names of its
+ *     contributors may be used to endorse or promote products derived from
+ *     this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+ * OR BUSINESS INTERRUPTION). HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+ * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+ * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *****************************************************************************/
+
+module $TOP_MODULE_NAME$(
+//- Global Control ------------------
+(* X_INTERFACE_PARAMETER = "ASSOCIATED_BUSIF in0_V:out_V:s_axilite" *)
+input	ap_clk,
+(* X_INTERFACE_PARAMETER = "ASSOCIATED_BUSIF in0_V:out_V:s_axilite" *)
+input	ap_rst_n,
+
+//- AXI Lite ------------------------
+// Writing
+input	       s_axilite_AWVALID,
+output	       s_axilite_AWREADY,
+input	[4:0]  s_axilite_AWADDR,
+
+input	        s_axilite_WVALID,
+output	        s_axilite_WREADY,
+input	[31:0]  s_axilite_WDATA,
+input	[ 3:0]  s_axilite_WSTRB,
+
+output	       s_axilite_BVALID,
+input	       s_axilite_BREADY,
+output	[1:0]  s_axilite_BRESP,
+
+// Reading
+input	       s_axilite_ARVALID,
+output	       s_axilite_ARREADY,
+input	[4:0]  s_axilite_ARADDR,
+
+output	        s_axilite_RVALID,
+input	        s_axilite_RREADY,
+output	[31:0]  s_axilite_RDATA,
+output	[ 1:0]  s_axilite_RRESP,
+
+//- AXI Stream - Input --------------
+output	in0_V_TREADY,
+input	in0_V_TVALID,
+input	[$STREAM_BITS$-1:0]  in0_V_TDATA,
+
+//- AXI Stream - Output -------------
+input	out_V_TREADY,
+output	out_V_TVALID,
+output	[$STREAM_BITS$-1:0]  out_V_TDATA
+);
+
+
+fmpadding_axi #(
+.XCOUNTER_BITS($XCOUNTER_BITS$),
+.YCOUNTER_BITS($YCOUNTER_BITS$),
+.NUM_CHANNELS($NUM_CHANNELS$),
+.SIMD($SIMD$),
+.ELEM_BITS($ELEM_BITS$),
+.INIT_XON($INIT_XON$),
+.INIT_XOFF($INIT_XOFF$),
+.INIT_XEND($INIT_XEND$),
+.INIT_YON($INIT_YON$),
+.INIT_YOFF($INIT_YOFF$),
+.INIT_YEND($INIT_YEND$)
+)
+$TOP_MODULE_NAME$_impl
+(
+ .ap_clk(ap_clk),
+ .ap_rst_n(ap_rst_n),
+ .s_axilite_AWVALID(s_axilite_AWVALID),
+ .s_axilite_AWREADY(s_axilite_AWREADY),
+ .s_axilite_AWADDR(s_axilite_AWADDR),
+ .s_axilite_WVALID(s_axilite_WVALID),
+ .s_axilite_WREADY(s_axilite_WREADY),
+ .s_axilite_WDATA(s_axilite_WDATA),
+ .s_axilite_WSTRB(s_axilite_WSTRB),
+ .s_axilite_BVALID(s_axilite_BVALID),
+ .s_axilite_BREADY(s_axilite_BREADY),
+ .s_axilite_BRESP(s_axilite_BRESP),
+ .s_axilite_ARVALID(s_axilite_ARVALID),
+ .s_axilite_ARREADY(s_axilite_ARREADY),
+ .s_axilite_ARADDR(s_axilite_ARADDR),
+ .s_axilite_RVALID(s_axilite_RVALID),
+ .s_axilite_RREADY(s_axilite_RREADY),
+ .s_axilite_RDATA(s_axilite_RDATA),
+ .s_axilite_RRESP(s_axilite_RRESP),
+ .s_axis_tready(in0_V_TREADY),
+ .s_axis_tvalid(in0_V_TVALID),
+ .s_axis_tdata(in0_V_TDATA),
+ .m_axis_tready(out_V_TREADY),
+ .m_axis_tvalid(out_V_TVALID),
+ .m_axis_tdata(out_V_TDATA)
+);
+
+endmodule
diff --git a/notebooks/advanced/0_custom_analysis_pass.ipynb b/notebooks/advanced/0_custom_analysis_pass.ipynb
index a4ad32ed7f547a4c035b5cbe4da11ebe2565883a..f8444520c3ded795702420d7f86335d0048ef043 100644
--- a/notebooks/advanced/0_custom_analysis_pass.ipynb
+++ b/notebooks/advanced/0_custom_analysis_pass.ipynb
@@ -137,7 +137,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
diff --git a/notebooks/advanced/1_custom_transformation_pass.ipynb b/notebooks/advanced/1_custom_transformation_pass.ipynb
index e40a534af56352712f20bfb250112aeacfee278f..391e852a71e1109b376abd7bb5d5f9d264d06498 100644
--- a/notebooks/advanced/1_custom_transformation_pass.ipynb
+++ b/notebooks/advanced/1_custom_transformation_pass.ipynb
@@ -233,7 +233,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
diff --git a/notebooks/advanced/2_custom_op.ipynb b/notebooks/advanced/2_custom_op.ipynb
index c27f8bdca788e6404fbc01e226b06e8cfaaba066..636da64dd52fab81f8d6a763d199e8e13e9e3cc0 100644
--- a/notebooks/advanced/2_custom_op.ipynb
+++ b/notebooks/advanced/2_custom_op.ipynb
@@ -8,14 +8,14 @@
     "\n",
     "Suppose that you want to introduce a new (custom) operation type into the FINN compiler. Custom operations in FINN are useful for a variety of things ranging from code generation to functional verification. This is achieved by creating a new Python module for your custom operation that fulfills certain interface specifications.\n",
     "\n",
-    "One thing to point out before we start is that **these custom operations are generic** and not really tied to e.g. Vivado HLS or few-bit quantization. As you will see in this notebook, it's possible to provide arbitrary Python/C/C++/... execution and code generation paths for custom nodes.\n",
+    "One thing to point out before we start is that **these custom operations are generic** and not really tied to e.g. Vitis HLS or few-bit quantization. As you will see in this notebook, it's possible to provide arbitrary Python/C/C++/... execution and code generation paths for custom nodes.\n",
     "\n",
     "## The CustomOp base class\n",
     "\n",
     "Subclasses of `CustomOp` provide a way of providing custom functionality for ONNX nodes in FINN.\n",
     "This is the base class for every custom op node used in the framework, so you must create subclasses of `CustomOp` to provide execution, code generation and other functionalities in FINN.\n",
     "\n",
-    "Let's start by looking at the `CustomOp` base class itself, which lives in the `finn-base` repository. You can view it [here](https://github.com/Xilinx/finn-base/blob/dev/src/finn/custom_op/base.py). Note that the `finn` Docker container already has `finn-base` set up as a dependency.\n",
+    "Let's start by looking at the `CustomOp` base class itself, which lives in the `qonnx` repository. You can view it [here](https://github.com/fastmachinelearning/qonnx/blob/main/src/qonnx/custom_op/base.py). Note that the `finn` Docker container already has `qonnx` set up as a dependency.\n",
     "\n",
     "Some points of importance:\n",
     "\n",
@@ -23,7 +23,7 @@
     "\n",
     "2. `CustomOp` subclasses need to implement the methods below (those not starting with underscore).\n",
     "\n",
-    "3. To be discoverable in the custom op register, `CustomOp` subclasses must set the `domain` field to the name of the Python module they appear in. For instance, to use the custom `Im2Col` op type from [here](https://github.com/Xilinx/finn-base/blob/dev/src/finn/custom_op/general/im2col.py), the ONNX node must use `domain=qonnx.custom_op.general` since its module is located at `finn/custom_op/general/im2col.py`."
+    "3. To be discoverable in the custom op register, `CustomOp` subclasses must set the `domain` field to the name of the Python module they appear in. For instance, to use the custom `Im2Col` op type from [here](https://github.com/fastmachinelearning/qonnx/blob/main/src/qonnx/custom_op/general/im2col.py), the ONNX node must use `domain=qonnx.custom_op.general` since its module is located at `qonnx/custom_op/general/im2col.py`."
    ]
   },
   {
@@ -130,7 +130,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To make sure our custom op is available, it needs to be registered. The best practice for this is to create a submodule under `finn.custom_op` which includes a `custom_op` dictionary that maps strings (op names) to classes (op implementations). Since we're in a Jupyter notebook we'll just hijack it at runtime like this:"
+    "To make sure our custom op is available, it needs to be registered. The best practice for this is to create a submodule under `qonnx.custom_op` which includes a `custom_op` dictionary that maps strings (op names) to classes (op implementations). Since we're in a Jupyter notebook we'll just hijack it at runtime like this:"
    ]
   },
   {
@@ -178,6 +178,7 @@
    "source": [
     "from qonnx.core.modelwrapper import ModelWrapper\n",
     "from onnx import TensorProto\n",
+    "from qonnx.util.basic import qonnx_make_model\n",
     "\n",
     "def make_graph(ishape, exp, op_type = \"MyPythonPowerOp\"):\n",
     "    inp = helper.make_tensor_value_info(\n",
@@ -204,7 +205,7 @@
     "    graph = helper.make_graph(\n",
     "        nodes=[custom_node], name=\"custom_graph\", inputs=[inp], outputs=[outp]\n",
     "    )\n",
-    "    model = helper.make_model(graph, producer_name=\"custom-model\")\n",
+    "    model = qonnx_make_model(graph, producer_name=\"custom-model\")\n",
     "    return ModelWrapper(model)"
    ]
   },
@@ -657,7 +658,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
diff --git a/notebooks/basics/0_how_to_work_with_onnx.ipynb b/notebooks/basics/0_how_to_work_with_onnx.ipynb
index 514efd1693d667af896e89902a264ea7e6e01da7..35a83ea97b87bbe78ae1ff58a5ee50a0b0420a8f 100644
--- a/notebooks/basics/0_how_to_work_with_onnx.ipynb
+++ b/notebooks/basics/0_how_to_work_with_onnx.ipynb
@@ -24,7 +24,7 @@
    "source": [
     "### How to create a simple ONNX model\n",
     "\n",
-    "To explain how to create an ONNX model a simple example with mathematical operations is used. All nodes are from the [standard operations library of ONNX](https://github.com/onnx/onnx/blob/master/docs/Operators.md).\n",
+    "To explain how to create an ONNX model a simple example with mathematical operations is used. All nodes are from the [standard operations library of ONNX](https://github.com/onnx/onnx/blob/main/docs/Operators.md).\n",
     "\n",
     "First ONNX is imported, then the helper function can be used to make a node."
    ]
@@ -36,6 +36,7 @@
    "outputs": [],
    "source": [
     "import onnx\n",
+    "from qonnx.util.basic import qonnx_make_model\n",
     "\n",
     "Add1_node = onnx.helper.make_node(\n",
     "    'Add',\n",
@@ -158,7 +159,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "onnx_model = onnx.helper.make_model(graph, producer_name=\"simple-model\")\n",
+    "onnx_model = qonnx_make_model(graph, producer_name=\"simple-model\")\n",
     "onnx.save(onnx_model, '/tmp/simple_model.onnx')"
    ]
   },
@@ -304,7 +305,7 @@
    "source": [
     "### How to manipulate an ONNX model\n",
     "\n",
-    "In the model there are two successive adder nodes. An adder node in ONNX can only add two inputs, but there is also the [**sum**](https://github.com/onnx/onnx/blob/master/docs/Operators.md#Sum) node, which can process more than two inputs. So it would be a reasonable change of the graph to combine the two successive adder nodes to one sum node."
+    "In the model there are two successive adder nodes. An adder node in ONNX can only add two inputs, but there is also the [**sum**](https://github.com/onnx/onnx/blob/main/docs/Operators.md#Sum) node, which can process more than two inputs. So it would be a reasonable change of the graph to combine the two successive adder nodes to one sum node."
    ]
   },
   {
@@ -550,7 +551,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "onnx_model1 = onnx.helper.make_model(graph, producer_name=\"simple-model1\")\n",
+    "onnx_model1 = qonnx_make_model(graph, producer_name=\"simple-model1\")\n",
     "onnx.save(onnx_model1, '/tmp/simple_model1.onnx')"
    ]
   },
@@ -598,7 +599,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
diff --git a/notebooks/basics/1_brevitas_network_import.ipynb b/notebooks/basics/1a_brevitas_network_import_via_FINN-ONNX.ipynb
similarity index 94%
rename from notebooks/basics/1_brevitas_network_import.ipynb
rename to notebooks/basics/1a_brevitas_network_import_via_FINN-ONNX.ipynb
index 5fb29754dc0ad56c2d07c783cf43102975b1621b..429effca8386d23a5fac0f83cd7cd9f501e77f3e 100644
--- a/notebooks/basics/1_brevitas_network_import.ipynb
+++ b/notebooks/basics/1a_brevitas_network_import_via_FINN-ONNX.ipynb
@@ -4,7 +4,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Importing Brevitas networks into FINN\n",
+    "# Importing Brevitas networks into FINN with the FINN-ONNX interchange format\n",
+    "\n",
+    "**Note: This notebook is very similar to the 1b notebook, in that it shows the same concepts for the FINN-ONNX ingestion as 1b does for QONNX. Section 1 is identical in both notebooks.**\n",
     "\n",
     "In this notebook we'll go through an example of how to import a Brevitas-trained QNN into FINN. The steps will be as follows:\n",
     "\n",
@@ -138,9 +140,9 @@
    "outputs": [],
    "source": [
     "import brevitas.onnx as bo\n",
-    "export_onnx_path = \"/tmp/LFCW1A1.onnx\"\n",
+    "export_onnx_path = \"/tmp/LFCW1A1_finn-onnx.onnx\"\n",
     "input_shape = (1, 1, 28, 28)\n",
-    "bo.export_finn_onnx(lfc, input_shape, export_onnx_path)"
+    "bo.export_finn_onnx(lfc, input_shape, export_onnx_path);"
    ]
   },
   {
@@ -156,7 +158,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "showInNetron('/tmp/LFCW1A1.onnx')"
+    "showInNetron(export_onnx_path)"
    ]
   },
   {
@@ -244,7 +246,7 @@
     "from qonnx.transformation.infer_shapes import InferShapes\n",
     "model = model.transform(InferShapes())\n",
     "model = model.transform(FoldConstants())\n",
-    "export_onnx_path_transformed = \"/tmp/LFCW1A1-clean.onnx\"\n",
+    "export_onnx_path_transformed = \"/tmp/LFCW1A1-finn-onnx-clean.onnx\"\n",
     "model.save(export_onnx_path_transformed)"
    ]
   },
@@ -254,7 +256,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "showInNetron('/tmp/LFCW1A1-clean.onnx')"
+    "showInNetron(export_onnx_path_transformed)"
    ]
   },
   {
@@ -297,7 +299,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
diff --git a/notebooks/basics/1b_brevitas_network_import_via_QONNX.ipynb b/notebooks/basics/1b_brevitas_network_import_via_QONNX.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..fba824dca24b21996a03a1647b2b98c1c1795157
--- /dev/null
+++ b/notebooks/basics/1b_brevitas_network_import_via_QONNX.ipynb
@@ -0,0 +1,326 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Importing Brevitas networks into FINN with the QONNX interchange format\n",
+    "\n",
+    "**Note: This notebook is very similar to the 1a notebook, in that it shows the same concepts for the QONNX ingestion as 1a does for FINN-ONNX. Section 1 is identical in both notebooks.**\n",
+    "\n",
+    "In this notebook we'll go through an example of how to import a Brevitas-trained QNN into FINN. The steps will be as follows:\n",
+    "\n",
+    "1. Load up the trained PyTorch model\n",
+    "2. Call Brevitas QONNX export and visualize with Netron\n",
+    "3. Import into FINN and converting QONNX to FINN-ONNX\n",
+    "\n",
+    "We'll use the following utility functions to print the source code for function calls (`showSrc()`) and to visualize a network using netron (`showInNetron()`) in the Jupyter notebook:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import onnx\n",
+    "from finn.util.visualization import showSrc, showInNetron"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Load up the trained PyTorch model\n",
+    "\n",
+    "The FINN Docker image comes with several [example Brevitas networks](https://github.com/Xilinx/brevitas/tree/master/src/brevitas_examples/bnn_pynq), and we'll use the LFC-w1a1 model as the example network here. This is a binarized fully connected network trained on the MNIST dataset. Let's start by looking at what the PyTorch network definition looks like:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from brevitas_examples import bnn_pynq\n",
+    "showSrc(bnn_pynq.models.FC)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can see that the network topology is constructed using a few helper functions that generate the quantized linear layers and quantized activations. The bitwidth of the layers is actually parametrized in the constructor, so let's instantiate a 1-bit weights and activations version of this network. We also have pretrained weights for this network, which we will load into the model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from finn.util.test import get_test_model\n",
+    "lfc = get_test_model(netname = \"LFC\", wbits = 1, abits = 1, pretrained = True)\n",
+    "lfc"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We have now instantiated our trained PyTorch network. Let's try to run an example MNIST image through the network using PyTorch."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import matplotlib.pyplot as plt\n",
+    "from pkgutil import get_data\n",
+    "import onnx\n",
+    "import onnx.numpy_helper as nph\n",
+    "raw_i = get_data(\"qonnx.data\", \"onnx/mnist-conv/test_data_set_0/input_0.pb\")\n",
+    "input_tensor = onnx.load_tensor_from_string(raw_i)\n",
+    "input_tensor_npy = nph.to_array(input_tensor)\n",
+    "input_tensor_pyt = torch.from_numpy(input_tensor_npy).float()\n",
+    "imgplot = plt.imshow(input_tensor_npy.reshape(28,28), cmap='gray')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.nn.functional import softmax\n",
+    "# do forward pass in PyTorch/Brevitas\n",
+    "produced = lfc.forward(input_tensor_pyt).detach()\n",
+    "probabilities = softmax(produced, dim=-1).flatten()\n",
+    "probabilities"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "objects = [str(x) for x in range(10)]\n",
+    "y_pos = np.arange(len(objects))\n",
+    "plt.bar(y_pos, probabilities, align='center', alpha=0.5)\n",
+    "plt.xticks(y_pos, objects)\n",
+    "plt.ylabel('Predicted Probability')\n",
+    "plt.title('LFC-w1a1 Predictions for Image')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Call Brevitas QONNX export and visualize with Netron\n",
+    "\n",
+    "Brevitas comes with built-in QONNX export functionality. This is similar to the regular ONNX export capabilities of PyTorch, with a few differences:\n",
+    "\n",
+    "1. Weight and activation quantization is represented as a 'fake-quantization' with Quant and BipolarQuant nodes.\n",
+    "2. Truncation operations as required by average pooling are represented with a Trunc node.\n",
+    "\n",
+    "One can read more about how QONNX works and why it was developed here: https://xilinx.github.io/finn//2021/11/03/qonnx-and-finn.html\n",
+    "\n",
+    "Additionally QONNX comes with a set of tools for working with the format. These are maintained together with the Fast Machinelearning collaboration as an open-source projet here: https://github.com/fastmachinelearning/qonnx\n",
+    "\n",
+    "It's actually quite straightforward to export QONNX from our Brevitas model as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from brevitas.export.onnx.generic.manager import BrevitasONNXManager\n",
+    "export_onnx_path = \"/tmp/LFCW1A1_qonnx.onnx\"\n",
+    "input_shape = (1, 1, 28, 28)\n",
+    "BrevitasONNXManager.export(lfc, input_shape, export_onnx_path);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's examine what the exported ONNX model looks like. For this, we will use the Netron visualizer:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "showInNetron(export_onnx_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When running this notebook in the FINN Docker container, you should be able to see an interactive visualization of the imported network above, and click on individual nodes to inspect their parameters. If you look at any of the MatMul nodes, you should be able to see that the weights are all {-1, +1} values."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Import into FINN and converting QONNX to FINN-ONNX\n",
+    "\n",
+    "Similarily to the 1a notebook we will first run a cleanup transformation on the exported QONNX model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from qonnx.util.cleanup import cleanup\n",
+    "\n",
+    "export_onnx_path_cleaned = \"/tmp/LFCW1A1-qonnx-clean.onnx\"\n",
+    "cleanup(export_onnx_path, out_file=export_onnx_path_cleaned)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "showInNetron(export_onnx_path_cleaned)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will now import this QONNX model into FINN using the ModelWrapper. Here we can immediatley execute the model to verify correctness."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from qonnx.core.modelwrapper import ModelWrapper\n",
+    "import qonnx.core.onnx_exec as oxe\n",
+    "model = ModelWrapper(export_onnx_path_cleaned)\n",
+    "input_dict = {\"global_in\": nph.to_array(input_tensor)}\n",
+    "output_dict = oxe.execute_onnx(model, input_dict)\n",
+    "produced_qonnx = output_dict[list(output_dict.keys())[0]]\n",
+    "\n",
+    "produced_qonnx"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "np.isclose(produced, produced_qonnx).all()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Using the `QONNXtoFINN` transformation we can convert the model to the FINN internal FINN-ONNX representation. Notably all Quant and BipolarQuant nodes will have disappeared and are converted into MultiThreshold nodes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from finn.transformation.qonnx.convert_qonnx_to_finn import ConvertQONNXtoFINN\n",
+    "model = ModelWrapper(export_onnx_path_cleaned)\n",
+    "\n",
+    "model = model.transform(ConvertQONNXtoFINN())\n",
+    "\n",
+    "export_onnx_path_converted = \"/tmp/LFCW1A1-qonnx-converted.onnx\"\n",
+    "model.save(export_onnx_path_converted)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "showInNetron(export_onnx_path_converted)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And once again we can execute the model with the FINN/QONNX execution engine."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = ModelWrapper(export_onnx_path_cleaned)\n",
+    "input_dict = {\"global_in\": nph.to_array(input_tensor)}\n",
+    "output_dict = oxe.execute_onnx(model, input_dict)\n",
+    "produced_finn = output_dict[list(output_dict.keys())[0]]\n",
+    "\n",
+    "produced_finn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "np.isclose(produced_qonnx, produced_finn).all()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We have succesfully verified that the transformed and cleaned-up FINN graph still produces the same output, and can now use this model for further processing in FINN."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/end2end_example/bnn-pynq/cnv_end2end_example.ipynb b/notebooks/end2end_example/bnn-pynq/cnv_end2end_example.ipynb
index 28155d6f3eacd4dfd77aefbc73fc4ed3ef12f1dd..388accad3aa2bb2633fb691241d234442d59bb11 100644
--- a/notebooks/end2end_example/bnn-pynq/cnv_end2end_example.ipynb
+++ b/notebooks/end2end_example/bnn-pynq/cnv_end2end_example.ipynb
@@ -46,7 +46,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The white fields show the state of the network representation in the respective step. The colored fields represent the transformations that are applied to the network to achieve a certain result. The diagram is divided into 5 sections represented by a different color, each of it includes several flow steps. The flow starts in top left corner with Brevitas export (green section), followed by the preparation of the network (blue section) for the Vivado HLS synthesis and Vivado IPI stitching (orange section), and finally building a PYNQ overlay bitfile and testing it on a PYNQ board (yellow section).\n",
+    "The white fields show the state of the network representation in the respective step. The colored fields represent the transformations that are applied to the network to achieve a certain result. The diagram is divided into 5 sections represented by a different color, each of it includes several flow steps. The flow starts in top left corner with Brevitas export (green section), followed by the preparation of the network (blue section) for the Vitis HLS synthesis and Vivado IPI stitching (orange section), and finally building a PYNQ overlay bitfile and testing it on a PYNQ board (yellow section).\n",
     "There is an additional section for functional verification (red section) on the left side of the diagram, which we will not cover in this notebook. For details please take a look in the verification notebook which you can find [here](tfc_end2end_verification.ipynb)\n",
     "\n",
     "\n",
@@ -199,7 +199,7 @@
     "\n",
     "![](cnv-mp-fc.png)\n",
     "\n",
-    "Note how the convolution layer looks very similar to the fully connected one in terms of the matrix-vector-threshold unit (MVTU), but now the MVTU is preceded by a sliding window unit that produces the matrix from the input image. All of these building blocks, including the `MaxPool` layer you see in this figure, exist as templated Vivado HLS C++ functions in [finn-hlslib](https://github.com/Xilinx/finn-hlslib).\n",
+    "Note how the convolution layer looks very similar to the fully connected one in terms of the matrix-vector-threshold unit (MVTU), but now the MVTU is preceded by a sliding window unit that produces the matrix from the input image. All of these building blocks, including the `MaxPool` layer you see in this figure, exist as templated Vitis HLS C++ functions in [finn-hlslib](https://github.com/Xilinx/finn-hlslib).\n",
     "\n",
     "\n",
     "To target this kind of hardware architecture with our network we'll apply a convolution lowering transformation, in addition to streamlining. You may recall the *streamlining transformation* that we applied to the TFC-w1a1 network, which is a series of mathematical simplifications that allow us to get rid of floating point scaling operations by implementing few-bit activations as thresholding operations. \n",
@@ -462,11 +462,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 5. Deployment and Remote Execution\n",
+    "## 5. Deployment and Execution\n",
     "\n",
-    "Now that we're done with the hardware generation, we can copy the necessary files onto our PYNQ board.\n",
-    "\n",
-    "**Make sure you've [set up the SSH keys for your PYNQ board](https://finn-dev.readthedocs.io/en/latest/getting_started.html#pynq-board-first-time-setup) before executing this step.**"
+    "The bitfile and generated driver files(s) will be copied into a deployment folder which then can be used to run the network on the PYNQ board."
    ]
   },
   {
@@ -475,33 +473,33 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import os\n",
+    "from shutil import copy\n",
+    "from distutils.dir_util import copy_tree\n",
+    "\n",
+    "# create directory for deployment files\n",
+    "deployment_dir = make_build_dir(prefix=\"pynq_deployment_\")\n",
+    "model.set_metadata_prop(\"pynq_deployment_dir\", deployment_dir)\n",
     "\n",
-    "# set up the following values according to your own environment\n",
-    "# FINN will use ssh to deploy and run the generated accelerator\n",
-    "ip = \"192.168.2.99\"\n",
-    "username = os.getenv(\"PYNQ_USERNAME\", \"xilinx\")\n",
-    "password = os.getenv(\"PYNQ_PASSWORD\", \"xilinx\")\n",
-    "port = os.getenv(\"PYNQ_PORT\", 22)\n",
-    "target_dir = os.getenv(\"PYNQ_TARGET_DIR\", \"/home/xilinx/finn_cnv_end2end_example\")\n",
-    "# set up ssh options to only allow publickey authentication\n",
-    "options = \"-o PreferredAuthentications=publickey -o PasswordAuthentication=no\"\n",
+    "# get and copy necessary files\n",
+    "# .bit and .hwh file\n",
+    "bitfile = model.get_metadata_prop(\"bitfile\")\n",
+    "hwh_file = model.get_metadata_prop(\"hw_handoff\")\n",
+    "deploy_files = [bitfile, hwh_file]\n",
     "\n",
-    "# test access to PYNQ board\n",
-    "! ssh {options} {username}@{ip} -p {port} cat /var/run/motd.dynamic"
+    "for dfile in deploy_files:\n",
+    "    if dfile is not None:\n",
+    "        copy(dfile, deployment_dir)\n",
+    "\n",
+    "# driver.py and python libraries\n",
+    "pynq_driver_dir = model.get_metadata_prop(\"pynq_driver_dir\")\n",
+    "copy_tree(pynq_driver_dir, deployment_dir)"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "from finn.transformation.fpgadataflow.make_deployment import DeployToPYNQ\n",
-    "\n",
-    "model = ModelWrapper(build_dir + \"/end2end_cnv_w1a1_synth.onnx\")\n",
-    "model = model.transform(DeployToPYNQ(ip, port, username, password, target_dir))\n",
-    "model.save(build_dir + \"/end2end_cnv_w1a1_pynq_deploy.onnx\")"
+    "Next to these files, we will also need an example numpy array to test the network on the PYNQ board. (*and before you ask, that's supposed to be a cat (CIFAR-10 class number 3)*) Recall that we partitioned our original network into a parent graph that contained the non-synthesizable nodes and a child graph that contained the bulk of the network, which we turned into a bitfile. The only operator left outside the FPGA partition was a `Transpose` to convert NCHW images into NHWC ones. Thus, we can skip the execution in the parent as long as we ensure our image has the expected data layout. The example numpy array can then be saved as .npy file."
    ]
   },
   {
@@ -510,8 +508,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "target_dir_pynq = target_dir + \"/\" + model.get_metadata_prop(\"pynq_deployment_dir\").split(\"/\")[-1]\n",
-    "target_dir_pynq"
+    "import pkg_resources as pk\n",
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "\n",
+    "fn = pk.resource_filename(\"finn.qnn-data\", \"cifar10/cifar10-test-data-class3.npz\")\n",
+    "x = np.load(fn)[\"arr_0\"]\n",
+    "x = x.reshape(3, 32,32).transpose(1, 2, 0)\n",
+    "plt.imshow(x)"
    ]
   },
   {
@@ -520,14 +524,19 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "! ssh {options} {username}@{ip} -p {port} 'ls -l {target_dir_pynq}'"
+    "model = ModelWrapper(build_dir + \"/end2end_cnv_w1a1_synth.onnx\")\n",
+    "iname = model.graph.input[0].name\n",
+    "ishape = model.get_tensor_shape(iname)\n",
+    "np.save(deployment_dir + \"/input.npy\", x.reshape(ishape))"
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "We only have two more steps to be able to remotely execute the deployed bitfile with some test data from the CIFAR-10 dataset. Let's load up some test data that comes bundled with FINN -- *and before you ask, that's supposed to be a cat (CIFAR-10 class number 3)*."
+    "! ls {deployment_dir}"
    ]
   },
   {
@@ -536,54 +545,34 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import pkg_resources as pk\n",
-    "import matplotlib.pyplot as plt\n",
-    "import numpy as np\n",
-    "\n",
-    "fn = pk.resource_filename(\"finn.qnn-data\", \"cifar10/cifar10-test-data-class3.npz\")\n",
-    "x = np.load(fn)[\"arr_0\"]\n",
-    "x = x.reshape(3, 32,32).transpose(1, 2, 0)\n",
-    "plt.imshow(x)"
+    "from shutil import make_archive\n",
+    "make_archive('deploy-on-pynq-cnv', 'zip', deployment_dir)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Recall that we partitioned our original network into a parent graph that contained the non-synthesizable nodes and a child graph that contained the bulk of the network, which we turned into a bitfile. The only operator left outside the FPGA partition was a `Transpose` to convert NCHW images into NHWC ones. Thus, we can skip the execution in the parent as long as we ensure our image has the expected data layout, which we have done above."
+    "You can now download the created zipfile (File -> Open, mark the checkbox next to the deploy-on-pynq-tfc.zip and select Download from the toolbar), then copy it to your PYNQ board (for instance via scp or rsync). Then, run the following commands on the PYNQ board to extract the archive and run the execution:"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "from finn.core.onnx_exec import execute_onnx\n",
-    "\n",
-    "model = ModelWrapper(build_dir + \"/end2end_cnv_w1a1_pynq_deploy.onnx\")\n",
-    "iname = model.graph.input[0].name\n",
-    "oname = model.graph.output[0].name\n",
-    "ishape = model.get_tensor_shape(iname)\n",
-    "input_dict = {iname: x.astype(np.float32).reshape(ishape)}\n",
-    "ret = execute_onnx(model, input_dict, True)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "ret[oname]"
+    "```shell\n",
+    "unzip deploy-on-pynq-cnv.zip -d finn-cnv-demo\n",
+    "cd finn-cnv-demo\n",
+    "sudo python3 -m pip install bitstring\n",
+    "sudo python3 driver.py --exec_mode=execute --batchsize=1 --bitfile=resizer.bit --inputfile=input.npy\n",
+    "```"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We see that the network correctly predicts this as a class 3 (\"cat\"). "
+    "The output will be saved on the PYNQ board as `output.npy` and can be copied to the host and opened with `np.load()`."
    ]
   },
   {
@@ -592,7 +581,7 @@
    "source": [
     "### Validating the Accuracy on a PYNQ Board <a id='validation'></a>\n",
     "\n",
-    "All the command line prompts here are meant to be executed with `sudo` on the PYNQ board, so we'll use a workaround (`echo password | sudo -S command`) to get that working from this notebook running on the host computer.\n",
+    "All the command line prompts here are meant to be executed with `sudo` on the PYNQ board.\n",
     "\n",
     "**Ensure that your PYNQ board has a working internet connecting for the next steps, since some there is some downloading involved.**\n",
     "\n",
@@ -601,16 +590,9 @@
     "\n",
     "Command to execute on PYNQ:\n",
     "\n",
-    "```pip3 install git+https://github.com/fbcotter/dataset_loading.git@0.0.4#egg=dataset_loading```"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "! ssh {options} -t {username}@{ip} -p {port} 'echo {password} | sudo -S pip3 install git+https://github.com/fbcotter/dataset_loading.git@0.0.4#egg=dataset_loading'"
+    "```shell\n",
+    "sudo pip3 install git+https://github.com/fbcotter/dataset_loading.git@0.0.4#egg=dataset_loading\n",
+    "```"
    ]
   },
   {
@@ -621,16 +603,9 @@
     "\n",
     "Command to execute on PYNQ:\n",
     "\n",
-    "`python3.6 validate.py --dataset cifar10 --batchsize 1000`"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "! ssh {options} -t {username}@{ip} -p {port} 'cd {target_dir_pynq}; echo {password} | sudo -S python3.6 validate.py --dataset cifar10 --batchsize 1000'"
+    "```shell\n",
+    "sudo python3 validate.py --dataset cifar10 --batchsize 1000\n",
+    "```"
    ]
   },
   {
@@ -643,7 +618,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
diff --git a/notebooks/end2end_example/bnn-pynq/tfc_end2end_example.ipynb b/notebooks/end2end_example/bnn-pynq/tfc_end2end_example.ipynb
index c4fc92b97c91d6b1dfadc41ac3c23d014bd9fada..eec17b2fa7e8226cafe48f095aa38eb704b2812e 100644
--- a/notebooks/end2end_example/bnn-pynq/tfc_end2end_example.ipynb
+++ b/notebooks/end2end_example/bnn-pynq/tfc_end2end_example.ipynb
@@ -33,7 +33,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The white fields show the state of the network representation in the respective step. The colored fields represent the transformations that are applied to the network to achieve a certain result. The diagram is divided into 5 sections represented by a different color, each of it includes several flow steps. The flow starts in top left corner with Brevitas export (green section), followed by the preparation of the network (blue section) for the Vivado HLS synthesis and Vivado IPI stitching (orange section), and finally building a PYNQ overlay bitfile and testing it on a PYNQ board (yellow section).\n",
+    "The white fields show the state of the network representation in the respective step. The colored fields represent the transformations that are applied to the network to achieve a certain result. The diagram is divided into 5 sections represented by a different color, each of it includes several flow steps. The flow starts in top left corner with Brevitas export (green section), followed by the preparation of the network (blue section) for the Vitis HLS synthesis and Vivado IPI stitching (orange section), and finally building a PYNQ overlay bitfile and testing it on a PYNQ board (yellow section).\n",
     "There is an additional section for functional verification (red section) on the right side of the diagram, which we will not cover in this notebook. For details please take a look in the verification notebook which you can find [here](tfc_end2end_verification.ipynb)\n",
     "\n",
     "\n",
@@ -161,7 +161,7 @@
     "\n",
     "![](finn-hw-arch.png)\n",
     "\n",
-    "In practice, the compute arrays are instantiated by function calls to optimized Vivado HLS building blocks from the [finn-hlslib](https://github.com/Xilinx/finn-hlslib) library. As these function calls can only handle certain patterns/cases, we need to transform the network into an appropriate form so that we can replace network layers with these function calls, which is the goal of the network preparation process."
+    "In practice, the compute arrays are instantiated by function calls to optimized Vitis HLS building blocks from the [finn-hlslib](https://github.com/Xilinx/finn-hlslib) library. As these function calls can only handle certain patterns/cases, we need to transform the network into an appropriate form so that we can replace network layers with these function calls, which is the goal of the network preparation process."
    ]
   },
   {
@@ -248,7 +248,7 @@
     "\n",
     "In FINN, we can bake some of these pre/postprocessing operatings into the graph, and in some cases these can be highly beneficial for performance by allowing our accelerator to directly consume raw data instead of going through CPU preprocessing. \n",
     "\n",
-    "We'll demonstrate this for our small image classification network as follows. Brevitas preprocesses BNN-PYNQ network inputs with `torchvision.transforms.ToTensor()` [prior to training](https://github.com/Xilinx/brevitas/blob/master/src/brevitas_examples/bnn_pynq/trainer.py#L104), which converts 8-bit RGB values into floats between 0 and 1 by dividing the input by 255. We can achieve the same effect in FINN by exporting a single-node ONNX graph for division by 255 (which already exists as `finn.util.pytorch.ToTensor` and merging this with our original model. Finally, we're going to mark our input tensor as 8-bit to let FINN know which level of precision to use."
+    "We'll demonstrate this for our small image classification network as follows. Brevitas preprocesses BNN-PYNQ network inputs with `torchvision.transforms.ToTensor()` [prior to training](https://github.com/Xilinx/brevitas/blob/master/src/brevitas_examples/bnn_pynq/trainer.py#L86), which converts 8-bit RGB values into floats between 0 and 1 by dividing the input by 255. We can achieve the same effect in FINN by exporting a single-node ONNX graph for division by 255 (which already exists as `finn.util.pytorch.ToTensor` and merging this with our original model. Finally, we're going to mark our input tensor as 8-bit to let FINN know which level of precision to use."
    ]
   },
   {
@@ -343,7 +343,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As can be seen, several transformations are involved in the streamlining transformation. There are move and collapse transformations. In the last step the operations are transformed into multithresholds. The involved transformations can be viewed in detail [here](https://github.com/Xilinx/finn/tree/master/src/finn/transformation/streamline). After each transformation, three of the tidy-up transformations (`GiveUniqueNodeNames`, `GiveReadableTensorNames` and `InferDataTypes`) are applied to the model.\n",
+    "As can be seen, several transformations are involved in the streamlining transformation. There are move and collapse transformations. In the last step the operations are transformed into multithresholds. The involved transformations can be viewed in detail [here](https://github.com/Xilinx/finn/tree/main/src/finn/transformation/streamline). After each transformation, three of the tidy-up transformations (`GiveUniqueNodeNames`, `GiveReadableTensorNames` and `InferDataTypes`) are applied to the model.\n",
     "\n",
     "After streamlining the network looks as follows:"
    ]
@@ -525,7 +525,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can use the higher-level [HLSCustomOp](https://github.com/Xilinx/finn/blob/main/src/finn/custom_op/fpgadataflow/__init__.py) wrappers for this node. These wrappers provide easy access to specific properties of these nodes, such as the folding factors (PE and SIMD). Let's have a look at which node attributes are defined by the CustomOp wrapper, and adjust the SIMD and PE attributes."
+    "We can use the higher-level [HLSCustomOp](https://github.com/Xilinx/finn/blob/main/src/finn/custom_op/fpgadataflow/hlscustomop.py) wrappers for this node. These wrappers provide easy access to specific properties of these nodes, such as the folding factors (PE and SIMD). Let's have a look at which node attributes are defined by the CustomOp wrapper, and adjust the SIMD and PE attributes."
    ]
   },
   {
@@ -547,7 +547,7 @@
    "metadata": {},
    "source": [
     "We can see that the PE and SIMD are listed as node attributes, as well as the depths of the FIFOs that will be inserted between consecutive layers, and all can be adjusted using `set_nodeattr` subject to certain constraints. There are also a lot of additional attributes that can be set for this node type.\n",
-    "**In this notebook we are setting the folding factors and FIFO depths manually, but in a future version we will support determining the folding factors given an FPGA resource budget according to the analytical model from the [FINN-R paper](https://arxiv.org/pdf/1809.04570).**"
+    "**In this notebook we are setting the folding factors and FIFO depths manually but it is possible to use FINN transformations for this ([SetFolding](https://github.com/Xilinx/finn/blob/main/src/finn/transformation/fpgadataflow/set_folding.py) and [InsertAndSetFIFODepths](https://github.com/Xilinx/finn/blob/main/src/finn/transformation/fpgadataflow/set_fifo_depths.py)).**"
    ]
   },
   {
@@ -609,7 +609,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This completes the network preparation and the network can be passed on to the next block *Vivado HLS and IPI*, which is described below."
+    "This completes the network preparation and the network can be passed on to the next block *Vitis HLS and IPI*, which is described below."
    ]
   },
   {
@@ -798,23 +798,21 @@
    "source": [
     "## 4.  PYNQ deployment <a id='hw_test'></a>\n",
     "\n",
-    "* [Deployment and Remote Execution](#deploy)\n",
+    "* [Deployment](#deploy)\n",
     "* [Validation on PYNQ Board](#validation)\n",
     "* [Throughput Test on PYNQ Board](#throughput)\n",
     "\n",
     "\n",
-    "We are almost done preparing our hardware design. We'll now put it in a form suitable for use as a PYNQ overlay, synthesize and deploy it."
+    "The bitfile and generated driver will be copied together with some necessary files for execution into a deployment folder which then can be used to run the network on the PYNQ board."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Deployment and Remote Execution <a id='deploy'></a>\n",
+    "### Deployment <a id='deploy'></a>\n",
     "\n",
-    "We'll now use the `DeployToPYNQ` transformation to create a deployment folder with the bitfile and driver file(s), and copy that to the PYNQ board. You can change the default IP address, username, password and target folder for the PYNQ below.\n",
-    "\n",
-    "**Make sure you've [set up the SSH keys for your PYNQ board](https://finn-dev.readthedocs.io/en/latest/getting_started.html#pynq-board-first-time-setup) before executing this step.**"
+    "We'll now create a deployment folder with the bitfile and driver file(s), we zip it and afterwards it can be copied to the PYNQ board for execution and validation."
    ]
   },
   {
@@ -823,74 +821,33 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import os\n",
+    "from shutil import copy\n",
+    "from distutils.dir_util import copy_tree\n",
     "\n",
-    "# set up the following values according to your own environment\n",
-    "# FINN will use ssh to deploy and run the generated accelerator\n",
-    "ip = \"192.168.2.99\"\n",
-    "username = os.getenv(\"PYNQ_USERNAME\", \"xilinx\")\n",
-    "password = os.getenv(\"PYNQ_PASSWORD\", \"xilinx\")\n",
-    "port = os.getenv(\"PYNQ_PORT\", 22)\n",
-    "target_dir = os.getenv(\"PYNQ_TARGET_DIR\", \"/home/xilinx/finn_tfc_end2end_example\")\n",
-    "# set up ssh options to only allow publickey authentication\n",
-    "options = \"-o PreferredAuthentications=publickey -o PasswordAuthentication=no\"\n",
+    "# create directory for deployment files\n",
+    "deployment_dir = make_build_dir(prefix=\"pynq_deployment_\")\n",
+    "model.set_metadata_prop(\"pynq_deployment_dir\", deployment_dir)\n",
     "\n",
-    "# test access to PYNQ board\n",
-    "! ssh {options} {username}@{ip} -p {port} cat /var/run/motd.dynamic"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from finn.transformation.fpgadataflow.make_deployment import DeployToPYNQ\n",
+    "# get and copy necessary files\n",
+    "# .bit and .hwh file\n",
+    "bitfile = model.get_metadata_prop(\"bitfile\")\n",
+    "hwh_file = model.get_metadata_prop(\"hw_handoff\")\n",
+    "deploy_files = [bitfile, hwh_file]\n",
     "\n",
-    "model = model.transform(DeployToPYNQ(ip, port, username, password, target_dir))\n",
-    "model.save(build_dir + \"/tfc_w1_a1_pynq_deploy.onnx\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Let's verify that the remote access credentials is saved in the model metadata, and that the deployment folder has been successfully copied to the board:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "model.model.metadata_props"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "target_dir_pynq = target_dir + \"/\" + model.get_metadata_prop(\"pynq_deployment_dir\").split(\"/\")[-1]\n",
-    "target_dir_pynq"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "! ssh {options} {username}@{ip} -p {port} 'ls -l {target_dir_pynq}'"
+    "for dfile in deploy_files:\n",
+    "    if dfile is not None:\n",
+    "        copy(dfile, deployment_dir)\n",
+    "\n",
+    "# driver.py and python libraries\n",
+    "pynq_driver_dir = model.get_metadata_prop(\"pynq_driver_dir\")\n",
+    "copy_tree(pynq_driver_dir, deployment_dir)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We only have two more steps to be able to remotely execute the deployed bitfile with some test data from the MNIST dataset. Let's load up some test data that comes bundled with FINN."
+    "Next to these files, we will also need an example numpy array to test the network on the PYNQ board. You may recall that one \"reshape\" node was left out of the StreamingDataflowPartition. We'll do that manually with a numpy function call when passing in the input, but everything else in the network ended up inside the StreamingDataflowPartition so that's all we need to do. The example numpy array can then be saved as .npy file. "
    ]
   },
   {
@@ -914,18 +871,23 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "model = ModelWrapper(build_dir + \"/tfc_w1_a1_pynq_deploy.onnx\")\n",
+    "import numpy as np\n",
+    "\n",
+    "model = ModelWrapper(build_dir + \"/tfc_w1_a1_post_synthesis.onnx\")\n",
     "iname = model.graph.input[0].name\n",
     "oname = parent_model.graph.output[0].name\n",
     "ishape = model.get_tensor_shape(iname)\n",
-    "print(\"Expected network input shape is \" + str(ishape))"
+    "print(\"Expected network input shape is \" + str(ishape))\n",
+    "np.save(deployment_dir + \"/input.npy\", x.reshape(ishape))"
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "Finally, we can call `execute_onnx` on the graph, which will internally call remote execution with the bitfile, grab the results and return a numpy array. You may recall that one \"reshape\" node was left out of the StreamingDataflowPartition. We'll do that manually with a numpy function call when passing in the input, but everything else in the network ended up inside the StreamingDataflowPartition so that's all we need to do."
+    "! ls {deployment_dir}"
    ]
   },
   {
@@ -934,27 +896,34 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import numpy as np\n",
-    "from finn.core.onnx_exec import execute_onnx\n",
-    "\n",
-    "input_dict = {iname: x.reshape(ishape)}\n",
-    "ret = execute_onnx(model, input_dict)"
+    "from shutil import make_archive\n",
+    "make_archive('deploy-on-pynq-tfc', 'zip', deployment_dir)"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can now download the created zipfile (**File -> Open**, mark the checkbox next to the `deploy-on-pynq-tfc.zip` and select Download from the toolbar), then copy it to your PYNQ board (for instance via `scp` or `rsync`). Then, run the following commands **on the PYNQ board** to extract the archive and run the execution:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "ret[oname]"
+    "```shell\n",
+    "unzip deploy-on-pynq-tfc.zip -d finn-tfc-demo\n",
+    "cd finn-tfc-demo\n",
+    "sudo python3 -m pip install bitstring\n",
+    "sudo python3 driver.py --exec_mode=execute --batchsize=1 --bitfile=resizer.bit --inputfile=input.npy\n",
+    "```"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We see that the network correctly predicts this as a digit 2."
+    "The output will be saved on the PYNQ board as `output.npy` and can be copied to the host and opened with `np.load()`."
    ]
   },
   {
@@ -963,25 +932,16 @@
    "source": [
     "### Validating the Accuracy on a PYNQ Board <a id='validation'></a>\n",
     "\n",
-    "All the command line prompts here are meant to be executed with `sudo` on the PYNQ board, so we'll use a workaround (`echo password | sudo -S command`) to get that working from this notebook running on the host computer.\n",
-    "\n",
     "**Ensure that your PYNQ board has a working internet connecting for the next steps, since there is some downloading involved.**\n",
     "\n",
     "To validate the accuracy, we first need to install the [`dataset-loading`](https://github.com/fbcotter/dataset_loading) Python package to the PYNQ board. This will give us a convenient way of downloading and accessing the MNIST dataset.\n",
     "\n",
     "\n",
-    "Command to execute on PYNQ:\n",
+    "Command to execute on PYNQ board:\n",
     "\n",
-    "```sudo pip3 install git+https://github.com/fbcotter/dataset_loading.git@0.0.4#egg=dataset_loading```"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "! ssh {options} -t {username}@{ip} -p {port} 'echo {password} | sudo -S pip3 install git+https://github.com/fbcotter/dataset_loading.git@0.0.4#egg=dataset_loading'"
+    "```shell\n",
+    "sudo pip3 install git+https://github.com/fbcotter/dataset_loading.git@0.0.4#egg=dataset_loading\n",
+    "```"
    ]
   },
   {
@@ -990,18 +950,11 @@
    "source": [
     "We can now use the `validate.py` script that was generated together with the driver to measure top-1 accuracy on the MNIST dataset.\n",
     "\n",
-    "Command to execute on PYNQ:\n",
+    "Command to execute on PYNQ board:\n",
     "\n",
-    "`python3.6 validate.py --dataset mnist --batchsize 1000`"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "! ssh {options} -t {username}@{ip} -p {port} 'cd {target_dir_pynq}; echo {password} | sudo -S python3.6 validate.py --dataset mnist --batchsize 1000'"
+    "```shell\n",
+    "sudo python3 validate.py --dataset mnist --batchsize 1000\n",
+    "```"
    ]
   },
   {
@@ -1016,60 +969,30 @@
    "metadata": {},
    "source": [
     "### Throughput Test on PYNQ Board <a id='throughput'></a>\n",
-    "In addition to the functional verification, FINN also offers the possibility to measure the network performance directly on the PYNQ board. This can be done using the core function `throughput_test`. In the next section we import the function and execute it.\n",
-    "First we extract the `remote_exec_model` again and pass it to the function. The function returns the metrics of the network as dictionary. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from finn.core.throughput_test import throughput_test_remote\n",
-    "\n",
-    "model = ModelWrapper(build_dir + \"/tfc_w1_a1_pynq_deploy.onnx\")\n",
-    "res = throughput_test_remote(model, 10000)\n",
-    "print(\"Network metrics:\")\n",
-    "for key in res:\n",
-    "    print(str(key) + \": \" + str(res[key]))"
+    "In addition to the functional verification, FINN also offers the possibility to measure the network performance directly on the PYNQ board. This can be done setting the `exec_mode` to `throughput_test`. \n",
+    "Command to execute on PYNQ board:"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Together with the values for folding we can evaluate the performance of our accelerator. Each layer has a total folding factor of 64 and because the network is fully pipelined, it follows: `II = 64`. II is the initiation interval and indicates how many cycles are needed for one input to be processed. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "II = 64\n",
-    "# frequency in MHz\n",
-    "f_MHz = 100\n",
-    "# expected throughput in MFPS\n",
-    "expected_throughput = f_MHz / II\n",
-    "# measured throughput (FPS) from throughput test, converted to MFPS\n",
-    "measured_throughput = res[\"throughput[images/s]\"] * 0.000001\n",
-    "# peformance\n",
-    "print(\"We reach approximately \" + str(round((measured_throughput / expected_throughput)*100)) + \"% of the ideal performance.\")"
+    "```shell\n",
+    "sudo python3 driver.py --exec_mode=throughput_test --batchsize=1000 --bitfile=resizer.bit\n",
+    "```"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The measured values were recorded with a batch size of 10000 and at a frequency of 100 MHz. We will be improving the efficiency of the generated accelerator examples in the coming FINN releases."
+    "The network metrics from the throughput test are saved in a file called `nw_metrics.txt` on the PYNQ board. Which can be investigated after running the command above."
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
diff --git a/notebooks/end2end_example/bnn-pynq/tfc_end2end_verification.ipynb b/notebooks/end2end_example/bnn-pynq/tfc_end2end_verification.ipynb
index 813127197e07e4ddb5ec5ff39aed0278e117babc..6c3b7965098e013fa35ac5f5b2b481e678d68f5d 100644
--- a/notebooks/end2end_example/bnn-pynq/tfc_end2end_verification.ipynb
+++ b/notebooks/end2end_example/bnn-pynq/tfc_end2end_verification.ipynb
@@ -61,7 +61,7 @@
     "fc = get_test_model_trained(\"TFC\", 1, 1)\n",
     "raw_i = get_data(\"qonnx.data\", \"onnx/mnist-conv/test_data_set_0/input_0.pb\")\n",
     "input_tensor = onnx.load_tensor_from_string(raw_i)\n",
-    "input_brevitas = torch.from_numpy(nph.to_array(input_tensor)).float()\n",
+    "input_brevitas = torch.from_numpy(nph.to_array(input_tensor).copy()).float()\n",
     "output_golden = fc.forward(input_brevitas).detach().numpy()\n",
     "output_golden"
    ]
@@ -72,7 +72,7 @@
    "source": [
     "## Simulation using Python <a id='simpy'></a>\n",
     "\n",
-    "If an ONNX model consists of [standard ONNX](https://github.com/onnx/onnx/blob/master/docs/Operators.md) nodes and/or FINN custom operations that do not belong to the fpgadataflow (`backend` $\\neq$ `fpgadataflow`) this model can be checked for functionality using Python.\n",
+    "If an ONNX model consists of [standard ONNX](https://github.com/onnx/onnx/blob/main/docs/Operators.md) nodes and/or FINN custom operations that do not belong to the fpgadataflow (`backend` $\\neq$ `fpgadataflow`) this model can be checked for functionality using Python.\n",
     "\n",
     "To simulate a standard ONNX node [onnxruntime](https://github.com/microsoft/onnxruntime) is used. onnxruntime is an open source tool developed by Microsoft to run standard ONNX nodes. For the FINN custom op nodes execution, functions are defined. The following is an example of the execution function of a XNOR popcount node.\n"
    ]
@@ -383,7 +383,15 @@
     "\n",
     "child_model = ModelWrapper(build_dir + \"/tfc_w1_a1_dataflow_child.onnx\")\n",
     "child_model = child_model.transform(InsertDWC())\n",
-    "child_model = child_model.transform(InsertFIFO())\n",
+    "\n",
+    "# set all impl_styles of the DWCs to hls to enable emulation\n",
+    "dwc_nodes = child_model.get_nodes_by_op_type(\"StreamingDataWidthConverter_Batch\")\n",
+    "for dwc in dwc_nodes:\n",
+    "    dwc_inst = getCustomOp(dwc)\n",
+    "    dwc_inst.set_nodeattr(\"impl_style\", \"hls\")\n",
+    "    \n",
+    "child_model = child_model.transform(InsertFIFO(create_shallow_fifos=True))\n",
+    "child_model.save(build_dir + \"/test.onnx\");\n",
     "child_model = child_model.transform(GiveUniqueNodeNames())\n",
     "child_model = child_model.transform(PrepareIP(test_fpga_part, target_clk_ns))\n",
     "child_model = child_model.transform(HLSSynthIP())\n",
@@ -431,7 +439,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
diff --git a/notebooks/end2end_example/cybersecurity/1-train-mlp-with-brevitas.ipynb b/notebooks/end2end_example/cybersecurity/1-train-mlp-with-brevitas.ipynb
index 5625a6f1c20ee5e4a66df28931a6a891f699a738..3d77586258b9ddb64985e7f7b7a2215565839c50 100644
--- a/notebooks/end2end_example/cybersecurity/1-train-mlp-with-brevitas.ipynb
+++ b/notebooks/end2end_example/cybersecurity/1-train-mlp-with-brevitas.ipynb
@@ -741,7 +741,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
diff --git a/notebooks/end2end_example/cybersecurity/2-import-into-finn-and-verify.ipynb b/notebooks/end2end_example/cybersecurity/2-import-into-finn-and-verify.ipynb
index 370312c77e90c67a3095e0800ad0c6046bfd75f4..e4848a1f40bed5865eccc1d831a634ac5f54e965 100644
--- a/notebooks/end2end_example/cybersecurity/2-import-into-finn-and-verify.ipynb
+++ b/notebooks/end2end_example/cybersecurity/2-import-into-finn-and-verify.ipynb
@@ -381,7 +381,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
diff --git a/notebooks/end2end_example/cybersecurity/3-build-accelerator-with-finn.ipynb b/notebooks/end2end_example/cybersecurity/3-build-accelerator-with-finn.ipynb
index 33adb68dc8ddfff1b427d82e4666a70e883bf2c8..a18cafd6044328d53139acafb2be2cf73a4ec9b6 100644
--- a/notebooks/end2end_example/cybersecurity/3-build-accelerator-with-finn.ipynb
+++ b/notebooks/end2end_example/cybersecurity/3-build-accelerator-with-finn.ipynb
@@ -624,7 +624,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
diff --git a/requirements.txt b/requirements.txt
index 348b1afab9deca1547d40cb8d8c54a396befa65d..83aad07d729e30cbbbaf565b4332fb1f7ae6f014 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -4,11 +4,11 @@ dataclasses-json==0.5.7
 docrep==0.2.7
 gspread==3.6.0
 numpy==1.22.0
-onnx==1.11.0
+onnx==1.13.0
 onnxoptimizer
 onnxruntime==1.11.1
 pre-commit==2.9.2
-protobuf==3.20.2
+protobuf==3.20.3
 psutil==5.9.4
 pyscaffold==3.2.1
 scipy==1.5.2
diff --git a/setup.cfg b/setup.cfg
index a1d0fef6cb08994ae8666fd2ea37166bf1cd3752..1893aa42316dad341fcedbd527f5abcf482e5cfb 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -72,18 +72,20 @@ exclude =
 # Add here additional requirements for extra features, to install with:
 # `pip install FINN[PDF]` like:
 # PDF = ReportLab; RXP
-# finn-base is needed to build the full set of docs
+# qonnx is needed to build the full set of docs
 docs =
-    finn-base==0.0.3
     docutils==0.17.1
     dataclasses-json==0.5.7
     gspread==3.6.0
+    IPython
     pytest
     netron
     vcdvcd
     torchvision
     torch
     qonnx@git+https://github.com/fastmachinelearning/qonnx@main#egg=qonnx
+    pyverilator@git+https://github.com/maltanar/pyverilator@master#egg=pyverilator
+    brevitas@git+https://github.com/Xilinx/brevitas@master#egg=brevitas_examples
 
 # Add here test requirements (semicolon/line-separated)
 testing =
diff --git a/src/finn/builder/build_dataflow_config.py b/src/finn/builder/build_dataflow_config.py
index 80934d812fd7a165fdaba9c3fb17dc37e5a82d49..a38cb6e572d683871a924330742a1859b6fbe75d 100644
--- a/src/finn/builder/build_dataflow_config.py
+++ b/src/finn/builder/build_dataflow_config.py
@@ -253,6 +253,10 @@ class DataflowBuildConfig:
     #: for each FIFO.
     auto_fifo_depths: Optional[bool] = True
 
+    #: Whether FIFO nodes with depth larger than 32768 will be split.
+    #: Allow to configure very large FIFOs in the folding_config_file.
+    split_large_fifos: Optional[bool] = False
+
     #: When `auto_fifo_depths = True`, select which method will be used for
     #: setting the FIFO sizes.
     auto_fifo_strategy: Optional[
diff --git a/src/finn/builder/build_dataflow_steps.py b/src/finn/builder/build_dataflow_steps.py
index 9a6966ac9bc877deb0c246f3bf84324a3b069e64..9ae5b1aab85cde8b26084f86d0f52eeef3965a6b 100644
--- a/src/finn/builder/build_dataflow_steps.py
+++ b/src/finn/builder/build_dataflow_steps.py
@@ -99,6 +99,7 @@ from finn.transformation.fpgadataflow.set_exec_mode import SetExecMode
 from finn.transformation.fpgadataflow.set_fifo_depths import (
     InsertAndSetFIFODepths,
     RemoveShallowFIFOs,
+    SplitLargeFIFOs,
 )
 from finn.transformation.fpgadataflow.set_folding import SetFolding
 from finn.transformation.fpgadataflow.synth_ooc import SynthOutOfContext
@@ -528,7 +529,9 @@ def step_set_fifo_depths(model: ModelWrapper, cfg: DataflowBuildConfig):
             model = model.transform(DeriveFIFOSizes())
             model = model.transform(
                 InsertFIFO(
-                    vivado_ram_style=cfg.large_fifo_mem_style, max_qsrl_depth=256
+                    vivado_ram_style=cfg.large_fifo_mem_style,
+                    max_qsrl_depth=256,
+                    create_shallow_fifos=True,
                 )
             )
             model = model.transform(GiveUniqueNodeNames())
@@ -550,6 +553,8 @@ def step_set_fifo_depths(model: ModelWrapper, cfg: DataflowBuildConfig):
                     force_python_sim=force_python_sim,
                 )
             )
+            # InsertAndSetFIFODepths internally removes any shallow FIFOs
+            # so no need to call RemoveShallowFIFOs here
         else:
             assert "Unsupported auto_fifo_strategy: " + cfg.auto_fifo_strategy
     else:
@@ -563,8 +568,6 @@ def step_set_fifo_depths(model: ModelWrapper, cfg: DataflowBuildConfig):
         model = model.transform(GiveReadableTensorNames())
         if cfg.folding_config_file is not None:
             model = model.transform(ApplyConfig(cfg.folding_config_file))
-        # remove any shallow FIFOs
-        model = model.transform(RemoveShallowFIFOs())
 
     # extract the final configuration and save it as json
     hw_attrs = [
@@ -576,11 +579,20 @@ def step_set_fifo_depths(model: ModelWrapper, cfg: DataflowBuildConfig):
         "resType",
         "mem_mode",
         "runtime_writeable_weights",
+        "inFIFODepths",
+        "outFIFODepths",
     ]
     extract_model_config_to_json(
         model, cfg.output_dir + "/final_hw_config.json", hw_attrs
     )
 
+    # perform FIFO splitting and shallow FIFO removal only after the final config
+    # json file has been written. otherwise, since these transforms may add/remove
+    # FIFOs, we get name mismatch problems when trying to reuse the final config.
+    if cfg.split_large_fifos:
+        model = model.transform(SplitLargeFIFOs())
+    model = model.transform(RemoveShallowFIFOs())
+
     # after FIFOs are ready to go, call PrepareIP and HLSSynthIP again
     # this will only run for the new nodes (e.g. FIFOs and DWCs)
     model = model.transform(
@@ -655,9 +667,8 @@ def step_measure_rtlsim_performance(model: ModelWrapper, cfg: DataflowBuildConfi
                 + "in FINN C++ verilator driver, falling back to Python"
             )
         rtlsim_bs = int(cfg.rtlsim_batch_size)
+        orig_rtlsim_trace_depth = get_rtlsim_trace_depth()
         if force_python_rtlsim:
-            # run with single input to get latency
-            orig_rtlsim_trace_depth = get_rtlsim_trace_depth()
             assert rtlsim_bs > 0, "rtlsim batch size must be >0"
             if cfg.verify_save_rtlsim_waveforms:
                 # set depth to 3 for layer-by-layer visibility
@@ -669,9 +680,11 @@ def step_measure_rtlsim_performance(model: ModelWrapper, cfg: DataflowBuildConfi
             rtlsim_model.set_metadata_prop(
                 "extra_verilator_args", str(["-CFLAGS", "-O3"])
             )
+            # run with single input to get latency
+            rtlsim_latency_dict = throughput_test_rtlsim(rtlsim_model, 1)
+            # run with batch to get stable-state throughput
             rtlsim_perf_dict = throughput_test_rtlsim(rtlsim_model, rtlsim_bs)
-            rtlsim_latency = rtlsim_perf_dict["cycles"]
-            rtlsim_perf_dict["latency_cycles"] = rtlsim_latency
+            rtlsim_perf_dict["latency_cycles"] = rtlsim_latency_dict["cycles"]
         else:
             rtlsim_perf_dict = verilator_fifosim(model, rtlsim_bs)
             # keep keys consistent between the Python and C++-styles
@@ -685,6 +698,19 @@ def step_measure_rtlsim_performance(model: ModelWrapper, cfg: DataflowBuildConfi
             for (key, val) in rtlsim_perf_dict.items():
                 if "max_count" in key:
                     del rtlsim_perf_dict[key]
+        # estimate stable-state throughput based on latency+throughput
+        if rtlsim_bs == 1:
+            rtlsim_perf_dict["stable_throughput[images/s]"] = rtlsim_perf_dict[
+                "throughput[images/s]"
+            ]
+        else:
+            total_cycles = rtlsim_perf_dict["cycles"]
+            latency_cycles = rtlsim_perf_dict["latency_cycles"]
+            stablestate_cycles = total_cycles - latency_cycles
+            clk_ns = float(model.get_metadata_prop("clk_ns"))
+            fclk_mhz = 1 / (clk_ns * 0.001)
+            runtime_s = (stablestate_cycles * clk_ns) * (10**-9)
+            rtlsim_perf_dict["stable_throughput[images/s]"] = rtlsim_bs / runtime_s
 
         with open(report_dir + "/rtlsim_performance.json", "w") as f:
             json.dump(rtlsim_perf_dict, f, indent=2)
diff --git a/src/finn/custom_op/fpgadataflow/__init__.py b/src/finn/custom_op/fpgadataflow/__init__.py
index e5eb483a00f6890f5eeb16c5cec533a4533c9f15..56d4230a3af3057daaa5c47140fcde1590dee686 100644
--- a/src/finn/custom_op/fpgadataflow/__init__.py
+++ b/src/finn/custom_op/fpgadataflow/__init__.py
@@ -43,6 +43,7 @@ from finn.custom_op.fpgadataflow.downsampler import DownSampler
 from finn.custom_op.fpgadataflow.duplicatestreams_batch import DuplicateStreams_Batch
 from finn.custom_op.fpgadataflow.eltwise import StreamingEltwise
 from finn.custom_op.fpgadataflow.fmpadding_batch import FMPadding_Batch
+from finn.custom_op.fpgadataflow.fmpadding_rtl import FMPadding_rtl
 from finn.custom_op.fpgadataflow.globalaccpool_batch import GlobalAccPool_Batch
 from finn.custom_op.fpgadataflow.iodma import IODMA
 from finn.custom_op.fpgadataflow.labelselect_batch import LabelSelect_Batch
@@ -91,3 +92,4 @@ custom_op["Lookup"] = Lookup
 custom_op["StreamingConcat"] = StreamingConcat
 custom_op["CheckSum"] = CheckSum
 custom_op["StreamingEltwise"] = StreamingEltwise
+custom_op["FMPadding_rtl"] = FMPadding_rtl
diff --git a/src/finn/custom_op/fpgadataflow/eltwise.py b/src/finn/custom_op/fpgadataflow/eltwise.py
index d6284750c73026c09fb7986ffc2517ed9ae3b153..68ed6546c741277bd8e962b6e80eda083cedba9c 100644
--- a/src/finn/custom_op/fpgadataflow/eltwise.py
+++ b/src/finn/custom_op/fpgadataflow/eltwise.py
@@ -398,7 +398,7 @@ class StreamingEltwise(HLSCustomOp):
                 "StreamingEltwise",
                 self.get_nodeattr("NumChannels"),
                 self.get_nodeattr("PE"),
-                self.get_number_output_values(),
+                int(np.prod(self.get_folded_output_shape()[:-2])),
                 slice_in0,
                 slice_in1,
                 slice_out,
diff --git a/src/finn/custom_op/fpgadataflow/fmpadding_rtl.py b/src/finn/custom_op/fpgadataflow/fmpadding_rtl.py
new file mode 100644
index 0000000000000000000000000000000000000000..5650d218857a7c7ff86c15ac057c4ebbc18df5ca
--- /dev/null
+++ b/src/finn/custom_op/fpgadataflow/fmpadding_rtl.py
@@ -0,0 +1,420 @@
+# Copyright (C) 2022, Advanced Micro Devices, Inc.
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# * Redistributions of source code must retain the above copyright notice, this
+#   list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright notice,
+#   this list of conditions and the following disclaimer in the documentation
+#   and/or other materials provided with the distribution.
+#
+# * Neither the name of FINN nor the names of its
+#   contributors may be used to endorse or promote products derived from
+#   this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import math
+import numpy as np
+import os
+import shutil
+import warnings
+from qonnx.core.datatype import DataType
+from qonnx.util.basic import roundup_to_integer_multiple
+
+from finn.custom_op.fpgadataflow.hlscustomop import HLSCustomOp
+from finn.util.basic import get_rtlsim_trace_depth, make_build_dir
+from finn.util.data_packing import npy_to_rtlsim_input, rtlsim_output_to_npy
+
+try:
+    from pyverilator import PyVerilator
+except ModuleNotFoundError:
+    PyVerilator = None
+
+
+class FMPadding_rtl(HLSCustomOp):
+    """CustomOp wrapper for the finn-rtllib fmpadding_axi component
+    Supports adjusting the padding amount and spatial feature sizes at
+    runtime."""
+
+    def __init__(self, onnx_node):
+        super().__init__(onnx_node)
+
+    def get_nodeattr_types(self):
+        my_attrs = {
+            # spatial size of input images
+            "ImgDim": ("ints", True, []),  # [H, W] = [Y, X]
+            # total padding (per dimension) to apply
+            "Padding": (
+                "ints",
+                True,
+                [1, 1, 1, 1],
+            ),  # [H_begin, W_begin, H_end, W_end] = [Y_begin, X_begin, Y_end, X_end]
+            # number of channels in input image
+            "NumChannels": ("i", True, 0),
+            # SIMD Input parallelism
+            "SIMD": ("i", False, 1),
+            # FINN input datatype
+            "inputDataType": ("s", True, ""),
+            # shape describing input vecs per execution
+            "numInputVectors": ("i", False, 1),
+            # Enable reprogrammable implementation to change FM dimensions,
+            # stride, or dilation during runtime
+            "dynamic_mode": ("i", False, 0, {0, 1}),
+            # attribute to save top module name - not user configurable
+            "gen_top_module": ("s", False, ""),
+        }
+        my_attrs.update(super().get_nodeattr_types())
+        return my_attrs
+
+    def get_padded_odim(self):
+        "Return the padded spatial size of the output."
+        idim_h, idim_w = self.get_nodeattr("ImgDim")
+        pad = self.get_nodeattr("Padding")
+        pad_h = pad[0] + pad[2]
+        pad_w = pad[1] + pad[3]
+        odim_h = idim_h + pad_h
+        odim_w = idim_w + pad_w
+        return [odim_h, odim_w]
+
+    def get_exp_cycles(self):
+        odim_h, odim_w = self.get_padded_odim()
+        channels = self.get_nodeattr("NumChannels")
+        simd = self.get_nodeattr("SIMD")
+        batch_size = self.get_nodeattr("numInputVectors")
+        exp_cycles = (channels / simd) * batch_size * odim_h * odim_w
+        return int(exp_cycles)
+
+    def get_normal_input_shape(self, ind=0):
+        idim_h, idim_w = self.get_nodeattr("ImgDim")
+        num_ch = self.get_nodeattr("NumChannels")
+        ishape = (1, idim_h, idim_w, num_ch)
+        return ishape
+
+    def get_normal_output_shape(self, ind=0):
+        odim_h, odim_w = self.get_padded_odim()
+        num_ch = self.get_nodeattr("NumChannels")
+
+        oshape = (1, odim_h, odim_w, num_ch)
+        return oshape
+
+    def get_folded_input_shape(self, ind=0):
+        normal_ishape = list(self.get_normal_input_shape())
+        ifm_ch = self.get_nodeattr("NumChannels")
+        simd = self.get_nodeattr("SIMD")
+        assert ifm_ch % simd == 0, "SIMD must divide input channels"
+        fold = int(normal_ishape[-1] / simd)
+        folded_ishape = normal_ishape[:-1] + [fold, simd]
+        return tuple(folded_ishape)
+
+    def get_folded_output_shape(self, ind=0):
+        normal_oshape = list(self.get_normal_output_shape())
+        ifm_ch = self.get_nodeattr("NumChannels")
+        simd = self.get_nodeattr("SIMD")
+        assert ifm_ch % simd == 0, "SIMD must divide input channels"
+        fold = int(normal_oshape[-1] / simd)
+        folded_oshape = normal_oshape[:-1] + [fold, simd]
+        return tuple(folded_oshape)
+
+    def make_shape_compatible_op(self, model):
+        exp_ishape = self.get_normal_input_shape()
+        oshape = self.get_normal_output_shape()
+        ishape = tuple(model.get_tensor_shape(self.onnx_node.input[0]))
+        assert ishape == exp_ishape, "Unexpected input shape for FMPadding_rtl."
+        return super().make_const_shape_op(oshape)
+
+    def infer_node_datatype(self, model):
+        node = self.onnx_node
+        idt = model.get_tensor_datatype(node.input[0])
+        if idt != self.get_input_datatype():
+            warn_str = "inputDataType changing for %s: %s -> %s " % (
+                node.name,
+                str(self.get_input_datatype()),
+                str(idt),
+            )
+            warnings.warn(warn_str)
+        self.set_nodeattr("inputDataType", idt.name)
+        model.set_tensor_datatype(node.output[0], idt)
+
+    def verify_node(self):
+        pass
+
+    def get_input_datatype(self, ind=0):
+        """Returns FINN DataType of input."""
+        ret = DataType[self.get_nodeattr("inputDataType")]
+        # the hlslib op always pads with zeros, so ensure that the DataType
+        # is able to represent zeros
+        assert ret.allowed(0), "FMPadding_rtl DataType must support zero"
+        return ret
+
+    def get_output_datatype(self, ind=0):
+        """Returns FINN DataType of output. (Same as input datatype)"""
+        return self.get_input_datatype()
+
+    def get_instream_width(self, ind=0):
+        ibits = self.get_input_datatype().bitwidth()
+        simd = self.get_nodeattr("SIMD")
+        return ibits * simd
+
+    def get_outstream_width(self, ind=0):
+        obits = self.get_output_datatype().bitwidth()
+        simd = self.get_nodeattr("SIMD")
+        return obits * simd
+
+    def get_number_output_values(self):
+        folded_oshape = self.get_folded_output_shape()
+        return np.prod(folded_oshape[:-1])
+
+    def get_verilog_top_module_intf_names(self):
+        # Overload default HLSCustomOp implementation to add axilite control IF
+        intf_names = super().get_verilog_top_module_intf_names()
+        if self.get_nodeattr("dynamic_mode"):
+            intf_names["axilite"] = ["s_axilite"]
+        return intf_names
+
+    def execute_node(self, context, graph):
+        mode = self.get_nodeattr("exec_mode")
+        node = self.onnx_node
+        exp_ishape = self.get_normal_input_shape()
+        exp_oshape = self.get_normal_output_shape()
+        folded_ishape = self.get_folded_input_shape()
+
+        if mode == "cppsim":
+            raise Exception(
+                "cppsim not possible for FMPadding_rtl, please set exec_mode to rtlsim"
+            )
+        elif mode == "rtlsim":
+            code_gen_dir = self.get_nodeattr("code_gen_dir_ipgen")
+        else:
+            raise Exception(
+                """Invalid value for attribute exec_mode! Is currently set to: {}
+            has to be set to one of the following value ("cppsim", "rtlsim")""".format(
+                    mode
+                )
+            )
+
+        inp = context[node.input[0]]
+        assert str(inp.dtype) == "float32", "Input datatype is not float32"
+        assert (
+            inp.shape == exp_ishape
+        ), """Input shape doesn't
+        match expected shape (1, ImgDim_h, ImgDim_w, NumChannels)."""
+        export_idt = self.get_input_datatype()
+
+        reshaped_input = inp.reshape(folded_ishape)
+        np.save(os.path.join(code_gen_dir, "input_0.npy"), reshaped_input)
+
+        sim = self.get_rtlsim()
+        nbits = self.get_instream_width()
+        rtlsim_inp = npy_to_rtlsim_input(
+            "{}/input_0.npy".format(code_gen_dir), export_idt, nbits
+        )
+        super().reset_rtlsim(sim)
+        super().toggle_clk(sim)
+        rtlsim_output = self.rtlsim(sim, rtlsim_inp)
+        odt = export_idt
+        target_bits = odt.bitwidth()
+        packed_bits = self.get_outstream_width()
+        out_npy_path = "{}/output.npy".format(code_gen_dir)
+        out_shape = self.get_folded_output_shape()
+        rtlsim_output_to_npy(
+            rtlsim_output, out_npy_path, odt, out_shape, packed_bits, target_bits
+        )
+        # load and reshape output
+        output = np.load(out_npy_path)
+        output = np.asarray([output], dtype=np.float32).reshape(*exp_oshape)
+        context[node.output[0]] = output
+
+        assert (
+            context[node.output[0]].shape == exp_oshape
+        ), """Output shape doesn't match expected shape
+            (1, OutputDim_H, OutputDim_W, NumChannels)."""
+
+    def get_template_values(self, ifm_dims, pads, chans, simd, idt):
+        dimY, dimX = ifm_dims
+        padT, padL, padB, padR = pads
+        y_counter_bits = int(math.ceil(math.log2(padT + dimY + padB + 1)))
+        x_counter_bits = int(math.ceil(math.log2(padL + dimX + padR + 1)))
+        topname = self.get_verilog_top_module_name()
+        stream_bits = idt.bitwidth() * simd
+        stream_bits = int(roundup_to_integer_multiple(stream_bits, 8))
+        code_gen_dict = {
+            "XCOUNTER_BITS": int(x_counter_bits),
+            "YCOUNTER_BITS": int(y_counter_bits),
+            "NUM_CHANNELS": int(chans),
+            "SIMD": int(simd),
+            "ELEM_BITS": idt.bitwidth(),
+            "TOP_MODULE_NAME": topname,
+            "INIT_XON": int(padL),
+            "INIT_XOFF": int(padL + dimX),
+            "INIT_XEND": int(padL + dimX + padR - 1),
+            "INIT_YON": int(padT),
+            "INIT_YOFF": int(padT + dimY),
+            "INIT_YEND": int(padT + dimY + padB - 1),
+            "STREAM_BITS": int(stream_bits),
+        }
+        return code_gen_dict
+
+    def get_dynamic_config(self, ifm_dims=None, pads=None):
+        """Returns a configuration dict to re-configure FM dimension and
+        padding amounts during runtime."""
+
+        if ifm_dims is None:
+            ifm_dims = self.get_nodeattr("ImgDim")
+        if pads is None:
+            pads = self.get_nodeattr("Padding")
+        chans = self.get_nodeattr("NumChannels")
+        simd = self.get_nodeattr("SIMD")
+        idt = self.get_input_datatype()
+        code_gen_dict = self.get_template_values(ifm_dims, pads, chans, simd, idt)
+        config = {
+            "XON": (0 * 4, (code_gen_dict["INIT_XON"])),
+            "XOFF": (1 * 4, (code_gen_dict["INIT_XOFF"])),
+            "XEND": (2 * 4, (code_gen_dict["INIT_XEND"])),
+            "YON": (3 * 4, (code_gen_dict["INIT_YON"])),
+            "YOFF": (4 * 4, (code_gen_dict["INIT_YOFF"])),
+            "YEND": (5 * 4, (code_gen_dict["INIT_YEND"])),
+        }
+        return config
+
+    def generate_hdl(self):
+        rtlsrc = os.environ["FINN_ROOT"] + "/finn-rtllib/fmpadding/hdl"
+        template_path = rtlsrc + "/fmpadding_template.v"
+        dims = self.get_nodeattr("ImgDim")
+        pads = self.get_nodeattr("Padding")
+        chans = self.get_nodeattr("NumChannels")
+        simd = self.get_nodeattr("SIMD")
+        idt = self.get_input_datatype()
+        code_gen_dict = self.get_template_values(dims, pads, chans, simd, idt)
+        # save top module name so we can refer to it after this node has been renamed
+        # (e.g. by GiveUniqueNodeNames(prefix) during MakeZynqProject)
+        self.set_nodeattr("gen_top_module", self.get_verilog_top_module_name())
+
+        # apply code generation to templates
+        code_gen_dir = self.get_nodeattr("code_gen_dir_ipgen")
+        with open(template_path, "r") as f:
+            template = f.read()
+        for key_name in code_gen_dict:
+            key = "$%s$" % key_name
+            template = template.replace(key, str(code_gen_dict[key_name]))
+
+        with open(
+            os.path.join(code_gen_dir, self.get_verilog_top_module_name() + ".v"),
+            "w",
+        ) as f:
+            f.write(template)
+
+        sv_files = ["fmpadding_axi.sv", "fmpadding.sv", "axi2we.sv"]
+        for sv_file in sv_files:
+            shutil.copy(rtlsrc + "/" + sv_file, code_gen_dir)
+        # set ipgen_path and ip_path so that HLS-Synth transformation
+        # and stich_ip transformation do not complain
+        self.set_nodeattr("ipgen_path", code_gen_dir)
+        self.set_nodeattr("ip_path", code_gen_dir)
+
+    def prepare_rtlsim(self):
+        """Creates a Verilator emulation library for the RTL code generated
+        for this node, sets the rtlsim_so attribute to its path and returns
+        a PyVerilator wrapper around it."""
+        # Modified to use generated (System-)Verilog instead of HLS output products
+
+        if PyVerilator is None:
+            raise ImportError("Installation of PyVerilator is required.")
+
+        code_gen_dir = self.get_nodeattr("code_gen_dir_ipgen")
+        verilog_paths = [code_gen_dir]
+        verilog_files = [
+            "fmpadding_axi.sv",
+            "fmpadding.sv",
+            "axi2we.sv",
+            self.get_nodeattr("gen_top_module") + ".v",
+        ]
+
+        # build the Verilator emu library
+        sim = PyVerilator.build(
+            verilog_files,
+            build_dir=make_build_dir("pyverilator_" + self.onnx_node.name + "_"),
+            verilog_path=verilog_paths,
+            trace_depth=get_rtlsim_trace_depth(),
+            top_module_name=self.get_verilog_top_module_name(),
+        )
+        # save generated lib filename in attribute
+        self.set_nodeattr("rtlsim_so", sim.lib._name)
+        return sim
+
+    def code_generation_ipi(self):
+        """Constructs and returns the TCL for node instantiation in Vivado IPI."""
+        code_gen_dir = self.get_nodeattr("code_gen_dir_ipgen")
+
+        sourcefiles = [
+            "fmpadding_axi.sv",
+            "fmpadding.sv",
+            "axi2we.sv",
+            self.get_nodeattr("gen_top_module") + ".v",
+        ]
+
+        sourcefiles = [os.path.join(code_gen_dir, f) for f in sourcefiles]
+
+        cmd = []
+        for f in sourcefiles:
+            cmd += ["add_files -norecurse %s" % (f)]
+        cmd += [
+            "create_bd_cell -type module -reference %s %s"
+            % (self.get_nodeattr("gen_top_module"), self.onnx_node.name)
+        ]
+        return cmd
+
+    def code_generation_ipgen(self, model, fpgapart, clk):
+        """Normally: Generates C++ code and tcl script for IP generation.
+        Here: Generates (System-)Verilog code for IP generation."""
+        self.generate_hdl()
+
+    def ipgen_singlenode_code(self):
+        """Normally: Builds the bash script for IP generation."""
+        pass
+
+    def code_generation_cppsim(self, model):
+        """Normally: Generates C++ code for simulation (cppsim)."""
+        pass
+
+    def compile_singlenode_code(self):
+        pass
+
+    def global_includes(self):
+        pass
+
+    def defines(self, var):
+        pass
+
+    def read_npy_data(self):
+        pass
+
+    def strm_decl(self):
+        pass
+
+    def docompute(self):
+        pass
+
+    def dataoutstrm(self):
+        pass
+
+    def save_as_npy(self):
+        pass
+
+    def blackboxfunction(self):
+        pass
+
+    def pragmas(self):
+        pass
diff --git a/src/finn/custom_op/fpgadataflow/matrixvectoractivation.py b/src/finn/custom_op/fpgadataflow/matrixvectoractivation.py
index df9d1f1e70674f7bc91460e154f4e24af08df79c..72128fda4cfe23db4858fe3ffe80a755733954cc 100644
--- a/src/finn/custom_op/fpgadataflow/matrixvectoractivation.py
+++ b/src/finn/custom_op/fpgadataflow/matrixvectoractivation.py
@@ -576,6 +576,10 @@ class MatrixVectorActivation(HLSCustomOp):
 
     def minimize_accumulator_width(self, model):
         weights = model.get_initializer(self.onnx_node.input[1])
+        # since in the calculation the values of the weight matrix are used,
+        # for the bipolar case they need to be converted to bipolar
+        if self.get_nodeattr("binaryXnorMode"):
+            weights = 2 * weights - 1
         if len(self.onnx_node.input) > 2:
             thresholds = model.get_initializer(self.onnx_node.input[2])
         else:
@@ -702,10 +706,12 @@ class MatrixVectorActivation(HLSCustomOp):
         of weights.
 
         Arguments:
+
         * weights : numpy array with weights to be put into the file
         * weight_file_mode : one of {hls_header, decoupled_verilog_dat,
           decoupled_runtime}
         * weight_file_name : filename for the weight file to be generated
+
         """
         # convert weights into hlslib-compatible format
         weight_tensor = self.get_hls_compatible_weight_tensor(weights)
diff --git a/src/finn/custom_op/fpgadataflow/pool_batch.py b/src/finn/custom_op/fpgadataflow/pool_batch.py
index 91cd537baeff0c7666bbf3596b46a7412ec2fe4e..813f13e504eae181f4398eccbe40ad66b6e3bf16 100644
--- a/src/finn/custom_op/fpgadataflow/pool_batch.py
+++ b/src/finn/custom_op/fpgadataflow/pool_batch.py
@@ -42,12 +42,13 @@ class Pool_Batch(HLSCustomOp):
     Output shape (BatchSize,OutImgDim,OutImgDim,Channels)
 
     Notes:
-    # The input shape was chosen to be compatible with im2col (only true when there
-    is not folding).
 
-    # The actual data layout produced by the hlslib kernels is different
-    for depthwise ops.
-     * depthwise SWG: (1, OFMDim, OFMDim, IFMChannels/PE, K, K, PE)
+    * The input shape was chosen to be compatible with im2col (only true when there
+      is not folding).
+    * The actual data layout produced by the hlslib kernels is different
+      for depthwise ops.
+
+        * depthwise SWG: (1, OFMDim, OFMDim, IFMChannels/PE, K, K, PE)
 
     Channels can be folded using PE (SIMD from the input perspective)
     """
diff --git a/src/finn/custom_op/fpgadataflow/streamingdatawidthconverter_batch.py b/src/finn/custom_op/fpgadataflow/streamingdatawidthconverter_batch.py
index a3aa9d570d0efcbe82090d19a151d4f5b12078b6..a80d2bbefac96e8ec2a48e04179d3d285e78cef7 100644
--- a/src/finn/custom_op/fpgadataflow/streamingdatawidthconverter_batch.py
+++ b/src/finn/custom_op/fpgadataflow/streamingdatawidthconverter_batch.py
@@ -78,24 +78,33 @@ class StreamingDataWidthConverter_Batch(HLSCustomOp):
 
     def check_divisible_iowidths(self):
         impl_style = self.get_nodeattr("impl_style")
-        if impl_style == "hls":
-            # when using impl_style = hls must have the following
-            # if inWidth > outWidth: inWidth % outWidth = 0
-            # if inWidth < outWidth: outWidth % inWidth = 0
-            iwidth = self.get_nodeattr("inWidth")
-            owidth = self.get_nodeattr("outWidth")
-            if iwidth > owidth:
-                assert (
-                    iwidth % owidth == 0
-                ), """DWC InWidth is bigger than OutWidth and is not divisible by it.
-                Please adjust PE and SIMD values so that InWidth % OutWidth = 0
-                or alternatively use impl_style = vivado"""
-            else:
-                assert (
-                    owidth % iwidth == 0
-                ), """DWC OutWidth is bigger than InWidth and is not divisible by it.
-                Please adjust PE and SIMD values so that OutWidth % InWidth = 0
-                or alternatively use impl_style = vivado"""
+        iwidth = self.get_nodeattr("inWidth")
+        owidth = self.get_nodeattr("outWidth")
+        if impl_style == "vivado":
+            # the AXIS IP we use in vivado mode only supports
+            # stream widths that are divisible by 8
+            iwidth_d8 = iwidth % 8 == 0
+            owidth_d8 = owidth % 8 == 0
+            assert (
+                iwidth_d8 and owidth_d8
+            ), """DWC impl_style=vivado requires
+            stream widths that are divisible by 8: (%d, %d)""" % (
+                iwidth,
+                owidth,
+            )
+
+    def get_iowidth_lcm(self):
+        iwidth = self.get_nodeattr("inWidth")
+        owidth = self.get_nodeattr("outWidth")
+        return int(np.lcm(iwidth, owidth))
+
+    def needs_lcm(self):
+        iwidth = self.get_nodeattr("inWidth")
+        owidth = self.get_nodeattr("outWidth")
+        maxwidth = max(iwidth, owidth)
+        minwidth = min(iwidth, owidth)
+        impl_style = self.get_nodeattr("impl_style")
+        return (impl_style == "hls") and (maxwidth % minwidth != 0)
 
     def get_folded_input_shape(self, ind=0):
         self.check_divisible_iowidths()
@@ -202,6 +211,16 @@ class StreamingDataWidthConverter_Batch(HLSCustomOp):
             "#define NumInWords %d " % numInWords,
             "#define numReps %d" % numReps,
         ]
+        if self.needs_lcm():
+            lcmWidth = self.get_iowidth_lcm()
+            assert (
+                numInWords % (lcmWidth / inWidth) == 0
+            ), "Error in DWC LCM calculation"
+            numLCMToOut = numInWords // (lcmWidth / inWidth)
+            self.code_gen_dict["$DEFINES$"].append("#define LCMWidth %d" % lcmWidth)
+            self.code_gen_dict["$DEFINES$"].append(
+                "#define NumLCMToOut %d" % (numLCMToOut)
+            )
 
     def read_npy_data(self):
         code_gen_dir = self.get_nodeattr("code_gen_dir_cppsim")
@@ -226,6 +245,12 @@ class StreamingDataWidthConverter_Batch(HLSCustomOp):
         self.code_gen_dict["$STREAMDECLARATIONS$"].append(
             'hls::stream<ap_uint<{}>> in0 ("in0");'.format(self.get_instream_width())
         )
+        if self.needs_lcm():
+            self.code_gen_dict["$STREAMDECLARATIONS$"].append(
+                'hls::stream<ap_uint<{}>> intermediate ("intermediate");'.format(
+                    self.get_iowidth_lcm()
+                )
+            )
         self.code_gen_dict["$STREAMDECLARATIONS$"].append(
             'hls::stream<ap_uint<{}>> out ("out");'.format(self.get_outstream_width())
         )
@@ -233,9 +258,19 @@ class StreamingDataWidthConverter_Batch(HLSCustomOp):
     def docompute(self):
         # TODO continue with fxns below, they are copy-pasted
         op = "StreamingDataWidthConverter_Batch"
-        self.code_gen_dict["$DOCOMPUTE$"] = [
-            "%s<InWidth, OutWidth, NumInWords>(in0, out, numReps);" % (op)
-        ]
+        if self.needs_lcm():
+            self.code_gen_dict["$DOCOMPUTE$"] = [
+                'hls::stream<ap_uint<{}>> intermediate ("intermediate");'.format(
+                    self.get_iowidth_lcm()
+                ),
+                "%s<InWidth, LCMWidth, NumInWords>(in0, intermediate, numReps);" % (op),
+                "%s<LCMWidth, OutWidth, NumLCMToOut>(intermediate, out, numReps);"
+                % (op),
+            ]
+        else:
+            self.code_gen_dict["$DOCOMPUTE$"] = [
+                "%s<InWidth, OutWidth, NumInWords>(in0, out, numReps);" % (op)
+            ]
 
     def dataoutstrm(self):
         code_gen_dir = self.get_nodeattr("code_gen_dir_cppsim")
@@ -287,6 +322,10 @@ class StreamingDataWidthConverter_Batch(HLSCustomOp):
         self.code_gen_dict["$PRAGMAS$"].append(
             "#pragma HLS INTERFACE ap_ctrl_none port=return"
         )
+        if self.needs_lcm():
+            self.code_gen_dict["$PRAGMAS$"].append(
+                "#pragma HLS DATAFLOW disable_start_propagation"
+            )
 
     def execute_node(self, context, graph):
         mode = self.get_nodeattr("exec_mode")
@@ -466,3 +505,28 @@ class StreamingDataWidthConverter_Batch(HLSCustomOp):
             cset_luts += outw
 
         return int(cnt_luts + cset_luts)
+
+    def prepare_rtlsim(self):
+        assert self.get_nodeattr("impl_style") != "vivado", (
+            "StreamingDataWidthConverter impl_style "
+            "cannot be vivado for rtlsim. Only impl_style=rtl supported."
+        )
+        super().prepare_rtlsim()
+
+    def code_generation_ipgen(self, model, fpgapart, clk):
+        # no codegen required for impl_style=vivado since
+        # that uses premade, configurable AXIS IP
+        if self.get_nodeattr("impl_style") == "hls":
+            super().code_generation_ipgen(model, fpgapart, clk)
+
+    def ipgen_singlenode_code(self):
+        # no IP generation required for impl_style=vivado since
+        # that uses premade, configurable AXIS IP
+        if self.get_nodeattr("impl_style") == "hls":
+            super().ipgen_singlenode_code()
+        else:
+            code_gen_dir = self.get_nodeattr("code_gen_dir_ipgen")
+            # set ipgen_path and ip_path so that HLSSynthIP
+            # and CreatedStitchedIP transformations do not complain
+            self.set_nodeattr("ipgen_path", code_gen_dir)
+            self.set_nodeattr("ip_path", code_gen_dir)
diff --git a/src/finn/custom_op/fpgadataflow/streamingfifo.py b/src/finn/custom_op/fpgadataflow/streamingfifo.py
index c71e8ffe323b1f2bb459a0f982e63d881a7ae58d..522305327ff7c5f1356aad4fdf6b9e0a942eca72 100644
--- a/src/finn/custom_op/fpgadataflow/streamingfifo.py
+++ b/src/finn/custom_op/fpgadataflow/streamingfifo.py
@@ -72,6 +72,9 @@ class StreamingFIFO(HLSCustomOp):
                 ),
                 # whether depth monitoring is enabled (impl_style=rtl only)
                 "depth_monitor": ("i", False, 0),
+                # the FIFO does not need its own FIFOs
+                "inFIFODepths": ("ints", False, [0]),
+                "outFIFODepths": ("ints", False, [0]),
             }
         )
 
diff --git a/src/finn/custom_op/fpgadataflow/thresholding_batch.py b/src/finn/custom_op/fpgadataflow/thresholding_batch.py
index f2cc64668d62ef15446772309577e9b15a378ef5..d9745acf63c4685b3369ac379abde0a6c5a3f157 100644
--- a/src/finn/custom_op/fpgadataflow/thresholding_batch.py
+++ b/src/finn/custom_op/fpgadataflow/thresholding_batch.py
@@ -354,10 +354,12 @@ class Thresholding_Batch(HLSCustomOp):
         run-time reconfig of weights.
 
         Arguments:
+
         * weights : numpy array with weights to be put into the file
         * weight_file_mode : one of {hls_header, decoupled_verilog_dat,
           decoupled_runtime}
         * weight_file_name : filename for the weight file to be generated
+
         """
         threshold_tensor = self.get_hls_compatible_threshold_tensor(weights)
         tdt = self.get_weight_datatype()
@@ -600,13 +602,17 @@ class Thresholding_Batch(HLSCustomOp):
 
     # TODO check and add whatever missing
     def defines(self, var):
+        numReps = 1
         numInputVectors = list(self.get_nodeattr("numInputVectors"))
-        numReps = int(np.prod(numInputVectors))
+        total_spatial_size = int(np.prod(numInputVectors))
+
         self.code_gen_dict["$DEFINES$"] = [
-            """#define NumChannels1 {}\n #define PE1 {}\n #define numReps {}""".format(
+            """#define NumChannels1 {}\n #define PE1 {}\n #define numReps {}\n
+               #define ImgDim1 {}""".format(
                 self.get_nodeattr("NumChannels"),
                 self.get_nodeattr("PE"),
                 numReps,
+                total_spatial_size,
             )
         ]
         if self.get_nodeattr("mem_mode") == "decoupled":
@@ -647,7 +653,7 @@ class Thresholding_Batch(HLSCustomOp):
             npy_in = "%s/thresholds.npy" % code_gen_dir
 
             self.code_gen_dict["$READNPYDATA$"].append(
-                'npy2apintstream<%s, %s, %d, %s>("%s", weights, false, numReps);'
+                'npy2apintstream<%s, %s, %d, %s>("%s", weights, false, ImgDim1);'
                 % (packed_hls_type, elem_hls_type, elem_bits, npy_type, npy_in)
             )
 
@@ -669,18 +675,13 @@ class Thresholding_Batch(HLSCustomOp):
 
     def docompute(self):
         tmpl_args = self.get_template_param_values()
-        # TODO: why put some template parameters into defines and not others?
-        # should ImgDim be defined or just filled in here like we do now?
         node = self.onnx_node
-        inp_vecs = self.get_nodeattr("numInputVectors")
-        total_spatial_size = int(np.prod(inp_vecs))
         mem_mode = self.get_nodeattr("mem_mode")
         if mem_mode == "const":
             self.code_gen_dict["$DOCOMPUTE$"] = [
-                """{}<{}, NumChannels1, PE1, {}, {}>
+                """{}<ImgDim1, NumChannels1, PE1, {}, {}>
                 (in0, out, threshs, numReps);""".format(
                     node.op_type,
-                    total_spatial_size,
                     tmpl_args["TSrcI"],
                     tmpl_args["TDstI"],
                 )
@@ -690,10 +691,9 @@ class Thresholding_Batch(HLSCustomOp):
             # - for cppsim the repetition comes from the threshold stream reader+input
             # - for synth the unit runs continuously anyway (ap_ctrl_none)
             self.code_gen_dict["$DOCOMPUTE$"] = [
-                """{}<{}, NumChannels1, PE1, {}, {}, ActVal1, ThresType1, NumSteps1>
-                (in0, out, weights, 1);""".format(
+                """{}<ImgDim1, NumChannels1, PE1, {}, {}, ActVal1, ThresType1, NumSteps1>
+                (in0, out, weights, numReps);""".format(
                     "Thresholding_Stream_Batch",
-                    total_spatial_size,
                     tmpl_args["TSrcI"],
                     tmpl_args["TDstI"],
                 )
diff --git a/src/finn/custom_op/fpgadataflow/vectorvectoractivation.py b/src/finn/custom_op/fpgadataflow/vectorvectoractivation.py
index a411d245a9a397e6d827abfa8ee4a784f207ecd5..d5e29ca22acf89440c3c3a66101bec89d4a66d46 100644
--- a/src/finn/custom_op/fpgadataflow/vectorvectoractivation.py
+++ b/src/finn/custom_op/fpgadataflow/vectorvectoractivation.py
@@ -410,10 +410,12 @@ class VectorVectorActivation(HLSCustomOp):
         of weights.
 
         Arguments:
+
         * weights : numpy array with weights to be put into the file
         * weight_file_mode : one of {hls_header, decoupled_verilog_dat,
           decoupled_runtime}
         * weight_file_name : filename for the weight file to be generated
+
         """
         # convert weights into hlslib-compatible format
         weight_tensor = self.get_hls_compatible_weight_tensor(weights)
diff --git a/src/finn/qnn-data/build_dataflow/dataflow_build_config.json b/src/finn/qnn-data/build_dataflow/dataflow_build_config.json
index 27ec38f6a4eb55c99dc4805f91d6e388e735308c..a053c1a22f7d3d290628c661a5cf113a3be44f53 100644
--- a/src/finn/qnn-data/build_dataflow/dataflow_build_config.json
+++ b/src/finn/qnn-data/build_dataflow/dataflow_build_config.json
@@ -7,6 +7,7 @@
   "standalone_thresholds": true,
   "shell_flow_type": "vivado_zynq",
   "verify_save_rtlsim_waveforms": true,
+  "force_python_rtlsim": true,
   "verify_steps": [
     "initial_python",
     "streamlined_python",
diff --git a/src/finn/transformation/fpgadataflow/convert_to_hls_layers.py b/src/finn/transformation/fpgadataflow/convert_to_hls_layers.py
index d1f6eb4608f71b57810366b7753b3036c52e117e..b9b4fd45d984d51fb45698df808ecbc0386cd2b3 100644
--- a/src/finn/transformation/fpgadataflow/convert_to_hls_layers.py
+++ b/src/finn/transformation/fpgadataflow/convert_to_hls_layers.py
@@ -117,8 +117,12 @@ class InferConvInpGen(Transformation):
                     ConvInpGen_idim_h = odim_padding_h
                     ConvInpGen_idim_w = odim_padding_w
 
+                    padding_optype = (
+                        "FMPadding_rtl" if self.use_rtl_variant else "FMPadding_Batch"
+                    )
+
                     padding_node = helper.make_node(
-                        "FMPadding_Batch",
+                        padding_optype,
                         [i2c_input],
                         [padding_out],
                         domain="finn.custom_op.fpgadataflow",
diff --git a/src/finn/transformation/fpgadataflow/create_stitched_ip.py b/src/finn/transformation/fpgadataflow/create_stitched_ip.py
index 52e4e88b409766f0764d3ce7666dbf1971713575..d1cb3c4af9decb30a731bafe209fbf507fe03991 100644
--- a/src/finn/transformation/fpgadataflow/create_stitched_ip.py
+++ b/src/finn/transformation/fpgadataflow/create_stitched_ip.py
@@ -310,6 +310,14 @@ class CreateStitchedIP(Transformation):
                 behavior. It is strongly recommended to insert FIFOs prior to
                 calling CreateStitchedIP."""
             )
+        if model.graph.node[0].op_type == "StreamingFIFO":
+            firstfifo = getCustomOp(model.graph.node[0])
+            if firstfifo.get_nodeattr("impl_style") == "vivado":
+                warnings.warn(
+                    """First FIFO has impl_style=vivado, which may cause
+                    simulation glitches (e.g. dropping the first input sample
+                    after reset)."""
+                )
         for node in model.graph.node:
             # ensure that all nodes are fpgadataflow, and that IPs are generated
             assert is_fpgadataflow_node(
@@ -404,7 +412,7 @@ class CreateStitchedIP(Transformation):
         wrapper_filename = "%s/hdl/%s_wrapper.v" % (bd_base, block_name)
         tcl.append("add_files -norecurse %s" % wrapper_filename)
         model.set_metadata_prop("wrapper_filename", wrapper_filename)
-        tcl.append("set_property top finn_design_wrapper [current_fileset]")
+        tcl.append("set_property top %s_wrapper [current_fileset]" % block_name)
         # synthesize to DCP and export stub, DCP and constraints
         if self.vitis:
             tcl.append(
diff --git a/src/finn/transformation/fpgadataflow/derive_characteristic.py b/src/finn/transformation/fpgadataflow/derive_characteristic.py
index f783f7ae711739cf4e011315c6714ad95d3c7919..67eb96995ef3312dff72799c905216b82b7ef8ee 100644
--- a/src/finn/transformation/fpgadataflow/derive_characteristic.py
+++ b/src/finn/transformation/fpgadataflow/derive_characteristic.py
@@ -134,8 +134,9 @@ class DeriveFIFOSizes(NodeLocalTransformation):
       NodeLocalTransformation for more details.
     """
 
-    def __init__(self, num_workers=None):
+    def __init__(self, num_workers=None, io_fifo_depth=32):
         super().__init__(num_workers=num_workers)
+        self.io_fifo_depth = io_fifo_depth
 
     def applyNodeLocal(self, node):
         op_type = node.op_type
@@ -161,7 +162,7 @@ class DeriveFIFOSizes(NodeLocalTransformation):
                     if cons_node is None:
                         # could be final node, will be overridden if so
                         # need an entry in the list anyway
-                        out_fifo_depths.append(2)
+                        out_fifo_depths.append(self.io_fifo_depth)
                         continue
                     cons = registry.getCustomOp(cons_node)
                     cons_chrc = cons.get_nodeattr("io_chrc_in")[0]
@@ -182,6 +183,14 @@ class DeriveFIFOSizes(NodeLocalTransformation):
                 # for each tensor
                 prod.set_nodeattr("outFIFODepths", out_fifo_depths)
 
+                # finally, check node inputs to ensure FIFOs are added to
+                # any top-level inputs (at least self.io_fifo_depth deep)
+                in_fifo_depths = prod.get_nodeattr("inFIFODepths")
+                for (i, input_name) in enumerate(node.input):
+                    if input_name in [x.name for x in model.graph.input]:
+                        in_fifo_depths[i] = max(self.io_fifo_depth, in_fifo_depths[i])
+                prod.set_nodeattr("inFIFODepths", in_fifo_depths)
+
             except KeyError:
                 # exception if op_type is not supported
                 raise Exception(
diff --git a/src/finn/transformation/fpgadataflow/insert_dwc.py b/src/finn/transformation/fpgadataflow/insert_dwc.py
index efc179923545eb06e4d173c683b0941887f8bb79..cff8b602674fec41a1e6fd1d467acdc989b4afe2 100644
--- a/src/finn/transformation/fpgadataflow/insert_dwc.py
+++ b/src/finn/transformation/fpgadataflow/insert_dwc.py
@@ -81,12 +81,11 @@ class InsertDWC(Transformation):
                             dwc_in_width = n0.get_outstream_width()
                             # determine dwc outwidth
                             dwc_out_width = n1.get_instream_width()
-                            larger_width = max(dwc_in_width, dwc_out_width)
-                            smaller_width = min(dwc_in_width, dwc_out_width)
-                            if larger_width % smaller_width == 0:
-                                impl_style = "hls"
-                            else:
-                                impl_style = "vivado"
+                            # use hls mode by default since it supports more configs
+                            # vivado mode can be manually enabled by user, but does not
+                            # support e.g. node-by-node rtlsim neded for
+                            # characterization-based FIFO sizing
+                            impl_style = "hls"
 
                             # determine shape for dwc
                             dwc_shape = n0.get_normal_output_shape()
diff --git a/src/finn/transformation/fpgadataflow/insert_fifo.py b/src/finn/transformation/fpgadataflow/insert_fifo.py
index 38c2d60c9a09644ceec07e11bf91315ee743e02c..bfeee95e9bbd2a3a3f7c6eb0a4c7e74d30f76228 100644
--- a/src/finn/transformation/fpgadataflow/insert_fifo.py
+++ b/src/finn/transformation/fpgadataflow/insert_fifo.py
@@ -71,13 +71,15 @@ class InsertFIFO(Transformation):
     of the subsequent node. max() of these two values sets the FIFO depth.
 
     Constructor arguments:
-    - max_qsrl_depth : FIFOs deeper than this will use Vivado IP instead of
-                       Verilog FIFOs (Q_srl.v)
-    - vivado_ram_style : the StreamingFIFO.ram_style attribute to be used for
-                          large FIFOs implemented by Vivado
-    - create_shallow_fifos : Normally, shallow-depth (<=2) FIFOs won't be created since
-                            HLS streaming interfaces already have a degree of buffering.
-                            Override with this parameter.
+
+    :parameter max_qsrl_depth: FIFOs deeper than this will use Vivado IP
+        instead of Verilog FIFOs (Q_srl.v)
+    :parameter vivado_ram_style: the StreamingFIFO.ram_style attribute
+        to be used for large FIFOs implemented by Vivado
+    :parameter create_shallow_fifos: Normally, shallow-depth (<=2) FIFOs
+        won't be created since HLS streaming interfaces
+        already have a degree of buffering.
+        Override with this parameter.
 
 
     The other node attributes necessary to create a FIFO node are taken from the
@@ -175,14 +177,9 @@ class InsertFIFO(Transformation):
                             for idx, inp in enumerate(consumer.input):
                                 if inp == output_name:
                                     consumer.input[idx] = fifo_output_tensor.name
-                            # ensure created FIFO depth is reflected on both sides
-                            odepths = n0.get_nodeattr("outFIFODepths")
-                            odepths[idx_out] = fifo_depth
-                            n0.set_nodeattr("outFIFODepths", odepths)
-                            idepths = n1.get_nodeattr("inFIFODepths")
-                            idepths[idx_inp] = fifo_depth
-                            n1.set_nodeattr("inFIFODepths", idepths)
-
+                            # removed setting of node attributes based on created
+                            # FIFO sizes here, better to preserve original attrs
+                            # as they are.
                             graph_modified = True
 
         if graph_modified is False:
@@ -202,41 +199,44 @@ class InsertFIFO(Transformation):
                     dtype = n0.get_input_datatype(inp_ind)
                     fifo_depth = n0.get_nodeattr("inFIFODepths")[inp_ind]
 
-                    if fifo_depth <= 2:
-                        warnings.warn("Overriding input FIFO depth to 32")
-                        fifo_depth = 32
-
-                    # create fifo node
-                    fifo_output_tensor = oh.make_tensor_value_info(
-                        model.make_new_valueinfo_name(),
-                        TensorProto.FLOAT,
-                        n0.get_normal_input_shape(),
-                    )
-                    graph.value_info.append(fifo_output_tensor)
-                    model.set_tensor_datatype(fifo_output_tensor.name, dtype)
+                    if fifo_depth > 2 or self.create_shallow_fifos:
+                        # create fifo node
+                        fifo_output_tensor = oh.make_tensor_value_info(
+                            model.make_new_valueinfo_name(),
+                            TensorProto.FLOAT,
+                            n0.get_normal_input_shape(),
+                        )
+                        graph.value_info.append(fifo_output_tensor)
+                        model.set_tensor_datatype(fifo_output_tensor.name, dtype)
 
-                    if self.max_qsrl_depth is None or fifo_depth <= self.max_qsrl_depth:
+                        # only use rtl-style FIFOs to avoid simulation bug
+                        # (top-level IOs should not have impl_style=vivado)
                         impl_style = "rtl"
+
+                        fifo_node = oh.make_node(
+                            "StreamingFIFO",
+                            [n_input],
+                            [fifo_output_tensor.name],
+                            domain="finn.custom_op.fpgadataflow",
+                            backend="fpgadataflow",
+                            depth=fifo_depth,
+                            folded_shape=fld_shape,
+                            dataType=str(dtype.name),
+                            impl_style=impl_style,
+                            ram_style=self.vivado_ram_style,
+                        )
+                        # insert fifo
+                        graph.node.insert(0, fifo_node)
+
+                        # set fifo output tensor as new input tensor of second node
+                        first_node.input[inp_ind] = fifo_output_tensor.name
                     else:
-                        impl_style = "vivado"
-
-                    fifo_node = oh.make_node(
-                        "StreamingFIFO",
-                        [n_input],
-                        [fifo_output_tensor.name],
-                        domain="finn.custom_op.fpgadataflow",
-                        backend="fpgadataflow",
-                        depth=fifo_depth,
-                        folded_shape=fld_shape,
-                        dataType=str(dtype.name),
-                        impl_style=impl_style,
-                        ram_style=self.vivado_ram_style,
-                    )
-                    # insert fifo
-                    graph.node.insert(0, fifo_node)
-
-                    # set fifo output tensor as new input tensor of second node
-                    first_node.input[inp_ind] = fifo_output_tensor.name
+                        warnings.warn(
+                            """Input FIFO for %s has depth %d and won't
+                        be created. This may cause RTL simulation issues.
+                        """
+                            % (graph_in_name, fifo_depth)
+                        )
 
             # insert FIFO as last node, except when last node is DMA
             graph_out_names = [x.name for x in model.graph.output]
@@ -257,40 +257,43 @@ class InsertFIFO(Transformation):
                     dtype = n0.get_output_datatype(out_ind)
                     fifo_depth = n0.get_nodeattr("outFIFODepths")[out_ind]
 
-                    if fifo_depth <= 2:
-                        warnings.warn("Overriding output FIFO depth to 32")
-                        fifo_depth = 32
-
-                    # create fifo node
-                    fifo_input_tensor = oh.make_tensor_value_info(
-                        model.make_new_valueinfo_name(),
-                        TensorProto.FLOAT,
-                        n0.get_normal_output_shape(),
-                    )
-                    graph.value_info.append(fifo_input_tensor)
-                    model.set_tensor_datatype(fifo_input_tensor.name, dtype)
+                    if fifo_depth > 2 or self.create_shallow_fifos:
+                        # create fifo node
+                        fifo_input_tensor = oh.make_tensor_value_info(
+                            model.make_new_valueinfo_name(),
+                            TensorProto.FLOAT,
+                            n0.get_normal_output_shape(),
+                        )
+                        graph.value_info.append(fifo_input_tensor)
+                        model.set_tensor_datatype(fifo_input_tensor.name, dtype)
 
-                    if self.max_qsrl_depth is None or fifo_depth <= self.max_qsrl_depth:
+                        # only use rtl-style FIFOs to avoid simulation bug
+                        # (top-level IOs should not have impl_style=vivado)
                         impl_style = "rtl"
+
+                        fifo_node = oh.make_node(
+                            "StreamingFIFO",
+                            [fifo_input_tensor.name],
+                            [graph_out_name],
+                            domain="finn.custom_op.fpgadataflow",
+                            backend="fpgadataflow",
+                            depth=fifo_depth,
+                            folded_shape=fld_shape,
+                            dataType=str(dtype.name),
+                            impl_style=impl_style,
+                            ram_style=self.vivado_ram_style,
+                        )
+                        # insert fifo
+                        graph.node.append(fifo_node)
+
+                        # set fifo output tensor as new input tensor of second node
+                        final_node.output[0] = fifo_input_tensor.name
                     else:
-                        impl_style = "vivado"
-
-                    fifo_node = oh.make_node(
-                        "StreamingFIFO",
-                        [fifo_input_tensor.name],
-                        [graph_out_name],
-                        domain="finn.custom_op.fpgadataflow",
-                        backend="fpgadataflow",
-                        depth=fifo_depth,
-                        folded_shape=fld_shape,
-                        dataType=str(dtype.name),
-                        impl_style=impl_style,
-                        ram_style=self.vivado_ram_style,
-                    )
-                    # insert fifo
-                    graph.node.append(fifo_node)
-
-                    # set fifo output tensor as new input tensor of second node
-                    final_node.output[0] = fifo_input_tensor.name
+                        warnings.warn(
+                            """Output FIFO for %s has depth %d and won't
+                        be created. This may cause RTL simulation issues.
+                        """
+                            % (graph_out_name, fifo_depth)
+                        )
 
         return (model, graph_modified)
diff --git a/src/finn/transformation/fpgadataflow/set_fifo_depths.py b/src/finn/transformation/fpgadataflow/set_fifo_depths.py
index 4492be0725bd7d804b29424970639776f692ac83..35e7b9e6c929587d00038650742edb5dcb922130 100644
--- a/src/finn/transformation/fpgadataflow/set_fifo_depths.py
+++ b/src/finn/transformation/fpgadataflow/set_fifo_depths.py
@@ -29,10 +29,16 @@
 import math
 import numpy as np
 import warnings
+from onnx import TensorProto, helper
 from pyverilator.util.axi_utils import reset_rtlsim, toggle_clk
+from qonnx.core.datatype import DataType
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.base import Transformation
-from qonnx.transformation.general import GiveReadableTensorNames, GiveUniqueNodeNames
+from qonnx.transformation.general import (
+    GiveReadableTensorNames,
+    GiveUniqueNodeNames,
+    SortGraph,
+)
 
 from finn.analysis.fpgadataflow.dataflow_performance import dataflow_performance
 from finn.transformation.fpgadataflow.annotate_cycles import AnnotateCycles
@@ -126,14 +132,17 @@ class CapConvolutionFIFODepths(Transformation):
     constructor flag is set.
 
     Constructor arguments:
-    - max_qsrl_depth : FIFOs deeper than this will use Vivado IP instead of
-                       Verilog FIFOs (Q_srl.v)
+
+    :parameter max_qsrl_depth: FIFOs deeper than this will use Vivado IP
+        instead of Verilog FIFOs (Q_srl.v)
 
     Assumed input graph properties:
+
     - all nodes are fpgadataflow nodes
     - FIFOs inserted with InsertAndSetFIFODepths
 
     Output:
+
     - graph with smaller-depth FIFOs for convolutions
 
     Background:
@@ -189,22 +198,25 @@ class InsertAndSetFIFODepths(Transformation):
     throughput in the created accelerator.
 
     Constructor arguments:
-    - clk_ns : clock period (used for IP preparation)
-    - max_qsrl_depth : FIFOs deeper than this will use Vivado IP instead of
-                       Verilog FIFOs (Q_srl.v)
-    - max_depth : how deep the "max"-sized FIFOs initially inserted will be
-                   if set to None, use the tensor size as the depth
-    - swg_exception : call CapConvolutionFIFODepths to make convolution FIFOs
-                        smaller where appropriate
-    - vivado_ram_style : the StreamingFIFO.ram_style attribute to be used for
-                          large FIFOs implemented by Vivado afterwards
+
+    :parameter clk_ns: clock period (used for IP preparation)
+    :parameter max_qsrl_depth: FIFOs deeper than this will use Vivado IP
+        instead of Verilog FIFOs (Q_srl.v)
+    :parameter max_depth: how deep the "max"-sized FIFOs initially inserted
+        will be. If set to None, use the tensor size as the depth
+    :parameter swg_exception: call CapConvolutionFIFODepths to make convolution FIFOs
+        smaller where appropriate
+    :parameter vivado_ram_style: the StreamingFIFO.ram_style attribute to be used
+        for large FIFOs implemented by Vivado afterwards
 
     Assumed input graph properties:
+
     - all nodes are fpgadataflow nodes
     - no FIFOs inserted,
     - (inFIFODepths/outFIFODepths attrs will be ignored)
 
     Output:
+
     - graph with appropriate-depth FIFOs inserted
 
     Background:
@@ -212,12 +224,14 @@ class InsertAndSetFIFODepths(Transformation):
     necessary to insert FIFOs between them to prevent stalls due to bursty
     behavior. The sizes of those FIFOs are hard to predict analytically, so
     we do the following:
+
     - insert deep (=tensor size) FIFOs between all fpgadataflow nodes
     - create stitched design
     - run through rtlsim with stream of multiple random input images (to fill pipeline)
     - keep track of observed maximum occupancy for each FIFO during rtlsim
     - when sim finished, update each FIFO depth to maximum observed occupancy
-      and set inFIFODepths/outFIFODepths attrs to 0 on relevant nodes
+      and set inFIFODepths/outFIFODepths attrs to that depth as well
+
     """
 
     def __init__(
@@ -281,7 +295,7 @@ class InsertAndSetFIFODepths(Transformation):
 
         # insert stream infrastructure (DWC/FIFO)
         model = model.transform(InsertDWC())
-        model = model.transform(InsertFIFO())
+        model = model.transform(InsertFIFO(create_shallow_fifos=True))
         model = model.transform(GiveUniqueNodeNames())
         model = model.transform(GiveReadableTensorNames())
 
@@ -374,9 +388,9 @@ class InsertAndSetFIFODepths(Transformation):
                 # nodes as # inputs to drive the imulation
                 n_inputs = int(len(model.graph.node) / 2)
             else:
-                # convnet, single input is typically enough to fill entire
+                # convnet, two inputs are typically enough to fill entire
                 # layer pipeline due to overlaps
-                n_inputs = 1
+                n_inputs = 2
             sim = verilator_fifosim(model, n_inputs)
 
         for ind, node in enumerate(fifo_nodes):
@@ -398,8 +412,13 @@ class InsertAndSetFIFODepths(Transformation):
                 node_inst = getCustomOp(node)
                 node_inst.set_nodeattr("depth", depth)
                 node_inst.set_nodeattr("depth_monitor", 0)
+                # exception for top-level IO FIFOs which cause a bug in simulation
+                # (top-level IOs should not have impl_style=vivado)
+                toplevel_in = node.input[0] in [x.name for x in model.graph.input]
+                toplevel_out = node.output[0] in [x.name for x in model.graph.output]
+                toplevel_style_exception = toplevel_in or toplevel_out
                 # Set FIFO implementation/ram styles
-                if depth > self.max_qsrl_depth:
+                if (depth > self.max_qsrl_depth) and (not toplevel_style_exception):
                     node_inst.set_nodeattr("impl_style", "vivado")
                     node_inst.set_nodeattr("ram_style", self.vivado_ram_style)
                 else:
@@ -408,11 +427,7 @@ class InsertAndSetFIFODepths(Transformation):
                 reset_implementation(node_inst)
                 del fifos[node.name]
             else:
-                inst = getCustomOp(node)
-                ifd = inst.get_nodeattr("inFIFODepths")
-                ofd = inst.get_nodeattr("outFIFODepths")
-                inst.set_nodeattr("inFIFODepths", [0] * len(ifd))
-                inst.set_nodeattr("outFIFODepths", [0] * len(ofd))
+                # (removed setting of node FIFO size attributes to 0 here)
                 # for every extw node we changed from external to decoupled,
                 # change back and reset implementation
                 if node.op_type in extw_optypes:
@@ -434,4 +449,172 @@ class InsertAndSetFIFODepths(Transformation):
         # remove shallow FIFOs
         model = model.transform(RemoveShallowFIFOs())
 
+        # reflect final values in attributes
+        for node in model.graph.node:
+            if node.op_type != "StreamingFIFO":
+                node_inst = getCustomOp(node)
+                fifodepth_in = []
+                for node_inp in node.input:
+                    prod = model.find_producer(node_inp)
+                    if prod is None:
+                        # no producer for this input
+                        if node_inp in [x.name for x in model.graph.input]:
+                            # top-level input with no FIFO
+                            fifodepth_in.append(0)
+                        else:
+                            # FIFO depth attr applies only to dynamic attributes
+                            pass
+                    else:
+                        # there is a producer for this input
+                        if prod.op_type == "StreamingFIFO":
+                            prod_inst = getCustomOp(prod)
+                            fifodepth_in.append(prod_inst.get_nodeattr("depth"))
+                        else:
+                            # explicitly no FIFO on this dynamic input
+                            fifodepth_in.append(0)
+                fifodepth_out = []
+                for node_out in node.output:
+                    cons = model.find_consumer(node_out)
+                    if cons is None:
+                        # no consumer for this output
+                        if node_out in [x.name for x in model.graph.output]:
+                            # top-level output with no FIFO
+                            fifodepth_out.append(0)
+                        else:
+                            # FIFO depth attr applies only to dynamic attributes
+                            pass
+                    else:
+                        # there is a consumer for this input
+                        if cons.op_type == "StreamingFIFO":
+                            cons_inst = getCustomOp(cons)
+                            fifodepth_out.append(cons_inst.get_nodeattr("depth"))
+                        else:
+                            # explicitly no FIFO on this dynamic output
+                            fifodepth_out.append(0)
+                node_inst.set_nodeattr("inFIFODepths", fifodepth_in)
+                node_inst.set_nodeattr("outFIFODepths", fifodepth_out)
+
+        return (model, False)
+
+
+def get_fifo_split_configs(depth, max_qsrl_depth=256, max_vivado_depth=32768):
+    """Break non-power-of-2 sized FIFO depths into several ones"""
+
+    def floor_pow2(x):
+        if (x & (x - 1) == 0) and x != 0:
+            return x
+        else:
+            return 1 << ((x - 1).bit_length() - 1)
+
+    def decompose_pow2(x):
+        if x <= max_qsrl_depth:
+            return [x]
+        else:
+            r = floor_pow2(x)
+            if x == r:
+                return [x]
+            else:
+                return [r, *decompose_pow2(x - r)]
+
+    ret = []
+    # trivial case: for small FIFOs, return as-is with rtl style
+    if depth <= max_qsrl_depth:
+        return [(depth, "rtl")]
+    # first pass: ensure max depth is respected
+    # (restricted by Vivado AXIS infra IP)
+    remainder = depth
+    while remainder != 0:
+        if remainder > max_vivado_depth:
+            ret.append(max_vivado_depth)
+            remainder -= max_vivado_depth
+        else:
+            ret.append(remainder)
+            remainder = 0
+    # second pass: break non-power-of-2 sized FIFOs
+    # into several ones
+
+    ret_pass2 = list(map(decompose_pow2, ret))
+    # unpack list of lists
+    ret_pass2 = [x for dec_list in ret_pass2 for x in dec_list]
+
+    # finally, add impl_style to each split FIFO
+    ret_final = []
+    for cand_depth in ret_pass2:
+        if cand_depth <= max_qsrl_depth:
+            ret_final.append((cand_depth, "rtl"))
+        else:
+            ret_final.append((cand_depth, "vivado"))
+
+    return ret_final
+
+
+class SplitLargeFIFOs(Transformation):
+    """Split large FIFOs before implementation, for two reasons:
+
+    - impl_style="vivado" supports a max depth of 32k. Any larger
+      FIFOs must be implemented as a sequence of smaller FIFOs.
+    - impl_style="vivado" requires power-of-two depths, which is
+      normally handled by rounding up to the nearest power-of-two.
+      So a FIFO of size 8196 normally gets rounded-up to a depth of
+      16384 and wastes a lot of resources. Here, instead, we split
+      this up into two FIFOs of depth 8192 + 4.
+
+    """
+
+    def __init__(self, max_qsrl_depth=256, max_vivado_depth=32768):
+        super().__init__()
+        self.max_qsrl_depth = max_qsrl_depth
+        self.max_vivado_depth = max_vivado_depth
+
+    def apply(self, model):
+        graph = model.graph
+        node_ind = 0
+        graph_modified = False
+        for node in graph.node:
+            node_ind += 1
+            if node.op_type == "StreamingFIFO":
+                n_inst = getCustomOp(node)
+                depth = n_inst.get_nodeattr("depth")
+                cfgs = get_fifo_split_configs(
+                    depth, self.max_qsrl_depth, self.max_vivado_depth
+                )
+                if len(cfgs) > 1:
+                    fld_shape = n_inst.get_folded_output_shape()
+                    dtype = n_inst.get_nodeattr("dataType")
+                    ram_style = n_inst.get_nodeattr("ram_style")
+                    shape = model.get_tensor_shape(node.input[0])
+                    for i, (fifo_depth, impl_style) in enumerate(cfgs):
+                        if i == 0:
+                            inp = node.input[0]
+                        else:
+                            inp = node.name + "_" + str(i - 1) + "_out"
+                        if i == len(cfgs) - 1:
+                            outp = node.output[0]
+                        else:
+                            outp = node.name + "_" + str(i) + "_out"
+                            out_tensor = helper.make_tensor_value_info(
+                                outp, TensorProto.FLOAT, shape
+                            )
+                            graph.value_info.append(out_tensor)
+                            model.set_tensor_datatype(out_tensor.name, DataType[dtype])
+                        fifo_node = helper.make_node(
+                            "StreamingFIFO",
+                            [inp],
+                            [outp],
+                            domain="finn.custom_op.fpgadataflow",
+                            backend="fpgadataflow",
+                            depth=fifo_depth,
+                            folded_shape=fld_shape,
+                            dataType=dtype,
+                            impl_style=impl_style,
+                            ram_style=ram_style,
+                            name=node.name + "_" + str(i),
+                        )
+                        graph.node.insert(node_ind + i, fifo_node)
+
+                    graph.node.remove(node)
+                    graph_modified = True
+        if graph_modified:
+            model = model.transform(SortGraph())
+            model = model.transform(GiveReadableTensorNames())
         return (model, False)
diff --git a/src/finn/transformation/fpgadataflow/set_folding.py b/src/finn/transformation/fpgadataflow/set_folding.py
index 48e5d9f9e14c2d606bbdb10f695f11f32d388a2e..4e79a3faa5bcfa502c4d82dc1c86572ce88ccf20 100644
--- a/src/finn/transformation/fpgadataflow/set_folding.py
+++ b/src/finn/transformation/fpgadataflow/set_folding.py
@@ -62,13 +62,14 @@ class SetFolding(Transformation):
 
     Notable exceptions and special behavior:
 
-    * When folding dense convolution/FC compute engines ("MVAU"/MatrixVectorActivation),
+    When folding dense convolution/FC compute engines ("MVAU"/MatrixVectorActivation),
     which have two attributes (PE and SIMD):
-        * first increases SIMD while weight stream width per PE is <= mvau_wwidth_max
-          (configurable in the SetFolding initializer, defaults to 36)
-        * then increases PE until the target is met or max PE reached
 
-    * When folding depthwise convolutions ("VVAU"/VectorVectorActivation)
+    * first increases SIMD while weight stream width per PE is <= mvau_wwidth_max
+      (configurable in the SetFolding initializer, defaults to 36)
+    * then increases PE until the target is met or max PE reached
+
+    When folding depthwise convolutions ("VVAU"/VectorVectorActivation)
     or spatial reduction ops (Pool_Batch):
         * the producer of the node is expected to be a ConvolutionInputGenerator
         with depthwise=1, whose SIMD value will be set equal to the PE value of
diff --git a/src/finn/transformation/fpgadataflow/templates.py b/src/finn/transformation/fpgadataflow/templates.py
index 78bcdea0d701f97e9f80d7c7c489aa01bc93fa52..f52bad0ffb35ae4714acc24aef368d01967db426 100644
--- a/src/finn/transformation/fpgadataflow/templates.py
+++ b/src/finn/transformation/fpgadataflow/templates.py
@@ -126,6 +126,9 @@ if {$BOARD == "ZCU104"} {
 } elseif {$BOARD == "Pynq-Z1"} {
     set ZYNQ_TYPE "zynq_7000"
     set_property board_part www.digilentinc.com:pynq-z1:part0:1.0 [current_project]
+} elseif {$BOARD == "KV260_SOM"} {
+    set ZYNQ_TYPE "zynq_us+"
+    set_property board_part xilinx.com:kv260_som:part0:1.3 [current_project]
 } else {
     puts "Unrecognized board"
 }
diff --git a/src/finn/transformation/fpgadataflow/vitis_build.py b/src/finn/transformation/fpgadataflow/vitis_build.py
index 97da4d41524e7c86fb5a73375d9ea2c2c9aa10dc..e0a5666000fc2aa9599bb7475c1b8dd37489afac 100644
--- a/src/finn/transformation/fpgadataflow/vitis_build.py
+++ b/src/finn/transformation/fpgadataflow/vitis_build.py
@@ -358,16 +358,16 @@ class VitisBuild(Transformation):
     """Best-effort attempt at building the accelerator with Vitis.
     It assumes the model has only fpgadataflow nodes
 
-    fpga_part: string identifying the target FPGA
-    period_ns: target clock period
-    platform: target Alveo platform, one of ["U50", "U200", "U250", "U280"]
-    strategy: Vitis optimization strategy
-    enable_debug: add Chipscope to all AXI interfaces
-    floorplan_file: path to a JSON containing a dictionary with SLR assignments
-                    for each node in the ONNX graph. Must be parse-able by
-                    the ApplyConfig transform.
-    enable_link: enable linking kernels (.xo files), otherwise just synthesize
-                    them independently.
+    :parameter fpga_part: string identifying the target FPGA
+    :parameter period_ns: target clock period
+    :parameter platform: target Alveo platform, one of ["U50", "U200", "U250", "U280"]
+    :parameter strategy: Vitis optimization strategy
+    :parameter enable_debug: add Chipscope to all AXI interfaces
+    :parameter floorplan_file: path to a JSON containing a dictionary with
+        SLR assignments for each node in the ONNX graph.
+        Must be parse-able by the ApplyConfig transform.
+    :parameter enable_link: enable linking kernels (.xo files),
+        otherwise just synthesize them independently.
     """
 
     def __init__(
diff --git a/src/finn/transformation/qonnx/convert_qonnx_to_finn.py b/src/finn/transformation/qonnx/convert_qonnx_to_finn.py
index 967a1276365e4af1a6d617c081b9c04b4710da97..34f11d1e95e6bc3f6a36ce6d878ed493108b3ba6 100644
--- a/src/finn/transformation/qonnx/convert_qonnx_to_finn.py
+++ b/src/finn/transformation/qonnx/convert_qonnx_to_finn.py
@@ -56,12 +56,12 @@ class ConvertQONNXtoFINN(Transformation):
     is not converted to a MultiThreshold node.
 
     :param filter_function: Each candidate Quant and BinaryQant node is first evaluated
-    by this function. If the function returns False,
-    then the node is not converted to a MultiTrheshold node.
-    The function is given the model and candidate node as parameters.
-    Per default a filter function is inserted, which disables the conversion of
-    Quant nodes, which have a bit width of larger than 8.
-    Defaults to: default_filter_function_generator(max_multithreshold_bit_width=8)
+        by this function. If the function returns False,
+        then the node is not converted to a MultiTrheshold node.
+        The function is given the model and candidate node as parameters.
+        Per default a filter function is inserted, which disables the conversion of
+        Quant nodes, which have a bit width of larger than 8.
+        Defaults to: default_filter_function_generator(max_multithreshold_bit_width=8)
     """
 
     def __init__(
diff --git a/src/finn/transformation/qonnx/qonnx_activation_handlers.py b/src/finn/transformation/qonnx/qonnx_activation_handlers.py
index a50a5850779cadf7ab21b9c1c4dfdbb36232af42..9819086d826a51d1df5240d88c4fda8513cc9ba6 100644
--- a/src/finn/transformation/qonnx/qonnx_activation_handlers.py
+++ b/src/finn/transformation/qonnx/qonnx_activation_handlers.py
@@ -52,9 +52,7 @@ class QuantActBaseHandler(ABC):
         self._q_node = quant_node
         self._q_index = quant_node_index
 
-    @property
     @classmethod
-    @abstractmethod
     def valid_predecessor_op_types(self):
         """Defines which op types the preceding node is allowed to have for
         this type of activation.
@@ -284,9 +282,11 @@ class QuantReluHandler(QuantActBaseHandler):
     """Class for converting a quantized relu operation expressed in the QONNX
     dialect to the FINN ONNX dialect."""
 
-    valid_predecessor_op_types = [
-        "Relu",
-    ]
+    @classmethod
+    def valid_predecessor_op_types(self):
+        return [
+            "Relu",
+        ]
 
     def _check_compatibility(self):
         if self._q_node.op_type == "Quant":
@@ -391,15 +391,17 @@ class QuantIdentityHandler(QuantActBaseHandler):
     these are equivalent to quantized identity activations.
     """
 
-    valid_predecessor_op_types = [
-        "BatchNormalization",
-        "Sub",
-        "Add",
-        "Mul",
-        "Div",
-        "DebugMarker",
-        None,
-    ]
+    @classmethod
+    def valid_predecessor_op_types(self):
+        return [
+            "BatchNormalization",
+            "Sub",
+            "Add",
+            "Mul",
+            "Div",
+            "DebugMarker",
+            None,
+        ]
 
     def _check_compatibility(self):
         # Gather parameters to check
diff --git a/src/finn/transformation/qonnx/quant_act_to_multithreshold.py b/src/finn/transformation/qonnx/quant_act_to_multithreshold.py
index 77025ecdf57d5a422992d4163d05c740454986bb..48dda3820deb051bd8a291188f02fe7d1dd2cc0b 100644
--- a/src/finn/transformation/qonnx/quant_act_to_multithreshold.py
+++ b/src/finn/transformation/qonnx/quant_act_to_multithreshold.py
@@ -30,7 +30,10 @@
 import warnings
 from qonnx.transformation.base import Transformation
 
-from finn.transformation.qonnx.qonnx_activation_handlers import QuantActBaseHandler
+from finn.transformation.qonnx.qonnx_activation_handlers import (
+    QuantActBaseHandler,
+    QuantIdentityHandler,
+)
 
 
 def default_filter_function_generator(max_multithreshold_bit_width=8):
@@ -66,8 +69,7 @@ def default_filter_function_generator(max_multithreshold_bit_width=8):
 
 
 class ConvertQuantActToMultiThreshold(Transformation):
-    """
-    Converts Quant nodes in the activation path to MultiThreshold nodes.
+    """Converts Quant nodes in the activation path to MultiThreshold nodes.
 
     The optional keyword argument `filter_function`
     presents a way to control which Quant and BipolarQuant nodes in the activation path
@@ -75,12 +77,12 @@ class ConvertQuantActToMultiThreshold(Transformation):
     is not converted to a MultiThreshold node.
 
     :param filter_function: Each candidate Quant and BinaryQant node is first evaluated
-    by this function. If the function returns False,
-    then the node is not converted to a MultiTrheshold node.
-    The function is given the model and candidate node as parameters.
-    Per default a filter function is inserted, which disables the conversion of
-    Quant nodes, which have a bit width of larger than 8.
-    Defaults to: default_filter_function_generator(max_multithreshold_bit_width=8)
+        by this function. If the function returns False,
+        then the node is not converted to a MultiTrheshold node.
+        The function is given the model and candidate node as parameters.
+        Per default a filter function is inserted, which disables the conversion of
+        Quant nodes, which have a bit width of larger than 8.
+        Defaults to: default_filter_function_generator(max_multithreshold_bit_width=8)
     """
 
     def __init__(
@@ -127,7 +129,7 @@ class ConvertQuantActToMultiThreshold(Transformation):
                 # Check for possible ambiguity in handler selection
                 valid_predecessors = []
                 for cls in QuantActBaseHandler.__subclasses__():
-                    valid_predecessors.extend(cls.valid_predecessor_op_types)
+                    valid_predecessors.extend(cls.valid_predecessor_op_types())
                 if len(valid_predecessors) != len(set(valid_predecessors)):
                     raise RuntimeError(
                         "Two or more activation handlers declare the same "
@@ -138,16 +140,15 @@ class ConvertQuantActToMultiThreshold(Transformation):
 
                 # Try to find a fitting handler for this Quant activation node
                 for handler_cls in QuantActBaseHandler.__subclasses__():
-                    if predecessor_op_type in handler_cls.valid_predecessor_op_types:
+                    if predecessor_op_type in handler_cls.valid_predecessor_op_types():
                         handler = handler_cls(model, n, node_ind)
                         break
                 else:
-                    raise ValueError(
-                        f"Quant nodes in the activation path and with predecessor "
-                        f"nodes of type {predecessor_op_type} are currently not "
-                        f"supported by FINN and can not be converted to "
-                        f"MultiThreshold nodes."
-                    )
+                    # fall back to QuantIdentityHandler here
+                    # it may still not work due to its particular restrictions,
+                    # but better than just erroring out without trying
+                    handler = QuantIdentityHandler(model, n, node_ind)
+
                 model = handler.replace_quant_node()
                 graph_modified = True
                 return (model, graph_modified)
diff --git a/src/finn/util/basic.py b/src/finn/util/basic.py
index 4aba87216c8999612f748e989a945ceff33da167..3bc5b803db2072f4d0ed3829adab93b4fbd3b98e 100644
--- a/src/finn/util/basic.py
+++ b/src/finn/util/basic.py
@@ -40,6 +40,8 @@ pynq_part_map["ZCU102"] = "xczu9eg-ffvb1156-2-e"
 pynq_part_map["ZCU104"] = "xczu7ev-ffvc1156-2-e"
 pynq_part_map["ZCU111"] = "xczu28dr-ffvg1517-2-e"
 pynq_part_map["RFSoC2x2"] = "xczu28dr-ffvg1517-2-e"
+pynq_part_map["KV260_SOM"] = "xck26-sfvc784-2LV-c"
+
 
 # native AXI HP port width (in bits) for PYNQ boards
 pynq_native_port_width = dict()
@@ -50,6 +52,7 @@ pynq_native_port_width["ZCU102"] = 128
 pynq_native_port_width["ZCU104"] = 128
 pynq_native_port_width["ZCU111"] = 128
 pynq_native_port_width["RFSoC2x2"] = 128
+pynq_native_port_width["KV260_SOM"] = 128
 
 # Alveo device and platform mappings
 alveo_part_map = dict()
diff --git a/src/finn/util/create.py b/src/finn/util/create.py
index a8c2e67b385b797905cd4c5a196091069898b583..ed3e1a843eca47d2e20e9ca1c9df0d2d6f5a8a13 100644
--- a/src/finn/util/create.py
+++ b/src/finn/util/create.py
@@ -30,7 +30,11 @@ import numpy as np
 from onnx import TensorProto, helper
 from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
-from qonnx.util.basic import calculate_signed_dot_prod_range, gen_finn_dt_tensor
+from qonnx.util.basic import (
+    calculate_signed_dot_prod_range,
+    gen_finn_dt_tensor,
+    qonnx_make_model,
+)
 
 
 def hls_random_mlp_maker(layer_spec):
@@ -84,7 +88,7 @@ def hls_mlp_maker(layer_spec):
 
     graph = helper.make_graph(nodes=[], name="mlp", inputs=[], outputs=[])
 
-    model = helper.make_model(graph, producer_name="finn")
+    model = qonnx_make_model(graph, producer_name="finn")
     model = ModelWrapper(model)
 
     for lyr in layer_spec:
diff --git a/src/finn/util/pyverilator.py b/src/finn/util/pyverilator.py
index a00899cf784cbe3985b942af6b3d9a4c14cd8706..8d188585694c172d97d73fa6b5820edb7b48a948 100644
--- a/src/finn/util/pyverilator.py
+++ b/src/finn/util/pyverilator.py
@@ -133,6 +133,7 @@ def verilator_fifosim(model, n_inputs, max_iters=100000000):
     and throughput measurement."""
 
     vivado_stitch_proj_dir = prepare_stitched_ip_for_verilator(model)
+    verilog_header_dir = vivado_stitch_proj_dir + "/pyverilator_vh"
     build_dir = make_build_dir("verilator_fifosim_")
     fifosim_cpp_fname = pk.resource_filename(
         "finn.qnn-data", "cpp/verilator_fifosim.cpp"
@@ -184,6 +185,19 @@ def verilator_fifosim(model, n_inputs, max_iters=100000000):
     if which_verilator is None:
         raise Exception("'verilator' executable not found")
 
+    # add defines to make certain XPM src files work with Verilator
+    xpm_args = []
+    xpm_args.append("-DDISABLE_XPM_ASSERTIONS")
+    xpm_args.append("-DOBSOLETE")
+    xpm_args.append("-DONESPIN")
+    xpm_args.append("--bbox-unsup")
+    vivado_path = os.environ["VIVADO_PATH"]
+    # additional SystemVerilog modules to make XPMs work with Verilator
+    xpm_memory = f"{vivado_path}/data/ip/xpm/xpm_memory/hdl/xpm_memory.sv"
+    xpm_cdc = f"{vivado_path}/data/ip/xpm/xpm_cdc/hdl/xpm_cdc.sv"
+    xpm_fifo = f"{vivado_path}/data/ip/xpm/xpm_fifo/hdl/xpm_fifo.sv"
+    verilog_file_arg = ["finn_design_wrapper.v", xpm_memory, xpm_cdc, xpm_fifo]
+
     verilator_args = [
         "perl",
         which_verilator,
@@ -192,6 +206,8 @@ def verilator_fifosim(model, n_inputs, max_iters=100000000):
         build_dir,
         "-y",
         vivado_stitch_proj_dir,
+        "-y",
+        verilog_header_dir,
         "--CFLAGS",
         "--std=c++11",
         "-O3",
@@ -201,13 +217,14 @@ def verilator_fifosim(model, n_inputs, max_iters=100000000):
         "fast",
         "--noassert",
         "--cc",
-        "finn_design_wrapper.v",
+        *verilog_file_arg,
         "--top-module",
         "finn_design_wrapper",
         "--exe",
         "verilator_fifosim.cpp",
         "--threads",
         "4",
+        *xpm_args,
     ]
 
     proc_env = os.environ.copy()
diff --git a/src/finn/util/vcd.py b/src/finn/util/vcd.py
index aaeb3ab920d1d8fae79c1173582d18cf81d03063..1f77276d5a72e5f886d5f94af8d35121ccadd486 100644
--- a/src/finn/util/vcd.py
+++ b/src/finn/util/vcd.py
@@ -101,19 +101,21 @@ def get_stream_if_stats(vcd_file, if_base_name):
     <stream_state>: (<num_samples>, <fraction_of_time>),
 
     where <stream_state> is the combination of (V)alid/(R)eady values,
-    <num_samples> is the approximate number of rising clock edges spent in <state>
-    , and <fraction_of_time> is the fraction of <num_samples> to total
+    <num_samples> is the approximate number of rising clock edges spent in <state>,
+    and <fraction_of_time> is the fraction of <num_samples> to total
     amount of time recorded by the trace.
 
     Example:
-    {"{'V': 0, 'R': 0}": (5, 0.0006060606060606061),
-     "{'V': 1, 'R': 0}": (0, 0.0),
-     "{'V': 0, 'R': 1}": (7605, 0.9218181818181819),
-     "{'V': 1, 'R': 1}": (640, 0.07757575757575758)}
-
+    {
+    "{'V': 0, 'R': 0}": (5, 0.0006060606060606061),
+    "{'V': 1, 'R': 0}": (0, 0.0),
+    "{'V': 0, 'R': 1}": (7605, 0.9218181818181819),
+    "{'V': 1, 'R': 1}": (640, 0.07757575757575758)
+    }
     Here we can see the stream was transmitting values 7.7% of the time,
     and 9.2% of the time there was no incoming data (valid 0, ready 1)
     """
+
     if_valid = if_base_name + vname
     if_ready = if_base_name + rname
     v = VCDVCD(vcd_file, signals=[if_valid], store_tvs=True)
diff --git a/tests/end2end/test_end2end_bnn_pynq.py b/tests/end2end/test_end2end_bnn_pynq.py
index 79cfafa22d670f168c3c03a5ef01a51256912a8c..858363d6d31c7c17803bffdb87e7b168dec4b76d 100644
--- a/tests/end2end/test_end2end_bnn_pynq.py
+++ b/tests/end2end/test_end2end_bnn_pynq.py
@@ -564,12 +564,6 @@ class TestEnd2End:
         model = model.transform(InsertAndSetFIFODepths(test_fpga_part, target_clk_ns))
         fifo_layers = model.get_nodes_by_op_type("StreamingFIFO")
         assert len(fifo_layers) > 0
-        hls_layers = model.get_finn_nodes()
-        for node in hls_layers:
-            if node.op_type != "StreamingFIFO":
-                op_inst = getCustomOp(node)
-                assert op_inst.get_nodeattr("inFIFODepths") == [0]
-                assert op_inst.get_nodeattr("outFIFODepths") == [0]
         model.save(
             get_checkpoint_name(
                 topology, wbits, abits, QONNX_export, "fifodepth_" + kind
diff --git a/tests/fpgadataflow/test_code_gen_trafo.py b/tests/fpgadataflow/test_code_gen_trafo.py
index 49ee32c71ee941ff7435d4c12ccadae3f8e55c5e..f5edabbd4ba029899239cc2f40dd6a94d178eafd 100644
--- a/tests/fpgadataflow/test_code_gen_trafo.py
+++ b/tests/fpgadataflow/test_code_gen_trafo.py
@@ -32,7 +32,7 @@ import os
 from onnx import TensorProto, helper
 from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
-from qonnx.util.basic import gen_finn_dt_tensor, get_by_name
+from qonnx.util.basic import gen_finn_dt_tensor, get_by_name, qonnx_make_model
 
 from finn.transformation.fpgadataflow.prepare_cppsim import PrepareCppSim
 
@@ -70,7 +70,7 @@ def test_code_gen_trafo():
         nodes=[FCLayer_node], name="fclayer_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="fclayer-model")
+    model = qonnx_make_model(graph, producer_name="fclayer-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_compilation_trafo.py b/tests/fpgadataflow/test_compilation_trafo.py
index 9bafb101cedabc99d97356069c883cab4ed8a87f..d04b68a56ba7fc5f01e1eef57075636954f86843 100644
--- a/tests/fpgadataflow/test_compilation_trafo.py
+++ b/tests/fpgadataflow/test_compilation_trafo.py
@@ -32,7 +32,7 @@ import os
 from onnx import TensorProto, helper
 from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
-from qonnx.util.basic import gen_finn_dt_tensor, get_by_name
+from qonnx.util.basic import gen_finn_dt_tensor, get_by_name, qonnx_make_model
 
 from finn.transformation.fpgadataflow.compile_cppsim import CompileCppSim
 from finn.transformation.fpgadataflow.prepare_cppsim import PrepareCppSim
@@ -71,7 +71,7 @@ def test_compilation_trafo():
         nodes=[FCLayer_node], name="fclayer_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="fclayer-model")
+    model = qonnx_make_model(graph, producer_name="fclayer-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_convert_to_hls_1d_conv_layer.py b/tests/fpgadataflow/test_convert_to_hls_1d_conv_layer.py
index 7b3e20616410f54e4718290baec9a510a0d49c0d..98a7c76ee4de0332586772ba7c1007ee55979a51 100644
--- a/tests/fpgadataflow/test_convert_to_hls_1d_conv_layer.py
+++ b/tests/fpgadataflow/test_convert_to_hls_1d_conv_layer.py
@@ -38,7 +38,7 @@ from qonnx.transformation.general import GiveUniqueNodeNames
 from qonnx.transformation.infer_datatypes import InferDataTypes
 from qonnx.transformation.infer_shapes import InferShapes
 from qonnx.transformation.lower_convs_to_matmul import LowerConvsToMatMul
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 import finn.transformation.fpgadataflow.convert_to_hls_layers as to_hls
@@ -121,7 +121,7 @@ def test_convert_to_hls_1d_conv_layer(conv_config, depthwise, use_rtl_swg, exec_
         helper.make_tensor_value_info("p1", TensorProto.FLOAT, conv_param_shape)
     ]
 
-    modelproto = helper.make_model(
+    modelproto = qonnx_make_model(
         helper.make_graph(
             name="conv_test",
             inputs=[top_in],
diff --git a/tests/fpgadataflow/test_convert_to_hls_channelwise_layer.py b/tests/fpgadataflow/test_convert_to_hls_channelwise_layer.py
index 0f19b6d79ab0ed77981022f286fabd430094d69f..089d1ae420f4fab744fcda5950d88b13216b4c93 100644
--- a/tests/fpgadataflow/test_convert_to_hls_channelwise_layer.py
+++ b/tests/fpgadataflow/test_convert_to_hls_channelwise_layer.py
@@ -35,7 +35,7 @@ from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.transformation.general import GiveUniqueNodeNames
 from qonnx.transformation.infer_data_layouts import InferDataLayouts
 from qonnx.transformation.infer_shapes import InferShapes
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 import finn.transformation.fpgadataflow.convert_to_hls_layers as to_hls
@@ -57,7 +57,7 @@ def make_single_maxpool_modelwrapper(onnx_op_name, ishape, idt, pdt, pshape):
     outp = helper.make_tensor_value_info("outp", TensorProto.FLOAT, ishape)
     p0 = helper.make_tensor_value_info("p0", TensorProto.FLOAT, pshape)
 
-    model = helper.make_model(
+    model = qonnx_make_model(
         helper.make_graph(
             name="test",
             inputs=[inp],
diff --git a/tests/fpgadataflow/test_convert_to_hls_conv_fc_transition.py b/tests/fpgadataflow/test_convert_to_hls_conv_fc_transition.py
index 0760ff9b37487f4a1ac06853055d2e47b7269f9e..3512c39cb3fab04e4e4225728c9495b546b7c655 100755
--- a/tests/fpgadataflow/test_convert_to_hls_conv_fc_transition.py
+++ b/tests/fpgadataflow/test_convert_to_hls_conv_fc_transition.py
@@ -39,7 +39,7 @@ from qonnx.transformation.infer_data_layouts import InferDataLayouts
 from qonnx.transformation.infer_datatypes import InferDataTypes
 from qonnx.transformation.infer_shapes import InferShapes
 from qonnx.transformation.lower_convs_to_matmul import LowerConvsToMatMul
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 import finn.transformation.fpgadataflow.convert_to_hls_layers as to_hls
@@ -149,7 +149,7 @@ def test_convert_to_hls_conv_fc_transition(conv_config, depthwise, use_reshape):
             "Flatten", ["thres1_out"], ["flatten_out"], axis=1
         )
 
-    modelproto = helper.make_model(
+    modelproto = qonnx_make_model(
         helper.make_graph(
             name="test",
             inputs=[global_in],
diff --git a/tests/fpgadataflow/test_convert_to_hls_conv_layer.py b/tests/fpgadataflow/test_convert_to_hls_conv_layer.py
index 8c9f110c315089ec03354863bf2213963197217a..de31ef0f125cb96ea82f953eadb9d5ccf7aab16c 100644
--- a/tests/fpgadataflow/test_convert_to_hls_conv_layer.py
+++ b/tests/fpgadataflow/test_convert_to_hls_conv_layer.py
@@ -38,7 +38,7 @@ from qonnx.transformation.general import GiveUniqueNodeNames
 from qonnx.transformation.infer_datatypes import InferDataTypes
 from qonnx.transformation.infer_shapes import InferShapes
 from qonnx.transformation.lower_convs_to_matmul import LowerConvsToMatMul
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 import finn.transformation.fpgadataflow.convert_to_hls_layers as to_hls
@@ -107,7 +107,7 @@ def test_convert_to_hls_conv_layer(conv_config, depthwise, use_rtl_swg, exec_mod
         helper.make_tensor_value_info("p1", TensorProto.FLOAT, conv_param_shape)
     ]
 
-    modelproto = helper.make_model(
+    modelproto = qonnx_make_model(
         helper.make_graph(
             name="conv_test",
             inputs=[top_in],
@@ -175,8 +175,11 @@ def test_convert_to_hls_conv_layer(conv_config, depthwise, use_rtl_swg, exec_mod
             assert np.isclose(exp_cycles, cycles_rtlsim, atol=11)
             assert exp_cycles != 0
 
-    if pad == 1:
-        padding_node = new_model.get_nodes_by_op_type("FMPadding_Batch")[0]
+    if pad:
+        if use_rtl_swg:
+            padding_node = new_model.get_nodes_by_op_type("FMPadding_rtl")[0]
+        else:
+            padding_node = new_model.get_nodes_by_op_type("FMPadding_Batch")[0]
         padding_inst = getCustomOp(padding_node)
         assert padding_inst.get_nodeattr("SIMD") == in_chn
 
diff --git a/tests/fpgadataflow/test_convert_to_hls_layers_synthetic.py b/tests/fpgadataflow/test_convert_to_hls_layers_synthetic.py
index 79a48793e0c4f062654e43aadcaf09ebf6d7da5b..c837a46a7ca7dcab6628cbf16373161b7b9ab9c2 100644
--- a/tests/fpgadataflow/test_convert_to_hls_layers_synthetic.py
+++ b/tests/fpgadataflow/test_convert_to_hls_layers_synthetic.py
@@ -43,7 +43,7 @@ from qonnx.transformation.infer_data_layouts import InferDataLayouts
 from qonnx.transformation.infer_datatypes import InferDataTypes
 from qonnx.transformation.infer_shapes import InferShapes
 from qonnx.transformation.insert_topk import InsertTopK
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 import finn.transformation.fpgadataflow.convert_to_hls_layers as to_hls
@@ -123,7 +123,7 @@ def make_model(ch, ifmdim):
         outputs=[outp],
     )
 
-    model = helper.make_model(graph, producer_name="add-model")
+    model = qonnx_make_model(graph, producer_name="add-model")
     model = ModelWrapper(model)
 
     # set initializers for scalar add/mul nodes
diff --git a/tests/fpgadataflow/test_convert_to_hls_pool_batch.py b/tests/fpgadataflow/test_convert_to_hls_pool_batch.py
index ef9bd7a13dcecf7aa61ecb982ac6393d7813a4d5..6d628c9e53828fef88028bdc115bd64b0292dfed 100644
--- a/tests/fpgadataflow/test_convert_to_hls_pool_batch.py
+++ b/tests/fpgadataflow/test_convert_to_hls_pool_batch.py
@@ -35,7 +35,7 @@ from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
 from qonnx.transformation.infer_shapes import InferShapes
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 import finn.transformation.fpgadataflow.convert_to_hls_layers as to_hls
@@ -78,7 +78,7 @@ def make_single_maxpool_modelwrapper(
         nodes=[mp_node], name="mp_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="mp-model")
+    model = qonnx_make_model(graph, producer_name="mp-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
@@ -112,7 +112,7 @@ def make_single_quantavpool_modelwrapper(k, stride, ifm_ch, ifm_dim, ofm_dim, id
         nodes=[mp_node], name="mp_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="mp-model")
+    model = qonnx_make_model(graph, producer_name="mp-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_depthwise_convolution.py b/tests/fpgadataflow/test_depthwise_convolution.py
index 5228ade3d0f4db3bd99f5fcccb7aee41f57ed73b..8ab22bcfdcb0312bd49677f0e00d8e97cdcad3c1 100644
--- a/tests/fpgadataflow/test_depthwise_convolution.py
+++ b/tests/fpgadataflow/test_depthwise_convolution.py
@@ -37,7 +37,11 @@ from qonnx.custom_op.general.im2col import compute_conv_output_dim
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
 from qonnx.transformation.infer_shapes import InferShapes
-from qonnx.util.basic import calculate_signed_dot_prod_range, gen_finn_dt_tensor
+from qonnx.util.basic import (
+    calculate_signed_dot_prod_range,
+    gen_finn_dt_tensor,
+    qonnx_make_model,
+)
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.fpgadataflow.compile_cppsim import CompileCppSim
@@ -123,7 +127,7 @@ def set_up_reference_model(act, idt, wdt, k, ifm_dim, ifm_ch, stride, padding):
         outputs=[global_out],
         value_info=value_info,
     )
-    model = oh.make_model(graph, producer_name="lowered_dw_cnv-model")
+    model = qonnx_make_model(graph, producer_name="lowered_dw_cnv-model")
     model = ModelWrapper(model)
 
     # initialize model
diff --git a/tests/fpgadataflow/test_fifosizing.py b/tests/fpgadataflow/test_fifosizing.py
index 116df98d17985951892cf7f43baf486ab4fd80c8..9399fbe3949a5d0052ba80b24b1a9e0c44c5597c 100644
--- a/tests/fpgadataflow/test_fifosizing.py
+++ b/tests/fpgadataflow/test_fifosizing.py
@@ -32,6 +32,8 @@ import pytest
 import json
 import shutil
 from brevitas.export.onnx.generic.manager import BrevitasONNXManager
+from qonnx.core.modelwrapper import ModelWrapper
+from qonnx.custom_op.registry import getCustomOp
 
 import finn.builder.build_dataflow as build
 import finn.builder.build_dataflow_config as build_cfg
@@ -49,22 +51,24 @@ def fetch_test_model(topology, wbits=2, abits=2):
 
 @pytest.mark.slow
 @pytest.mark.vivado
+@pytest.mark.fpgadataflow
 @pytest.mark.parametrize(
     "method", ["largefifo_rtlsim_python", "largefifo_rtlsim_cpp", "characterize"]
 )
-def test_fifosizing_linear(method):
+@pytest.mark.parametrize("topology", ["tfc", "cnv"])
+def test_fifosizing_linear(method, topology):
     force_python_rtlsim = "python" in method
     method_key = "largefifo_rtlsim" if "largefifo_rtlsim" in method else "characterize"
-    tmp_output_dir = fetch_test_model("tfc")
+    tmp_output_dir = fetch_test_model(topology)
     cfg = build_cfg.DataflowBuildConfig(
         output_dir=tmp_output_dir,
         auto_fifo_depths=True,
         auto_fifo_strategy=method_key,
-        target_fps=10000,
+        target_fps=10000 if topology == "tfc" else 1000,
         force_python_rtlsim=force_python_rtlsim,
         synth_clk_period_ns=10.0,
         board="Pynq-Z1",
-        rtlsim_batch_size=100,
+        rtlsim_batch_size=100 if topology == "tfc" else 2,
         shell_flow_type=build_cfg.ShellFlowType.VIVADO_ZYNQ,
         generate_outputs=[
             build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
@@ -79,8 +83,36 @@ def test_fifosizing_linear(method):
     with open(tmp_output_dir + "/report/rtlsim_performance.json") as f:
         sim_data = json.load(f)
     assert (
-        float(sim_data["throughput[images/s]"])
+        float(sim_data["stable_throughput[images/s]"])
         / float(est_data["estimated_throughput_fps"])
         > 0.9
     )
+    # now run the same build using the generated folding and FIFO config
+    tmp_output_dir_cmp = fetch_test_model(topology)
+    cfg_cmp = cfg
+    cfg_cmp.output_dir = tmp_output_dir_cmp
+    cfg_cmp.auto_fifo_depths = False
+    cfg_cmp.target_fps = None
+    cfg_cmp.generate_outputs = [build_cfg.DataflowOutputType.STITCHED_IP]
+    cfg_cmp.folding_config_file = tmp_output_dir + "/final_hw_config.json"
+    build.build_dataflow_cfg(tmp_output_dir_cmp + "/model.onnx", cfg_cmp)
+
+    model0 = ModelWrapper(
+        tmp_output_dir + "/intermediate_models/step_create_stitched_ip.onnx"
+    )
+    model1 = ModelWrapper(
+        tmp_output_dir_cmp + "/intermediate_models/step_create_stitched_ip.onnx"
+    )
+
+    assert len(model0.graph.node) == len(model1.graph.node)
+    for i in range(len(model0.graph.node)):
+        node0 = model0.graph.node[i]
+        node1 = model1.graph.node[i]
+        assert node0.op_type == node1.op_type
+        if node0.op_type == "StreamingFIFO":
+            node0_inst = getCustomOp(node0)
+            node1_inst = getCustomOp(node1)
+            assert node0_inst.get_nodeattr("depth") == node1_inst.get_nodeattr("depth")
+
     shutil.rmtree(tmp_output_dir)
+    shutil.rmtree(tmp_output_dir_cmp)
diff --git a/tests/fpgadataflow/test_fpgadataflow_addstreams.py b/tests/fpgadataflow/test_fpgadataflow_addstreams.py
index 6d881f45b60384d9a78b5d9f9705581a10b48e6c..1ad2c26610c99c46bde4c05ed156a81b122aba53 100644
--- a/tests/fpgadataflow/test_fpgadataflow_addstreams.py
+++ b/tests/fpgadataflow/test_fpgadataflow_addstreams.py
@@ -34,7 +34,7 @@ from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.analysis.fpgadataflow.exp_cycles_per_layer import exp_cycles_per_layer
@@ -68,7 +68,7 @@ def make_addstreams_modelwrapper(ch, pe, idt):
         outputs=[outp],
     )
 
-    model = helper.make_model(graph, producer_name="addstreams-model")
+    model = qonnx_make_model(graph, producer_name="addstreams-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp1", idt)
diff --git a/tests/fpgadataflow/test_fpgadataflow_channelwise_ops.py b/tests/fpgadataflow/test_fpgadataflow_channelwise_ops.py
index ceafda90e54004c7aea8786d003b6adf1defab35..13fab9a47f15999c184680b9db04494787889881 100644
--- a/tests/fpgadataflow/test_fpgadataflow_channelwise_ops.py
+++ b/tests/fpgadataflow/test_fpgadataflow_channelwise_ops.py
@@ -34,7 +34,7 @@ from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.analysis.fpgadataflow.exp_cycles_per_layer import exp_cycles_per_layer
@@ -73,7 +73,7 @@ def make_modelwrapper(C, pe, idt, odt, pdt, func, vecs):
     )
     graph = helper.make_graph(nodes=[node], name="graph", inputs=[inp], outputs=[outp])
 
-    model = helper.make_model(graph, producer_name="model")
+    model = qonnx_make_model(graph, producer_name="model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_fpgadataflow_checksum.py b/tests/fpgadataflow/test_fpgadataflow_checksum.py
index 495fcd10b6a977c6b0917ac37b58ec5595185c25..cd404f5a6332d77f17ec69c47b53c8c893f28607 100644
--- a/tests/fpgadataflow/test_fpgadataflow_checksum.py
+++ b/tests/fpgadataflow/test_fpgadataflow_checksum.py
@@ -36,7 +36,7 @@ from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveReadableTensorNames, GiveUniqueNodeNames
 from qonnx.transformation.infer_shapes import InferShapes
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.core.rtlsim_exec import rtlsim_exec
@@ -115,7 +115,7 @@ def create_two_fc_model():
         value_info=[mid],
     )
 
-    model = helper.make_model(graph, producer_name="fclayer-model")
+    model = qonnx_make_model(graph, producer_name="fclayer-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_fpgadataflow_convinputgenerator.py b/tests/fpgadataflow/test_fpgadataflow_convinputgenerator.py
index a196ecbb61b74843ddc8efa4ac3c5ab8197e64fe..3cfff9ac34ae47bdc072bca9f6ca0fffeea756c5 100644
--- a/tests/fpgadataflow/test_fpgadataflow_convinputgenerator.py
+++ b/tests/fpgadataflow/test_fpgadataflow_convinputgenerator.py
@@ -34,7 +34,7 @@ from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.analysis.fpgadataflow.exp_cycles_per_layer import exp_cycles_per_layer
@@ -73,7 +73,7 @@ def make_single_im2col_modelwrapper(
         nodes=[im2col_node], name="im2col_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="im2col-model")
+    model = qonnx_make_model(graph, producer_name="im2col-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
@@ -117,7 +117,7 @@ def make_single_slidingwindow_modelwrapper(
         outputs=[outp],
     )
 
-    model = helper.make_model(graph, producer_name="slidingwindow-model")
+    model = qonnx_make_model(graph, producer_name="slidingwindow-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_fpgadataflow_convinputgenerator1d.py b/tests/fpgadataflow/test_fpgadataflow_convinputgenerator1d.py
index 0fc3ca82cfa919079a324160e4876377ac4dc3b4..f467f37618bbee6359bb7b7dfa963e3d8785d0c9 100644
--- a/tests/fpgadataflow/test_fpgadataflow_convinputgenerator1d.py
+++ b/tests/fpgadataflow/test_fpgadataflow_convinputgenerator1d.py
@@ -35,7 +35,7 @@ from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.general.im2col import compute_conv_output_dim
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.analysis.fpgadataflow.exp_cycles_per_layer import exp_cycles_per_layer
@@ -82,7 +82,7 @@ def make_single_im2col_modelwrapper(
         nodes=[im2col_node], name="im2col_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="im2col-model")
+    model = qonnx_make_model(graph, producer_name="im2col-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
@@ -133,7 +133,7 @@ def make_single_slidingwindow_modelwrapper(
         outputs=[outp],
     )
 
-    model = helper.make_model(graph, producer_name="slidingwindow-model")
+    model = qonnx_make_model(graph, producer_name="slidingwindow-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_fpgadataflow_convinputgenerator_rtl.py b/tests/fpgadataflow/test_fpgadataflow_convinputgenerator_rtl.py
index a66038ef29a9b68541507cdefadc52c4cfce22d3..67754986109346d024edc13c4a57f6d63d2f10f3 100755
--- a/tests/fpgadataflow/test_fpgadataflow_convinputgenerator_rtl.py
+++ b/tests/fpgadataflow/test_fpgadataflow_convinputgenerator_rtl.py
@@ -33,7 +33,7 @@ from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.general.im2col import compute_conv_output_dim
 from qonnx.transformation.general import GiveUniqueNodeNames
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.fpgadataflow.prepare_ip import PrepareIP
@@ -72,7 +72,7 @@ def make_single_im2col_modelwrapper(k, ifm_ch, ifm_dim, ofm_dim, stride, dilatio
         nodes=[im2col_node], name="im2col_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="im2col-model")
+    model = qonnx_make_model(graph, producer_name="im2col-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
@@ -124,7 +124,7 @@ def make_single_slidingwindow_modelwrapper(
         outputs=[outp],
     )
 
-    model = helper.make_model(graph, producer_name="slidingwindow-model")
+    model = qonnx_make_model(graph, producer_name="slidingwindow-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_fpgadataflow_convinputgenerator_rtl_dynamic.py b/tests/fpgadataflow/test_fpgadataflow_convinputgenerator_rtl_dynamic.py
index 979dcbfab8061ba4d963e53d010e4e46b92c02fe..7f7bf649a9284e7716aec5adfb91957fdabb55d5 100644
--- a/tests/fpgadataflow/test_fpgadataflow_convinputgenerator_rtl_dynamic.py
+++ b/tests/fpgadataflow/test_fpgadataflow_convinputgenerator_rtl_dynamic.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2022, Xilinx
+# Copyright (c) 2022, Advanced Micro Devices, Inc.
 # All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
@@ -41,8 +41,11 @@ from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveReadableTensorNames, GiveUniqueNodeNames
 from qonnx.transformation.infer_datatypes import InferDataTypes
 from qonnx.transformation.infer_shapes import InferShapes
-from qonnx.transformation.lower_convs_to_matmul import LowerConvsToMatMul
-from qonnx.util.basic import gen_finn_dt_tensor, get_by_name
+from qonnx.transformation.lower_convs_to_matmul import (
+    LowerConvsToMatMul,
+    _auto_pad_to_explicit_padding,
+)
+from qonnx.util.basic import gen_finn_dt_tensor, get_by_name, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 import finn.transformation.fpgadataflow.convert_to_hls_layers as to_hls
@@ -54,25 +57,48 @@ from finn.transformation.fpgadataflow.create_dataflow_partition import (
 )
 from finn.transformation.fpgadataflow.create_stitched_ip import CreateStitchedIP
 from finn.transformation.fpgadataflow.hlssynth_ip import HLSSynthIP
+from finn.transformation.fpgadataflow.insert_dwc import InsertDWC
 from finn.transformation.fpgadataflow.insert_fifo import InsertFIFO
 from finn.transformation.fpgadataflow.prepare_ip import PrepareIP
 from finn.util.basic import pyverilate_get_liveness_threshold_cycles
 
 
-def create_conv_model(idim, ifm, k, stride, ofm, idt, wdt):
+def create_conv_model(
+    idim_h, idim_w, ifm, k, stride, ofm, idt, wdt, pad_mode, depthwise
+):
     np.random.seed(0)
-    ishp = (1, ifm, idim, idim)
-    int_dim = compute_conv_output_dim(idim, k, stride)
-    odim = compute_conv_output_dim(int_dim, k, stride)
-    oshp = (1, ofm, odim, odim)
-    wshp = (ofm, ifm, k, k)
-    wshp_1 = (ofm, ofm, k, k)
+    group = ifm if depthwise else 1
+    group_str = str(group)
+    ishp = (1, ifm, idim_h, idim_w)
+    pad_0 = _auto_pad_to_explicit_padding(
+        pad_mode, idim_h, idim_w, k, k, stride, stride, 2
+    )
+    int_dim_h = compute_conv_output_dim(
+        idim_h, k, stride, total_pad=pad_0[0] + pad_0[2]
+    )
+    int_dim_w = compute_conv_output_dim(
+        idim_w, k, stride, total_pad=pad_0[1] + pad_0[3]
+    )
+
+    pad_1 = _auto_pad_to_explicit_padding(
+        pad_mode, int_dim_h, int_dim_w, k, k, stride, stride, 2
+    )
+    odim_h = compute_conv_output_dim(
+        int_dim_h, k, stride, total_pad=pad_1[0] + pad_1[2]
+    )
+    odim_w = compute_conv_output_dim(
+        int_dim_w, k, stride, total_pad=pad_1[1] + pad_1[3]
+    )
+    oshp = (1, ifm, odim_h, odim_w) if depthwise else (1, ofm, odim_h, odim_w)
+    wshp = (ifm, 1, k, k) if depthwise else (ofm, ifm, k, k)
+    wshp_1 = (ifm, 1, k, k) if depthwise else (ofm, ofm, k, k)
     ishp_str = str(list(ishp))
     oshp_str = str(list(oshp))
     wshp_str = str(list(wshp))
     wshp_1_str = str(list(wshp_1))
     kshp_str = str([k, k])
-    pad_str = str([0, 0, 0, 0])
+    pad_0_str = str(list(pad_0))
+    pad_1_str = str(list(pad_1))
     stride_str = str([stride, stride])
     dil_str = str([1, 1])
 
@@ -88,11 +114,11 @@ def create_conv_model(idim, ifm, k, stride, ofm, idt, wdt):
     >
     {{
         conv0 = Conv<
-                dilations={dil_str},group=1,kernel_shape={kshp_str},pads={pad_str},
+                dilations={dil_str},group={group_str},kernel_shape={kshp_str},pads={pad_0_str},
                 strides={stride_str}
             >(in0, param_c0_weight)
         out0 = Conv<
-                dilations={dil_str},group=1,kernel_shape={kshp_str},pads={pad_str},
+                dilations={dil_str},group={group_str},kernel_shape={kshp_str},pads={pad_1_str},
                 strides={stride_str}
             >(conv0, param_c1_weight)
     }}
@@ -109,17 +135,19 @@ def create_conv_model(idim, ifm, k, stride, ofm, idt, wdt):
     return model
 
 
-def update_conv_model_dims(model, idim_new):
+def update_conv_model_dims(model, idim_new_h, idim_new_w):
     cnode = model.get_nodes_by_op_type("Conv")[0]
     k, _ = get_by_name(cnode.attribute, "kernel_shape").ints
     stride, _ = get_by_name(cnode.attribute, "strides").ints
     ishp = model.get_tensor_shape("in0")
     n, ci, _, _ = ishp
     n, co, _, _ = model.get_tensor_shape("out0")
-    int_dim = compute_conv_output_dim(idim_new, k, stride)
-    odim = compute_conv_output_dim(int_dim, k, stride)
-    model.set_tensor_shape("in0", (n, ci, idim_new, idim_new))
-    model.set_tensor_shape("out0", (n, co, odim, odim))
+    int_dim_h = compute_conv_output_dim(idim_new_h, k, stride)
+    int_dim_w = compute_conv_output_dim(idim_new_w, k, stride)
+    odim_h = compute_conv_output_dim(int_dim_h, k, stride)
+    odim_w = compute_conv_output_dim(int_dim_w, k, stride)
+    model.set_tensor_shape("in0", (n, ci, idim_new_h, idim_new_w))
+    model.set_tensor_shape("out0", (n, co, odim_h, odim_w))
     # remove all existing shapes
     del model.graph.value_info[:]
     model = model.transform(InferShapes())
@@ -142,42 +170,87 @@ def config_hook(configs):
         return None
 
     def write_swg_config(sim):
+        reset_rtlsim(sim)
         for axi_name, config in configs:
-            # 1. Write config registers to the SWG, dict defines (addr, value) tuples
+            # Write config registers to the SWG/FMPadding dict
+            # defines (addr, value) tuples
             for config_entry in config.values():
                 axilite_write(sim, config_entry[0], config_entry[1], basename=axi_name)
-            # 2. Set cfg_valid flag (>= 1 cycle)
-            axilite_write(sim, 0, 1, basename=axi_name)
-        # 3. Reset component (>= 1 cycle)
         reset_rtlsim(sim)
 
     return write_swg_config
 
 
+cfg0 = {
+    "idims": [(32, 32), (8, 8)],
+    "ifm": 64,
+    "k": 3,
+    "stride": 1,
+    "ofm": 64,
+    "depthwise": True,
+    "pad_mode": "SAME_UPPER",
+}
+cfg1 = {
+    "idims": [(32, 16), (16, 8)],
+    "ifm": 4,
+    "k": 4,
+    "stride": 1,
+    "ofm": 8,
+    "depthwise": False,
+    "pad_mode": "SAME_UPPER",
+}
+cfg2 = {
+    "idims": [(64, 128), (2, 4)],
+    "ifm": 64,
+    "k": 3,
+    "stride": 1,
+    "ofm": 64,
+    "depthwise": True,
+    "pad_mode": "SAME_UPPER",
+}
+
+
+@pytest.mark.parametrize("cfg", [cfg0, cfg1, cfg2])
 @pytest.mark.slow
 @pytest.mark.vivado
-def test_fpgadataflow_conv_dynamic():
-    idims = [32, 16]
-    ifm = 4
-    k = 4
-    stride = 1
-    ofm = 8
-    idt = DataType["UINT8"]
+@pytest.mark.fpgadataflow
+def test_fpgadataflow_conv_dynamic(cfg):
+    pad_mode = cfg["pad_mode"]
+    depthwise = cfg["depthwise"]
+    idims = cfg["idims"]
+    ifm = cfg["ifm"]
+    k = cfg["k"]
+    stride = cfg["stride"]
+    ofm = cfg["ofm"]
+    idt = DataType["UINT4"]
     wdt = DataType["INT2"]
     exp_cfgs = []
     largest_model = None
     for idim in idims:
-        ishp = (1, ifm, idim, idim)
+        idim_h, idim_w = idim
+        ishp = (1, ifm, idim_h, idim_w)
         np.random.seed(0)
         inp = gen_finn_dt_tensor(idt, ishp)
-        model = create_conv_model(idim, ifm, k, stride, ofm, idt, wdt)
-        _, _, int_dim, _ = model.get_tensor_shape("conv0")
-        _, _, odim, _ = model.get_tensor_shape("out0")
+        model = create_conv_model(
+            idim_h, idim_w, ifm, k, stride, ofm, idt, wdt, pad_mode, depthwise
+        )
+        _, _, int_dim_h, int_dim_w = model.get_tensor_shape("conv0")
+        _, _, odim_h, odim_w = model.get_tensor_shape("out0")
+        pad0 = get_by_name(model.graph.node[0].attribute, "pads").ints
+        pad1 = get_by_name(model.graph.node[1].attribute, "pads").ints
         if idim == max(idims):
             # use largest model for hardware conversion
             largest_model = copy.deepcopy(model)
         golden = execute_onnx(model, {"in0": inp})["out0"]
-        exp_cfg = (idim, int_dim, odim, inp, golden)
+        exp_cfg = (
+            (idim_h, idim_w),
+            (int_dim_h, int_dim_w),
+            (odim_h, odim_w),
+            pad0,
+            pad1,
+            inp,
+            golden,
+        )
         exp_cfgs.append(exp_cfg)
 
     # convert to hardware and prepare simulation
@@ -186,21 +259,35 @@ def test_fpgadataflow_conv_dynamic():
     model = model.transform(
         to_hls.InferQuantizedMatrixVectorActivation(mem_mode="decoupled")
     )
+    model = model.transform(to_hls.InferVectorVectorActivation())
     model = model.transform(absorb.AbsorbConsecutiveTransposes())
     parent_model = model.transform(CreateDataflowPartition())
     sdp_inst = getCustomOp(
         parent_model.get_nodes_by_op_type("StreamingDataflowPartition")[0]
     )
     model = ModelWrapper(sdp_inst.get_nodeattr("model"))
-    for swg_node in model.get_nodes_by_op_type("ConvolutionInputGenerator_rtl"):
-        getCustomOp(swg_node).set_nodeattr("SIMD", 1)
+    assert len(model.get_nodes_by_op_type("ConvolutionInputGenerator_rtl")) == 2
+    if pad_mode == "VALID":
+        assert len(model.get_nodes_by_op_type("FMPadding_rtl")) == 0
+    else:
+        assert len(model.get_nodes_by_op_type("FMPadding_rtl")) == 2
+    dyn_nodes = model.get_nodes_by_op_type("ConvolutionInputGenerator_rtl")
+    dyn_nodes += model.get_nodes_by_op_type("FMPadding_rtl")
+    for swg_node in dyn_nodes:
+        getCustomOp(swg_node).set_nodeattr("SIMD", 4)
         getCustomOp(swg_node).set_nodeattr("dynamic_mode", 1)
         getCustomOp(swg_node).set_nodeattr("inFIFODepths", [16])
         getCustomOp(swg_node).set_nodeattr("outFIFODepths", [16])
-        print("SWG initial config:")
-        idim = getCustomOp(swg_node).get_nodeattr("IFMDim")
-        print(getCustomOp(swg_node).get_dynamic_config(idim))
-    model = model.transform(InsertFIFO())
+    comp_nodes = model.get_nodes_by_op_type("MatrixVectorActivation")
+    comp_nodes += model.get_nodes_by_op_type("VectorVectorActivation")
+    for comp_node in comp_nodes:
+        if depthwise:
+            getCustomOp(comp_node).set_nodeattr("PE", 4)
+        else:
+            getCustomOp(comp_node).set_nodeattr("SIMD", 4)
+            getCustomOp(comp_node).set_nodeattr("PE", 4)
+    model = model.transform(InsertDWC())
+    model = model.transform(InsertFIFO(create_shallow_fifos=True))
     model = model.transform(GiveUniqueNodeNames())
     model = model.transform(GiveReadableTensorNames())
     model = model.transform(PrepareIP("xc7z020clg400-1", 5))
@@ -210,31 +297,64 @@ def test_fpgadataflow_conv_dynamic():
 
     # loop through experiment configurations
     for exp_cfg in exp_cfgs:
-        idim, int_dim, odim, inp, golden = exp_cfg
+        (
+            (idim_h, idim_w),
+            (int_dim_h, int_dim_w),
+            (odim_h, odim_w),
+            pad0,
+            pad1,
+            inp,
+            golden,
+        ) = exp_cfg
+        conv0_idim_h = idim_h + pad0[0] + pad0[2]
+        conv0_idim_w = idim_w + pad0[1] + pad0[3]
+        conv1_idim_h = int_dim_h + pad1[0] + pad1[2]
+        conv1_idim_w = int_dim_w + pad1[1] + pad1[3]
         # get config for the new dimensions
         swg_nodes = model.get_nodes_by_op_type("ConvolutionInputGenerator_rtl")
         swg0 = getCustomOp(swg_nodes[0])
-        update_tensor_dim(model, swg0.onnx_node.input[0], (idim, idim))
-        update_tensor_dim(model, swg0.onnx_node.output[0], (int_dim, int_dim))
-        config0 = swg0.get_dynamic_config((idim, idim))
+        update_tensor_dim(model, swg0.onnx_node.input[0], (conv0_idim_h, conv0_idim_w))
+        update_tensor_dim(model, swg0.onnx_node.output[0], (int_dim_h, int_dim_w))
+        swg_config0 = swg0.get_dynamic_config((conv0_idim_h, conv0_idim_w))
         swg1 = getCustomOp(swg_nodes[1])
-        update_tensor_dim(model, swg1.onnx_node.input[0], (int_dim, int_dim))
-        update_tensor_dim(model, swg1.onnx_node.output[0], (odim, odim))
-        config1 = swg1.get_dynamic_config((int_dim, int_dim))
-        configs = [("s_axilite_0_", config0), ("s_axilite_1_", config1)]
+        update_tensor_dim(model, swg1.onnx_node.input[0], (conv1_idim_h, conv1_idim_w))
+        update_tensor_dim(model, swg1.onnx_node.output[0], (odim_h, odim_w))
+        swg_config1 = swg1.get_dynamic_config((conv1_idim_h, conv1_idim_w))
+        if pad_mode != "VALID":
+            pad_nodes = model.get_nodes_by_op_type("FMPadding_rtl")
+            padder0 = getCustomOp(pad_nodes[0])
+            update_tensor_dim(model, padder0.onnx_node.input[0], (idim_h, idim_w))
+            update_tensor_dim(
+                model, padder0.onnx_node.output[0], (conv0_idim_h, conv0_idim_w)
+            )
+            pad_config0 = padder0.get_dynamic_config((idim_h, idim_w), pad0)
+            padder1 = getCustomOp(pad_nodes[1])
+            update_tensor_dim(model, padder1.onnx_node.input[0], (int_dim_h, int_dim_w))
+            update_tensor_dim(
+                model, padder1.onnx_node.output[0], (conv1_idim_h, conv1_idim_w)
+            )
+            pad_config1 = padder1.get_dynamic_config((int_dim_h, int_dim_w), pad1)
+            configs = [
+                ("s_axilite_0_", pad_config0),
+                ("s_axilite_1_", swg_config0),
+                ("s_axilite_2_", pad_config1),
+                ("s_axilite_3_", swg_config1),
+            ]
+        else:
+            configs = [("s_axilite_0_", swg_config0), ("s_axilite_1_", swg_config1)]
         # adjust folded shapes for I/O FIFOs
         # (since rtlsim_exec uses folded shape info to fold global i/o tensors)
         first_node = getCustomOp(model.graph.node[0])
         first_node_shp = list(first_node.get_folded_input_shape())
-        first_node_shp[1] = idim
-        first_node_shp[2] = idim
+        first_node_shp[1] = idim_h
+        first_node_shp[2] = idim_w
         first_node.set_nodeattr("folded_shape", first_node_shp)
-        update_tensor_dim(model, first_node.onnx_node.input[0], (idim, idim))
+        update_tensor_dim(model, first_node.onnx_node.input[0], (idim_h, idim_w))
         last_node = getCustomOp(model.graph.node[-1])
         last_node_shp = list(last_node.get_folded_output_shape())
-        last_node_shp[1] = odim
-        last_node_shp[2] = odim
-        update_tensor_dim(model, last_node.onnx_node.output[0], (odim, odim))
+        last_node_shp[1] = odim_h
+        last_node_shp[2] = odim_w
+        update_tensor_dim(model, last_node.onnx_node.output[0], (odim_h, odim_w))
         last_node.set_nodeattr("folded_shape", last_node_shp)
         ctx = {"global_in": inp.transpose(0, 2, 3, 1)}
         liveness_prev = pyverilate_get_liveness_threshold_cycles()
@@ -276,7 +396,7 @@ def make_single_im2col_modelwrapper(k, ifm_ch, ifm_dim, ofm_dim, stride, dilatio
         nodes=[im2col_node], name="im2col_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="im2col-model")
+    model = qonnx_make_model(graph, producer_name="im2col-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
@@ -329,7 +449,7 @@ def make_single_slidingwindow_modelwrapper(
         outputs=[outp],
     )
 
-    model = helper.make_model(graph, producer_name="slidingwindow-model")
+    model = qonnx_make_model(graph, producer_name="slidingwindow-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
@@ -364,6 +484,7 @@ def prepare_inputs(input_tensor):
 @pytest.mark.parametrize("m", [1])
 @pytest.mark.slow
 @pytest.mark.vivado
+@pytest.mark.fpgadataflow
 def test_fpgadataflow_slidingwindow_rtl_dynamic(
     idt, k, ifm_dim_series, ifm_ch, stride, dilation, dw, simd, m, parallel_window
 ):
diff --git a/tests/fpgadataflow/test_fpgadataflow_duplicatestreams.py b/tests/fpgadataflow/test_fpgadataflow_duplicatestreams.py
index 7ec254405d23f0a972de7f9d02d2ab021ed3d959..441bbce50a8a218185f93a7968767abe2541ed15 100644
--- a/tests/fpgadataflow/test_fpgadataflow_duplicatestreams.py
+++ b/tests/fpgadataflow/test_fpgadataflow_duplicatestreams.py
@@ -36,7 +36,7 @@ from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
 from qonnx.transformation.infer_datatypes import InferDataTypes
 from qonnx.transformation.infer_shapes import InferShapes
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.analysis.fpgadataflow.exp_cycles_per_layer import exp_cycles_per_layer
@@ -76,7 +76,7 @@ def make_dupstreams_modelwrapper(ch, pe, idim, idt, n_dupl):
         nodes=[dupstrm_node], name="graph", inputs=[inp], outputs=out_vi
     )
 
-    model = helper.make_model(graph, producer_name="addstreams-model")
+    model = qonnx_make_model(graph, producer_name="addstreams-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_fpgadataflow_dwc.py b/tests/fpgadataflow/test_fpgadataflow_dwc.py
index bcf2a1fe3d304ac27a06b544825a84f5757830c9..2bde148a1499e4c7065ab1e151e3c4198e1e96da 100644
--- a/tests/fpgadataflow/test_fpgadataflow_dwc.py
+++ b/tests/fpgadataflow/test_fpgadataflow_dwc.py
@@ -32,19 +32,19 @@ from onnx import TensorProto, helper
 from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.transformation.general import GiveUniqueNodeNames
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
+from finn.transformation.fpgadataflow.create_stitched_ip import CreateStitchedIP
 from finn.transformation.fpgadataflow.hlssynth_ip import HLSSynthIP
+from finn.transformation.fpgadataflow.insert_fifo import InsertFIFO
 from finn.transformation.fpgadataflow.prepare_ip import PrepareIP
-from finn.transformation.fpgadataflow.prepare_rtlsim import PrepareRTLSim
-from finn.transformation.fpgadataflow.set_exec_mode import SetExecMode
 
 
-def make_single_dwc_modelwrapper(Shape, INWidth, OUTWidth, finn_dtype):
+def make_single_dwc_modelwrapper(shape, inWidth, outWidth, finn_dtype, impl_style):
 
-    inp = helper.make_tensor_value_info("inp", TensorProto.FLOAT, Shape)
-    outp = helper.make_tensor_value_info("outp", TensorProto.FLOAT, Shape)
+    inp = helper.make_tensor_value_info("inp", TensorProto.FLOAT, shape)
+    outp = helper.make_tensor_value_info("outp", TensorProto.FLOAT, shape)
 
     DWC_node = helper.make_node(
         "StreamingDataWidthConverter_Batch",
@@ -52,17 +52,18 @@ def make_single_dwc_modelwrapper(Shape, INWidth, OUTWidth, finn_dtype):
         ["outp"],
         domain="finn.custom_op.fpgadataflow",
         backend="fpgadataflow",
-        shape=Shape,
-        inWidth=INWidth,
-        outWidth=OUTWidth,
+        shape=shape,
+        inWidth=inWidth,
+        outWidth=outWidth,
         dataType=str(finn_dtype.name),
+        impl_style=impl_style,
     )
 
     graph = helper.make_graph(
         nodes=[DWC_node], name="dwc_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="dwc-model")
+    model = qonnx_make_model(graph, producer_name="dwc-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", finn_dtype)
@@ -75,34 +76,42 @@ def prepare_inputs(input_tensor, dt):
     return {"inp": input_tensor}
 
 
-# shape
-@pytest.mark.parametrize("Shape", [[1, 4], [1, 2, 8]])
-# inWidth
-@pytest.mark.parametrize("INWidth", [2, 4])
-# outWidth
-@pytest.mark.parametrize("OUTWidth", [2, 4])
-# finn_dtype
-@pytest.mark.parametrize("finn_dtype", [DataType["BIPOLAR"], DataType["INT2"]])
+@pytest.mark.parametrize(
+    "config",
+    [
+        ([1, 24], 6, 4, DataType["INT2"], "hls"),
+        ([1, 24], 4, 6, DataType["INT2"], "hls"),
+        ([1, 4], 2, 4, DataType["BIPOLAR"], "hls"),
+        ([1, 2, 8], 2, 4, DataType["BIPOLAR"], "hls"),
+        ([1, 4], 4, 2, DataType["INT2"], "hls"),
+        ([1, 2, 8], 4, 4, DataType["INT2"], "hls"),
+        ([1, 2, 8], 8, 16, DataType["INT2"], "vivado"),
+    ],
+)
 @pytest.mark.fpgadataflow
 @pytest.mark.slow
 @pytest.mark.vivado
-def test_fpgadataflow_dwc_rtlsim(Shape, INWidth, OUTWidth, finn_dtype):
-
+def test_fpgadataflow_dwc_rtlsim(config):
+    shape, inWidth, outWidth, finn_dtype, impl_style = config
+    test_fpga_part = "xc7z020clg400-1"
+    target_clk_ns = 10.0
     # generate input data
-    x = gen_finn_dt_tensor(finn_dtype, Shape)
+    x = gen_finn_dt_tensor(finn_dtype, shape)
     input_dict = prepare_inputs(x, finn_dtype)
 
-    model = make_single_dwc_modelwrapper(Shape, INWidth, OUTWidth, finn_dtype)
-
-    model = model.transform(SetExecMode("rtlsim"))
+    model = make_single_dwc_modelwrapper(
+        shape, inWidth, outWidth, finn_dtype, impl_style
+    )
+    model = model.transform(InsertFIFO(create_shallow_fifos=True))
     model = model.transform(GiveUniqueNodeNames())
-    model = model.transform(PrepareIP("xc7z020clg400-1", 5))
+    model = model.transform(PrepareIP(test_fpga_part, 5))
     model = model.transform(HLSSynthIP())
-    model = model.transform(PrepareRTLSim())
+    model = model.transform(CreateStitchedIP(test_fpga_part, target_clk_ns))
+    model.set_metadata_prop("exec_mode", "rtlsim")
     y = oxe.execute_onnx(model, input_dict)["outp"]
 
     assert (
         y == x
     ).all(), """The output values are not the same as the
         input values anymore."""
-    assert y.shape == tuple(Shape), """The output shape is incorrect."""
+    assert y.shape == tuple(shape), """The output shape is incorrect."""
diff --git a/tests/fpgadataflow/test_fpgadataflow_fifo.py b/tests/fpgadataflow/test_fpgadataflow_fifo.py
index b9c74185d9f104e15355a5dd6021d7e74dac641e..efdb3bf6aaab23fec67055ae28b2e285f1a32b6a 100644
--- a/tests/fpgadataflow/test_fpgadataflow_fifo.py
+++ b/tests/fpgadataflow/test_fpgadataflow_fifo.py
@@ -33,7 +33,7 @@ from onnx import TensorProto, helper
 from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.transformation.general import GiveUniqueNodeNames
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.fpgadataflow.hlssynth_ip import HLSSynthIP
@@ -66,7 +66,7 @@ def make_single_fifo_modelwrapper(Shape, Depth, fld_shape, finn_dtype):
         nodes=[FIFO_node], name="fifo_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="fifo-model")
+    model = qonnx_make_model(graph, producer_name="fifo-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", finn_dtype)
diff --git a/tests/fpgadataflow/test_fpgadataflow_fmpadding.py b/tests/fpgadataflow/test_fpgadataflow_fmpadding.py
index 34928ce45be0fd96d27b153ae28e2128bf306bb5..b95409fda87718f30a74bad88697c3dbad0bf98f 100644
--- a/tests/fpgadataflow/test_fpgadataflow_fmpadding.py
+++ b/tests/fpgadataflow/test_fpgadataflow_fmpadding.py
@@ -36,7 +36,7 @@ from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
 from qonnx.transformation.infer_shapes import InferShapes
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.analysis.fpgadataflow.exp_cycles_per_layer import exp_cycles_per_layer
@@ -53,7 +53,7 @@ test_fpga_part = pynq_part_map[test_pynq_board]
 target_clk_ns = 10
 
 
-def make_single_fmpadding_modelwrapper(idim, padding, num_ch, simd, idt):
+def make_single_fmpadding_modelwrapper(optype, idim, padding, num_ch, simd, idt):
     pad_h = padding[0] + padding[2]
     pad_w = padding[1] + padding[3]
     idim_h, idim_w = idim
@@ -70,7 +70,7 @@ def make_single_fmpadding_modelwrapper(idim, padding, num_ch, simd, idt):
     )
 
     FMPadding = helper.make_node(
-        "FMPadding_Batch",
+        optype,
         ["inp"],
         ["outp"],
         domain="finn.custom_op.fpgadataflow",
@@ -87,7 +87,7 @@ def make_single_fmpadding_modelwrapper(idim, padding, num_ch, simd, idt):
         nodes=[FMPadding], name="fmpadding_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="fmpadding-model")
+    model = qonnx_make_model(graph, producer_name="fmpadding-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
@@ -110,10 +110,14 @@ def make_single_fmpadding_modelwrapper(idim, padding, num_ch, simd, idt):
 @pytest.mark.parametrize("idt", [DataType["INT2"], DataType["INT4"]])
 # execution mode
 @pytest.mark.parametrize("mode", ["cppsim", "rtlsim"])
+# implementation style
+@pytest.mark.parametrize("impl_style", ["rtl", "hls"])
 @pytest.mark.fpgadataflow
 @pytest.mark.slow
 @pytest.mark.vivado
-def test_fpgadataflow_fmpadding(idim, pad, num_ch, simd, idt, mode):
+def test_fpgadataflow_fmpadding(idim, pad, num_ch, simd, idt, mode, impl_style):
+    if impl_style == "rtl" and mode == "cppsim":
+        pytest.skip("rtl implstyle has no cppsim, skipping")
     if num_ch % simd != 0:
         pytest.skip(" num_ch % simd != 0, skipping")
 
@@ -127,7 +131,9 @@ def test_fpgadataflow_fmpadding(idim, pad, num_ch, simd, idt, mode):
     odim_h = idim_h + pad_h
     odim_w = idim_w + pad_w
 
-    model = make_single_fmpadding_modelwrapper(idim, pad, num_ch, simd, idt)
+    optype = {"hls": "FMPadding_Batch", "rtl": "FMPadding_rtl"}[impl_style]
+
+    model = make_single_fmpadding_modelwrapper(optype, idim, pad, num_ch, simd, idt)
     model = model.transform(InferShapes())
     model = model.transform(SetExecMode(mode))
     model = model.transform(GiveUniqueNodeNames())
@@ -138,6 +144,7 @@ def test_fpgadataflow_fmpadding(idim, pad, num_ch, simd, idt, mode):
         model = model.transform(PrepareIP(test_fpga_part, target_clk_ns))
         model = model.transform(HLSSynthIP())
         model = model.transform(PrepareRTLSim())
+
     y_produced = oxe.execute_onnx(model, input_dict)["outp"]
     expected_oshape = (1, odim_h, odim_w, num_ch)
     assert y_produced.shape == expected_oshape
@@ -149,7 +156,7 @@ def test_fpgadataflow_fmpadding(idim, pad, num_ch, simd, idt, mode):
     assert (y_produced == y_expected).all()
 
     if mode == "rtlsim":
-        node = model.get_nodes_by_op_type("FMPadding_Batch")[0]
+        node = model.get_nodes_by_op_type(optype)[0]
         inst = getCustomOp(node)
         cycles_rtlsim = inst.get_nodeattr("cycles_rtlsim")
         exp_cycles_dict = model.analysis(exp_cycles_per_layer)
diff --git a/tests/fpgadataflow/test_fpgadataflow_globalaccpool.py b/tests/fpgadataflow/test_fpgadataflow_globalaccpool.py
index a37e6e3271a9f7e033e6beaa6dbed01271365101..a2c3d09a55f81dc5e9d5ae1819cd8ea6b7df1e27 100644
--- a/tests/fpgadataflow/test_fpgadataflow_globalaccpool.py
+++ b/tests/fpgadataflow/test_fpgadataflow_globalaccpool.py
@@ -34,7 +34,7 @@ from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.analysis.fpgadataflow.exp_cycles_per_layer import exp_cycles_per_layer
@@ -65,7 +65,7 @@ def make_accpool_modelwrapper(ch, pe, idim, idt):
         nodes=[accpool_node], name="graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="thresholding-model")
+    model = qonnx_make_model(graph, producer_name="thresholding-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_fpgadataflow_ipstitch.py b/tests/fpgadataflow/test_fpgadataflow_ipstitch.py
index 325470a6d6c6032249ca1dd64317fb288d3e94c9..b220338e6919e8eeaeef0f6e5343fed9b1dfca10 100644
--- a/tests/fpgadataflow/test_fpgadataflow_ipstitch.py
+++ b/tests/fpgadataflow/test_fpgadataflow_ipstitch.py
@@ -36,7 +36,7 @@ from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
 from qonnx.transformation.infer_data_layouts import InferDataLayouts
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 from finn.core.onnx_exec import execute_onnx
 from finn.transformation.fpgadataflow.create_dataflow_partition import (
@@ -100,7 +100,7 @@ def create_one_fc_model(mem_mode="const"):
         nodes=[fc0], name="fclayer_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="fclayer-model")
+    model = qonnx_make_model(graph, producer_name="fclayer-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
@@ -177,7 +177,7 @@ def create_two_fc_model(mem_mode="decoupled"):
         value_info=[mid],
     )
 
-    model = helper.make_model(graph, producer_name="fclayer-model")
+    model = qonnx_make_model(graph, producer_name="fclayer-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_fpgadataflow_labelselect.py b/tests/fpgadataflow/test_fpgadataflow_labelselect.py
index a9b98ecaf80b4c86fc1e9ccec23e6d97c5982f55..553f263ba2e004233011db90feabea057d88026a 100644
--- a/tests/fpgadataflow/test_fpgadataflow_labelselect.py
+++ b/tests/fpgadataflow/test_fpgadataflow_labelselect.py
@@ -33,7 +33,7 @@ from onnx import TensorProto, helper
 from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.transformation.general import GiveUniqueNodeNames
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.fpgadataflow.compile_cppsim import CompileCppSim
@@ -67,7 +67,7 @@ def make_labelselect_modelwrapper(labels, pe, k, idt):
         outputs=[outp],
     )
 
-    model = helper.make_model(graph, producer_name="thresholding-model")
+    model = qonnx_make_model(graph, producer_name="thresholding-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_fpgadataflow_mvau.py b/tests/fpgadataflow/test_fpgadataflow_mvau.py
index a7e7eba7ee8de81ec5eebe3e270e8e1d28564a00..b80ef76a19e487a93b23ae7db17350e85fb66822 100644
--- a/tests/fpgadataflow/test_fpgadataflow_mvau.py
+++ b/tests/fpgadataflow/test_fpgadataflow_mvau.py
@@ -36,7 +36,11 @@ from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.general.multithreshold import multithreshold
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
-from qonnx.util.basic import calculate_signed_dot_prod_range, gen_finn_dt_tensor
+from qonnx.util.basic import (
+    calculate_signed_dot_prod_range,
+    gen_finn_dt_tensor,
+    qonnx_make_model,
+)
 
 import finn.core.onnx_exec as oxe
 from finn.analysis.fpgadataflow.exp_cycles_per_layer import exp_cycles_per_layer
@@ -106,7 +110,7 @@ def make_single_fclayer_modelwrapper(W, pe, simd, wdt, idt, odt, T=None, tdt=Non
         nodes=[FCLayer_node], name="fclayer_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="fclayer-model")
+    model = qonnx_make_model(graph, producer_name="fclayer-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_fpgadataflow_res_estimate.py b/tests/fpgadataflow/test_fpgadataflow_res_estimate.py
index e3c79fa44fb57718d359b58d1a8716746f6668fb..b3cf7b4229c39f27c7f3689ef51fb7d22c7aa0f2 100644
--- a/tests/fpgadataflow/test_fpgadataflow_res_estimate.py
+++ b/tests/fpgadataflow/test_fpgadataflow_res_estimate.py
@@ -32,6 +32,7 @@ from onnx import TensorProto, helper
 from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.transformation.general import GiveUniqueNodeNames
+from qonnx.util.basic import qonnx_make_model
 
 from finn.analysis.fpgadataflow.res_estimation import (
     res_estimation,
@@ -87,7 +88,7 @@ def test_res_estimate():
         nodes=[FCLayer_node], name="fclayer_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="fclayer-model")
+    model = qonnx_make_model(graph, producer_name="fclayer-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_fpgadataflow_streamingmaxpool.py b/tests/fpgadataflow/test_fpgadataflow_streamingmaxpool.py
index a3968cf79704092ffb5ec53c887842372b625f4d..628721b429abadf198126a2f5801178f2f710033 100644
--- a/tests/fpgadataflow/test_fpgadataflow_streamingmaxpool.py
+++ b/tests/fpgadataflow/test_fpgadataflow_streamingmaxpool.py
@@ -35,7 +35,7 @@ from qonnx.custom_op.general.maxpoolnhwc import compute_pool_output_dim
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
 from qonnx.transformation.infer_shapes import InferShapes
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.analysis.fpgadataflow.exp_cycles_per_layer import exp_cycles_per_layer
@@ -74,7 +74,7 @@ def make_single_maxpoolnhwc_modelwrapper(k, ifm_ch, ifm_dim, ofm_dim, idt, ceil_
         nodes=[mp_node], name="mp_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="mp-model")
+    model = qonnx_make_model(graph, producer_name="mp-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_fpgadataflow_thresholding.py b/tests/fpgadataflow/test_fpgadataflow_thresholding.py
index 706679b6809844d0b2924411440088ea892ba7a9..96cd69c3453793c1634f132cb159f0cc8a94a28c 100644
--- a/tests/fpgadataflow/test_fpgadataflow_thresholding.py
+++ b/tests/fpgadataflow/test_fpgadataflow_thresholding.py
@@ -37,7 +37,7 @@ from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.general.multithreshold import multithreshold
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.analysis.fpgadataflow.exp_cycles_per_layer import exp_cycles_per_layer
@@ -93,7 +93,7 @@ def make_single_thresholding_modelwrapper(
         outputs=[outp],
     )
 
-    model = helper.make_model(graph, producer_name="thresholding-model")
+    model = qonnx_make_model(graph, producer_name="thresholding-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_fpgadataflow_vvau.py b/tests/fpgadataflow/test_fpgadataflow_vvau.py
index 03ddb1286320b8178276ea53082095106a43d7a1..abf8ba0b9efde67c77711abc8451475887430cae 100644
--- a/tests/fpgadataflow/test_fpgadataflow_vvau.py
+++ b/tests/fpgadataflow/test_fpgadataflow_vvau.py
@@ -35,7 +35,7 @@ from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.general.multithreshold import multithreshold
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.analysis.fpgadataflow.exp_cycles_per_layer import exp_cycles_per_layer
@@ -132,7 +132,7 @@ def _make_single_vvau_modelwrapper(
         nodes=[VVAU_node], name="vvau_graph", inputs=[inp], outputs=[outp]
     )
 
-    model = helper.make_model(graph, producer_name="vvau-model")
+    model = qonnx_make_model(graph, producer_name="vvau-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", idt)
diff --git a/tests/fpgadataflow/test_set_folding.py b/tests/fpgadataflow/test_set_folding.py
index 8ea0e18f2cace10b6fefae50ce1e28845ab24050..5355dd7044343d9dbb077225b5b8786eb7fdfe32 100644
--- a/tests/fpgadataflow/test_set_folding.py
+++ b/tests/fpgadataflow/test_set_folding.py
@@ -34,6 +34,7 @@ from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.registry import getCustomOp
 from qonnx.transformation.general import GiveUniqueNodeNames
+from qonnx.util.basic import qonnx_make_model
 
 from finn.analysis.fpgadataflow.exp_cycles_per_layer import exp_cycles_per_layer
 from finn.transformation.fpgadataflow.create_dataflow_partition import (
@@ -91,7 +92,7 @@ def make_multi_fclayer_model(ch, wdt, adt, tdt, nnodes):
         outputs=[tensors[-1]],
     )
 
-    model = helper.make_model(graph, producer_name="fclayer-model")
+    model = qonnx_make_model(graph, producer_name="fclayer-model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", adt)
diff --git a/tests/fpgadataflow/test_split_large_fifos.py b/tests/fpgadataflow/test_split_large_fifos.py
new file mode 100644
index 0000000000000000000000000000000000000000..85b4a2bfa8dc0de3cbdd0ca34ec5b1ee68f37acf
--- /dev/null
+++ b/tests/fpgadataflow/test_split_large_fifos.py
@@ -0,0 +1,128 @@
+# Copyright (C) 2022, Advanced Micro Devices, Inc.
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# * Redistributions of source code must retain the above copyright notice, this
+#   list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright notice,
+#   this list of conditions and the following disclaimer in the documentation
+#   and/or other materials provided with the distribution.
+#
+# * Neither the name of Xilinx nor the names of its
+#   contributors may be used to endorse or promote products derived from
+#   this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+import pytest
+
+import json
+import shutil
+from brevitas.export.onnx.generic.manager import BrevitasONNXManager
+from qonnx.core.modelwrapper import ModelWrapper
+from qonnx.custom_op.registry import getCustomOp
+
+import finn.builder.build_dataflow as build
+import finn.builder.build_dataflow_config as build_cfg
+from finn.transformation.fpgadataflow.set_fifo_depths import get_fifo_split_configs
+from finn.util.basic import make_build_dir
+from finn.util.test import get_trained_network_and_ishape
+
+
+def fetch_test_model(topology, wbits=2, abits=2):
+    tmp_output_dir = make_build_dir("build_fifosizing_%s_" % topology)
+    (model, ishape) = get_trained_network_and_ishape(topology, wbits, abits)
+    chkpt_name = tmp_output_dir + "/model.onnx"
+    BrevitasONNXManager.export(model, ishape, chkpt_name)
+    return tmp_output_dir
+
+
+def get_folding_cfg(depth=65536):
+    cfg = dict()
+    cfg["Defaults"] = dict()
+    for i in range(3):
+        key = "StreamingFIFO_" + str(i)
+        cfg[key] = {"depth": depth, "ram_style": "auto", "impl_style": "vivado"}
+    return cfg
+
+
+@pytest.mark.slow
+@pytest.mark.vivado
+@pytest.mark.fpgadataflow
+@pytest.mark.parametrize("depth", [16384, 65536, 45000])
+@pytest.mark.parametrize("force_python_rtlsim", ["True", "False"])
+def test_split_large_fifos(depth, force_python_rtlsim):
+    tmp_output_dir = fetch_test_model("tfc")
+    folding_cfg = get_folding_cfg(depth)
+    with open(tmp_output_dir + "/folding_config.json", "w") as f:
+        json.dump(folding_cfg, f, indent=2)
+    cfg = build_cfg.DataflowBuildConfig(
+        output_dir=tmp_output_dir,
+        auto_fifo_depths=False,
+        split_large_fifos=True,
+        folding_config_file=tmp_output_dir + "/folding_config.json",
+        target_fps=10000,
+        force_python_rtlsim=force_python_rtlsim,
+        synth_clk_period_ns=10.0,
+        board="Pynq-Z1",
+        rtlsim_batch_size=100,
+        shell_flow_type=build_cfg.ShellFlowType.VIVADO_ZYNQ,
+        generate_outputs=[
+            build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
+            build_cfg.DataflowOutputType.STITCHED_IP,
+            build_cfg.DataflowOutputType.RTLSIM_PERFORMANCE,
+        ],
+        default_mem_mode=build_cfg.ComputeEngineMemMode.DECOUPLED,
+    )
+    build.build_dataflow_cfg(tmp_output_dir + "/model.onnx", cfg)
+    with open(tmp_output_dir + "/report/estimate_network_performance.json") as f:
+        est_data = json.load(f)
+    with open(tmp_output_dir + "/report/rtlsim_performance.json") as f:
+        sim_data = json.load(f)
+    assert (
+        float(sim_data["throughput[images/s]"])
+        / float(est_data["estimated_throughput_fps"])
+        > 0.9
+    )
+    model = ModelWrapper(
+        tmp_output_dir + "/intermediate_models/step_set_fifo_depths.onnx"
+    )
+    # exclude final FIFO node (output FIFO, not part of test)
+    fifo_nodes = model.get_nodes_by_op_type("StreamingFIFO")[:-1]
+    golden_cfg = get_fifo_split_configs(depth, 256, 32768)
+    for i, fifo_node in enumerate(fifo_nodes):
+        inst = getCustomOp(fifo_node)
+        fifo_depth = inst.get_nodeattr("depth")
+        assert fifo_depth == golden_cfg[i % len(golden_cfg)][0]
+
+    shutil.rmtree(tmp_output_dir)
+
+
+def test_split_large_fifo_configs():
+    ret0 = get_fifo_split_configs(513, 256, 32768)
+    assert ret0 == [(512, "vivado"), (1, "rtl")]
+    ret1 = get_fifo_split_configs(1200, 256, 32768)
+    assert ret1 == [(1024, "vivado"), (176, "rtl")]
+    ret2 = get_fifo_split_configs(45000, 256, 32768)
+    assert ret2 == [
+        (32768, "vivado"),
+        (8192, "vivado"),
+        (2048, "vivado"),
+        (1024, "vivado"),
+        (512, "vivado"),
+        (256, "rtl"),
+        (200, "rtl"),
+    ]
diff --git a/tests/transformation/streamline/test_absorb_mul_into_topk.py b/tests/transformation/streamline/test_absorb_mul_into_topk.py
index a6dff788dc58dba45536a280c7fe5f5c53edc4e1..89ef74e0b3f83fc092268ad2582c533e47eab618 100644
--- a/tests/transformation/streamline/test_absorb_mul_into_topk.py
+++ b/tests/transformation/streamline/test_absorb_mul_into_topk.py
@@ -34,6 +34,7 @@ from qonnx.transformation.general import GiveReadableTensorNames, GiveUniqueNode
 from qonnx.transformation.infer_datatypes import InferDataTypes
 from qonnx.transformation.infer_shapes import InferShapes
 from qonnx.transformation.insert_topk import InsertTopK
+from qonnx.util.basic import qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.streamline.absorb import AbsorbScalarMulAddIntoTopK
@@ -65,7 +66,7 @@ def test_absorb_mul_into_topk(mul_positive, scalar):
         value_info=[a0, b0, c0],
     )
 
-    model = helper.make_model(mul_graph, producer_name="mul_model")
+    model = qonnx_make_model(mul_graph, producer_name="mul_model")
     model = ModelWrapper(model)
     # initialize values
     # for mul
diff --git a/tests/transformation/streamline/test_absorb_transp_into_flatten.py b/tests/transformation/streamline/test_absorb_transp_into_flatten.py
index 1358d468c04c3edf08b11e7e9b858dda58965368..44b0c1d7e04447f13043cb326047a7b8d69469dd 100644
--- a/tests/transformation/streamline/test_absorb_transp_into_flatten.py
+++ b/tests/transformation/streamline/test_absorb_transp_into_flatten.py
@@ -8,6 +8,7 @@ from qonnx.transformation.general import GiveReadableTensorNames, GiveUniqueNode
 from qonnx.transformation.infer_data_layouts import InferDataLayouts
 from qonnx.transformation.infer_datatypes import InferDataTypes
 from qonnx.transformation.infer_shapes import InferShapes
+from qonnx.util.basic import qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.streamline.absorb import AbsorbTransposeIntoFlatten
@@ -45,7 +46,7 @@ def test_absorb_transp_into_flatten(perm, shape, ishape, data_layout):
         outputs=[outp],
     )
 
-    model = helper.make_model(graph, producer_name="absorb_transpose_model")
+    model = qonnx_make_model(graph, producer_name="absorb_transpose_model")
     model = ModelWrapper(model)
     if shape is not None:
         model.graph.value_info.append(shape0)
diff --git a/tests/transformation/streamline/test_collapse_repeated_op.py b/tests/transformation/streamline/test_collapse_repeated_op.py
index 268e0ffc5c5cb342634ff51ac8fe02157ae8c7c6..c1d3ee00883b84ec2a8c18d093b1756a4d6aea36 100644
--- a/tests/transformation/streamline/test_collapse_repeated_op.py
+++ b/tests/transformation/streamline/test_collapse_repeated_op.py
@@ -33,6 +33,7 @@ import onnx.helper as oh
 from onnx import TensorProto
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.transformation.infer_shapes import InferShapes
+from qonnx.util.basic import qonnx_make_model
 
 import finn.core.onnx_exec as ox
 from finn.transformation.streamline import CollapseRepeatedAdd, CollapseRepeatedMul
@@ -46,7 +47,7 @@ def test_collapse_repeated_op():
     add_param_1 = oh.make_tensor_value_info("add_param_1", TensorProto.FLOAT, [2])
     mul_param_1 = oh.make_tensor_value_info("mul_param_1", TensorProto.FLOAT, [2])
     top_out = oh.make_tensor_value_info("top_out", TensorProto.FLOAT, [2])
-    modelproto = oh.make_model(
+    modelproto = qonnx_make_model(
         oh.make_graph(
             name="test",
             inputs=[top_in],
@@ -96,7 +97,7 @@ def test_collapse_repeated_only_if_linear(test_args):
     value_info += [oh.make_tensor_value_info("p4", TensorProto.FLOAT, [1])]
     value_info += [oh.make_tensor_value_info("p5", TensorProto.FLOAT, [1])]
 
-    modelproto = oh.make_model(
+    modelproto = qonnx_make_model(
         oh.make_graph(
             name="test",
             inputs=[top_in],
diff --git a/tests/transformation/streamline/test_factor_out_mul_sign_magnitude.py b/tests/transformation/streamline/test_factor_out_mul_sign_magnitude.py
index 04ab9bf0b9c092bdf2c2a6c6268974fd78020eee..89596a1c0f4af4b95e19f3b6aba19e7f459aa7df 100644
--- a/tests/transformation/streamline/test_factor_out_mul_sign_magnitude.py
+++ b/tests/transformation/streamline/test_factor_out_mul_sign_magnitude.py
@@ -33,6 +33,7 @@ import onnx.helper as oh
 from onnx import TensorProto
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.transformation.infer_shapes import InferShapes
+from qonnx.util.basic import qonnx_make_model
 
 import finn.core.onnx_exec as ox
 from finn.transformation.streamline import FactorOutMulSignMagnitude
@@ -43,7 +44,7 @@ def test_factor_out_mul_sign_magnitude():
     top_in = oh.make_tensor_value_info("top_in", TensorProto.FLOAT, [1, 2])
     mul_param = oh.make_tensor_value_info("mul_param", TensorProto.FLOAT, [1, 2])
     top_out = oh.make_tensor_value_info("top_out", TensorProto.FLOAT, [1, 2])
-    modelproto = oh.make_model(
+    modelproto = qonnx_make_model(
         oh.make_graph(
             name="test",
             inputs=[top_in],
diff --git a/tests/transformation/streamline/test_linear_past_eltwise.py b/tests/transformation/streamline/test_linear_past_eltwise.py
index 12633d750bb405757efca0c028dece92b289b472..4e5dcd63862b61f5575d8adf2cbb69912ee726d7 100644
--- a/tests/transformation/streamline/test_linear_past_eltwise.py
+++ b/tests/transformation/streamline/test_linear_past_eltwise.py
@@ -35,6 +35,7 @@ from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.transformation.fold_constants import FoldConstants
 from qonnx.transformation.general import GiveReadableTensorNames, GiveUniqueNodeNames
 from qonnx.transformation.infer_shapes import InferShapes
+from qonnx.util.basic import qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.streamline.reorder import MoveLinearPastEltwiseAdd
@@ -78,7 +79,7 @@ def make_model(shape):
         outputs=[outp],
     )
 
-    model = helper.make_model(graph, producer_name="add-model")
+    model = qonnx_make_model(graph, producer_name="add-model")
     model = ModelWrapper(model)
 
     # set initializers for scalar add/mul nodes
@@ -156,7 +157,7 @@ def test_linear_past_eltwise_add_multiple_forks(ch, ifmdim):
             helper.make_tensor_value_info("p" + str(i), TensorProto.FLOAT, input_shape)
         ]
 
-    modelproto = helper.make_model(
+    modelproto = qonnx_make_model(
         helper.make_graph(
             name="test",
             inputs=[top_in],
diff --git a/tests/transformation/streamline/test_maxpool_nhwc.py b/tests/transformation/streamline/test_maxpool_nhwc.py
index aa77b5cf1a6e77d67ff8351ca5f544a63eb47f29..d61eedaaf5d1f10e64712d5282190b67f56acb49 100644
--- a/tests/transformation/streamline/test_maxpool_nhwc.py
+++ b/tests/transformation/streamline/test_maxpool_nhwc.py
@@ -7,7 +7,7 @@ from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.general.maxpoolnhwc import compute_pool_output_dim
 from qonnx.transformation.infer_shapes import InferShapes
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.streamline.reorder import MakeMaxPoolNHWC
@@ -56,7 +56,7 @@ def create_maxpool(ifm_dim, ifm_ch, kernel_shape, pads, strides, ceil_mode, idt)
         value_info=[outp_mp],
     )
 
-    model = oh.make_model(graph, producer_name="maxpool_model")
+    model = qonnx_make_model(graph, producer_name="maxpool_model")
     model = ModelWrapper(model)
     model.set_tensor_datatype("inp", idt)
     model.set_tensor_datatype("outp", idt)
diff --git a/tests/transformation/streamline/test_move_add_past_mul.py b/tests/transformation/streamline/test_move_add_past_mul.py
index 0fb4dd9f7a116d0d52578d7222421f251ac17ec1..ea9c2a954d2bd7b4a4be421c1869d4a8dd8f0cf1 100644
--- a/tests/transformation/streamline/test_move_add_past_mul.py
+++ b/tests/transformation/streamline/test_move_add_past_mul.py
@@ -33,6 +33,7 @@ import onnx.helper as oh
 from onnx import TensorProto
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.transformation.infer_shapes import InferShapes
+from qonnx.util.basic import qonnx_make_model
 
 import finn.core.onnx_exec as ox
 from finn.transformation.streamline import MoveAddPastMul
@@ -44,7 +45,7 @@ def test_move_add_past_mul_single():
     add_param = oh.make_tensor_value_info("add_param", TensorProto.FLOAT, [2])
     mul_param = oh.make_tensor_value_info("mul_param", TensorProto.FLOAT, [2])
     top_out = oh.make_tensor_value_info("top_out", TensorProto.FLOAT, [2])
-    modelproto = oh.make_model(
+    modelproto = qonnx_make_model(
         oh.make_graph(
             name="test",
             inputs=[top_in],
@@ -76,7 +77,7 @@ def test_move_add_past_mul_multi():
     add_param_1 = oh.make_tensor_value_info("add_param_1", TensorProto.FLOAT, [2])
     mul_param_1 = oh.make_tensor_value_info("mul_param_1", TensorProto.FLOAT, [2])
     top_out = oh.make_tensor_value_info("top_out", TensorProto.FLOAT, [2])
-    modelproto = oh.make_model(
+    modelproto = qonnx_make_model(
         oh.make_graph(
             name="test",
             inputs=[top_in],
@@ -116,7 +117,7 @@ def test_move_add_past_mul_only_if_linear():
     value_info += [oh.make_tensor_value_info("mul1_param", TensorProto.FLOAT, [1])]
     value_info += [oh.make_tensor_value_info("mul2_param", TensorProto.FLOAT, [1])]
     value_info += [oh.make_tensor_value_info("mul3_param", TensorProto.FLOAT, [1])]
-    modelproto = oh.make_model(
+    modelproto = qonnx_make_model(
         oh.make_graph(
             name="test",
             inputs=[top_in],
diff --git a/tests/transformation/streamline/test_move_chw_add_past_conv.py b/tests/transformation/streamline/test_move_chw_add_past_conv.py
index 7eb7f9f1af67efa1a6934157b9c2b3f8a6a814c2..e1b324a798a23b5f4a6878f5e2b27434a61fe8f8 100644
--- a/tests/transformation/streamline/test_move_chw_add_past_conv.py
+++ b/tests/transformation/streamline/test_move_chw_add_past_conv.py
@@ -33,6 +33,7 @@ from onnx import TensorProto, helper
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.general.im2col import compute_conv_output_dim
 from qonnx.transformation.infer_shapes import InferShapes
+from qonnx.util.basic import qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.streamline.reorder import MoveAddPastConv
@@ -72,7 +73,7 @@ def test_move_chw_add_past_conv(idim, k, s, ich, och):
     add_node = helper.make_node("Add", ["inp", "a0"], ["add_out"])
     conv_node = helper.make_node("Conv", ["add_out", "a1"], ["outp"], **conv_config)
 
-    model = helper.make_model(
+    model = qonnx_make_model(
         helper.make_graph(
             nodes=[add_node, conv_node],
             name="move-add-graph",
diff --git a/tests/transformation/streamline/test_move_flatten_past_affine.py b/tests/transformation/streamline/test_move_flatten_past_affine.py
index 8c3f71d1f35de1b03fb33e53e41599fae7e02304..22c5e19fac700e147a36f74f10dad10614d47992 100644
--- a/tests/transformation/streamline/test_move_flatten_past_affine.py
+++ b/tests/transformation/streamline/test_move_flatten_past_affine.py
@@ -36,7 +36,7 @@ from qonnx.transformation.general import GiveReadableTensorNames, GiveUniqueNode
 from qonnx.transformation.infer_data_layouts import InferDataLayouts
 from qonnx.transformation.infer_datatypes import InferDataTypes
 from qonnx.transformation.infer_shapes import InferShapes
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.streamline.reorder import MoveFlattenPastAffine
@@ -74,7 +74,7 @@ def test_move_flatten_past_affine(data_layout, batch_size):
         value_info=[a0, a1, a2],
     )
 
-    model = helper.make_model(graph, producer_name="move_reshape_model")
+    model = qonnx_make_model(graph, producer_name="move_reshape_model")
     model = ModelWrapper(model)
 
     # initialize values
diff --git a/tests/transformation/streamline/test_move_flatten_past_topk.py b/tests/transformation/streamline/test_move_flatten_past_topk.py
index d1478088e2e8caaeb33fbec2880e74ea65905073..82336cd3e69d865e4c36536e7e0b16f092a7033d 100644
--- a/tests/transformation/streamline/test_move_flatten_past_topk.py
+++ b/tests/transformation/streamline/test_move_flatten_past_topk.py
@@ -36,7 +36,7 @@ from qonnx.transformation.infer_data_layouts import InferDataLayouts
 from qonnx.transformation.infer_datatypes import InferDataTypes
 from qonnx.transformation.infer_shapes import InferShapes
 from qonnx.transformation.insert_topk import InsertTopK
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.streamline.reorder import MoveFlattenPastTopK
@@ -67,7 +67,7 @@ def test_move_flatten_past_topk(data_layout, batch_size):
         outputs=[outp],
     )
 
-    model = helper.make_model(graph, producer_name="move_flatten_model")
+    model = qonnx_make_model(graph, producer_name="move_flatten_model")
     model = ModelWrapper(model)
 
     model.set_tensor_datatype("inp", DataType["INT2"])
diff --git a/tests/transformation/streamline/test_move_identical_op_past_join_op.py b/tests/transformation/streamline/test_move_identical_op_past_join_op.py
index 4986363ff4dba0b0126babdbd1f393faa2df5de3..7be97631625354297c322267792520628454c4f9 100644
--- a/tests/transformation/streamline/test_move_identical_op_past_join_op.py
+++ b/tests/transformation/streamline/test_move_identical_op_past_join_op.py
@@ -30,7 +30,7 @@ import pytest
 from onnx import TensorProto
 from onnx import helper as oh
 from qonnx.core.modelwrapper import ModelWrapper
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.streamline.reorder import MoveTransposePastJoinAdd
@@ -81,7 +81,7 @@ def create_model(perm):
         ],
     )
 
-    onnx_model = oh.make_model(graph, producer_name="test_model")
+    onnx_model = qonnx_make_model(graph, producer_name="test_model")
     model = ModelWrapper(onnx_model)
 
     return model
diff --git a/tests/transformation/streamline/test_move_maxpool_past_multithreshold.py b/tests/transformation/streamline/test_move_maxpool_past_multithreshold.py
index bf25eee9e685d2536faf5bd25bc7b1aa36700463..6126acd9e388869c34cd0c73bb64f4b6c56b4c06 100644
--- a/tests/transformation/streamline/test_move_maxpool_past_multithreshold.py
+++ b/tests/transformation/streamline/test_move_maxpool_past_multithreshold.py
@@ -32,6 +32,7 @@ from onnx import TensorProto, helper
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.transformation.infer_datatypes import InferDataTypes
 from qonnx.transformation.infer_shapes import InferShapes
+from qonnx.util.basic import qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.streamline.reorder import MoveMaxPoolPastMultiThreshold
@@ -99,7 +100,7 @@ def test_move_maxpool_past_multithreshold():
         )
     ]
 
-    modelproto = helper.make_model(
+    modelproto = qonnx_make_model(
         helper.make_graph(
             name="test",
             inputs=[top_in],
diff --git a/tests/transformation/streamline/test_move_mul_past_dw_conv.py b/tests/transformation/streamline/test_move_mul_past_dw_conv.py
index 401631a728412e7676fa804626601cfc58b5a5e3..72a6650ec4e6b853b79c93941af84dd15a7e5c47 100644
--- a/tests/transformation/streamline/test_move_mul_past_dw_conv.py
+++ b/tests/transformation/streamline/test_move_mul_past_dw_conv.py
@@ -33,7 +33,7 @@ from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.general.im2col import compute_conv_output_dim
 from qonnx.transformation.infer_datatypes import InferDataTypes
 from qonnx.transformation.infer_shapes import InferShapes
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.streamline.reorder import MoveMulPastDWConv
@@ -94,7 +94,7 @@ def test_move_mul_past_dw_conv(ifm_dim, ifm_ch, k, stride, pad_amt, dw):
         value_info=[mul, W],
     )
 
-    model = helper.make_model(graph, producer_name="mulpastconv-model")
+    model = qonnx_make_model(graph, producer_name="mulpastconv-model")
     model = ModelWrapper(model)
     inp_values = gen_finn_dt_tensor(DataType["INT2"], [1, ifm_ch, ifm_dim, ifm_dim])
     mul_values = gen_finn_dt_tensor(DataType["INT2"], [1, ifm_ch, 1, 1])
diff --git a/tests/transformation/streamline/test_move_mul_past_maxpool.py b/tests/transformation/streamline/test_move_mul_past_maxpool.py
index fcc1b6513230c548bdcc04a40aad793b64c6faf2..3bae2905a064b8372b520a7a8083905284343429 100755
--- a/tests/transformation/streamline/test_move_mul_past_maxpool.py
+++ b/tests/transformation/streamline/test_move_mul_past_maxpool.py
@@ -34,7 +34,7 @@ from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.custom_op.general.maxpoolnhwc import compute_pool_output_dim
 from qonnx.transformation.infer_datatypes import InferDataTypes
 from qonnx.transformation.infer_shapes import InferShapes
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.streamline.reorder import MoveMulPastMaxPool
@@ -92,7 +92,7 @@ def test_move_mul_past_maxpool(ifm_dim, ifm_ch, k, stride, pad, cw, negative):
         value_info=[mul],
     )
 
-    model = helper.make_model(graph, producer_name="mulpastmaxpool-model")
+    model = qonnx_make_model(graph, producer_name="mulpastmaxpool-model")
     model = ModelWrapper(model)
     inp_values = gen_finn_dt_tensor(DataType["INT2"], [1, ifm_ch, ifm_dim, ifm_dim])
     mul_values = np.random.random_sample(mul_shape).astype(np.float32)
diff --git a/tests/transformation/streamline/test_move_scalar_past_conv.py b/tests/transformation/streamline/test_move_scalar_past_conv.py
index 59b8b8f8b2fee99bbb77c6d354620406a108cb54..bb99fd1d8f7d48ab9ad7038d78f5352f26f2ad06 100644
--- a/tests/transformation/streamline/test_move_scalar_past_conv.py
+++ b/tests/transformation/streamline/test_move_scalar_past_conv.py
@@ -32,6 +32,7 @@ import onnx.helper as oh
 from onnx import TensorProto
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.transformation.infer_shapes import InferShapes
+from qonnx.util.basic import qonnx_make_model
 
 import finn.core.onnx_exec as ox
 from finn.transformation.streamline import MoveAddPastConv, MoveScalarMulPastConv
@@ -79,7 +80,7 @@ def test_move_scalar_past_conv(test_args, padding):
     value_info += [oh.make_tensor_value_info("p2", TensorProto.FLOAT, conv_param_shape)]
     value_info += [oh.make_tensor_value_info("p3", TensorProto.FLOAT, conv_param_shape)]
 
-    modelproto = oh.make_model(
+    modelproto = qonnx_make_model(
         oh.make_graph(
             name="test",
             inputs=[top_in],
@@ -158,7 +159,7 @@ def test_move_scalar_past_conv_only_if_linear(test_args):
     value_info += [oh.make_tensor_value_info("p4", TensorProto.FLOAT, conv_param_shape)]
     value_info += [oh.make_tensor_value_info("p5", TensorProto.FLOAT, conv_param_shape)]
 
-    modelproto = oh.make_model(
+    modelproto = qonnx_make_model(
         oh.make_graph(
             name="test",
             inputs=[top_in],
diff --git a/tests/transformation/streamline/test_move_scalar_past_matmul.py b/tests/transformation/streamline/test_move_scalar_past_matmul.py
index 6fdaaadfaea5862b566fd3a8d060ac28acadf1cd..6c788294bc739332c0b9bd0e98081bcb83330b53 100644
--- a/tests/transformation/streamline/test_move_scalar_past_matmul.py
+++ b/tests/transformation/streamline/test_move_scalar_past_matmul.py
@@ -33,6 +33,7 @@ import onnx.helper as oh
 from onnx import TensorProto
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.transformation.infer_shapes import InferShapes
+from qonnx.util.basic import qonnx_make_model
 
 import finn.core.onnx_exec as ox
 from finn.transformation.streamline import (
@@ -47,7 +48,7 @@ def test_move_scalar_mul_past_matmul():
     mul_param = oh.make_tensor_value_info("mul_param", TensorProto.FLOAT, [1, 1])
     matmul_param = oh.make_tensor_value_info("matmul_param", TensorProto.FLOAT, [2, 2])
     top_out = oh.make_tensor_value_info("top_out", TensorProto.FLOAT, [1, 2])
-    modelproto = oh.make_model(
+    modelproto = qonnx_make_model(
         oh.make_graph(
             name="test",
             inputs=[top_in],
@@ -79,7 +80,7 @@ def test_move_scalar_add_past_matmul():
     add_param = oh.make_tensor_value_info("add_param", TensorProto.FLOAT, [1, 1])
     matmul_param = oh.make_tensor_value_info("matmul_param", TensorProto.FLOAT, [2, 2])
     top_out = oh.make_tensor_value_info("top_out", TensorProto.FLOAT, [1, 2])
-    modelproto = oh.make_model(
+    modelproto = qonnx_make_model(
         oh.make_graph(
             name="test",
             inputs=[top_in],
@@ -122,7 +123,7 @@ def test_move_scalar_past_matmul_only_if_linear(test_args):
     p2 = oh.make_tensor_value_info("p2", TensorProto.FLOAT, matmul_shape)
     p3 = oh.make_tensor_value_info("p3", TensorProto.FLOAT, matmul_shape)
     p4 = oh.make_tensor_value_info("p4", TensorProto.FLOAT, matmul_shape)
-    modelproto = oh.make_model(
+    modelproto = qonnx_make_model(
         oh.make_graph(
             name="test",
             inputs=[top_in],
diff --git a/tests/transformation/streamline/test_move_transpose_past_scalar_mul.py b/tests/transformation/streamline/test_move_transpose_past_scalar_mul.py
index 9662ba8a908e9bb793e0c0c2b078cf26adb5cef3..6bf72961ac06331c8ce972c8ca78dea99fb3c0a0 100644
--- a/tests/transformation/streamline/test_move_transpose_past_scalar_mul.py
+++ b/tests/transformation/streamline/test_move_transpose_past_scalar_mul.py
@@ -36,6 +36,7 @@ from qonnx.transformation.general import GiveReadableTensorNames, GiveUniqueNode
 from qonnx.transformation.infer_data_layouts import InferDataLayouts
 from qonnx.transformation.infer_datatypes import InferDataTypes
 from qonnx.transformation.infer_shapes import InferShapes
+from qonnx.util.basic import qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.streamline.reorder import MoveTransposePastScalarMul
@@ -71,7 +72,7 @@ def test_move_transpose_past_scalar_mul(perm, scalar, data_layout):
         value_info=[a0],
     )
 
-    model = helper.make_model(graph, producer_name="mv_transpose_model")
+    model = qonnx_make_model(graph, producer_name="mv_transpose_model")
     model = ModelWrapper(model)
 
     # initialize values
diff --git a/tests/transformation/streamline/test_round_thresholds.py b/tests/transformation/streamline/test_round_thresholds.py
index 1ec5f02e878a540a89cc37179b2e6dd76ede882c..85c60b37d5193de7ed2f38b9da6eb2e9b35b0150 100644
--- a/tests/transformation/streamline/test_round_thresholds.py
+++ b/tests/transformation/streamline/test_round_thresholds.py
@@ -32,6 +32,7 @@ import numpy as np
 from onnx import TensorProto, helper
 from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
+from qonnx.util.basic import qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.streamline import RoundAndClipThresholds
@@ -46,7 +47,7 @@ def test_round_thresholds():
         "MultiThreshold", ["v", "thresholds"], ["out"], domain="qonnx.custom_op.general"
     )
     graph_def = helper.make_graph([node_def], "test_model", [v, thresholds], [out])
-    model_def = helper.make_model(graph_def)
+    model_def = qonnx_make_model(graph_def)
     model = ModelWrapper(model_def)
     threshold_val = np.asarray([[-1.1], [0.7], [2.3], [5.1]], dtype=np.float32)
     model.set_initializer("thresholds", threshold_val)
diff --git a/tests/transformation/streamline/test_scale_resize_nhwc.py b/tests/transformation/streamline/test_scale_resize_nhwc.py
index f10930f4e7d5aeb98a60630e7e4f48adfc371d59..5e107448f8d8cc78d572f846496ed541591dfe05 100644
--- a/tests/transformation/streamline/test_scale_resize_nhwc.py
+++ b/tests/transformation/streamline/test_scale_resize_nhwc.py
@@ -9,7 +9,7 @@ from qonnx.core.datatype import DataType
 from qonnx.core.modelwrapper import ModelWrapper
 from qonnx.transformation.infer_data_layouts import InferDataLayouts
 from qonnx.transformation.infer_shapes import InferShapes
-from qonnx.util.basic import gen_finn_dt_tensor
+from qonnx.util.basic import gen_finn_dt_tensor, qonnx_make_model
 
 import finn.core.onnx_exec as oxe
 from finn.transformation.streamline.reorder import MakeScaleResizeNHWC
@@ -58,7 +58,7 @@ def create_resize_transpose(ifm_dim, ifm_ch, scales, mode, idt):
         value_info=[outp_up, param, roi],
     )
 
-    model = oh.make_model(graph, producer_name="resize_model1")
+    model = qonnx_make_model(graph, producer_name="resize_model1")
     model = ModelWrapper(model)
     model.set_tensor_datatype("inp", idt)
     model.set_tensor_datatype("outp", idt)
@@ -113,7 +113,7 @@ def create_transpose_resize(ifm_dim, ifm_ch, scales, mode, idt):
         value_info=[outp_tr, param, roi],
     )
 
-    model = oh.make_model(graph, producer_name="resize_model2")
+    model = qonnx_make_model(graph, producer_name="resize_model2")
     model = ModelWrapper(model)
     model.set_tensor_datatype("inp", idt)
     model.set_tensor_datatype("outp", idt)
@@ -180,7 +180,7 @@ def create_transpose_resize_transpose(ifm_dim, ifm_ch, scales, mode, idt):
         value_info=[outp_up, outp_tr, param, roi],
     )
 
-    model = oh.make_model(graph, producer_name="resize_model3")
+    model = qonnx_make_model(graph, producer_name="resize_model3")
     model = ModelWrapper(model)
     model.set_tensor_datatype("inp", idt)
     model.set_tensor_datatype("outp", idt)
diff --git a/tutorials/fpga_flow/README.md b/tutorials/fpga_flow/README.md
index 63ca6ac832c556b3e47a15fc3207683886796f23..2aaad0423b7d49c3d6760243fe1b1c1899b9030e 100644
--- a/tutorials/fpga_flow/README.md
+++ b/tutorials/fpga_flow/README.md
@@ -4,7 +4,7 @@ This example demonstrates how to bring a FINN compiled model into the Vivado FPG
 
 If you are new to the command-line flow, more information can be found [here](https://finn.readthedocs.io/en/latest/command_line.html).
 
-This demo was created using Vivado 2020.1.
+This demo was created using Vivado 2022.1.
 
 ## Compiling the Model in FINN
 
@@ -26,7 +26,7 @@ Prior to running, insure the following prerequisites have been met:
 - Install FINN and prerequisites.  The [Getting Started](https://finn.readthedocs.io/en/latest/getting_started.html#quickstart) section of the FINN documentation might be helpful for this.
 - Ensure you have the `FINN_XILINX_PATH` and `FINN_XILINX_VERSION` env variables set appropriately for your install.  For example:
 > export FINN_XILINX_PATH=/opt/Xilinx
-> export FINN_XILINX_VERSION=2020.1
+> export FINN_XILINX_VERSION=2022.1
 - Set the env variable for your `finn` install top directory (where you cloned the FINN compiler repo):
 > export FINN_ROOT=/home/foo/finn
 
@@ -112,7 +112,7 @@ testbench generators.
 
 There are any number of ways to bring the stitched IP into larger design.
 
-FINN already packages the stitched IP block design as a standalone IP-XACT component, which you can find under `${FINN_ROOT}/tutorials/fpga_flow/output_tfc_w0a1_fpga/stitched_ip/ip`. You can add this to the list of IP repos and use it in your own Vivado designs. A good reference for this is [UG1119](https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_1/ug1119-vivado-creating-packaging-ip-tutorial.pdf)
+FINN already packages the stitched IP block design as a standalone IP-XACT component, which you can find under `${FINN_ROOT}/tutorials/fpga_flow/output_tfc_w0a1_fpga/stitched_ip/ip`. You can add this to the list of IP repos and use it in your own Vivado designs. A good reference for this is [UG1119](https://www.xilinx.com/content/dam/xilinx/support/documents/sw_manuals/xilinx2022_1/ug1119-vivado-creating-packaging-ip-tutorial.pdf)
 
 Keep in mind that all of the User IP Repo's included in the Stitched IP project (from `$FINN_HOST_BUILD_DIR` which is normally located under `/tmp/finn_dev_<username>`) need to also be brought in as IP Repo's to any project using the stitched IP.  It would be prudent to copy those IP repos to an appropriate archive location. You should also set the
 `FINN_ROOT` environment variable to point to the compiler installation directory, as some of the build scripts will