LPYOLO on PYNQ-Z2
LPYOLO on PYNQ-Z2 is a build note from the Face Frenzy project, where I explored turning a low-precision YOLO face detector into an FPGA bitstream. The goal was to generate the FINN accelerator and keep the option of using the board's buttons, switches, and LEDs in the same overlay.
The final working setup uses the 2W4A LPYOLO model. I tried the larger variants too, but the PYNQ-Z2 was not exactly thrilled about them.
Overview
This part of the project covered three main things:
- building the LPYOLO FPGA bitstream with FINN and Vivado
- generating the PYNQ-compatible driver and bitstream package
- restoring GPIO access so the FINN overlay could still be used with the board's physical inputs and LEDs
Build Setup
The working toolchain was locked to FINN v0.10.1, Vivado 2022.2, PYNQ 3.1, and a Jupyter notebook workflow inside the FINN Docker environment.
The rest of this post is intentionally closer to a technical build log than a polished tutorial. It keeps the exact commands, patches, and fixes that made the setup reproducible.
Runtime Corrections From The Face Frenzy Integration
The build below produces a valid FINN accelerator, but the first version of the app-level integration had a few wrong assumptions. The current Face Frenzy runtime uses these rules:
- Put the final bitstream, HWH, and generated driver under
/home/xilinx/face-frenzy/fpga/, not the older/home/xilinx/lpyolo/...paths used in some standalone LPYOLO notes. - Run overlay load, GPIO, FINN execution, smoke tests, and the app as root on PYNQ-Z2. Over SSH, non-root execution can fail with
RuntimeError: Could not open device with index '0'. - After
sudo -i, source the PYNQ environment again:
source /etc/profile.d/pynq_venv.sh
source /etc/profile.d/xrt_setup.sh
- Install OpenCV through apt on PYNQ armv7:
sudo apt update
sudo apt install -y python3-opencv
Do not use pip install opencv-python on the board; it can try to build OpenCV from source.
- Install the generated FINN driver's Python dependency:
python3 -m pip install bitstring
- The generated
UINT4input packer in FINN's generic Python path is too slow for full 416x416 frames on the Z2. The Face Frenzy repo patchesfpga/driver_base.pywith a vectorized UINT4 pack path. Without it, execution can look "stuck" even though DMA has not started yet. - HDMI output is not part of the final FINN+GPIO bitstream. The game uses the browser UI at
http://<board-ip>:5000/for live video, boxes, count, and controls. - A passing overlay load is not enough. The useful smoke tests are:
python3 scripts/fpga-exec-smoke.py --timeout 30
FACE_FRENZY_MANUAL_DMA_SMOKE=0 python3 scripts/fpga-exec-smoke.py --timeout 30
python3 scripts/fpga-frame-debug.py --out fpga-frame-debug.jpg
At the time of writing, FPGA execution is confirmed at about 0.13-0.15 s per frame on the board. A good execute smoke looks like both IODMAs reaching 0x0e and then returning to idle 0x04.
Face box decoding is app-side. Raw (1, 13, 13, 18) INT8 output can produce obvious garbage geometry, such as full-height vertical stripes or full-width horizontal bars, if the output scale/layout is wrong or the confidence threshold is too low. The Face Frenzy detector now applies face-shape sanity filters and can be tuned with:
find /home/xilinx -name scale.npy 2>/dev/null
FACE_FRENZY_CONF_THRESHOLD=0.05
FACE_FRENZY_OUTPUT_SCALE=64
FACE_FRENZY_YOLO_LAYOUT=auto
FACE_FRENZY_SCORE_MODE=product
FACE_FRENZY_CHANNEL_ORDER=rgb
If the FINN deploy package contains scale.npy, copy it into fpga/scale.npy.
FaceDetector.py loads it automatically unless FACE_FRENZY_OUTPUT_SCALE is
set. If no scale file is present, the runtime defaults to 64, which is a
better starting point for this INT8 head than 16.
Use scripts/fpga-frame-debug.py for calibration. It captures one camera frame,
runs one FPGA inference, prints the chosen layout and raw/kept box counts, and
writes an annotated JPEG. Extra decoder probes:
python3 scripts/fpga-frame-debug.py --sweep
python3 scripts/fpga-frame-debug.py --cell 6 6
python3 scripts/fpga-frame-debug.py --save-raw /tmp/raw_output.npy
Repository
Resources and acknowledgements
This build was pieced together from a few useful starting points:
- End-to-end LPYOLO deployment tutorial - a useful reference for building LPYOLO with an older FINN version.
- finn-quantized-yolo - the original repository for the quantized LPYOLO models.
- Model mirror - backup downloads for the models if the original OneDrive links stop working.
Build Notes
Version Lock
| Component | Version |
|---|---|
| FINN compiler | v0.10.1 |
| Vivado / Vitis HLS | 2022.2 — installed inside WSL2 |
| Host OS | Windows + WSL2 (Ubuntu 22.04) |
| Docker | Docker Desktop with WSL2 backend |
| PYNQ runtime | v3.1.0 |
| Python (board) | 3.10 (bundled with PYNQ 3.1) |
| Python (FINN Docker) | 3.10 |
| Board | PYNQ-Z2 (XC7Z020-1CLG400C) |
| Model | LPYOLO 2W4A (2-bit weights, 4-bit activations) |
Part A - Host Machine Setup
A1. Install WSL2
In an admin PowerShell:
wsl --install
Reboot when prompted. This installs WSL2 with Ubuntu 22.04 by default.
A2. Install Docker Desktop
- Download from https://docs.docker.com/desktop/install/windows-install/
- Run the installer — select Use WSL2 instead of Hyper-V
- Open Docker Desktop settings:
- Resources → WSL Integration → enable your Ubuntu distro
- Resources → Advanced → set at least 8 GB RAM and 4 CPUs
- Verify inside WSL2:
docker run hello-world
A3. Install Vivado 2022.2 Inside WSL2
Vivado must be installed INSIDE WSL2, not in Windows. The FINN Docker container mounts the WSL2 Vivado installation.
-
Download the Linux installer from: https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/vivado-design-tools/2022-2.html
Get Xilinx Unified Installer 2022.2: Linux Self Extracting Web Installer
-
Install it:
cp /mnt/c/Users/YOU/Downloads/Xilinx_Unified_2022.2_*_Lin64.bin ~/
chmod +x ~/Xilinx_Unified_2022.2_*_Lin64.bin
~/Xilinx_Unified_2022.2_*_Lin64.bin
-
In the installer GUI:
- Product: Vivado
- Edition: Vivado ML Standard (free, no licence needed for Zynq 7020)
- Devices: check Zynq-7000 only — deselect everything else to save ~30 GB
- Install location:
/tools/Xilinx
-
Installation takes 1–2 hours and needs ~45 GB.
-
Verify:
source /tools/Xilinx/Vivado/2022.2/settings64.sh
vivado -version # must print: Vivado v2022.2
A4. Clone FINN
cd ~
git clone https://github.com/Xilinx/finn.git
cd finn
git checkout 9d299689f2ec0895f208b8bfe3bcdcf6f450181a
git submodule update --init --recursive
A5. Patch the FINN Dockerfile
Two fixes are required or the Docker build will fail. Both edits are in ~/finn/docker/Dockerfile.finn.
Fix 1 — pip timeout:
sed -i 's/RUN pip install -r \/tmp\/requirements.txt/RUN python3 -m pip install --default-timeout=120 --retries 10 --no-cache-dir -r \/tmp\/requirements.txt/' ~/finn/docker/Dockerfile.finn
Fix 2 — anyio version conflict. Find the Jupyter install line:
grep -n "jupyter" ~/finn/docker/Dockerfile.finn
Open the file in a text editor and add this line immediately after the Jupyter install line:
RUN pip install "anyio==3.7.1"
A6. Set Environment Variables
Create ~/finn-env.sh. All four env vars below are required — skipping any one will cause a failure later.
#!/bin/bash
export FINN_XILINX_PATH=/tools/Xilinx
export FINN_XILINX_VERSION=2022.2
export PYNQ_BOARD=Pynq-Z2
export FINN_HOST_BUILD_DIR=$HOME/finn_build
export FINN_DOCKER_GPU=0
# Optional: more parallelism (uses more RAM)
export NUM_DEFAULT_WORKERS=4
# CRITICAL: forwards XILINX_LOCAL_USER_DATA into the container, which
# forces Vivado to read its Tcl store from the install dir (read-only)
# instead of the user home (shared/corruptible between parallel Vivado
# instances during HLS). Without this, HLS export_design will randomly
# fail with: "Unable to load Tcl app xilinx::xsim"
export XILINX_LOCAL_USER_DATA=no
export FINN_DOCKER_EXTRA=" -e XILINX_LOCAL_USER_DATA=no "
Source it in every new WSL2 terminal before working with FINN:
source ~/finn-env.sh
A7. Patch FINN to use area-optimized Vivado strategies
FINN v0.10.1 hardcodes Flow_PerfOptimized_high / Performance_ExtraTimingOpt in its Vivado synthesis templates, which favour speed over area. On the Z2 this leaves the design a few hundred LUTs short of fitting. Switch to area-optimized strategies before the first build:
sed -i 's/set_property strategy Flow_PerfOptimized_high/set_property strategy Flow_AreaOptimized_high/' ~/finn/src/finn/transformation/fpgadataflow/templates.py
sed -i 's/set_property strategy Performance_ExtraTimingOpt/set_property strategy Area_Explore/' ~/finn/src/finn/transformation/fpgadataflow/templates.py
# Insert four placer-relaxation lines before launch_runs: ignore minor LUT over-utilization,
# relax control-set packing, enable physical opt passes (recover LUTs post-route).
sed -i '/^launch_runs -to_step write_bitstream impl_1/i set_param drc.disableLUTOverUtilError 1\nset_property STEPS.SYNTH_DESIGN.ARGS.CONTROL_SET_OPT_THRESHOLD 16 [get_runs synth_1]\nset_property STEPS.PHYS_OPT_DESIGN.IS_ENABLED true [get_runs impl_1]\nset_property STEPS.POST_ROUTE_PHYS_OPT_DESIGN.IS_ENABLED true [get_runs impl_1]\n' ~/finn/src/finn/transformation/fpgadataflow/templates.py
# clear the cached bytecode so the edit takes effect
rm -f ~/finn/src/finn/transformation/fpgadataflow/__pycache__/templates.cpython-310.pyc
Verify:
grep -nE "strategy|disableLUTOverUtilError|CONTROL_SET_OPT|PHYS_OPT_DESIGN.IS_ENABLED|launch_runs" ~/finn/src/finn/transformation/fpgadataflow/templates.py
Expected output (line numbers may vary):
set_property strategy Flow_AreaOptimized_high [get_runs synth_1]
set_property strategy Area_Explore [get_runs impl_1]
set_param drc.disableLUTOverUtilError 1
set_property STEPS.SYNTH_DESIGN.ARGS.CONTROL_SET_OPT_THRESHOLD 16 [get_runs synth_1]
set_property STEPS.PHYS_OPT_DESIGN.IS_ENABLED true [get_runs impl_1]
set_property STEPS.POST_ROUTE_PHYS_OPT_DESIGN.IS_ENABLED true [get_runs impl_1]
launch_runs -to_step write_bitstream impl_1
If you had Jupyter running with FINN already imported, restart the kernel after editing.
A8. Build and Test the FINN Docker Image
cd ~/finn
source ~/finn-env.sh
./run-docker.sh quicktest
Builds the Docker image (~20–30 min first time) and runs the self-test suite. Must print PASSED before continuing.
A9. Launch the Jupyter Notebook Server
This is the workflow for all subsequent build work:
cd ~/finn
source ~/finn-env.sh
./run-docker.sh notebook
Container starts and prints a Jupyter URL. Open it in a browser. All subsequent steps happen in notebook cells.
Critical: the finn_build directory must exist before launching Docker or you will hit a ghost-inode bug where directories appear to exist but can't be written:
mkdir -p ~/finn_build
If you ever rm -rf ~/finn_build while the container is running, you MUST exit and relaunch the container — otherwise all subsequent writes fail with misleading FileNotFoundError errors even after recreating the directory.
Part B - Prepare the Model and Config Files
B1. Get the ONNX Model
- Go to: https://github.com/sefaburakokcu/finn-quantized-yolo
- In the README table, find the 2w4a row and download the ONNX file from OneDrive
- Place it in a new folder under FINN's notebooks dir:
mkdir -p ~/finn/notebooks/lpyolo
cp /mnt/c/Users/YOU/Downloads/2w4a.onnx ~/finn/notebooks/lpyolo/
Open Jupyter, navigate to notebooks/lpyolo/, and create a new notebook there (e.g. build_lpyolo.ipynb) for all subsequent cells.
B2. Create the Folding Config
This config is the working one for Z2. Earlier folding configs online (from the LPYOLO paper era) target FINN v0.7 with different node names (StreamingFCLayer_Batch_0 etc.) and silently fail to apply in v0.10.1, leading to PE=1/SIMD=1 defaults which blow out BRAM and LUTs by 10×+.
The key insights baked into this config:
- Names must be
MVAU_N(pre-specialize). Afterstep_specialize_layers, names get wiped — so folding MUST happen before specialize. mem_modevalue is"internal_decoupled"in v0.10.1, not the old"decoupled".- For Z2's 53,200 LUT and 280 RAMB18 budget, PE/SIMD values from the LPYOLO paper are too aggressive. These values were tuned down until
place_designactually fits. MVAU_6uses"resType": "dsp"to offload one of the larger compute layers to DSP slices, relieving LUT pressure.StreamingFIFOdefault depth of 32 keeps FIFO RAM usage negligible;auto_fifo_depthsis disabled in the build config because it otherwise sizes FIFOs to full tensor-size (1M+ entries, which blows RAMB36 usage to thousands even aftersplit_large_fifos).
Save as ~/finn/notebooks/lpyolo/Pynq-Z2_folding_config.json:
{
"Defaults": {
"depth": [32, ["StreamingFIFO"]]
},
"MVAU_0": { "PE": 4, "SIMD": 3, "ram_style": "block", "resType": "lut", "mem_mode": "internal_decoupled", "runtime_writeable_weights": 0 },
"MVAU_1": { "PE": 4, "SIMD": 4, "ram_style": "block", "resType": "lut", "mem_mode": "internal_decoupled", "runtime_writeable_weights": 0 },
"MVAU_2": { "PE": 4, "SIMD": 4, "ram_style": "block", "resType": "lut", "mem_mode": "internal_decoupled", "runtime_writeable_weights": 0 },
"MVAU_3": { "PE": 2, "SIMD": 4, "ram_style": "block", "resType": "lut", "mem_mode": "internal_decoupled", "runtime_writeable_weights": 0 },
"MVAU_4": { "PE": 1, "SIMD": 4, "ram_style": "block", "resType": "lut", "mem_mode": "internal_decoupled", "runtime_writeable_weights": 0 },
"MVAU_5": { "PE": 1, "SIMD": 4, "ram_style": "block", "resType": "lut", "mem_mode": "internal_decoupled", "runtime_writeable_weights": 0 },
"MVAU_6": { "PE": 2, "SIMD": 4, "ram_style": "block", "resType": "dsp", "mem_mode": "internal_decoupled", "runtime_writeable_weights": 0 },
"MVAU_7": { "PE": 1, "SIMD": 4, "ram_style": "block", "resType": "lut", "mem_mode": "internal_decoupled", "runtime_writeable_weights": 0 },
"MVAU_8": { "PE": 1, "SIMD": 4, "ram_style": "block", "resType": "lut", "mem_mode": "internal_decoupled", "runtime_writeable_weights": 0 }
}
Dimension / divisibility verification (MH/MW are the actual LPYOLO 2W4A layer dims after FINN's streamlining):
| Node | MH | MW | PE | MH%PE | SIMD | MW%SIMD |
|---|---|---|---|---|---|---|
| MVAU_0 | 8 | 72 | 4 | 0 | 3 | 0 |
| MVAU_1 | 16 | 72 | 4 | 0 | 4 | 0 |
| MVAU_2 | 32 | 144 | 4 | 0 | 4 | 0 |
| MVAU_3 | 56 | 288 | 2 | 0 | 4 | 0 |
| MVAU_4 | 104 | 504 | 2 | 0 | 4 | 0 |
| MVAU_5 | 208 | 936 | 1 | 0 | 4 | 0 |
| MVAU_6 | 56 | 208 | 2 | 0 | 4 | 0 |
| MVAU_7 | 104 | 504 | 2 | 0 | 4 | 0 |
| MVAU_8 | 18 | 936 | 3 | 0 | 4 | 0 |
All divisibility constraints satisfied.
Part C - Run the FINN Build in Jupyter
All cells below run in the notebook created in step B1.
C1. Cell 1: Verify Environment
import os
print("FINN_XILINX_PATH: ", os.environ.get("FINN_XILINX_PATH"))
print("FINN_XILINX_VERSION: ", os.environ.get("FINN_XILINX_VERSION"))
print("XILINX_LOCAL_USER_DATA:", os.environ.get("XILINX_LOCAL_USER_DATA"))
print("cwd: ", os.getcwd())
Expected output:
FINN_XILINX_PATH: /tools/Xilinx
FINN_XILINX_VERSION: 2022.2
XILINX_LOCAL_USER_DATA: no
cwd: /workspace/finn/notebooks/lpyolo
If XILINX_LOCAL_USER_DATA is None: you didn't source ~/finn-env.sh before launching the container, or the FINN_DOCKER_EXTRA line is missing from it. Exit the container, fix, relaunch.
C2. Cell 2: Define the Build
This is the full build. The critical quirks baked in:
- Custom
step_rename_empty_nodesinserted after specialize: FINN v0.10.1 wipes node names duringstep_specialize_layers, and without names the generated HLS Tcl has emptyset_topcalls which fail. auto_fifo_depths=False, split_large_fifos=False: see B2 rationale.- Step order: folding BEFORE specialize (so node names are still
MVAU_N).
import os
os.chdir("/home/svenons/finn/notebooks/lpyolo")
import finn.builder.build_dataflow as build
import finn.builder.build_dataflow_config as build_cfg
from finn.builder.build_dataflow_steps import *
from qonnx.core.modelwrapper import ModelWrapper
def step_rename_empty_nodes(model, cfg):
for i, n in enumerate(model.graph.node):
if not n.name or n.name.strip() == "":
n.name = f"{n.op_type}_{i}"
return model
cfg = build.DataflowBuildConfig(
output_dir = "output_full",
folding_config_file = "Pynq-Z2_folding_config.json",
auto_fifo_depths = False,
split_large_fifos = False,
synth_clk_period_ns = 10.0,
board = "Pynq-Z2",
shell_flow_type = build_cfg.ShellFlowType.VIVADO_ZYNQ,
save_intermediate_models = True,
generate_outputs = [
build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
build_cfg.DataflowOutputType.BITFILE,
build_cfg.DataflowOutputType.PYNQ_DRIVER,
build_cfg.DataflowOutputType.DEPLOYMENT_PACKAGE,
],
steps = [
step_qonnx_to_finn,
step_tidy_up,
step_streamline,
step_convert_to_hw,
step_create_dataflow_partition,
step_target_fps_parallelization,
step_apply_folding_config,
step_specialize_layers,
step_rename_empty_nodes,
step_minimize_bit_width,
step_generate_estimate_reports,
step_hw_codegen,
step_hw_ipgen,
step_set_fifo_depths,
step_create_stitched_ip,
step_measure_rtlsim_performance,
step_out_of_context_synthesis,
step_synthesize_bitfile,
step_make_pynq_driver,
step_deployment_package,
]
)
build.build_dataflow_cfg("2w4a.onnx", cfg)
C3. Run the Cell - What to Expect
Total runtime: 2–6 hours. Progress prints per step:
Running step: step_qonnx_to_finn [1/20]
Running step: step_tidy_up [2/20]
Running step: step_streamline [3/20]
Running step: step_convert_to_hw [4/20]
Running step: step_create_dataflow_partition [5/20]
Running step: step_target_fps_parallelization [6/20]
Running step: step_apply_folding_config [7/20]
Running step: step_specialize_layers [8/20]
Running step: step_rename_empty_nodes [9/20]
Running step: step_minimize_bit_width [10/20]
Running step: step_generate_estimate_reports [11/20]
Running step: step_hw_codegen [12/20]
Running step: step_hw_ipgen [13/20] ← LONGEST, 1–2 hours
Running step: step_set_fifo_depths [14/20]
Running step: step_create_stitched_ip [15/20]
Running step: step_measure_rtlsim_performance [16/20]
Running step: step_out_of_context_synthesis [17/20]
Running step: step_synthesize_bitfile [18/20] ← Vivado impl, ~30 min
Running step: step_make_pynq_driver [19/20]
Running step: step_deployment_package [20/20]
Completed successfully
If a step fails, the error shows immediately — no need to wait for the rest. Common failures and fixes are in the troubleshooting table.
C4. Verify the Output
import os
deploy_dir = "output_full/deploy"
for f in ["finn-accel.bit", "finn-accel.hwh", "driver.py"]:
path = f"{deploy_dir}/{f}"
print(f"{path}: {'OK' if os.path.isfile(path) else 'MISSING'}")
# scale.npy may be one folder deeper
import subprocess
r = subprocess.run(["find", deploy_dir, "-name", "scale.npy"], capture_output=True, text=True)
print("scale.npy paths:", r.stdout.strip())
Part D - Copy the Build Output
From WSL2:
scp -r ~/finn/notebooks/lpyolo/output_full/deploy/ xilinx@pynq:/home/xilinx/lpyolo/
Part E - GPIO Access Alongside the FINN Overlay
Loading the FINN overlay unloads the base overlay, which means the board interfaces provided by base.bit stop working. For buttons, switches, and LEDs that is fixable by adding AXI GPIO IP to the FINN design. HDMI output is different: it is not just a GPIO pin bundle, but a full video pipeline from the base overlay, including the HDMI controller / video DMA path and its supporting clocks and constraints. The GPIO merge below does not recreate that pipeline, so HDMI out will not work from the final FINN+GPIO bitstream. Obvious in hindsight, slightly annoying in practice.
You cannot have both bitstreams loaded simultaneously on a Z2. Two workable patterns:
E1. Swap overlays on demand
Load the base overlay for GPIO operations, then load the FINN overlay for inference, then swap back. Each overlay load takes ~1 second. Good for interactive/test use, bad for realtime apps that need both.
from pynq import Overlay
def use_gpio():
return Overlay("base.bit") # restores buttons, LEDs, switches, HDMI
def use_fpga_inference():
from pynq import Overlay
return Overlay("/home/xilinx/lpyolo/finn-accel.bit")
E2. Rebuild the FINN bitstream with GPIO merged in (Vivado)
To have GPIO (BTN0–3, SW0–1, LD0–3, RGB LEDs) work while the FINN accelerator is loaded, you need a single bitstream that contains both the FINN dataflow partition and the extra AXI GPIO blocks. There is no way around this on the Z2 — every user-facing button, switch, and LED is routed through the PL. This section walks through adding those GPIO blocks in Vivado.
This still does not restore HDMI. The PYNQ-Z2 HDMI output in base.bit depends on the base overlay's video subsystem, not only on board pin constraints. Adding that subsystem back would be a separate video-design integration task, and was not possible with the same model, as far as I checked.
Total time: 60–90 minutes on top of the FINN build.
E2.1. Prerequisites: WSL2 locale and Vivado system libraries
Vivado 2022.2 on Ubuntu 22.04 needs a few packages that aren't there by default, or it fails to launch:
# Fix locale
sudo apt-get install -y locales
sudo locale-gen en_US.UTF-8
sudo update-locale LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
# libtinfo5 (removed from Ubuntu 22.04 main repos, fetch manually)
wget http://archive.ubuntu.com/ubuntu/pool/universe/n/ncurses/libtinfo5_6.3-2ubuntu0.1_amd64.deb
sudo dpkg -i libtinfo5_6.3-2ubuntu0.1_amd64.deb
sudo apt-get install -y libncurses5
WSL2's WSLg has OpenGL rendering issues that freeze Vivado on complex dialogs. Force software rendering before launching:
export LIBGL_ALWAYS_SOFTWARE=1
export _JAVA_OPTIONS="-Dsun.java2d.opengl=false"
E2.2. Install the PYNQ-Z2 board files
Vivado needs the TUL PYNQ-Z2 board definition for the pin constraints of buttons, switches, and LEDs:
cd ~
git clone https://github.com/Xilinx/XilinxBoardStore.git
sudo mkdir -p /tools/Xilinx/Vivado/2022.2/data/boards/board_files/
sudo cp -r ~/XilinxBoardStore/boards/TUL/pynq-z2 /tools/Xilinx/Vivado/2022.2/data/boards/board_files/
Verify:
ls /tools/Xilinx/Vivado/2022.2/data/boards/board_files/pynq-z2/
# Should show: A.0
Note the path is TUL/pynq-z2, not Xilinx/pynq-z2 — the Z2 is a TUL product, not Xilinx.
E2.3. Open the FINN-generated Vivado project
FINN leaves a complete Vivado project behind from its synthesis step. Find it:
ls -d /home/svenons/finn_build/vivado_zynq_proj_*/
There will be one directory. Launch Vivado on its .xpr file:
source /tools/Xilinx/Vivado/2022.2/settings64.sh
vivado /home/svenons/finn_build/vivado_zynq_proj_XXXXXX/finn_zynq_link.xpr &
The GUI opens with the existing block design containing the Zynq PS, FINN dataflow partitions (StreamingDataflowPartition_0/1/2), DMA engines (idma0, odma0), and AXI infrastructure.
In Tools → Settings → General → Project device, confirm the board is set to TUL PYNQ-Z2. If not, select it and click OK.
E2.4. Add AXI GPIO IPs to the block design
Open the block design from the Flow Navigator: IP INTEGRATOR → Open Block Design.
You'll add four AXI GPIO IPs — one each for buttons, switches, LEDs, and RGB LEDs.
For each of the four, repeat these steps:
- Right-click in empty canvas area → Add IP
- Search for
AXI GPIO, double-click to add it - Double-click the new
axi_gpio_0block in the diagram to open its "Re-customize IP" dialog - On the Board tab, set GPIO → Board Interface to one of these (naming depends on the board file):
| Block to add | Board Interface value | Rename to |
|---|---|---|
| 1st | btns 4bits (check Enable Interrupt at the bottom) |
btns_gpio |
| 2nd | sws 2bits |
switches_gpio |
| 3rd | leds 4bits |
leds_gpio |
| 4th | rgb led (exact label varies — whatever the board file uses for 6-bit RGB) |
rgbleds_gpio |
- Click OK to close the dialog
- In the block diagram, right-click the block → Rename... → use the name from the table above → OK
After all four are added, click the green "Run Connection Automation" banner at the top of the Diagram view. In the dialog:
- Check the top-level "All Automation" checkbox (or manually expand each GPIO block and check both
S_AXIandGPIOsub-items) - Click OK
Vivado will wire each block's S_AXI through the existing AXI interconnect to the PS, and route each GPIO interface out to external pins using the board file's constraints. The AXI interconnect auto-expands to accommodate the new masters.
E2.5. Validate and save
- Tools → Validate Design (or F6) — must return 0 errors
- File → Save Block Design (Ctrl-S)
In the Sources tab (top-left), expand top_wrapper and find top_i : top (top.bd) underneath it. Right-click it → Generate Output Products...:
- Synthesis Options: Global (not Out of Context)
- Click Generate (~1 min)
Note: FINN ships a pre-generated HDL wrapper. Do not click "Create HDL Wrapper" — it exists already. But the shipped wrapper is out of date after you add new IPs, so Generate Output Products writes a second wrapper under finn_zynq_link.gen/.... You must swap the project's top file to this new wrapper in E2.6, otherwise synthesis runs against the stale one.
E2.6. Replace the stale wrapper with the regenerated one
This is the single most common way E2 silently fails. FINN originally ships a static top_wrapper.v under finn_zynq_link.srcs/sources_1/imports/hdl/ that matches the block design before GPIO was added. Generate Output Products writes a new wrapper under finn_zynq_link.gen/sources_1/bd/top/hdl/ with the button/LED ports, but the project still lists the old one as a source. Synthesis then runs against the old wrapper and produces a bitstream where BTN0–3 / SW0–1 / LD0–3 / RGB LEDs are internally connected but never routed to the physical FPGA pins.
Symptom: ol.btns_gpio.read(0) always returns 0 regardless of which button you press, and ol.leds_gpio.write(0, 0xF) does not turn any LED on, even though ip_dict shows btns_gpio / leds_gpio present.
Confirm before fixing — look at both wrappers:
find /home/svenons/finn_build/vivado_zynq_proj_*/ -name top_wrapper.v
grep -E "btn|sws|leds|rgb" \
/home/svenons/finn_build/vivado_zynq_proj_*/finn_zynq_link.srcs/sources_1/imports/hdl/top_wrapper.v
grep -E "btn|sws|leds|rgb" \
/home/svenons/finn_build/vivado_zynq_proj_*/finn_zynq_link.gen/sources_1/bd/top/hdl/top_wrapper.v | head
If the imports file returns nothing and the gen file lists btns_4bits_tri_i, leds_4bits_tri_io, etc., the wrapper is stale. Fix via the Tcl console in Vivado:
remove_files [get_files */imports/hdl/top_wrapper.v]
add_files -norecurse [glob */finn_zynq_link.gen/sources_1/bd/top/hdl/top_wrapper.v]
set_property top top_wrapper [current_fileset]
update_compile_order -fileset sources_1
Then proceed to the re-synthesis in E2.7.
E2.7. Re-synthesize and generate bitstream
The FINN project already has the synthesis/implementation strategies we want. In the Tcl console (bottom of window) paste:
reset_run synth_1
launch_runs synth_1 -jobs 4
wait_on_run synth_1
launch_runs impl_1 -to_step write_bitstream -jobs 4
Wait ~45 minutes total.
Once it finishes, before exporting hardware, run a sanity check:
cat /home/svenons/finn_build/vivado_zynq_proj_*/finn_zynq_link.runs/impl_1/top_wrapper_io_placed.rpt \
| grep -E "D19|D20|L19|L20"
The "Signal Name" column (2nd one) on each of those four rows must be non-empty — it should read something like btns_4bits_tri_i[0]. If the column is still blank, the wrapper is still stale and synthesis used the wrong one; redo E2.6.
If implementation fails with "write_bitstream out-of-date" errors:
reset_run impl_1
launch_runs impl_1 -to_step write_bitstream -jobs 4
If synthesis fails with "black box" errors about StreamingDataflowPartition_X_0, it means the output products regenerated but the dataflow partition OOC checkpoints weren't re-linked. Fix:
reset_runs [get_runs -filter {IS_SYNTHESIS}]
launch_runs synth_1 -jobs 4
When everything succeeds, Vivado shows "Bitstream Generation successfully completed." Click Cancel on that dialog.
E2.8. Export the hardware
In the Tcl console:
write_hw_platform -fixed -include_bit -force -file /home/svenons/top_wrapper.xsa
Then in a WSL2 terminal (outside Vivado):
cd /tmp
mkdir -p xsa_extract && cd xsa_extract
unzip -o /home/svenons/top_wrapper.xsa
ls *.bit *.hwh
You'll see top_wrapper.bit and top.hwh (ignore top_smartconnect_0_0.hwh — that's a sub-IP export).
E2.9. Copy the updated bitstream files
Rename to match what the FINN driver expects, copy over:
cp /tmp/xsa_extract/top_wrapper.bit ~/finn-accel.bit
cp /tmp/xsa_extract/top.hwh ~/finn-accel.hwh
scp ~/finn-accel.bit ~/finn-accel.hwh xilinx@pynq:/home/xilinx/lpyolo/deploy/bitfile/
For the Face Frenzy repo, copy them into fpga/ instead:
scp ~/finn-accel.bit ~/finn-accel.hwh xilinx@pynq:/home/xilinx/face-frenzy/fpga/
Also copy scale.npy from the FINN deploy package if it exists:
find ~/finn/notebooks/lpyolo/output_full/deploy -name scale.npy
scp /path/to/scale.npy xilinx@pynq:/home/xilinx/face-frenzy/fpga/scale.npy
E2.10. Use the restored GPIO from Python
Once the new bitstream is loaded with Overlay("finn-accel.bit"), the GPIO IPs are accessible as named attributes on the overlay object:
from pynq import Overlay
ol = Overlay("/home/xilinx/lpyolo/deploy/bitfile/finn-accel.bit")
# Buttons: read the 4-bit value; each bit = one BTNx
btns = ol.btns_gpio.read(0) # bit 0 = BTN0, bit 3 = BTN3
btn0_pressed = bool(btns & 0x1)
# Switches: 2-bit value
sws = ol.switches_gpio.read(0)
sw0 = bool(sws & 0x1)
sw1 = bool(sws & 0x2)
# LEDs: write a 4-bit value, bit N = LDN
ol.leds_gpio.write(0, 0b1010) # LD1 and LD3 on
# RGB LEDs: 6-bit value; bit order depends on board file —
# typically [LD4_B, LD4_G, LD4_R, LD5_B, LD5_G, LD5_R]
ol.rgbleds_gpio.write(0, 0b000100) # LD4 green
# The FINN accelerator is still there. In Face Frenzy, use the repo helper.
from FaceDetector import load_finn_accelerator
accel = load_finn_accelerator("fpga/finn-accel.bit", "fpga")
# ...use accel.execute(...) as before
The FINN accelerator and the GPIO now coexist in a single bitstream. Loading the overlay configures both, and you can use them independently in the same Python script.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
ERROR [Synth 8-5834] Design needs N RAMB18 which is more than device capacity of 280 |
Folding didn't apply — node names mismatched | Confirm config in B2 is exact; verify step order has folding BEFORE specialize |
AssertionError: mem_mode = decoupled not in {...internal_decoupled...} |
Old v0.7 mem_mode value | Use"internal_decoupled" in the folding config |
set_top: Argument 'name' requires a value! in HLS log |
Empty node names after specialize | step_rename_empty_nodes in the steps list |
ERROR [Common 17-685] Unable to load Tcl app xilinx::xsim |
Vivado tclstore corruption from parallel HLS | export FINN_DOCKER_EXTRA=" -e XILINX_LOCAL_USER_DATA=no " in finn-env.sh |
ERROR [DRC UTLZ-1] LUT as Logic over-utilized |
Folding too aggressive, design too large | Lower PE on the biggest MVAU layers (usually MVAU_5 which is MH=208) |
ERROR [DRC UTLZ-1] RAMB18/RAMB36 over-utilized with 1000+ BRAMs needed |
auto_fifo_depths=True sized FIFOs to tensor size |
Setauto_fifo_depths=False and use "Defaults": {"depth": [32, ["StreamingFIFO"]]} in folding config |
FileNotFoundError on makedirs when dir exists |
Ghost-inode bug afterrm -rf finn_build while container was running |
Exit and relaunch container |
FINN build fails atstep_hw_ipgen on multiple nodes |
Vivado not mounted in container | VerifyFINN_XILINX_PATH=/tools/Xilinx before ./run-docker.sh |
Value '524288' is out of range for FIFO depth |
auto_fifo_depths sized a FIFO beyond Vivado's 32768 cap | Disableauto_fifo_depths or enable split_large_fifos=True |
| Buttons / LEDs / switches blank after loading FINN overlay | PL overlay replaced the base overlay | See Part E |
| HDMI out blank after loading the FINN+GPIO overlay | HDMI pipeline from base.bit was not recreated |
Use base.bit for HDMI, or integrate a full video subsystem separately |
RuntimeError: Could not open device with index '0' on PYNQ |
Overlay/XRT access attempted without root | Use sudo -i, then source pynq_venv.sh and xrt_setup.sh again |
ModuleNotFoundError: No module named 'bitstring' |
Generated FINN packing dependency missing from root PYNQ env | Install bitstring inside the sourced root PYNQ environment |
pip install opencv-python tries to compile for hours |
PYNQ-Z2 is armv7 and pip wheels are unreliable | Install python3-opencv with apt instead |
accel.execute(...) appears stuck before DMA starts |
Generic FINN UINT4 packing path is too slow on the board | Use the Face Frenzy fpga/driver_base.py fast UINT4 pack patch |
| FPGA execute passes but face count stays at zero | App-side decoder scale/layout/channel order is not calibrated | Run scripts/fpga-frame-debug.py and tune FACE_FRENZY_OUTPUT_SCALE / layout env vars |
| WSL2 out of disk during Vivado install | Default vhdx is 256GB but partitioned smaller | Follow A2 to resize |