OpenOPU

OPEN Processing Unit: a software/hardware tool-chain for the acceleration of general deep learning algorithms

Academic research of CPU has been facilitated greatly by open source toolchain and resources such as instruction set (ISA), compiler, and instruction/microarchitecture level simulation. Examples include SimpleScalar, GEM5, and more recently, RISC-V toolset. Yet, such open-source and complete eco-system does not exist for general machine learning algorithms. OpenOPU is meant to be an open-source and complete eco-system for machine learning hardware research, including: ISA with executable specifications, compiler with formal verification, instruction level (functional) and microarchitecture level (cycle-accurate) simulation, parametrized modules in RTL and Chisel, and FPGA emulation and development boards.

Our first release includes OPU (open processing unit for ML) for edge inference on FPGA. Future releases will extend to ISA and microarchitecture for both training and inference and for cloud AI computing as well as in-network AI computing, considering FPGA, SOC and 3D FPGA/SOC.

Currently, OPU is being used for AI research at UCLA, University of Michigan and Stanford University. (Last updated on Jul 14, 2020)

OpenOPU: AI on a Chip for CNN, GCN, and Transformers.

Recent Publications

OPU: An FPGA-Based Overlay Processor for Convolutional Neural Networks

TVLSI 2019

A domain-specific FPGA overlay processor, named OPU to accelerate CNN networks. (Last updated on Jul 14, 2020)

Uni-OPU: An FPGA Based Uniform Accelerator for Convolutional and Transposed Convolutional Networks

TVLSI 2020

The first full software / hardware stack, called Uni-OPU, for an efficient uniform hardware acceleration of different types of transposed convolutional (TCONV) networks and conventional convolutional (CONV) networks. (Last updated on Jul 14, 2020)

Light-OPU: An FPGA Overlay Processor for Lightweight Convolutional Neural Networks

FPGA 2020 (Best Paper Candidate)

An FPGA-based overlay processor with a corresponding compilation flow for general LW-CNN accelerations, called Light-OPU. (Last updated on Jul 14, 2020)

MP-OPU: A Mixed Precision FPGA-based Overlay Processor for Convolutional Neural Networks

FPL 2021

A Mixed Precision FPGA-based Overlay Processor that effectively accelerates the inference of mixed precision models.

Lightweight FPGA-based Graph Convolutional Network Accelerator

TRETS 2022

A lightweight FPGA-based accelerator with a software-hardware co-designed process to tackle irregularity in computation and memory access in GCN inference.

Lasa: Abstraction and Specialization for Productive and Performant Linear Algebra on FPGAs

FCCM 2023

(To be updated)

Overlay Processor for Transformer-based Machine Learning

Under review

Optimizes data layout with dataflow, the first in-depth study on overlay processor considering both vision and language transformer models.

Low Precision Floating-point Arithmetic for High Performance FPGA-based CNN Acceleration

29 Feb 2020

A low precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome re-training and accuracy limitations. (Last updated on Jul 14, 2020)

NPE: An FPGA-based Overlay Processor for Natural Language Processing

13 Apr 2021

An FPGA-based overlay processor for NLP model inference at the edge (Last updated on Apr 15, 2021)

Heterogeneous Dual-Core Overlay Processor for Light-Weight CNNs

FCCM21 (post)

A heterogeneous dual-core architecture where one core is optimized for regular convolution layers and the other for depth-wise convolution layers.

Low Precision Floating-point Arithmetic for High Performance FPGA-based CNN Acceleration

TRETS 2021

Utilizes low-precision floating point operations to efficiently accelerate CNN, the first work that can fit four 8-bit multiplications for inference in one DSP while maintaining comparable accuracy without any retraining.

Simple Yet Effective Accelerator For GCN Training

FPL 2022

An FPGA-based GCN accelerator, named SkeletonGCN, including multiple software-hardware co-optimizations to improve training efficiency.

ITLE: Token-packing for Transformers with Variable-Length Inputs

FPL 2023

(To be updated)

MCore-OPU: An FPGA-based Multi-Core Overlay Processor for Transformer-based Models

Under Review