Skip to content

Latest commit

 

History

History
402 lines (349 loc) · 34.2 KB

papers.md

File metadata and controls

402 lines (349 loc) · 34.2 KB

List of FPGA Papers

This is a list of academic papers that cover all sorts of FPGA related topic, more from a system researcher's point of view though.

Not all the listed papers are good, some of them just repeat the history. That's why I started this: have a thorough understanding of this area first. If you see any papers missing, please comment below and I will add accordingly.

Website version: link.

Virtualization

Scheduling

Questions to cover:

  • Can we partition an FPGA bitstream into multiple small ones? If so, how?
    • Are there any tools that could automatically partiton an FPGA bitstream into multiple smalls ones without developer intervention?
    • Sure, partitioning adds extra communication overhead, potentionally lower overall performance. But partitioning should be able to greatly increases scheduling flexibility in terms of area allocation.
  • How to do hardware context-switch?
  • How can we preemptively de-schedule already deployed bitstream? What should be taken care of?
  • How to deal with area framentation?
    • FPGA area framentation is different from traditional memory framentation, simply because a) FPGA area is two-dimentional, b) relocating a bitstream may not be possible, c) the bitstream has a fixed requirement on the area.
  • How to allocate FPGA areas?
  • How to relocate already deployed bitstreams to make room for a new one?
  • What scheduling algorithms shall we use?

Offline:

Online:

NoC

Network-on-Chip on FPGA.

Memory Hierarchy

Papers deal with BRAM, registers, on-board DRAM, and host DRAM.

Dynamic Memory Allocation

malloc() and free() for FPGA on-board DRAM.

Integrate with Host Virtual Memory

Papers deal with OS Virtual Memory System (VMS). Note that, all these papers introduce some form of MMU into the FPGA to let FPGA be able to work with host VMS. This added MMU is similar to CPU's MMU and RDMA NIC's internal cache. Note that the VMS still runs inside Linux (include pgfault, swapping, TLB shootdown and so on. What could really stands out, is to implement VMS inside FPGA.)

Integrate with Host OSs

Summary

Summary on current FPGA Virtualization Status. Prior art mainly focus on: 1) How to virtualize on-chip BRAM (e.g., CoRAM, LEAP Scratchpad), 2) How to work with host, specifically, how to use the host DRAM, how to use host virtual memory. 3) How to schedule bitstreams inside a FPGA chip. 4) How to provide certain services to make FPGA programming easier (mostly work with host OS).

Languages, Runtime, and Framework

Innovations in the toolchain space.

Xilinx HLS

High-Level Languages and Platforms

Integrate with Frameworks

  • Map-reduce as a Programming Model for Custom Computing Machines, FCCM'08
    • This paper proposes a model to translate MapReduce code written in C to code that could run on FPGA and GPU. Many details are omitted, and they don't really have the compiler.
    • Single-host framework, everything is in FPGA and GPU.
  • Axel: A Heterogeneous Cluster with FPGAs and GPUs, FPGA'10
    • A distributed MapReduce Framework, targets clusters with CPU, GPU, and FPGA. Mainly the idea of scheduling FPGA/GPU jobs.
    • Distributed Framework.
  • FPMR: MapReduce Framework on FPGA, FPGA'10
    • A MapReduce framework on a single host's FPGA. You need to write Verilog/HLS for processing logic to hook with their framework. The framework mainly includes a data transfer controller, a simple schedule that enable certain blocks at certain time.
    • Single-host framework, everything is in FPGA.
  • Melia: A MapReduce Framework on OpenCL-Based FPGAs, IEEE'16
    • Another framework, written in OpenCL, and users can use OpenCL to program as well. Similar to previous work, it's more about the framework design, not specific algorithms on FPGA.
    • Single-host framework, everything is in FPGA. But they have a discussion on running on multiple FPGAs.
    • Four MapReduce FPGA papers here, I believe there are more. The marriage between MapReduce and FPGA is not something hard to understand. FPGA can be viewed as another core with different capabilities. The thing is, given FPGA's reprogram-time and limited on-board memory, how to design a good scheduling algorithm and data moving/caching mechanisms. Those papers give some hints on this.
  • UCLA: When Apache Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration, HotCloud'16
  • UCLA: Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale, SoCC'16
    • A system that hooks FPGA with Spark.
    • There is a line of work that hook FPGA with big data processing framework (Spark), so the implementation of FPGA and the scale-out software can be separated. The Spark can schedule FPGA jobs to different machines, and take care of scale-out, failure handling etc. But, I personally think this line of work is really just an extension to ReconOS/FUSE/BORPH line of work. The main reason is: both these two lines of work try to integrate jobs run on CPU and jobs run on FPGA, so CPU and FPGA have an easier way to talk, or put in another way, CPU and FPGA have a better division of labor. Whether it's single-machine (like ReconOS, Melia), or distributed (like Blaze, Axel), they are essentially the same.
  • UCLA: Heterogeneous Datacenters: Options and Opportunities, DAC'16
    • Follow up work of Blaze. Nice comparison of big and wimpy cores.

Cloud Infrastructure

Applications

Programmable Network

Database

Storage

Machine Learning

  • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks, FPGA'15
  • From High-Level Deep Neural Models to FPGAs, ISCA'16
  • Deep Learning on FPGAs: Past, Present, and Future, arXiv'16
  • Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC, FPT'16
  • FINN: A Framework for Fast, Scalable Binarized Neural Network Inference, FPGA'17
  • In-Datacenter Performance Analysis of a Tensor Processing Unit, ISCA'17
  • Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs, FPGA'17
  • A Configurable Cloud-Scale DNN Processor for Real-Time AI, ISCA'18
  • A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks, MICRO'18
  • DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs, ICCAD'18
  • FA3C : FPGA-Accelerated Deep Reinforcement Learning, ASPLOS’19
  • Cognitive SSD: A Deep Learning Engine for In-Storage Data Retrieval, ATC'19

Graph

  • A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing, ISCA'15
  • Energy Efficient Architecture for Graph Analytics Accelerators, ISCA'16
  • Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search, FPGA'17
  • FPGA-Accelerated Transactional Execution of Graph Workloads, FPGA'17
  • An FPGA Framework for Edge-Centric Graph Processing, CF'18

Key-Value Store

  • Achieving 10Gbps line-rate key-value stores with FPGAs, HotCloud'13
  • Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached, ISCA'13
  • An FPGA Memcached Appliance, FPGA'13
  • Scaling out to a Single-Node 80Gbps Memcached Server with 40Terabytes of Memory, HotStorage'15
  • KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC, SOSP'17
  • Ultra-Low-Latency and Flexible In-Memory Key-Value Store System Design on CPU-FPGA, FPT'18

Bio

Consensus

  • Consensus in a Box: Inexpensive Coordination in Hardware, NSDI'16

Video Processing

  • TODO

Blockchain

  • TODO

Micro-services

  • TODO

FPGA Internal

General

Partial Reconfiguration

Logical Optimization and Technology Mapping

Place and Route

RTL2FPGA