楼主: 大多数88
765 0

[电气工程与系统科学] DLA:用于神经网络推理加速的编译器和FPGA叠加 [推广有奖]

  • 0关注
  • 3粉丝

会员

学术权威

67%

还不是VIP/贵宾

-

威望
10
论坛币
10 个
通用积分
70.7797
学术水平
0 点
热心指数
4 点
信用等级
0 点
经验
23294 点
帖子
3809
精华
0
在线时间
0 小时
注册时间
2022-2-24
最后登录
2022-4-15

楼主
大多数88 在职认证  发表于 2022-3-13 16:42:00 来自手机 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
摘要翻译:
覆盖层为现场可编程门阵列(FPGAs)提供了快速的开发周期,并消除了传统FPGA硬件设计流程中的许多挑战。然而,这通常会带来很大的性能负担,导致很少在实际应用中采用覆盖层。在本文中,我们为特定的应用领域定制了一个覆盖,并展示了如何在不支付传统上与覆盖相关的性能开销的情况下保持其完全的可编程性。具体来说,我们引入了一个用于深度神经网络推理的覆盖层,其开销仅为1%,以支持使用轻量级超长指令字(VLIW)网络的控制和重编程逻辑。此外,我们实现了一个复杂的领域特定图编译器,它编译深度学习语言,如Caffe或Tensorflow来轻松地瞄准我们的覆盖。我们展示了我们的图形编译器如何执行架构驱动的软件优化,以显著提高卷积和递归神经网络(CNNS/RNNs)的性能--与朴素的实现相比,我们展示了对ResNet-101的3倍改进和对长短期记忆(LSTM)单元的12倍改进。最后,我们描述了如何定制我们的硬件覆盖,并使用我们的图形编译器在Intel Arria 10 115 0上的GoogLeNet上实现约900 fps--这是在可比FPGAs上报道的最快的。
---
英文标题:
《DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration》
---
作者:
Mohamed S. Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco,
  Shane OConnell, Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, Andrew
  C. Ling, Gordon R. Chiu
---
最新提交年份:
2018
---
分类信息:

一级分类:Computer Science        计算机科学
二级分类:Distributed, Parallel, and Cluster Computing        分布式、并行和集群计算
分类描述:Covers fault-tolerance, distributed algorithms, stabilility, parallel computation, and cluster computing. Roughly includes material in ACM Subject Classes C.1.2, C.1.4, C.2.4, D.1.3, D.4.5, D.4.7, E.1.
包括容错、分布式算法、稳定性、并行计算和集群计算。大致包括ACM学科类C.1.2、C.1.4、C.2.4、D.1.3、D.4.5、D.4.7、E.1中的材料。
--
一级分类:Computer Science        计算机科学
二级分类:Hardware Architecture        硬件体系结构
分类描述:Covers systems organization and hardware architecture. Roughly includes material in ACM Subject Classes C.0, C.1, and C.5.
涵盖系统组织和硬件架构。大致包括ACM主题课程C.0、C.1和C.5中的材料。
--
一级分类:Electrical Engineering and Systems Science        电气工程与系统科学
二级分类:Signal Processing        信号处理
分类描述:Theory, algorithms, performance analysis and applications of signal and data analysis, including physical modeling, processing, detection and parameter estimation, learning, mining, retrieval, and information extraction. The term "signal" includes speech, audio, sonar, radar, geophysical, physiological, (bio-) medical, image, video, and multimodal natural and man-made signals, including communication signals and data. Topics of interest include: statistical signal processing, spectral estimation and system identification; filter design, adaptive filtering / stochastic learning; (compressive) sampling, sensing, and transform-domain methods including fast algorithms; signal processing for machine learning and machine learning for signal processing applications; in-network and graph signal processing; convex and nonconvex optimization methods for signal processing applications; radar, sonar, and sensor array beamforming and direction finding; communications signal processing; low power, multi-core and system-on-chip signal processing; sensing, communication, analysis and optimization for cyber-physical systems such as power grids and the Internet of Things.
信号和数据分析的理论、算法、性能分析和应用,包括物理建模、处理、检测和参数估计、学习、挖掘、检索和信息提取。“信号”一词包括语音、音频、声纳、雷达、地球物理、生理、(生物)医学、图像、视频和多模态自然和人为信号,包括通信信号和数据。感兴趣的主题包括:统计信号处理、谱估计和系统辨识;滤波器设计;自适应滤波/随机学习;(压缩)采样、传感和变换域方法,包括快速算法;用于机器学习的信号处理和用于信号处理应用的机器学习;网络与图形信号处理;信号处理中的凸和非凸优化方法;雷达、声纳和传感器阵列波束形成和测向;通信信号处理;低功耗、多核、片上系统信号处理;信息物理系统的传感、通信、分析和优化,如电网和物联网。
--

---
英文摘要:
  Overlays have shown significant promise for field-programmable gate-arrays (FPGAs) as they allow for fast development cycles and remove many of the challenges of the traditional FPGA hardware design flow. However, this often comes with a significant performance burden resulting in very little adoption of overlays for practical applications. In this paper, we tailor an overlay to a specific application domain, and we show how we maintain its full programmability without paying for the performance overhead traditionally associated with overlays. Specifically, we introduce an overlay targeted for deep neural network inference with only ~1% overhead to support the control and reprogramming logic using a lightweight very-long instruction word (VLIW) network. Additionally, we implement a sophisticated domain specific graph compiler that compiles deep learning languages such as Caffe or Tensorflow to easily target our overlay. We show how our graph compiler performs architecture-driven software optimizations to significantly boost performance of both convolutional and recurrent neural networks (CNNs/RNNs) - we demonstrate a 3x improvement on ResNet-101 and a 12x improvement for long short-term memory (LSTM) cells, compared to naive implementations. Finally, we describe how we can tailor our hardware overlay, and use our graph compiler to achieve ~900 fps on GoogLeNet on an Intel Arria 10 1150 - the fastest ever reported on comparable FPGAs.
---
PDF链接:
https://arxiv.org/pdf/1807.06434
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:神经网络 FPGA 编译器 神经网 推理 展示 架构 FPGA FPGAs

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
jg-xs1
拉您进交流群
GMT+8, 2025-12-21 14:41