SwePub
Sök i LIBRIS databas

  Utökad sökning

onr:"swepub:oai:DiVA.org:kth-252646"
 

Sökning: onr:"swepub:oai:DiVA.org:kth-252646" > An Efficient Stream...

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

Chen, Qinyu (författare)
Nanjing Univ, Sch Elect Sci & Engn, Nanjing 210093, Jiangsu, Peoples R China.
Fu, Yuxiang (författare)
Nanjing Univ, Sch Elect Sci & Engn, Nanjing 210093, Jiangsu, Peoples R China.
Song, Wenqing (författare)
Nanjing Univ, Sch Elect Sci & Engn, Nanjing 210093, Jiangsu, Peoples R China.
visa fler...
Cheng, Kaifeng (författare)
Nanjing Univ, Sch Elect Sci & Engn, Nanjing 210093, Jiangsu, Peoples R China.
Lu, Zhonghai (författare)
KTH,Elektronik och inbyggda system
Zhang, Chuan (författare)
Southeast Univ, Natl Mobile Commun Res Lab, Nanjing 210096, Jiangsu, Peoples R China.
Li, Li (författare)
Nanjing Univ, Sch Elect Sci & Engn, Nanjing 210093, Jiangsu, Peoples R China.
visa färre...
Nanjing Univ, Sch Elect Sci & Engn, Nanjing 210093, Jiangsu, Peoples R China Elektronik och inbyggda system (creator_code:org_t)
2019-03-27
2019
Engelska.
Ingår i: Electronics. - : MDPI. - 2079-9292. ; 8:4
  • Tidskriftsartikel (refereegranskat)
Abstract Ämnesord
Stäng  
  • Convolutional Neural Networks (CNNs) have been widely applied in various fields, such as image recognition, speech processing, as well as in many big-data analysis tasks. However, their large size and intensive computation hinder their deployment in hardware, especially on the embedded systems with stringent latency, power, and area requirements. To address this issue, low bit-width CNNs are proposed as a highly competitive candidate. In this paper, we propose an efficient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture. With a novel coarse grain task partitioning (CGTP) strategy, the proposed accelerator with heterogeneous computing units, supporting multi-pattern dataflows, can nearly double the throughput for various CNN models on average. Besides, a hardware-friendly algorithm is proposed to simplify the activation and quantification process, which can reduce the power dissipation and area overhead. Based on the optimized algorithm, an efficient reconfigurable three-stage activation-quantification-pooling (AQP) unit with the low power staged blocking strategy is developed, which can process activation, quantification, and max-pooling operations simultaneously. Moreover, an interleaving memory scheduling scheme is proposed to well support the streaming architecture. The accelerator is implemented with TSMC 40 nm technology with a core size of . It can achieve TOPS/W energy efficiency and area efficiency at 100.1mW, which makes it a promising design for the embedded devices.

Ämnesord

TEKNIK OCH TEKNOLOGIER  -- Elektroteknik och elektronik (hsv//swe)
ENGINEERING AND TECHNOLOGY  -- Electrical Engineering, Electronic Engineering, Information Engineering (hsv//eng)

Nyckelord

low bit-width convolutional neural networks
parallel streaming architecture
coarse grain task partitioning
reconfigurable
VLSI

Publikations- och innehållstyp

ref (ämneskategori)
art (ämneskategori)

Hitta via bibliotek

Till lärosätets databas

Sök utanför SwePub

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy