Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








304 Hits in 3.8 sec

Teraflop FPGA Design

Martin Langhammer
2011 2011 IEEE 20th Symposium on Computer Arithmetic  
This article will review devices and methods for achieving consistent high performance system implementations in floating point.  ...  FUSED DATAPATH MAPPING Fused datapath methodology uses rules to create functional clusters, where the normalization and denormalization is merged among multiple operators [1] .  ...  INTRODUCTION Many operator libraries have been designed for FPGAs; a brief survey of these shows that the most commonly used operators (multiply and add/subtract) have similar areas, performance levels  ... 
doi:10.1109/arith.2011.32 dblp:conf/arith/Langhammer10 fatcat:tg6iwj6a65chvgrshoaydzgpnq

Tools and Techniques for Efficient High-Level System Design on FPGAs [article]

Adrian J. Chung, Kathryn Cobden, Mark Jervis, Martin Langhammer, Bogdan Pasca
2014 arXiv   pre-print
The combination of 7 floating-point precisions, fused-datapath support, custom operator support and automated folding allows exploring the best tradeoffs between accuracy, size and throughput.  ...  In order for FPGAs to be successful outside traditional markets, tools which enable software programmers to achieve high levels of system performance while abstracting away the FPGA-specific details are  ...  IMPLEMENTATIONS USING IEEE-754 OPERATOR ASSEMBLY AND THE PRESENTED FUSED DATAPATH TECHNIQUE ON STRATIXV, TARGETING SINGLE PRECISION AND A CUSTOM 35 BIT FRACTION FORMAT Type Precision Performance  ... 
arXiv:1408.4797v1 fatcat:uonfo5musfb2hh7t2o7dtic5oa

FPGA Implementation of Double Precision Floating Point Multiplier

Mohd. Abdullah, Bharti Chourasia
2022 International Journal on Recent and Innovation Trends in Computing and Communication  
High speed computation is the need of today's generation of Processors.  ...  Especially in the field of signal processing, multiplication division operation is widely used in many applications.  ...  It needs a large variety of matrix processes, and also the ability to perform a series of matrix operations with the same structure.  ... 
doi:10.17762/ijritcc.v10i12.5896 fatcat:o6ijkmcvmzh4jhspy5fwvwhcwa

A mixed-precision fused multiply and add

Nicolas Brunie, Florent de Dinechin, Benoit de Dinechin
2011 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR)  
The floating-point fused multiply and add, computing R=AB+C with a single rounding, is now an IEEE-754 standard operator.  ...  Like the standard FMA operator, the proposed mixedprecision operator computes AB+C with a single rounding, and fully support subnormals.  ...  , complex operations, range reductions, multiple-precision operations, and others [8] .  ... 
doi:10.1109/acssc.2011.6189977 dblp:conf/acssc/BrunieDD11 fatcat:2yjz6ikeuvbfthr6etvvqz7leq

Latency Sensitive FMA Design

Sameh Galal, Mark Horowitz
2011 2011 IEEE 20th Symposium on Computer Arithmetic  
The implementation of merged floating-point multiply-add operations can be optimized in many ways.  ...  The cascade design has the same area and energy budget as a traditional fused multiple-add FMA.  ...  INTRODUCTION A high performance floating-point unit is a major component of modern CPU and GPU designs.  ... 
doi:10.1109/arith.2011.26 dblp:conf/arith/GalalH10 fatcat:zdzrod6iuzga7btd5kfsavhabu

From SODA to scotch: The evolution of a wireless baseband processor

Mark Woh, Yuan Lin, Sangwon Seo, Scott Mahlke, Trevor Mudge, Chaitali Chakrabarti, Richard Bruce, Danny Kershaw, Alastair Reid, Mladen Wilder, Krisztian Flautner
2008 2008 41st IEEE/ACM International Symposium on Microarchitecture  
Ardbeg's redesign process can be grouped into the following three major areas: optimizing the wide SIMD datapath, providing long instruction word (LIW) support for SIMD operations, and adding application-specific  ...  Ardbeg also provides modest LIW support by allowing two SIMD operations to issue in the same cycle.  ...  Acknowledgment We thank the anonymous referees for their useful comments and suggestions.  ... 
doi:10.1109/micro.2008.4771787 dblp:conf/micro/WohLSMMCBKRWF08 fatcat:2vlng2aqazfbnhz4yjvxwqv3za

High-Level Design Tools for Floating Point FPGAs

Deshanand P. Singh, Bogdan Pasca, Tomasz S. Czajkowski
2015 Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA '15  
In the matrix-matrix multiplication algorithm, shown in Figure 2, the multiplication is performed using blocks of data, where on  ...  Matrix-Matrix Multiplication pseudo code.  ... 
doi:10.1145/2684746.2689079 dblp:conf/fpga/SinghPC15 fatcat:xk2qbx244fczffdbycnfaaedv4

A Full-stack Accelerator Search Technique for Vision Applications [article]

Dan Zhang, Safeen Huda, Ebrahim Songhori, Quoc Le, Anna Goldie, Azalia Mirhoseini
2021 arXiv   pre-print
, software scheduling, and compiler passes such as operation fusion and tensor padding.  ...  When evaluated on EfficientNet, ResNet50v2, and OCR inference performance relative to a TPU-v3, designs generated by FAST optimized for single workloads can improve Perf/TDP (peak power) by over 6x in  ...  PE systolic arrays perform a matrix-vector multiply each cycle. Vector and scalar PEs can be modeled by setting systolic array X and/or Y dims to 1.  ... 
arXiv:2105.12842v1 fatcat:mtunvjdcdrcr5pc5bpyfye7mea

A Fused Hybrid Floating-Point and Fixed-Point Dot-Product for FPGAs [chapter]

Antonio Roldao Lopes, George A. Constantinides
2010 Lecture Notes in Computer Science  
Results using a high-end Xilinx FPGA and an order 150 dot-product demonstrate that, for equivalent accuracy metrics, it is possible to utilize 3.8 times fewer resources, operate at 1.62 times faster clock  ...  In this paper we present a dotproduct implementation which operates using a hybrid floating-point and fixed-point number system.  ...  This operation is also a building block in other fundamental algebraic operations such as matrix-byvector, and matrix-by-matrix multiplications.  ... 
doi:10.1007/978-3-642-12133-3_16 fatcat:f4jbk43ygfaspa7rfdfke5kmrq

Customizing wide-SIMD architectures for H.264

S. Seo, M. Woh, S. Mahlke, T. Mudge, S. Vijay, C. Chakrabarti
2009 2009 International Symposium on Systems, Architectures, Modeling, and Simulation  
operation support to increase the processing performance, and a fast programmable crossbar to support complex data permutation patterns.  ...  Several customized features have been added to improve the processing performance and lower the power consumption.  ...  Fused Operation Based on this analysis, we propose to fuse the frequently used instruction pairs.  ... 
doi:10.1109/icsamos.2009.5289229 dblp:conf/samos/SeoWMMVC09 fatcat:b7ur6xpzwfbuxe4ovjbimfr53i

Custom FPGA-based soft-processors for sparse graph acceleration

Nachiket Kapre
2015 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)  
We develop the processor RTL using Vivado High-Level Synthesis and also provide an assembler and compilation flow to configure the processor instruction and data memories.  ...  FPGA-based soft processors customized for operations on sparse graphs can deliver significant performance improvements over conventional organizations (ARMv7 CPUs) for bulk synchronous sparse graph algorithms  ...  We use SpMV (streaming multiply-accumulate datapath) to quantify performance on our architecture.  ... 
doi:10.1109/asap.2015.7245698 dblp:conf/asap/Kapre15 fatcat:mqos2rxf4zdkxcsq2hf6q3xji4

Exploiting On-chip Memory Bandwidth in the VIRAM Compiler [chapter]

David Judd, Katherine Yelick, Christoforos Kozyrakis, David Martin, David Patterson
2001 Lecture Notes in Computer Science  
It combines vector processing with mixed logic and DRAM to achieve high performance with relatively low energy, area, and design complexity.  ...  Many architectural ideas that appear to be useful from a hardware standpoint fail to achieve wide acceptance due to lack of compiler support.  ...  We are also very grateful for the support provided by the Cray, Inc. compiler group in helping us use and modify their compiler.  ... 
doi:10.1007/3-540-44570-6_8 fatcat:yo6zqdknxfcabgvbf6ghsh5qye

A Customized Processor for Energy Efficient Scientific Computing

Ankit Sethia, Ganesh Dasika, Trevor Mudge, Scott Mahlke
2012 IEEE transactions on computers  
It is now possible to assemble a system that provides several TFLOPs of performance on scientific applications for the cost of a high-end laptop computer.  ...  To combat these challenges, this paper presents the PEPSC architecture-an architecture customized for the domain of data parallel dense matrix style scientific application where power efficiency is the  ...  This research was supported by the US National Science Foundation grant CNS-0964478 and ARM Ltd.  ... 
doi:10.1109/tc.2012.144 fatcat:6wb7y7femfftlh5geqsqbm37wy

Modified Booth Recoder for Efficient Add-Multiply Operator

Aparna V. Kale, Prof. Patil M. D.
2015 IJARCCE  
The fusion of the two operators resulting in Fused Add-Multiply(FAM) operator.  ...  It consists of recoding table which has been used to minimize the partial products of multiplier. An adder and the multiplier operator of the unit is combine to form a single add-multiply unit.  ...  Multipliers were introduced to perform the multiplication operation of the arithmetic circuits using add and shift operation.  ... 
doi:10.17148/ijarcce.2015.45115 fatcat:4wg7pbe2o5cz7ahuv6dyft24sm

AnySP

Mark Woh, Sangwon Seo, Scott Mahlke, Trevor Mudge, Chaitali Chakrabarti, Krisztian Flautner
2009 SIGARCH Computer Architecture News  
These three operating modes provide high throughput across varying application types.  ...  The current generation of devices employs a combination of general-purpose processors, digital signal processors, and hardwired accelerators to provide giga-operations-per-second performance on milliWatt  ...  We also thank the anonymous referees for their useful comments and suggestions.  ... 
doi:10.1145/1555815.1555773 fatcat:m3psv47xdbgvjcvfu2gzlqe5eu
« Previous Showing results 1 — 15 out of 304 results