### Software Helping Hardware Innovation

Michael O'Boyle University of Edinburgh EPSRC Senior Research Fellow

What are the key challenges in your area?

Why should ARM care?

What are the opportunities for ARM?

How could ARM help?

How does your work compare against rest of world?

What are the key challenges in your area?

- exploiting potential of accelerators
- enabling all code to utilise any accelerator

Why should ARM care?

What are the opportunities for ARM?

How could ARM help?

How does your work compare against rest of world?

our area? ors accelerator

What are the key challenges in your area?

- exploiting potential of accelerators
- enabling all code to utilise any accelerator

Why should ARM care?

- enabling general acceleration opens up hardware innovation
- sell IP to customers knowing their customers' code will work

What are the opportunities for ARM?

How could ARM help?

How does your work compare against rest of world?

How does your work compare against rest of world?

How could ARM help?

What are the opportunities for ARM? - innovate, design and develop accelerator IP others will miss

- enabling general acceleration opens up hardware innovation - sell IP to customers knowing their customers' code will work

Why should ARM care?

- enabling all code to utilise any accelerator

What are the key challenges in your area? - exploiting potential of accelerators

How does your work compare against rest of world?

- How could ARM help?
- What are the opportunities for ARM? - innovate, design and develop accelerator IP others will miss
- sell IP to customers knowing their customers' code will work
- Why should ARM care? - enabling general acceleration opens up hardware innovation
- enabling all code to utilise any accelerator

What are the key challenges in your area? - exploiting potential of accelerators

- engage with research; ARM researchers + funded PhDs - explore impact on accelerator roadmap

- World-leading compiler group at Edinburgh
- How does your work compare against rest of world?

- How could ARM help?
- What are the opportunities for ARM? - innovate, design and develop accelerator IP others will miss
- sell IP to customers knowing their customers' code will work
- Why should ARM care? - enabling general acceleration opens up hardware innovation
- enabling all code to utilise any accelerator

What are the key challenges in your area? - exploiting potential of accelerators

- engage with research; ARM researchers + funded PhDs - explore impact on accelerator roadmap

- Top conferences: PLDI, ASPLOS, HPCA, Micro, CGO, NeurIPS - 3 Best paper awards: ACM GPCE20, HPCA21, ASPLOS21 "highest ranked software" DARPA ERI SDH program

Matching Hardware to Software - Hardware Defined Software (HDS) Other

- Neural Architecture Search as Program Transformation Exploration

- Software Defined Hardware (SDH)

**Beyond Simple Acceleration** 

## Matching Hardware to Software - hardware defined software (HDS) Other

- Neural Architecture Search as Program Transformation Exploration

- Software Defined Hardware (SDH)

**Beyond Simple Acceleration** 

### Hardware/software contract breaking down

### Technology trends means

- Hardware specialised or heterogenous
- Great
- up to 100,000x performance/energy gains

### No free lunch

- Software cannot fit on new hardware

### Heterogeneous crisis

- hardware stalls as software cannot fit









### Hardware/software contract breaking down

### Technology trends means - Hardware specialised or heterogenous Great - up to 100.000x performance/energy dains Universal Rethink the contract

### Heterogeneous crisis - hardware stalls as software cannot fit







### New Application/Legacy Code







# Taming the Hardware Zoo?





# Language Approach

### New Application/Legacy Code

Parallel Language



# A universal parallel language + opt compiler per ISA/platform + smart runtime/glue?

### User rewrites

# Write new compiler







# DSL approach



Many specialised languages. Rewrite and hope it works on your (next) machine?



Good performance is hard to get even with well defined parallel language CUDA/OpenCL

### GPU-Accelerated Libraries

GPU-Accelerated libraries provide highly-optimized algorithms and functions you can incorporate into your applications, with minimal changes to your existing code. Many support drop-in compatibility to replace industry standard CPU-only libraries such as MKL, IPP, FFTW and widely-used libraries. Some also feature automatic multi-GPU performance scaling.



### AmgX

A simple path to accelerated core solvers, providing up to 10x acceleration in the computationally intense linear solver portion of simulations, and is very well suited for implicit unstructured methods.



### nvGRAPH

rwORAPH Analytics Library is a 0PU-accelerated graph analytics Library.



### cuDNN

NVIDIA CUDINN is a OPUaccelerated library of primitives for deep neural networks, it is designed to be integrated into higher-level machine learning frameworks.



### GIE

NVIDIA 0PU Interence Engine is a NVIDIA Performance Primitives is high performance neural network a OPU accelerated library with a inference library for deep learning applications



### cuFFT

NPP

NVIDIA CUDA Past Fourier Transform Library (cuPPT) provides a simple interface for computing FPTs up to 10x faster, without having to develop your cern custam OPU FFT implementation.

very large collection of 1000's of

image processing primitives and

signal processing primitives.



### IndeX Framework

NVIDIA IndeX Framework is a real-time scalable visualization plug-in for ParaWeer.



### FFmpeg

PPmpeg is a popular open-sour multi-media framework with a library of plugins that can be applied to various parts of the audio and video processing pipelines.



### CHOLMOD

**DPU-accelerated CHOLMOD is** part of the SuiteSparse linear algebra package by Prof. Tim Devis. SuiteSpanse is used extensively throughout industry and academia.



### cuSOLVER

A collection of dense and sparse direct solvers which deliver significant acceleration for Computer Vision, CPD, Computational Chemistry, and Linear Optimization applications



### CULA Tools

0PU-accelerated linear algebra library by EM Photonics, that utilizes CUDA to dramatically improve the computation speed of sophisticated mathematics.



### CUSPARSE

NVIDIA CUDA Sparse (cuSPARSE) Matrix library provides a collection of basic linear algebra subroutines used for sparse matrices that delivers over 8s performance boost.



### MAGMA

A collection of next gen linear algebra routines. Designed for heterogeneous OPU-based architectures. Supports current LAPACK and BLAS standards.

### Rogue Wave IMSL Fortran Numerical Library

Developed by RogueWave, a comprehensive set of mathematical and statistical functions that offloads work to BPUs.

### Good performance is hard to get even with well defined parallel language CUDA/OpenCL





cuRAND

The CUDA Random Number Generation library performs high guality OPU-accelerated random number generation [RND] over 5x faster than typical CPU only code.



CUDA Math Library

An industry proven, highly accurate collection of standard mathematical functions. providing high performance on NVIDIA BPUs



### Thrust

A powerful, open source library of A BPU-accelerated C++ parallel algorithms and data structures. Perform OPUaccelerated sort, scan, transform, and reductions with just a few lines of code.

| 1000        |            |      | STREE R   |
|-------------|------------|------|-----------|
| 1000        |            |      | 111111    |
| 100010-0000 |            |      | States To |
| 1000        | 22 2 2 2 2 | 1075 | TITLE IS  |
|             | +          | +    |           |

### NVB10

framework for High-Throughput Sequence Analysis for both shart and long read alignment.





### NVIDIA VIDEO CODEC SDK

Accelerate video compression with the NVIDIA Video Dodec SDK. This SDK includes documentation and code samples that illustrate how to use NVIDIA's NVENC and NVDEC hardware in OPUs to accelerate encode, decode, and transcode of H.254 and HEVC video formats.



### aralution

Library for sparse iterative withods with special focus on sulti-core and accelerator chnology such as OPUs.



### HiPLAR

HIPLAR High Performance Linear Algebra in R) delivers high performance linear algebra (LA) routines for the K platform for statistical computing using the latest software libraries for heterogeneous architectures.



Triton Ocean SDK

Triton provides real-time sisual simulation of the ocean and bodies of water for games, simulation, and training



OpenCV is the leading open source library for computer vision, image processing and machine learning, and now features OPU acceleration for real-time operation.



### cuBLAS

NVIDIA CUDA BLAS Library cuBLAS) is a OPU-accelerated version of the complete standard BLAS library that delivers as to 17x faster performance than the latest MKL BLAS.



### Geometry Performance Primitives(GPP)

**BPP** is a computational geometry engine that is optimized for OPU acceleration, and can be used in advanced Onaphical Information Systems (015), Electronic Design Automation [EDA], computer vision, and motion planning solutions.



### ArrayFire

Comprehensive, open source **BPU function library. Includes** functions for math, signal and image processing, statistics, and many more. Interfaces for C, C++, Java, R and Portran.



applications.

### GPU-Accelerated Libraries

GPU-Accelerated libraries provide highly-optimized algorithms and functions you can incorporate into your applications, with minimal changes to your esisting code. Many support drop-in compatibility to replace industry standard CPU-only libraries such as MKL, IPP, PPTW and widely-used libraries. Some also feature automatic multi-GPU performance scaling.

### Rather than building a new optimising compiler for each platform

### Pick the best Library/API/DSL and FIT the code to it

Library.

pipelines.



### CHOLMOD

**BPU-accelerated CHOLMOD** is part of the SuiteSparse linear Devis, SuiteSpanse is used and academia.



### cuSOLVER

direct solvers which deliver Matrix library provides a significant acceleration for Computer Vision, CPD. Computational Chemistry, and matrices that delivers over 8s Linear Optimization applications performance boost.

MAGMA n (par MAGMA

BPU-accelerated linear algebra A collection of next gen linear library by EM Photonics, that algebra routines. Designed for algebra package by Prof. Tim utilizes CUDA to dramatically heterogeneous OPU-based improve the computation speed architectures. Supports current extensively throughout industry of sophisticated mathematics. LAPACK and BLAS standards.



learning applications

**CULA** tools

CULA Tools

### cuSPARSE A collection of dense and sparse NVIDIA CUDA Sparse (cuSPARSE) collection of basic linear algebra subroutines used for sparse



mathematical and statistical functions that offloads work to OPUs.

'mpeg is a popular open-sour Accelerate video compression



### aralution

Ubrary for sparse iterative withods with special focus on sulti-core and accelerator choology such as OPUs.

OPU-accelerated graph analytics high performance neural network a OPU accelerated übrary with a multi-media framework with a with the NVIDIA Video Codec SDK. Linear Algebra in R) delivers high source übrary for computer inference library for deep very large collection of 1000's of library of plugins that can be This SDK includes documentation performance linear algebra (LA) vision, image processing and image processing primitives and applied to various parts of the and code samples that illustrate routines for the K platform for machine learning, and now signal processing primitives. audio and video processing how to use NVIDIA's NVENC and statistical computing using the features OPU acceleration for NVDEC hardware in OPUs to Latest software libraries for real-time operation. accelerate encode, decode, and heterogeneous architectures.



### Triton Ocean SDK

Triton provides real-time visual NVIDIA CUDA BLAS Library simulation of the ocean and bodies of water for games, simulation, and training applications. latest MRL BLAS.

cuBLAS

BPP is a computational peometry engine that is optimized for OPU acceleration, and can be used in advanced Oraphical Information Systems (015), Electronic Design Automation (EDA), computer vision, and motion planning solutions.



### ARRAYFIRE

Comprehensive, open source BLAS library that delivers as to image processing, statistics, and 17x faster performance than the many more. Interfaces for C, C++, Java, R and Portran.

ArrayFire [cuBLAS] is a OPU-accelerated OPU function library. Includes version of the complete standard - functions for math, signal and

transcode of H.254 and HEVC video formats.



### GPU-Accelerated Libraries

GPU-Accelerated libraries provide highly-optimized algorithms and functions you can incorporate into your applications, with minimal changes to your existing code. Many support drop-in compatibility to replace industry standard CPU-only libraries such as MKL, IPP, FPTW and widely-used libraries. Some also feature automatic multi-GPU performance scaling.

### Rather than building a new optimising compiler for each platform

### Pick the best Library/API/DSL and FIT the code to it

Library.

learning applications

BPU-accelerated graph analytics high performance neural network a GPU accelerated übrary with a multi-media framework with a with the NWDIA Video Codec SDK. Linear Algebra in R) delivers high asurce übrary for computer interence library for deep very large collection of 1000's of library of plugins that can be This SDK includes documentation performance linear algebra (LA) vision, image processing and image processing primitives and applied to various parts of the and code samples that illustrate routines for the K platform for machine learning, and now

pipelines.



**CULA** tools CULA Tools



IMSL IMSL Fortran Numerical

significant acceleration for Computer Vision, CPD. Computational Chemistry, and matrices that delivers over 8s Linear Optimization applications performance boost.

collection of basic linear algebra subroutines used for sparse

accelerate encode, decode, and heterogeneous architectures. transcode of H.254 and HEVC video formats.

J PARALUTION

Ubrary for sparse iterativ

signal processing primitives. audio and video processing how to use NVIDIA's NVENC and statistical computing using the features OPU acceleration for NVDEC hardware in DPUs to latest software libraries for real-time operation.

NVIDIA CUDA BLAS Librar

**BPP** is a computational peometry engine that is optimized for OPU acceleration, and can be used in advanced Oraphical Information Systems (015), Electronic Design Automation [EDA], computer vision, and motion planning solutions.



Sundog Software

Triton provides real-time visual

Triton Ocean SDK

cuBLAS

ArrayFire

Comprehe

ARRAYFIRE

Libraries/DSLs are the new ISA





### Program



### Program

Program

→





### Program

Program









# Interface nearer to algorithm Interface complex and changeable







### Detect code structures that match interface





### Detect code structures that match interface

### Use IO grey-box program synthesis



### Detect code structures that match interface

### Use IO grey-box program synthesis





# It works





# Automatically matches accelerator libraries to legacy code



Speedup (x)



Parboil SGEMM

20

15

 $\mathbf{0}$ 



### [ASPLOS18] [PACT19] [GPCE20] [PACT21]

# Automatically matches accelerator libraries to legacy code



### [ASPLOS18] [PACT19] [GPCE20] [PACT21]



### Automatically matching APIs frees up hardware creativity







### Space of Interesting Programs





### +Performance/Cost Estimate











### Space of Interesting Programs





Matching Hardware to Software - hardware defined software (HDS)

### Other

- Software Defined Hardware (SDH)

**Beyond Simple Acceleration** 

### - Neural Architecture Search as Program Transformation Exploration

# How we deploy neural networks



Neural Architecture Search

[ASPLOS 2021 Distinguished paper]

### Optimising Compiler



# Unifying the optimisation steps





### 1 for i in range(10): for j in range(20): 2 3 $A[i][j] = i \star j$







# Very different networks loosely "equivalent"



# Can we find better equivalent networks?





Can we characterise loosely equivalent? Then mix with compiler transformations



# **Example optimisation target: convolution**



```
for ci in [0,CI-1]:
  for co in [0,CO-1]:
    for oh in [0,OH-1]:
    for ow in [0,OW-1]:
      for kh in [0,KH-1]:
      for kw in [0,KW-1]:
            0[ci][co][oh][ow] +=
            W[ci][co][kh][kw]*
            I[ci][oh+kh][ow+kw]
```

# **Example optimisation target: convolution**



for ci in range(4):
 for co in range(4):
 spatial\_conv(0, W, I, co, ci)

### for ci in range(4): for co in range(4): spatial\_conv(0, W, I, co, ci)



interchange (ci,co)











## Transformations

# Bottlenecking



# 1 for co in range(4//B): 2 for ci in range(4): 3 spatial\_conv(0, W, I, co, ci)

## T(co, J') = (co', J') | co' < Co/B



# Grouping





## T(co, ci, J'') = (g, co/G, ci/G, J')

## **Transformation space** Compiler optimisations: interchange tile group unroll prefetch split fuse



- Network optimisations:
- bottleneck

Can mix and match to give new convolutions



# Example: unrolled group convolution

\* "group" the remainder



T: [Co, Ci, H, W, Kh, Kw] ->  $[H, W, Co, Ci, Kh, Kw] \rightarrow$  $[H(b), W, Co, Ci, Kh, Kw] \rightarrow$  $[W, H(b), Co, Ci, Kh, Kw] \rightarrow$  $[W(b), H(b), Co, Ci, Kh, Kw] \rightarrow$ [Co, Ci, H(b), W(b), Kh, Kw]

Spatial bottlenecking is bottlenecking plus interchange



# **CIFAR-10:** average 4x speedup over best





(a) ResNet-34









(c) DenseNet-161







mGPU TVM NAS Ours Compiling the original network with TVM

NAS-compression, compiled

TVM with our additional transformations as options.



# ImageNet: order magnitude improvement



Ported to Transmuter - see next topic Darpa workloads:

- reduce exec time by 80.6%
- reduce energy by 79.4%

## Matching Hardware to Software - hardware defined software (HDS) Other

- Neural Architecture Search as Program Transformation Exploration

## - Software Defined Hardware (SDH)

**Beyond Simple Acceleration** 

# Transmuter: Software Defined Hardware

### DARPA

- ARM, U Michigan, ASU, Edinburgh
- Python Stack, NumPy/SciPy acceleration
- Software monitors hardware and reconfigures

Coarse Grain Reconfigurable Multi-core

- **ARM M-Class cores**
- Fast reconfiguration
- Reconfigure cache/scratchpad, interconnect

### **Prodigy**

Software assisted prefetcher(HPCA21 Best Paper)

### **SparseAdapt**

Runtime reconfiguration (Micro21)







## Irregular Memory Accesses

## Software

## Hardware

## Programmable **Prefetching Hardware**

Ctrl

Mem

### Program **Compiler Analysis** Annotated

int bfs() regNode(...)





11101001

11100011

00011000

10011111

DIG

## **Prodigy Operation**

59

# **Effect on Performance** Speedup vs. No-Prefetching



## On average, 2.6x speedup compared to no prefetching Reduction in DRAM-stalls by 80.3% and branch-stalls by 65.3%





# Sparse Adapt: Hardware Reconfiguration



Model learns mapping  $f : X \rightarrow Y$ 

- X: the space of performance counters
- Y: best micro-architectural configurations

# Sparse Adapt: Hardware Reconfiguration



Model learns mapping  $f: X \rightarrow Y$ 

- X: the space of performance counters
- Y: best micro-architectural configurations







# Sparse Adapt: Hardware Reconfiguration



Model learns mapping  $f: X \rightarrow Y$ 

X: the space of performance counters

Y: best micro-architectural configurations

### 2.9x energy-efficiency







Matching Hardware to Software - hardware defined software (HDS) Other

- Neural Architecture Search as Program Transformation Exploration

- Software Defined Hardware (SDH)

### **Beyond Simple Acceleration**

# **Big-step Acceleration: FFT**

Matching complex accelerators is challenging

- Behaviour unlikely to match user code
- FFT acceleration a good example





# **Big-step Acceleration: FFT**

Matching complex accelerators is challenging

- Behaviour unlikely to match user code
- FFT acceleration a good example

Need to bridge gap

- Applied to GitHub code
- Significant speedups

See Jackson Woodruff presentation





# **Beyond fixed function: Neural Compilation**

Significant accelerators will be programmable

- Likely to have specialised prog lang

Can we learn how to translate existing code into new lang?

# **Beyond fixed function: Neural Compilation**

Significant accelerators will be programmable

- Likely to have specialised prog lang

Can we learn how to translate existing code into new lang?

- Proof of concept: Learn C->x86
- Used Transformer model



# **Beyond fixed function: Neural Compilation**

Significant accelerators will be programmable

- Likely to have specialised prog lang

Can we learn how to translate existing code into new lang?

Proof of concept: Learn C->x86

- Used Transformer model

Surprising results!





```
triangle_sum :
 . LFB0:
 . cfi_startproc
 pushq %rbp
 .cfi_def_cfa_offset 16
 . cfi_offset 6, -16
 movq %rsp, %rbp
 .cfi_def_cfa_register 6
 movl %edi, -20(%rbp)
 mov1 \$0, -12(\% rbp)
 movl \$1, -8(\% rbp)
jmp .L2
 . L5 :
 movl \$1, -4(\% rbp)
jmp .L3
 .L4:
 movl -4 (%rbp), %eax
 addl %eax, -12(\% rbp)
 add1 \$1, -4(\% rbp)
 .L3:
 movl -4(\% rbp), %eax
 cmpl - 8(\% rbp), %eax
 j1 .L4
 addl $1, -8(\% rbp)
 .L2:
 movl -8(\% rbp), %eax
 cmpl -20(\% rbp), %eax
 j1 .L5
 movl -12(\% rbp), %eax
popq %rbp
 .cfi_def_cfa 7, 8
 ret
 .cfi_endproc
```



# Conclusion

Matching Hardware to Software - enables hardware innovation

Expressing NAS as program transformation - generates new designs

Software can help hardware improve performance

- prefetching and reconfiguration

Going beyond simple acceleration requires new approaches









