# Revolutionizing Mobile and Cloud via Coherence

Vijay Nagarajan



#### Nicolai Oswald



#### Theo Olausson



#### Boris Grot





#### Adarsh Patil



#### Mahesh Dananjaya





#### Antonis Katsarakais



#### Vasilis Gavrielatos



#### Dan Sorin





**Tobias Grosser** 







Shared



Invalid









#### Concurrency!















#### Hierarchy!









# Existing approach and its limitations

- Suppose one wants to build a multiprocessor SoC
  - Read ~100-500 page prose document (e.g., CHI).
  - Implement protocol by hand in Verilog



arm

Copyright © 2014, 2017-2020 Arm Limited or its affliates. All rights reserve ARM IHI 0050E.a (ID081920)



# Existing approach and its limitations

- Suppose one wants to build a multiprocessor SoC
  - Read ~100-300 page prose document (Tilelink, CHI).
  - Implement protocol by hand in Verilog
- Limitations
  - Prose = Imprecise
  - Non-exhaustive and conservative
  - Only MOESI





arm

Copyright © 2014, 2017-2020 Arm Limited or its affiliates. All rights reserve ARM IHI 0050E.a (ID081920)









## Outline

- Background and Motivation
- Concurrency: ProtoGen
- Hierarchy: HieraGen
- Heterogeneity: HeteroGen
- Coherence for the cloud



## Cache Coherence



• SWMR: single-writer, multiple-reader invariant



## Consistency-directed Cache Coherence



THE UNIVERSITY of EDINBURGH

|                                                                         | difficult to design and<br>Memory Systems, 200    | otocols are notoriously<br>veri "The coherence problem is difficu<br>04] requires coordinating events acros<br>cack [IEEE Concurrency 2000] |             |
|-------------------------------------------------------------------------|---------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|-------------|
| "So<br>dif                                                              | ar " designing an                                 | d verifying a new hardware coherence<br>ult"<br>able Interface for Efficient Heterogeneo                                                    | us<br>It to |
| ſ                                                                       |                                                   | design and implement correctly" [ASPL                                                                                                       | OS 2017]    |
|                                                                         | "Cache coherence protocols for distributed shared |                                                                                                                                             |             |
| memory multiprocessors are notoriously difficult to design" [ICFS 1996] |                                                   |                                                                                                                                             |             |



## Bugs in the Wild

| No       | o. Errata Description                                                                         |                                                           |  |
|----------|-----------------------------------------------------------------------------------------------|-----------------------------------------------------------|--|
|          | 63 TLB Flush Filter Causes                                                                    | Coherency Problem in Multiprocessor                       |  |
| 1<br>51  | Systems                                                                                       | From AnandTech: " coherency was broken and                |  |
| 52       | Description                                                                                   | manually disabled on the Galaxy S 4. The                  |  |
| 57       | If the TLB flush filter is enabled in a multi-<br>between the page tables in memory and the   | translations in prications are serious from a power       |  |
| 58<br>60 |                                                                                               | after software consumption (and performance) standpoint." |  |
| 61       |                                                                                               |                                                           |  |
| 62       | Unpredictable system failure.                                                                 |                                                           |  |
| 63       | Suggested Workaround                                                                          |                                                           |  |
| 64       | In MP systems, disable the TLB flush filter by setting HWCR.FFDIS (bit 6 of MSR 0xC001_0015). |                                                           |  |
| 65       | Fix Planned                                                                                   |                                                           |  |
| 66       | Yes                                                                                           |                                                           |  |
| 68       |                                                                                               |                                                           |  |
| 69       | Multiprocessor Coherency Problem with Hardware<br>Prefetch Mechanism                          |                                                           |  |
| 70       | Microcode Patch Loading in 64-bit Mode Fails To<br>Use EDX                                    |                                                           |  |



# Why is Coherence Hard?

- Concurrency
- Hierarchy
- Heterogeneity



# Outline

- Background and Motivation
- Concurrency: ProtoGen
- Hierarchy: HieraGen
- Heterogeneity: HeteroGen
- Coherence for the cloud



### physical atomicity





### Atomic S to M Transition





# **Transient States** $\mathsf{SM}^{\mathsf{AD}}$ -store / send GetM-S Μ ·recv Ack

#### non-atomic transaction



#### **Concurrent Transactions**





#### **Concurrent Transactions**



#### non-atomic transactions + concurrency = complexity



### To Summarize...

- Stable state protocols assume physically atomic transactions
- Need to support concurrency for performance
- Transient states required to provide logically atomic transactions



### Key realization...

- Stable state protocol is a sequential specification
- The final protocol is a non-blocking concurrent implementation
- Transient states are synchronization operations





#### No wonder cache coherence protocols are Hard!



# Insight



# Demystifying Transient States

How do transient states provide logical atomicity?

- Convey directory serialization order to caches
- Transient states ensure that caches obey this order

#### ProtoGen automates by leveraging this insight!



### How does cache infer serialization order?





### How to resolve name conflicts?





#### Rename Messages





### ProtoGen Summary

- Infer serialization order from incoming messages
- Rename messages in order to achieve this
- React like in stable state



### ProtoGen Tool



#### ProtoGen as good (or better) than manually generated protocols



\*ISCA'18, IEEE Top Picks Honourable mention

## Outline

- Background and Motivation
- Concurrency: ProtoGen
- Hierarchy: HieraGen
- Heterogeneity: HeteroGen
- Coherence for the cloud



#### Hierarchical protocols





# The Complexity of Hierarchical protocols





# HieraGen\* tool flow





\*ISCA'20





#### HeteroGen

- How do you stich together two different protocols?
  - HieraGen should work!

- What is the correctness condition?
  - MOESI style protocols SC
  - RC-style protocols...RC
  - Compound consistency models!



[under submission]

# Compound Consistency

- Correctness condition for heterogeneous coherence
- Foundation for heterogeneous consistency
- Each cluster can assume its own memory model











## Outline

- Background and Motivation
- Concurrency: ProtoGen
- Hierarchy: HieraGen
- Heterogeneity: HeteroGen
- Coherence for the cloud



#### Datacentre Distributed datastores

In-memory with read/write API

Backbone of online services





4



# Distributed datastores

In-memory with read/write API

Backbone of online services

Need:

**Consistency (Programmability)** 

**High performance** 

Fault tolerance (Availability)





# Distributed datastores

In-memory with read/write API

Backbone of online services

Need:

**Consistency (Programmability)** 

**High performance** 

Fault tolerance (Availability)





# The problem



Replication  $\Rightarrow$  Performance vs Consistency

Strong Consistency

Performance



51

# Existing Solution: Multiple Consistency Levels (MCL)

| aws                    | Amazon DB  | Strong/Weak Strong/Weak Read |
|------------------------|------------|------------------------------|
| G                      | App Engine |                              |
| <b>Ү</b> дно <b>0!</b> | PNUTS      |                              |
| <b>y</b>               | Manhattan  |                              |
| Microsoft Azure        | Pileus     | MCL Replicated               |
|                        |            | KVS                          |

What about programmability?



#### Datacentre Distributed datastore



#### Datacentre Distributed datastore



#### Datacentre replication = Fast-path coherence + slow-path consensus?



## Coherence-inspired Hermes





\*ASPLOS'20, IEEE Top Picks Honourable mention

## Coherence-inspired Hermes



Local reads Fast, concurrent writes Protocol reliable but blocking





#### Can we do the same for KVSes?







# A Replicated KVS with ➢ Release Consistency ➢ High Availability



Kite Replicated KVS



\*PPoPP'20, Best paper nomination

#### Rethinking Datacentre Memory





# Revolutionizing Mobile/Cloud via Coherence

- Raise abstraction of coherence protocol design and automate
  - Concurrency
  - Hierarchy
  - Heterogeneity
- Datacentre coherence a great opportunity but needs new family of high-performance fault-tolerant coherence protocols.

