## Response Time Analysis of Dataflow Applications on a Many-Core Processor with Shared-Memory and Network-on-Chip

Matthieu Moy

LIP (Univ. Lyon 1)

November 2018

## CASH: Topics - People

Optimized (software/hardware) compilation for HPC software with data-intensive computations.

 $\rightsquigarrow$  Means: dataflow IR, static analyses, optimisations, simulation.



Christophe Alias, Laure Gonnord, Ludovic Henrio, Matthieu Moy http://www.ens-lyon.fr/LIP/CASH/

#### Outline

1 Critical, Real-Time and Many-Core

- 2 Parallel code generation and analysis
- 3 Models Definition
- 4 Interferences and NoC Communications

#### 5 Evaluation

6 Conclusion and Future Work

## Outline

#### 1 Critical, Real-Time and Many-Core

- 2 Parallel code generation and analysis
- 3 Models Definition
- 4 Interferences and NoC Communications
- 5 Evaluation
- 6 Conclusion and Future Work

## Time-critical, compute intensive applications





- Hard Real-Time: we must guarantee that task execution completes before deadline
- Compute-intensive
- Space/power bounded

## Performance Vs Predictability





# Many-core

## Lots of simple cores

Kalray MPPA (Massively Parallel Processor Array):

- 256 cores
- No cache consistency
- No out-of-order execution
- No branch prediction
- No timing anomaly
- Predictable NoC

## $\Rightarrow$ good fit for real-time?

#### Kalray's business model

1 A https://www.kalrayinc.com



#### Press releases

Sept 18 Kalray unveils artificial intelligence capabilities for autonomous vehicles based on Baidu's Apollo open platform July 9 Inside Secure Technology Chosen to Secure Kalray's Intelligent Processors for Autonomous Vehicles and Next-Generation Data Centers June 25 Kalray unveils its certified intelligent NVME-oF solutions with server and storage leader AIC at ISC 2018 June 7 Kalray has raised €43.5M: the most significant IPO since Europeyt Growth

#### Upcoming events

7 SOPHLA 2018 · SPRINGBOARD FOR NOV ARTIFICIAL INTELLIGENCE 12 SC18 NOV 5 NVME DEVELOPER DAYS DEC 7 EMBEDDED-SEC18 DEC

#### Latest tweets



#IA : Les 15 pépites en #Europe l'intérêt des fonds d'investisse 4 dans le lot : @Actility @O @dataiku et @Kalravinc https:// /NXpvtqXM42

7 / 28



#### High-level Data-Flow Application Model

Synchronous hypothesis: computation/communication in 0-time



| <br><u> </u> | <br>57 |
|--------------|--------|

#### High-level Data-Flow Application Model

Synchronous hypothesis: computation/communication in 0-time



#### High-level Data-Flow Application Model

Synchronous hypothesis: computation/communication in 0-time





→ Take into account all levels in Worst-Case Execution Time (WCET) analysis and programming model

## Outline

#### 1 Critical, Real-Time and Many-Core

- 2 Parallel code generation and analysis
  - 3 Models Definition
  - 4 Interferences and NoC Communications
  - 5 Evaluation
- 6 Conclusion and Future Work

#### **Execution of Synchronous Data Flow Programs**



High level representation

Industrialized as SCADE (1993)

heavily used in avionics and nuclear

#### **Execution of Synchronous Data Flow Programs**



Multi/Many-core code generation



static non-preemptive scheduling

High level representation



## Parallel code generation from Lustre/SCADE (pseudo-code)



## Contribution

- Previous work:
  - Predictable execution model within each cluster
  - Mathematical model of arbitration for memory accesses
  - Algorithm to compute a time-triggered schedule (fix-point resolution)
- This talk:
  - Multi-cluster application
  - Time-triggered schedule taking Network on Chip (NoC) accesses into account

## Outline

1 Critical, Real-Time and Many-Core

2 Parallel code generation and analysis

#### 3 Models Definition

4 Interferences and NoC Communications

#### 5 Evaluation

6 Conclusion and Future Work

## Architecture Model



- Kalray MPPA 256 Bostan
- 16 compute clusters + 4 I/O clusters
- Dual NoC (Network on Chip)

#### Architecture Model



#### Per cluster:

- 16 cores + 1 Resource Manager
- NoC Tx, NoC Rx, Debug Unit
- 16 shared memory banks (total size: 2 MB)

#### Architecture Model





#### Per cluster:

- 16 cores + 1 Resource Manager
- NoC Tx, NoC Rx, Debug Unit
- 16 shared memory banks (total size: 2 MB)
- Multi-level bus arbiter per memory bank







- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation

different tasks go to different memory banks

- Execution model:
  - execute in a "local" bank
  - write to a "remote" bank



- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation

different tasks go to different memory banks

- Execution model:
  - execute in a "local" bank
  - write to a "remote" bank





- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation

different tasks go to different memory banks

- Execution model:
  - execute in a "local" bank
  - write to a "remote" bank





- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation

different tasks go to different memory banks

- Execution model:
  - execute in a "local" bank
  - write to a "remote" bank

## **NoC Communications**



## **NoC Communications**



#### Steps:

- 1 Read from memory
- 2 Write to TX's buffer
- 3 Start NoC transfer
- 4 Data transmission through the NoC
- 5 Write to memory

## **NoC Communications**



#### Steps:

- 1 Read from memory
- 2 Write to TX's buffer
- 3 Start NoC transfer
- 4 Data transmission through the NoC
- 5 Write to memory

#### Interference:

- Same as other reads
- 2, 3 One TX channel per sender
  - $\Rightarrow$  independent accesses.
- Interferences in each router
  - $\rightarrow$  network calculus
- 5 High-priority interference  $\Rightarrow$   $\land$



- Directed Acyclic Task Graph
- Mono-rate
- Fixed mapping and execution order

• For each task 
$$\tau_i$$
:  
WCRT<sub>i</sub> = WCET<sub>i</sub> +  $\sum_{j \neq i}$  interference<sub>i,j</sub>



- Directed Acyclic Task Graph
- Mono-rate
- Fixed mapping and execution order
- For each task  $\tau_i$ : WCRT<sub>i</sub> = WCET<sub>i</sub> +  $\sum_{j \neq i}$  interference<sub>i,j</sub>





- Directed Acyclic Task Graph
- Mono-rate
- Fixed mapping and execution order
- For each task  $\tau_i$ : WCRT<sub>i</sub> = WCET<sub>i</sub> +  $\sum_{j \neq i}$  interference<sub>i,j</sub>





- Directed Acyclic Task Graph
- Mono-rate
- Fixed mapping and execution order
- For each task  $\tau_i$ : WCRT<sub>i</sub> = WCET<sub>i</sub> +  $\sum_{j \neq i}$  interference<sub>i,j</sub>





- Directed Acyclic Task Graph
- Mono-rate
- Fixed mapping and execution order
- For each task  $\tau_i$ : WCRT<sub>i</sub> = WCET<sub>i</sub> +  $\sum_{j \neq i}$  interference<sub>i,j</sub>





- Directed Acyclic Task Graph
- Mono-rate
- Fixed mapping and execution order
- For each task  $\tau_i$ : WCRT<sub>i</sub> = WCET<sub>i</sub> +  $\sum_{j \neq i}$  interference<sub>i,j</sub>



# Outline

1 Critical, Real-Time and Many-Core

- 2 Parallel code generation and analysis
- 3 Models Definition

#### 4 Interferences and NoC Communications

#### 5 Evaluation

6 Conclusion and Future Work

# Reminder: NoC Communications



#### Interference:

- Same as other reads
- 2, 3 One TX channel per sender
  - $\Rightarrow$  independent accesses.
- 4 Interferences in each router
  - $\rightarrow$  network calculus
- 5 High-priority interference  $\Rightarrow$   $\Lambda$

# **Reminder: NoC Communications**



#### Interference:

- Same as other reads
- 2, 🖪 One TX channel per sender
  - $\Rightarrow$  independent accesses.
- 4 Interferences in each router
  - $\rightarrow$  network calculus
- 5 High-priority interference  $\Rightarrow$   $\Lambda$

#### lssue:

Predict the possible execution time of 5 as precisely as possible.

Issue 1: overapproximation of RX execution interval



Issue 1: overapproximation of RX execution interval



Issue 1: overapproximation of RX execution interval



Issue 1: overapproximation of RX execution interval



Issue 2: circular dependency



Issue 2: circular dependency



- EOT = only one control register access
- Preload code to avoid instruction cache miss

## 3-phase Tasks Analysis

- Compute:
  - Fits in previous work model
- Copy to TX:
  - Force non-interfering schedule (add artificial dependencies if needed)
- Start NoC transfer (EOT):
  - No interference
- On the RX side:
  - RX can only start after "Start NoC transfer" has started
    ⇒ edge from "Copy to TX" to "RX" in the task dependency graph.

# Outline

1 Critical, Real-Time and Many-Core

- 2 Parallel code generation and analysis
- 3 Models Definition
- 4 Interferences and NoC Communications

#### 5 Evaluation

6 Conclusion and Future Work

## Example Application: Naive Schedule



## Example Application: Improved Schedule



| Task             | WCRT  |          | Release date |          |
|------------------|-------|----------|--------------|----------|
|                  | Naive | Improved | Naive        | Improved |
| Application      | 680   | 650      |              |          |
| RX1              | 230   | 120      | 0            | 220      |
| RX <sub>2</sub>  | 580   | 350      | 0            | 200      |
| n <sub>3</sub>   | 180   | 160      | 0            | 0        |
| n <sub>4</sub>   | 120   | 120      | 180          | 160      |
| n <sub>2</sub>   | 100   | 100      | 10           | 0        |
| n <sub>5</sub>   | 100   | 100      | 580          | 550      |
| n <sub>8</sub>   | 100   | 100      | 330          | 340      |
| n <sub>1</sub>   | 110   | 110      | 0            | 0        |
| n <sub>6</sub>   | 100   | 100      | 0            | 0        |
| n <sub>7</sub>   | 100   | 100      | 100          | 100      |
| n <sub>9</sub>   | 100   | 100      | 220          | 220      |
| n <sub>10</sub>  | 100   | 100      | 240          | 240      |
| Cpy <sub>1</sub> | 110   | 110      | 110          | 110      |
| Cpy <sub>2</sub> | 100   | 100      | 100          | 100      |
| EOT <sub>1</sub> | 20    | 20       | 220          | 220      |
| EOT <sub>2</sub> | 20    | 20       | 200          | 200      |





#### Improved:





# Outline

1 Critical, Real-Time and Many-Core

- 2 Parallel code generation and analysis
- 3 Models Definition
- 4 Interferences and NoC Communications

#### 5 Evaluation

#### 6 Conclusion and Future Work

## **Conclusion and Future Work**

- Code generation and real-time analysis for many-core (Kalray MPPA 256)
  major challenge for industry and research
- Hard Real-Time  $\Rightarrow$  simplicity, predictability  $\Rightarrow$  static, time-driven schedule
- Critical  $\Rightarrow$  traceability  $\Rightarrow$  no aggressive optimization

Our work:

- Understand and model the precise architecture of MPPA
- Extension of Multi-Core Response Time Analysis framework
- Integration analysis ↔ code generation
- Future Work:
  - Model Kalray MPPA3 chip (new NoC, new arbiters)
  - Improve the static scheduling algorithm:  $O(n^4)$  currently, we can do better.
  - Integration with RTOS?

### BACKUP

Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10



<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10



• Task of interest running on PE0:

 $R_0 = 10 + 3 \times 10$  (response time in isolation)

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10



• Task of interest running on PE0:

 $R_0 = 10 + 3 \times 10 \text{ (response time in isolation)}$  $R_1 = 10 + 3 \times 10 + 2 \times 10 = 60$ 

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10



• Task of interest running on PE0:

 $R_0 = 10 + 3 \times 10$  (response time in isolation)

 $R_1 = 10 + 3 \times 10 + 2 \times 10 = 60$ 

 $R_2 = 10 + 3 \times 10 + 2 \times 10 + 2 \times 10 = 80$ 

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10



• Task of interest running on PE0:

 $R_0 = 10 + 3 \times 10$  (response time in isolation)

 $R_1 = 10 + 3 \times 10 + 2 \times 10 = 60$ 

 $R_2 = 10 + 3 \times 10 + 2 \times 10 + 2 \times 10 = 80$ 

 $R_3 = 10 + 3 \times 10 + 2 \times 10 + 2 \times 10 + 0 = 80$  (fixed-point)

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

### The Global Picture

