Skip to content

July log

root edited this page Aug 1, 2018 · 1 revision

2018 July 11th

-- Sachille

Things to note when accessing arrays in Seashell

func mmul(a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)) {

  for (let i = 0..3) {
    for (let j = 0..3) unroll 3 {
      c[i] := a[i*3+j] + b[i*3+j];
    }
  }

}

This example should fail because even though the unrolled version can be mapped on to hardware directly, unrolling would need some muxing. But that condition is not currently checked by the type system.

However, this example fails in the current system because in order to use indexing of i, it needs to be static as i is unrolled. (Pointed out by Ted in Slack) This can be seen with the tests run on vvadd, vvadd example. Compare that with 1st instance which fails.

Therefore, it is important to note the two indexing methods in seashell, which are

  1. physical indexing i.e. array c[static bank integer][variable or static integer bank index]
  2. logical indexing i.e. array c[logical index variable or static integer] cannot be used interchangeably in loops.

If a loop is not unrolled, the designer has to specify the bank. This is evident when arrays are banked. But this is necessary even when they are not. Therefore, note any array used in a regular loop needs static banking i.e. for a non banked array c[0][i].

If a loop is banked, programmer can still use banked arrays, which can lead to issues such as

for i = 1..n unroll K:
  A[0][0] := i

being type checked by seashell. (Pointed out by Adrian in Slack)

This issue is avoided by using unroll in the non unrolled arrays. Edited multaccess example

How to deal with multiple write access issues?

The original expectation of the previous example is to highlight this issue, where unrolling the inner loop will force hardware to access the same element multiple times as write, forcing HLS to put some muxes to select the last access to write. However, this is a non-issue for array reads, as a single element can be fanned out to multiple computes. We arrived at a decision to restrict only writes and allow such reads in this example.

2018 July 12th

Inconsistency between explicit and implicit access in terms of multiple read access

This approach does create in inconsistency between unrolled access and non-unrolled access, as non-unrolled bank access requires even reads to be single access as highlighted by Ted here, inconsistency. One motivation to maintain this inconsistency would be, programmer already aware of hardware duplication when using unroll, which allows implicit banking to add some more wires to propagate the read data. But in explicit access, programmer is aware of this multiple wires by using explicit local variables.

Why do we need nested loops?

This led to the question, is it worthwhile to support nested loops, as a programmer, albeit with some difficulty, should be able to write the same program with one loop structure. The second nested loop issue being, handling multiple loop unrolls. This would lead to complex expressions within array indices, but would make type checking a single loop easier.

But it turns out, this benefit in terms of simple loop would be lost when type checking the array indices. Type systems are on the dumb side and would benefit from having simpler indices to operate with.

How to deal with multi dimensional arrays in seashell

With nested loop unrolling thus handled with

  • multiple access
  • nested loops
  • nested loop unrolling

the next issue is to handle compute-reduction (?) operations and a neat way to handle multi-dimensional logical arrays to bring them down to the physical arrays seashell has.

Issue with operations like += is that these require two access to the same memory element both as read and as write in the same operation. In hardware terms, this requires two registers (memory elements). I also need to understand more about why the implicit commutative property of + which is no longer followed in a += be a concern here. These operations can be handled with parallel compute and ultimately reduce method operators.

Then, concerning multi-dimensional arrays, logical dimensionality is at play for applications such as matrix multiplication or convolution, but actual hardware arrays are 1D. However, hardware optimizations such as unrolling depend on this logical dimensionality as they dictate the access pattern and thus better locality. How this translation can be done is the question. It has some subtlety to it beyond that, which I need to refresh.

2018 July 16th

Array partitioning seems to need different types of partitioning

We have come to the conclusion we need nested loops and also that we need to support nested loop unrolling.

a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)

for (let i = 0..3) {
  for (let j = 0..3) unroll 3 {
    c[i] := a[i*3+j] + b[i*3+j];
  }
}

The current array partitioning we assume is interleaved partitioning, cyclical like this

0 3 6 1 4 7 2 5 8

, which works alright for unrolling consecutive elements in single loop unroll.

for (let j = 0..9) unroll 3 {
  c:= a[j];
}

is equivalent to

for (let j = 0..3) unroll 3 {
  c:= a[j*3+0];
  c:= a[j*3+1];
  c:= a[j*3+2];
}

But the type system doesn't currently support an unrolling of the following nature,

for (let j = 0..3) unroll 3 {
  c:= a[0*3+j];
  c:= a[1*3+j];
  c:= a[2*3+j];
}

which would need a banking in blocks like

0 1 2 3 4 5 6 7 8

Therefore, we need at least two types of banking (maybe 3 as HLS has, pg. 148 UG902 (v2018.2) June 6, 2018).

Furthermore, considering an example like

a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)

for (let i = 0..3) {
  for (let j = 0..3) unroll 3 {
    c[i] := a[i*3+j] + b[i*3+j];
  }
}

this has further impact. Notice the arrays need to be banked cyclically for this access, validating the type check.

However, if we change the code to,

a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)

for (let i = 0..3) unroll 3 {
  for (let j = 0..3) {
    c[i] := a[i*3+j] + b[i*3+j];
  }
}

or

a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)

for (let i = 0..3) {
  for (let j = 0..3) unroll 3 {
    c[i] := a[j*3+i] + b[j*3+i];
  }
}

we need a blockwise partitioning. It also shows we need a mechanism to figure out whether the block partitioning we use is accurate.

Need a coherent method to access logically multi-D arrays

Logically multi arrays are extremely useful to FPGA apps. Simple 1D apps are usually trivial enough for GP-CPUs to outperform FPGAs. But CPUs struggle when it comes to complex memory accesses. FPGAs, with the capability to bank memories as they please, allow exploiting spatial locality. Also, it is simpler to index an array as [i][j] than always be computing the correct 1D index such as [i*3+j].

However, this representation must be robust enough to handle arbitrary number of dimensions (i.e. map to hardware), and also maintain capabilities we gave for the 1D variant.

In order to capture these requirements, I am considering an example of a 3D array with partial reordering across two dimensions.

a: float[32] bank(4);

for (let n = 0..1) {
  for (let i = 0..3) unroll 2 {
    for (let j = 0..3) unroll 2 {
      access(a[ n*16 + i*4 + j ]);
    }
  }
}

The parameters here are

  • array size- p1

  • banks- p2

  • number of loop elements- p3, p5, p7

  • largest loop element- p9, p10, p11

  • unroll factors- p4, p6, p8

  • intermediate index multiplier- v

  • largest index multiplier- w

Array mapping is determined by p1,v, w and p4,p6,p8

Possible array definitions are,

# current version, declaring physical parameters
a: float [32] bank(4);

# new version with logical parameters
a: float [2]:[4]bank(2):[4]bank(2);

Additional parameters for the possible new version,

  • dimension sizes- d1,d2,d3
  • dimension banks- b1,b2,b3

We'll consider n to be the frames, therefore we have two frames each of size 16. These frames would look like,

n=0 n=1
0 1 2 3 0 1 2 3
4 5 6 7 4 5 6 7
8 9 10 11 8 9 10 11
12 13 14 15 12 13 14 15

Default mapping for the three loop program without unrolling would be,

00 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 01 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151

Considering b3 only, we know it needs interleaving. To be more general, we need to partition into two in columns (j is the lowest loop, so interleaved by v/b3= 2), leading to a memory like this,

00 10 40 50 80 90 120 130 01 11 41 51 81 91 121 131 20 30 60 70 100 110 140 150 21 31 61 71 101 111 141 151

Considering b2 only, i.e. blockwise partitioning, we need to partition in terms of rows like (interleaved by w/b2= 8),

00 10 20 30 40 50 60 70 01 11 21 31 41 51 61 71 80 90 100 110 120 130 140 150 81 91 101 111 121 131 141 151

Adding both unroll factors lead to needing 4 banks, leading to a memory (first interleave by 8 to two banks, then interleave each by 2),

00 10 40 50 01 11 41 51 20 30 60 70 21 31 61 71 80 90 120 130 81 91 121 131 100 110 140 150 101 111 141 151

Now we have the physical mapping and the parameters to do that, 1st decision would be how to extract these parameters from user program.

  • b1, b2 and b3 are provided bank factors
  • v= (p7 or) d3
  • w= (p5*p7 or) d2*d3

For now, we feel the new version is clear as the programmer gets to explicitly say the banking they need.

Next decision we need is how to represent implicit and explicit array declarations with multiD arrays.

For the sake of visualization, this is how explicitly unrolled version would look like with a funky explicit bank in each dimension.

a: float[32] bank(4);

for (let n' = 0..1) {
  for (let i' = 0..1) {
    for (let j' = 0..1) {
      access( a [[0][n']] [[0][i']] [[0][j']] );
    }
  }
}

However, this is a horrible explicit representation,

  1. first of all it's not explicit, actual hardware has just 1D arrays as we illustrated earlier
  2. we need banks for each dimension as definition doesn't tell us which dimension to partition in
  3. it's super ugly and confusing

Alternate version with new array declaration is,

a: float [2]:[4]bank(2):[4]bank(2);  

for (let n' = 0..1) {
  for (let i' = 0..1) {
    for (let j' = 0..1) {
      access( a [[0]][n'*4 + i'*2 + j'] );
    }
  }
}

This is a better hardware representation, but very difficult for a programmer to figure out which bank and which element. But we would not want programmers to really use this explicit version, unless for really fine control. Therefore, this seems a good enough trade-off. (The above example uses [[..]] for banks, which is only a temporary symbol. Other suggestions for explicit banking include {..}{..} and why not {..}[..])

Therefore, we think maintaining this explicit access is prudent.

Concluding this note, it is of use to jot down translation from implicit access to physical access in HLS for C code generation and explicit access in seashell for type checking.

a: float [d1]:[d2]bank(b2):[d3]bank(b3); 

for (let n = 0..1) {
  for (let i = 0..3) unroll p6 {
    for (let j = 0..3) unroll p8 {
      access(a[n][i][j]);
    }
  }
}

Implicit access a[n][i][j] will be converted to,
physical access a[n*d3*d2 + i*d3 + j]
explicit access a{(i%b2)*b3 + j%b3}[n*(d3/b3)*(d2/b2) + (i/b2)*d3/b3 + j/b3] (need to verify)

  • I went through the example with the assumption b1=p4, b2=p6 and b3=p8. But it seems to me I needn't worry about the unroll factors but use the banking factor for partitioning and translation. Unrolling factor should just be 'safe' from a hardware perspective. Example is altered accordingly.

2018 July 18th

Examples where unroll factor can deviate from array partitioning

This probably needs more thought, but it should be valid to partition an array more than it's unroll factor, i.e. unrolling should probably be less than the bank factor as each unroll would need a dedicated RW port. A possible example for this,

a: float [8] bank(2); 
for (let i = 0..3) {
    access(a[i] + a[i+4]);
}

unrolling this would be independent of this access,

a: float [8] bank(8); 
for (let i = 0..3) unroll 4 {
    access(a[i] + a[i+4]);
}

A version like this could also be similarly mapped to hardware, but I need to check how smart the compiler can be to do it without any muxing,

a: float [10] bank(2); 
for (let i = 0..7) {
    access(a[i] + a[i+1]);
}

Note array a only has 9 elements but I'm specifying it larger to have a symmetric partition for now.

An attempt to humor an example with more unroll than banking required,

a: float [4], b: float [4]; 
for (let i = 0..7) unroll 2 {
  if (i%2==0)
    access(a[i/2]);
  else 
    access(b[i/2]);
}

This is a very bad example, for one we don't need an if condition here, and even then it is possible unroll 2 would reflect in a different array banking which will be needed to make this useful. Disregarding the fact that this is a bad program, this example is only to consider a valid design where unroll > banking.

Concerning muxes

Summarizing notes on slack from Adrian, the key idea of a MUX/DEMUX is to accept a dynamic signal as a select and use it to forward a selected data line (Multiplex) or forward data to a selected output (Demultiplex). A general representation of it, taken from the slack thread is, function mymux(i: int) { return mux(i) {expr1}{expr2}{expr3} }.

However, it is handy to come up with a version which could access an arbitrary memory bank, essentially in an expression like A[i] where A is partitioned in terms of i. We can write this as, (again from slack thread)

# writing in general representation
function Am(i: int) { return mux(i) {A[0]}{A[1]}{A[2]}{A[3]} }

# simplify with the prior knowledge of # of banks
function Am(i: int) { return mux(i) {A[i]} }

# simplifying this to a banked array template
mux n Am(A);

One thing to consider regardless of the general or array specific version is how to use this as a mux and demux. Adrian suggests muxes to be used both as 'lvalues' and 'rvalues'. lvalues and rvalues. This might result in following expressions being correcly identified as a mux or a demux.

a: float[10] bank(5);
mux 5 Am(A)
mux 5 Adm(A)

for (let i = 0..4) {
  for (let j = 0..1) {
# mux on A selects the correct input and send only that to c
    c = Am(A[i][j]);
    d = c*10 +7;
# demux on A takes the input and selects the correct output to go to
    Adm(A[i][j]) = d;
  }
}

Examples for muxes

We already considered one example of muxes with arbitrary access to banks. Another common example is if conditions.

Linear search

a: float[100] bank(5);
let float b = 20;
let out = 100;
for (let i = 0..99) {
  if(A[i]==b) { 
    out := i;
    i := 100;
  }
}

Binary search

a: float[100] bank(5);
let float b = 20;
let L = 0;
let R= 99;
let i= 0;
let out= 100;
while (L<=R){
  i:= (L+R)/2;
  if(A[i]<b)
    L:= i+1;
  else if(A[i]>b)
    R:= i-1;
  else
    out= i;
}

Multiple writes with lvalue and rvalue

After getting to know a type system can be aware of lvalues and rvalues, I'm curious if this can be used to treat memory reads and writes differently. Adrian said on slack, 'maybe the type of writable memories could be “consumed” from the context differently depending on whether it’s used as an lvaue or an rvalue'.

Examples to clarify the differences in multiple access

Assumption for this note is banking is only done in cyclic interleaving.
Let's consider the baseline valid design (we can access multiple banks in parallel),

a: float[9] bank(3); c: float[9] bank(3);

for (let i = 0..3) {
  c[0][i]:= a[0][i];  
  c[1][i]:= a[1][i];  
  c[2][i]:= a[2][i];  
}

We cannot access same bank in parallel due to resource limitation,
for read

a: float[9] bank(3); c: float[9] bank(3);

for (let i = 0..3) {
  c[0][i]:= a[i][0];  
  c[1][i]:= a[i][1];  
  c[2][i]:= a[i][2];  
}

and for write

a: float[9] bank(3); c: float[9] bank(3);

for (let i = 0..3) {
  c[i][0]:= a[0][i];  
  c[i][1]:= a[1][i];  
  c[i][2]:= a[2][i];  
}

We can access same location in a bank for read, it's just more fanout

a: float[9] bank(3); c: float[9] bank(3);

for (let i = 0..3) {
  c[0][i]:= a[i][0];  
  c[1][i]:= a[i][0];  
  c[2][i]:= a[i][0];  
}

or

a: float[9] bank(3); c: float[9] bank(3);

for (let i = 0..3) {
  c[0][i]:= a[0][i];  
  c[1][i]:= a[0][i];  
  c[2][i]:= a[0][i];  
}

But we cannot access the same element for write, because it is physically impossible to do that, (therefore, hardware would put some Muxes and pick which would go in)

a: float[9] bank(3); c: float[9] bank(3);

for (let i = 0..3) {
  c[i][0]:= a[0][i];  
  c[i][0]:= a[1][i];  
  c[i][0]:= a[2][i];  
}

or

a: float[9] bank(3); c: float[9] bank(3);

for (let i = 0..3) {
  c[0][i]:= a[0][i];  
  c[0][i]:= a[1][i];  
  c[0][i]:= a[2][i];  
}

2018 July 20th

Compute-reduce for Matmul

According to wikipedia reduction operation is reducing an array to a single element. It generally defines a reduction operator as an operator,

  • which can reduce an array to a single element
  • final result can be obtained from partial tasks created

with the operation following these requirements,

  • commutative- a # b = b # a
  • associative- (a # b) # c = a # (b # c)

Commutativity deals with the order of tasks and associativity deals with how to produce partial tasks.

This works fine with say vvadd,
C= A + B, as A + B = B + A

but becomes an issue for matrix multiplication
where A * B != B * A

The order of the tasks have to be preserved to get the correct result. (I'm not all clear about why this would matter in an HLS example, as order of operators are not changed anyway.)

Considering an example for HLS,

a: float[9] bank(3); 
let c= 0;

for (let i = 0..8) {
  c := a[i] + c;  
}

low complexity hardware would have one adder interating over a reading previous copy of c as a register.

Let's consider an unrolled version,

a: float[9] bank(3); 
let c= 0;

for (let i = 0..8) unroll 3 {
  c := a[i] + c;  
}

This can have multiple hardware implementations, one could be,

a: float[9] bank(3); 
let c= 0;

for (let i = 0..2) {
  w1 := a[0][i] + c;  
  w2 := a[1][i] + w1; 
  c := a[2][i] + w2; 
}

as is evident from this hardware description w1 and w2 create data dependencies and prevent running all 3 expressions in parallel. We'll get a benefit from not using the register, but it'll still be asynchronously sequential.

Another implementation would be,

a: float[9] bank(3); 
let c= 0;

for (let i = 0..2) {
  w1 := a[0][i] + c;  
  w2 := a[1][i] + a[2][i]; 
  c :=  w1 + w2; 
}

This still has the same dependencies, but the first two expressions can run in parallel completing the operation in 2 hardware unit cycles than 3 from the previous example.

Neither of these versions worries about the feedback loop due to c. Depending on the operation, which here all we care about is the addition of all elements in a, we can rewrite this as,

a: float[9] bank(9);
b: float[3] bank(3); 
let c= 0;

for (let i = 0..0) {
  b[0][0] := a[0][0] + a[1][0] + a[2][0]; 
  b[1][0] := a[3][0] + a[4][0] + a[5][0]; 
  b[2][0] := a[6][0] + a[7][0] + a[8][0]; 
}

for (let i = 0..0) {
  c := b[0][0] + b[1][0] + b[2][0];  
}

This version needs 4 hardware unit cycles to complete, same as what it would take for a maximum parallel version similar to prior examples. This would take 8 adders, one less from other version, but more importantly, we don't have the feedback loop anymore.

Picking which version to implement seems like a tradeoff. It'll be interesting to see what HLS makes of all this.

2018 July 26th

Multi D index types

The note on Adrian's blog provide an intro to index types.