Skip to content

yanyoulin/HLS-study-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HLS-study-project

Vitis HLS 2024.2
part=xcku035-fbva676-2-e

什麼是HLS

將高階語言的演算法轉換為RTL代碼,進一步用於FPGA的硬體實現

HLS pragma

pragma是用來向Vitis HLS提供指令的關鍵字,幫助優化硬體設計並控制生成的RTL代碼的行為。這些指令可以用來進行性能調整、資源分配以及設計流程的優化
以下是我目前研究及使用過的pragma(Vitis HLS官網也有詳細說明):

pragma HLS pipeline

void sum_array(int in[8], int* out) {
#pragma HLS PIPELINE
    int total = 0;
    for (int i = 0; i < 8; i++) {
        total += in[i];
    }
    *out = total;
}

將loop或是function以pipeline的形式結構執行,增加效率
image
pipeline讓多個操作分工並行執行,將一段運算「拆解成多個階段」,讓每個clock cycle都能輸入新的資料、產出新的結果
可以自行設定Initiation Interval(II):啟動新一次迭代所需的clock cycle,若II=1則效率最高

pragma HLS unroll

void sum_array_unroll(int in[8], int* out) {
    int total = 0;
    for (int i = 0; i < 8; i++) {
        #pragma HLS UNROLL
        total += in[i];
    }
    *out = total;
}

不同於pipeline,unroll將迴圈展開成多組平行運算單元。展開後,迴圈中每次迭代的操作會「同時」在硬體中執行,而不是像軟體那樣一個一個執行。
附上示意圖-擷取自網路
image
可以發現,unroll通常效率會高於pipeline(latency較低),同時也會消耗更多資源

pragma HLS array_partition

array_partition是一個對陣列資料做結構性分割的指令,用來將一個大陣列分割成多個小部分,讓它們可以在硬體中同時被存取
在硬體中,一個array預設只有一個port,同一時刻只能做一筆讀/寫。所有資料從同一記憶體來,就會造成爭用,讓 HLS 無法並行處理
解法:把array拆成多個memory slice,每slice各有port

#pragma HLS array_partition variable=<array_name> type=<partition_type> dim=<dimension>

type=complete(完全獨立)/block(分塊獨立)->需與factor搭配
dim->要分的是哪個維度

void sum_array_unroll(int in[8], int* out) {
#pragma HLS array_partition variable=in complete
    int total = 0;
    for (int i = 0; i < 8; i++) {
        #pragma HLS UNROLL
        total += in[i];
    }
    *out = total;
}

unroll常與array_partition做使用,因為unroll只是「告訴工具我想要展開」,但是否真的能展開,要看資料能不能同時被取用

pragma HLS DATAFLOW

它能讓多個function、loop在硬體中同時執行,就像是多個pipeline串起來一樣
經典用法:

#pragma HLS DATAFLOW

read_input(input_stream, buf);
compute(buf, result);
write_output(result, output_stream);
void dense_model(int W1[HIDDEN_DIM][IN_DIM], int W2[OUT_DIM][HIDDEN_DIM],
                 int b1[HIDDEN_DIM], int b2[OUT_DIM], int x[IN_DIM], int y[OUT_DIM]) {
#pragma HLS DATAFLOW
    int h[HIDDEN_DIM];
#pragma HLS array_partition variable=h complete

    dense1(W1, x, b1, h);
    dense2(W2, h, b2, y);
}

image

進入Vitis

  1. 建立一個專案環境,存放你未來建立的component
    image
  2. 可以建立component了
    image
    image
  3. 放入你要轉換的cpp檔,以及自己寫的testbench(也可以選擇先跳過)
    image
  4. 設定板子環境
    image
  5. 這樣就建立完成了
    若跳過第3步,可以在建立完成後再處理(我自己是這樣做)
    image
    記得設定top function(HLS轉換的單位)
    image

如何進行-以sum_array為例

#include "ap_int.h"

void sum_array_unroll(int in[8], int* out) {
#pragma HLS array_partition variable=in complete
    int total = 0;
    for (int i = 0; i < 8; i++) {
        #pragma HLS UNROLL
        total += in[i];
    }
    *out = total;
}
//testbench
#include <iostream>
using namespace std;

void sum_array_unroll(int in[8], int* out);

int main() {
    int in[8] = {1, 2, 3, 4, 5, 6, 7, 8};
    int result;
    sum_array_unroll(in, &result);
    cout << "sum = " << result << endl;
    if (result == 36) {
        cout << "PASS" << endl;
        return 0;
    } else {
        cout << "FAIL" << endl;
        return 1;
    }
}

之後跑 C Simulation 看測試的結果,用來檢查程式是否錯誤

 sum = 36
 PASS
 INFO: [SIM 211-1] CSim done with 0 errors.
 INFO: [SIM 211-3] *************** CSIM finish ***************
 INFO: [HLS 200-112] Total CPU user time: 2 seconds. Total CPU system time: 1 seconds. Total elapsed time: 7.338 seconds; peak allocated memory: 265.996 MB.
 INFO: [vitis-run 60-791] Total elapsed time: 0h 0m 12s
 C-simulation finished successfully

然後跑 C Synthesis 生成verilog code與report
在syn->report資料夾中有一份你的"檔名_synth.rpt"

//無unroll版
+ Latency: 
    * Summary: 
    +---------+---------+----------+----------+-----+-----+------------------------------------------------+
    |  Latency (cycles) |  Latency (absolute) |  Interval |                    Pipeline                    |
    |   min   |   max   |    min   |    max   | min | max |                      Type                      |
    +---------+---------+----------+----------+-----+-----+------------------------------------------------+
    |       10|       10|  0.100 us|  0.100 us|    9|    9|  loop auto-rewind stp (delay=1 clock cycles(s))|
    +---------+---------+----------+----------+-----+-----+------------------------------------------------+
* Summary: 
+-----------------+---------+------+--------+--------+-----+
|       Name      | BRAM_18K|  DSP |   FF   |   LUT  | URAM|
+-----------------+---------+------+--------+--------+-----+
|DSP              |        -|     -|       -|       -|    -|
|Expression       |        -|     -|       0|      63|    -|
|FIFO             |        -|     -|       -|       -|    -|
|Instance         |        -|     -|       -|       -|    -|
|Memory           |        -|     -|       -|       -|    -|
|Multiplexer      |        -|     -|       0|      45|    -|
|Register         |        -|     -|      41|       -|    -|
+-----------------+---------+------+--------+--------+-----+
|Total            |        0|     0|      41|     108|    0|
+-----------------+---------+------+--------+--------+-----+
//有unroll版
+ Latency: 
    * Summary: 
    +---------+---------+----------+----------+-----+-----+---------+
    |  Latency (cycles) |  Latency (absolute) |  Interval | Pipeline|
    |   min   |   max   |    min   |    max   | min | max |   Type  |
    +---------+---------+----------+----------+-----+-----+---------+
    |        0|        0|      0 ns|      0 ns|    1|    1|       no|
    +---------+---------+----------+----------+-----+-----+---------+
* Summary: 
+-----------------+---------+------+--------+--------+-----+
|       Name      | BRAM_18K|  DSP |   FF   |   LUT  | URAM|
+-----------------+---------+------+--------+--------+-----+
|DSP              |        -|     -|       -|       -|    -|
|Expression       |        -|     -|       0|     245|    -|
|FIFO             |        -|     -|       -|       -|    -|
|Instance         |        -|     -|       -|       -|    -|
|Memory           |        -|     -|       -|       -|    -|
|Multiplexer      |        -|     -|       -|       -|    -|
|Register         |        -|     -|       -|       -|    -|
+-----------------+---------+------+--------+--------+-----+
|Total            |        0|     0|       0|     245|    0|
+-----------------+---------+------+--------+--------+-----+

從report就能看出,unroll效率明顯提升,但資源使用量明顯較大
顯示出#pragma的重要性
我們也可以做 C/RTL Cosimulation (硬體正確性驗證)

Dense Layer

#include "ap_int.h"

#define IN_DIM  8
#define OUT_DIM 4

void dense(float W[OUT_DIM][IN_DIM], float x[IN_DIM], float b[OUT_DIM], float y[OUT_DIM]) {
#pragma HLS array_partition variable=W type=complete
#pragma HLS array_partition variable=x type=complete
#pragma HLS array_partition variable=b type=complete
#pragma HLS array_partition variable=y type=complete
#pragma HLS PIPELINE II=1

    for (int i = 0; i < OUT_DIM; i++) {
#pragma HLS UNROLL
        float acc = b[i];
        for (int j = 0; j < IN_DIM; j++) {
#pragma HLS UNROLL
            acc += W[i][j] * x[j];
        }
        y[i] = (acc > 0) ? acc : 0;
    }
}

2 dense layer

#include "ap_int.h"

#define IN_DIM  8
#define HIDDEN_DIM 4
#define OUT_DIM 2

void dense1(int W1[HIDDEN_DIM][IN_DIM], int x[IN_DIM], int b1[HIDDEN_DIM], int h[HIDDEN_DIM]) {
#pragma HLS array_partition variable=W1 type=complete
#pragma HLS array_partition variable=x type=complete
#pragma HLS array_partition variable=b1 type=complete
#pragma HLS array_partition variable=h type=complete
#pragma HLS PIPELINE II=1

    for (int i = 0; i < HIDDEN_DIM; i++) {
#pragma HLS UNROLL
        int acc = b1[i];
        for (int j = 0; j < IN_DIM; j++) {
#pragma HLS UNROLL
            acc += W1[i][j] * x[j];
        }
        if (acc < 0) acc = 0;
        h[i] = acc;
    }
}

void dense2(int W2[OUT_DIM][HIDDEN_DIM], int h[HIDDEN_DIM], int b2[OUT_DIM], int y[OUT_DIM]) {
#pragma HLS array_partition variable=W2 type=complete
#pragma HLS array_partition variable=h type=complete
#pragma HLS array_partition variable=b2 type=complete
#pragma HLS array_partition variable=y type=complete
#pragma HLS PIPELINE II=1

    for (int i = 0; i < OUT_DIM; i++) {
#pragma HLS UNROLL
        int acc = b2[i];
        for (int j = 0; j < HIDDEN_DIM; j++) {
#pragma HLS UNROLL
            acc += W2[i][j] * h[j];
        }
        y[i] = acc;
    }
}

void dense_model(int W1[HIDDEN_DIM][IN_DIM], int W2[OUT_DIM][HIDDEN_DIM],
                 int b1[HIDDEN_DIM], int b2[OUT_DIM], int x[IN_DIM], int y[OUT_DIM]) {
#pragma HLS DATAFLOW
    int h[HIDDEN_DIM];
#pragma HLS array_partition variable=h complete

    dense1(W1, x, b1, h);
    dense2(W2, h, b2, y);
}

將輸入資訊進行線性組合
每個輸出神經元都連接到所有輸入神經元
能學到任意線性轉換,適合作為特徵提取、映射與分類器的終端層
未來可以用在CNN之類的任務上

Attention Score & Softmax

#include "ap_fixed.h"
#include <hls_math.h>

#define DIM 4

typedef ap_fixed<16, 6> data_t;

// ----- Compute Q·K^T -----
void attention_score(data_t Q[DIM], data_t K[DIM], data_t* score_out) {
#pragma HLS array_partition variable=Q complete
#pragma HLS array_partition variable=K complete

    data_t score = 0;
    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        score += Q[i] * K[i];
    }
    *score_out = score;
}

// ----- Softmax over fixed-length 1D input -----
void softmax(data_t input[DIM], data_t output[DIM]) {
#pragma HLS array_partition variable=input complete
#pragma HLS array_partition variable=output complete

    data_t max_val = input[0];
    for (int i = 1; i < DIM; i++) {
#pragma HLS UNROLL
        if (input[i] > max_val) max_val = input[i];
    }

    data_t sum = 0;
    data_t exp_val[DIM];
#pragma HLS array_partition variable=exp_val complete

    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        exp_val[i] = hls::exp(input[i] - max_val);
        sum += exp_val[i];
    }

    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        output[i] = exp_val[i] / sum;
    }
}

通過計算Q與每個k向量的內積,得到它們的關聯分數
使用Softmax運算使得score i 轉為總和為1的機率分佈
幫助模型建立有意義的上下文關係

Multi_Head_Attention->Transformer Block

image

void attention_head(
    data_t Q_proj[HEAD_DIM], data_t K_proj[DIM][HEAD_DIM], data_t V_proj[DIM][HEAD_DIM], data_t out[HEAD_DIM]) {
#pragma HLS array_partition variable=Q_proj complete
#pragma HLS array_partition variable=K_proj complete dim=2
#pragma HLS array_partition variable=V_proj complete dim=2
#pragma HLS array_partition variable=out complete

    data_t scores[DIM];
#pragma HLS array_partition variable=scores complete

    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        attention_score(Q_proj, K_proj[i], &scores[i]);
    }

    data_t weights[DIM];
    softmax(scores, weights);

    for (int i = 0; i < HEAD_DIM; i++) {
#pragma HLS UNROLL
        out[i] = 0;
        for (int j = 0; j < DIM; j++) {
#pragma HLS UNROLL
            out[i] += weights[j] * V_proj[j][i];
        }
    }
}

void multi_head_attention(
    data_t Q[DIM], data_t K[DIM][DIM], data_t V[DIM][DIM],
    data_t W_Q[HEADS][HEAD_DIM][DIM],
    data_t W_K[HEADS][HEAD_DIM][DIM],
    data_t W_V[HEADS][HEAD_DIM][DIM],
    data_t W_O[DIM][HEADS * HEAD_DIM],
    data_t output[DIM]) {

#pragma HLS array_partition variable=Q complete
#pragma HLS array_partition variable=K complete dim=2
#pragma HLS array_partition variable=V complete dim=2
#pragma HLS array_partition variable=W_Q complete dim=2
#pragma HLS array_partition variable=W_K complete dim=2
#pragma HLS array_partition variable=W_V complete dim=2
#pragma HLS array_partition variable=W_O complete dim=2
#pragma HLS array_partition variable=output complete

    data_t concat_heads[HEADS * HEAD_DIM];
#pragma HLS array_partition variable=concat_heads complete

    for (int h = 0; h < HEADS; h++) {
#pragma HLS UNROLL
        data_t Q_proj[HEAD_DIM], K_proj[DIM][HEAD_DIM], V_proj[DIM][HEAD_DIM];
        data_t head_out[HEAD_DIM];
#pragma HLS array_partition variable=Q_proj complete
#pragma HLS array_partition variable=K_proj complete dim=2
#pragma HLS array_partition variable=V_proj complete dim=2
#pragma HLS array_partition variable=head_out complete

        for (int i = 0; i < HEAD_DIM; i++) {
#pragma HLS UNROLL
            Q_proj[i] = 0;
            for (int j = 0; j < DIM; j++) Q_proj[i] += W_Q[h][i][j] * Q[j];
        }
        for (int m = 0; m < DIM; m++) {
            for (int i = 0; i < HEAD_DIM; i++) {
#pragma HLS UNROLL
                K_proj[m][i] = 0;
                for (int j = 0; j < DIM; j++) K_proj[m][i] += W_K[h][i][j] * K[m][j];
            }
        }
        for (int m = 0; m < DIM; m++) {
            for (int i = 0; i < HEAD_DIM; i++) {
#pragma HLS UNROLL
                V_proj[m][i] = 0;
                for (int j = 0; j < DIM; j++) V_proj[m][i] += W_V[h][i][j] * V[m][j];
            }
        }

        attention_head(Q_proj, K_proj, V_proj, head_out);

        for (int i = 0; i < HEAD_DIM; i++) {
#pragma HLS UNROLL
            concat_heads[h * HEAD_DIM + i] = head_out[i];
        }
    }

    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        output[i] = 0;
        for (int j = 0; j < HEADS * HEAD_DIM; j++) {
#pragma HLS UNROLL
            output[i] += W_O[i][j] * concat_heads[j];
        }
    }
}

image image

#include "ap_fixed.h"
#include <hls_math.h>
#include "multi_head_attention.h"

#define DIM 4
#define HEADS 2
#define HEAD_DIM 2
#define FF_DIM 4

typedef ap_fixed<16, 6> data_t;

void multi_head_attention(
    data_t Q[DIM], data_t K[DIM][DIM], data_t V[DIM][DIM],
    data_t W_Q[HEADS][HEAD_DIM][DIM],
    data_t W_K[HEADS][HEAD_DIM][DIM],
    data_t W_V[HEADS][HEAD_DIM][DIM],
    data_t W_O[DIM][HEADS * HEAD_DIM],
    data_t output[DIM]);

void dense_ffn(data_t input[DIM], data_t W1[FF_DIM][DIM], data_t b1[FF_DIM],
               data_t W2[DIM][FF_DIM], data_t b2[DIM], data_t output[DIM]) {
#pragma HLS array_partition variable=input complete
#pragma HLS array_partition variable=output complete
#pragma HLS array_partition variable=W1 complete dim=2
#pragma HLS array_partition variable=W2 complete dim=2
#pragma HLS array_partition variable=b1 complete
#pragma HLS array_partition variable=b2 complete

    data_t hidden[FF_DIM];
#pragma HLS array_partition variable=hidden complete

    for (int i = 0; i < FF_DIM; i++) {
#pragma HLS UNROLL
        hidden[i] = b1[i];
        for (int j = 0; j < DIM; j++) hidden[i] += W1[i][j] * input[j];
        if (hidden[i] < 0) hidden[i] = 0;
    }

    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        output[i] = b2[i];
        for (int j = 0; j < FF_DIM; j++) output[i] += W2[i][j] * hidden[j];
    }
}
-
void transformer_block(
    data_t Q[DIM], data_t K[DIM][DIM], data_t V[DIM][DIM],
    data_t W_Q[HEADS][HEAD_DIM][DIM],
    data_t W_K[HEADS][HEAD_DIM][DIM],
    data_t W_V[HEADS][HEAD_DIM][DIM],
    data_t W_O[DIM][HEADS * HEAD_DIM],
    data_t W1[FF_DIM][DIM], data_t b1[FF_DIM],
    data_t W2[DIM][FF_DIM], data_t b2[DIM],
    data_t output[DIM]) {

#pragma HLS array_partition variable=Q complete
#pragma HLS array_partition variable=K complete dim=2
#pragma HLS array_partition variable=V complete dim=2
#pragma HLS array_partition variable=output complete

    data_t attn_out[DIM];
    data_t add1[DIM];
    data_t ffn_out[DIM];
#pragma HLS array_partition variable=attn_out complete
#pragma HLS array_partition variable=add1 complete
#pragma HLS array_partition variable=ffn_out complete

    multi_head_attention(Q, K, V, W_Q, W_K, W_V, W_O, attn_out);

    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        add1[i] = Q[i] + attn_out[i];
    }

    dense_ffn(add1, W1, b1, W2, b2, ffn_out);

    for (int i = 0; i < DIM; i++) {
#pragma HLS UNROLL
        output[i] = add1[i] + ffn_out[i];
    }
}

Multi-Head Attention
每個head執行比例化點積attention(Scaled Dot-Product Attention):
分數計算: score=Q·K^T
Softmax正規化
與值矩陣V的加權和
多個注意力頭的輸出會串接後送入線性投影層

Feed-Forward Network(FFN)
包含兩層Dense Layer與ReLU激活函數:
FFN(x)=max(0, W1x + b1)W2 + b2
成功實作一個HLS可合成的Transformer Block

目標 & 進度(每周更新)

4/15
開始朝stable diffusion在vitis上實作前進
學習實作像hls4ml把model轉換成c++ hls
思考新方向(如hls4rl, hls4障礙物偵測)
4/22
已完成一個transformer encoder block project的實作
包含:
dense layer(測試成功)
layer normalization(測試成功)
gelu(測試成功)
residual normalization(測試成功)
multi-head attention(測試成功)
最後整合到transformer encoder block達成圖示的目的
測試完成將把所有結果更新至github
可以以這個架構試著朝新主題式著發展了
可繼續往 Stable Diffusion、Edge AI 模型移植與加速方向發展

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published