使用Nvidia显卡驱动加速Nanos Unikernel应用

2024-06-19

前言

近两年来人工智能发展迅速，如何充分发挥硬件性能、提升应用运行效率成为了一个重要议题，Unikernel 作为一种轻量级的操作系统，以其高效、安全的特性备受关注。

本篇文章介绍如何将现有的 AI 应用集成至 Nanos Unikernel 中，并使用 Nvidia 驱动提供 CUDA 支持。

介绍

在文章 https://lyd.im/archives/nanos-unikernel-tutorials 中我曾介绍了几种 Unikernel 的解决方案，经过调研，截止目前仅有 Nanos 支持 Nvidia 的显卡驱动，这为 Unikernel 集成深度学习应用提供了很大的支持。

Nanos 通过以 Klibs 的方法将 Nvidia 驱动挂载到内核中，Klibs 可以理解为 Nanos 的插件机制，为 Nanos 提供了一些额外的功能。

Nanos 的 Nvidia 驱动位于 https://github.com/nanovms/gpu-nvidia ，主要是在 Nvidia 开源驱动基础上进行修改以适配 Nanos 内核，目前的驱动版本为 535.113.01。

Nanos 目前支持谷歌云 GCP 和本地两种平台集成 GPU，本文侧重本地集成，在开始之前，需要保证本地设备至少有一块支持 Nvidia 开源驱动的 Nvidia 显卡，并且已经开启了显卡直通功能，可以参考之前的文章 https://lyd.im/archives/pve-8-2-gpu-passthrough 。

编译 Klibs

目前想要使用 gpu_nvidia 的 klib 有两种方法，手动编译或者使用 Nanos 官方每日自动编译的 klib。

官方编译的 klib 可以在 https://storage.googleapis.com/nanos/release/nightly/gpu-nvidia-x86_64.tar.gz 下载。

如果想手动编译，可以安装以下步骤：

克隆 Nanos 内核仓库并编译

git clone https://github.com/nanovms/nanos
cd nanos
make

克隆 Nanos 的 Nvidia 驱动仓库并编译，NANOS_DIR 参数需指定上一步中 nanos 目录的路径

编译后的产物位于 kernel-open/_out/Nanos_x86_64/gpu_nvidia
```
git clone https://github.com/nanovms/gpu-nvidia
cd gpu-nvidia
make NANOS_DIR=/root/nanos
```

构建 Nanos 应用

此处以最简单的 CUDA Samples 中的 bandwidthTest 和 deviceQuery 两个应用作为测试

创建项目目录

mkdir cuda-samples-nanos && cd cuda-samples-nanos

集成 klib

这里使用官方编译的 klib 或者自己编译的都可以，这里以官方编译的为例

解压产物有一个 gpu_nvidia 文件和 nvidia/535.113.01/gsp_ga10x.bin 及 nvidia/535.113.01/gsp_tu10x.bin，其中.bin 文件为 GPU System Processor (GSP) 固件，其中 ga10x 是基于 Ampere 架构的 GPU，tu10x 是基于 Turing 架构的 GPU，可以根据自己显卡的架构保留其一或者都保留。
```
wget https://storage.googleapis.com/nanos/release/nightly/gpu-nvidia-x86_64.tar.gz
tar -vxf gpu-nvidia-x86_64.tar.gz && rm -rf gpu-nvidia-x86_64.tar.gz
mkdir klibs
mv gpu_nvidia klibs/
```
编译 cuda-samples

bandwidthTest 和 deviceQuery 两个应用可以在安装 CUDA 时勾选 CUDA Demo Suite，并在 CUDA 安装路径下的 samples 目录中获取，或者自己编译。

这里演示如何手动编译，编译产物位于 cuda-samples/bin/x86_64/linux/release/ 中，将其复制到项目目录下
```
git clone https://github.com/NVIDIA/cuda-samples
cd cuda-samples/Samples/1_Utilities/bandwidthTest/
make
cd ../deviceQuery/
make
```
准备动态依赖库

两个程序只需要一个 libcuda.so.1 即可，需要安装 535.113.01 同版本的驱动，然后可以在 /usr/lib/x86_64-linux-gnu/ 中找到它
```
mkdir -p usr/lib
cp /usr/lib/x86_64-linux-gnu/libcuda.so.1 ./usr/lib
```
编辑配置文件

新建一个 config.json 并写入以下内容
```
{
 "KlibDir": "./klibs",
 "Klibs": [
     "gpu_nvidia"
 ],
 "Dirs": [
     "nvidia",
     "usr"
 ],
 "RunConfig": {
     "GPUs": 1
 }
}
```
这里 KlibDir 设置 klib 的目录，Klibs 指定需要加载的 klib 为 gpu_nvidia，Dirs 参数指定将 nvidia 和 usr 两个目录映射至 Nanos 的根目录下，RunConfig.GPUs 指定需要直通的 GPU 数量。

检查

目前项目下有如下目录结构，检查是否有缺失

.
├── bandwidthTest
├── config.json
├── deviceQuery
├── klibs
│   └── gpu_nvidia
├── nvidia
│   ├── 535.113.01
│   │   ├── gsp_ga10x.bin
│   │   └── gsp_tu10x.bin
│   └── LICENSE
└── usr
     └── lib
         └── libcuda.so.1

运行

通过 ops run 运行 ELF 文件，-c 参数指定配置文件，-n 参数指定以 nightly 版本运行

ops run deviceQuery -c config.json -n

running local instance
booting /root/.ops/images/deviceQuery ...
en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (circleci@027ee46c5f57)  Fri Aug 16 02:11:27 AM UTC 2024
Loaded the UVM driver, major device number 0.
deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 4060"
  CUDA Driver Version / Runtime Version          12.2 / 12.2
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 7734 MBytes (8109293568 bytes)
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
  (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores
  GPU Max Clock rate:                            2505 MHz (2.50 GHz)
  Memory Clock rate:                             8501 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 25165824 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 4
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.2, NumDevs = 1, Device0 = NVIDIA GeForce RTX 4060
Result = PASS

ops run bandwidthTest -c config.json -n

running local instance
booting /root/.ops/images/bandwidthTest ...
en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (circleci@027ee46c5f57)  Fri Aug 16 02:11:27 AM UTC 2024
Loaded the UVM driver, major device number 0.
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA GeForce RTX 4060
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     12905.6

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     13203.8

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     232091.7

en1: assigned FE80::4466:88FF:FE1F:2F9
Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Avice·Blog

使用Nvidia显卡驱动加速Nanos Unikernel应用

前言

介绍

编译 Klibs

构建 Nanos 应用

运行