Nanos Unikernel移植Yolo推理服务

2024-11-27

在上一篇文章 https://lyd.im/archives/accelerating-nanos-unikernel-applications-with-nvidia-gpu-drivers 中介绍了如何在 Nanos 中使用 Nvidia 显卡加速推理深度学习服务,本篇文章开始实战如何将最新的 YOLOv11 推理服务移植到 Nanos Unikernel 中。

环境准备

  1. 首先,在上一篇文章的基础上,请准备好所需的 Klibnvidia_gpu 及驱动

     .
     ├── klibs
     │   └── gpu_nvidia
     └── nvidia
         ├── 535.113.01
         │   ├── gsp_ga10x.bin
         │   └── gsp_tu10x.bin
         └── LICENSE
    
  2. 一个 Python3.10 的解释器(3.10.6 版本最好)

创建 Python 环境

  1. 首先需要创建一个 Python 虚拟环境,后续会将其映射到 Nanos 中,作为 YOLO 的运行环境

    python -m venv .local --prompt yolo
    source .local/bin/activate
    
  2. 安装 Pytorch 和 Ultralytics

    pip install torch==2.3.0+cu121 torchvision==0.18.0+cu121 torchaudio==2.3.0+cu121 -f https://mirror.sjtu.edu.cn/pytorch-wheels/torch_stable.html
    pip install ultralytics
    
  3. 编写一个最小可用的 python 代码 main.py

     from ultralytics import YOLO
     
     model = YOLO("yolo11n.pt")
     results = model("https://ultralytics.com/images/bus.jpg")
    
  4. 编写配置文件 config.json

    {
       "KlibDir": "./klibs",
       "Klibs": [
           "gpu_nvidia"
       ],
       "RunConfig": {
           "GPUs": 1
       },
       "Dirs": [
           "nvidia",
           ".local"
       ],
       "Args": [
           "main.py"
       ]
     }
    
  5. 尝试运行程序

    ops pkg load eyberg/python:3.10.6 -c config.json -n
    

    不出意外的话还是出现了意外

    running local instance
     booting /root/.ops/images/python3.10 ...
     [0.257582] en1: assigned 10.0.2.15
     NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
     NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (circleci@690d85c32591)  Wed Nov 27 02:11:20 AM UTC 2024
     Loaded the UVM driver, major device number 0.
     Traceback (most recent call last):
       File "/.local/lib/python3.10/site-packages/numpy/_core/__init__.py", line 23, in <module>
         from . import multiarray
       File "/.local/lib/python3.10/site-packages/numpy/_core/multiarray.py", line 10, in <module>
         from . import overrides
       File "/.local/lib/python3.10/site-packages/numpy/_core/overrides.py", line 8, in <module>
         from numpy._core._multiarray_umath import (
     ImportError: libstdc++.so.6: cannot open shared object file: No such file or directory
     
     During handling of the above exception, another exception occurred:
     
     Traceback (most recent call last):
       File "/.local/lib/python3.10/site-packages/numpy/__init__.py", line 114, in <module>
         from numpy.__config__ import show as show_config
       File "/.local/lib/python3.10/site-packages/numpy/__config__.py", line 4, in <module>
         from numpy._core._multiarray_umath import (
       File "/.local/lib/python3.10/site-packages/numpy/_core/__init__.py", line 49, in <module>
         raise ImportError(msg)
     ImportError: 
     
     IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
     
     Importing the numpy C-extensions failed. This error can happen for
     many reasons, often due to issues with your setup or how NumPy was
     installed.
     
     We have compiled some common reasons and troubleshooting tips at:
     
         https://numpy.org/devdocs/user/troubleshooting-importerror.html
     
     Please note and check the following:
     
       * The Python version is: Python3.10 from ""
       * The NumPy version is: "2.1.3"
     
     and make sure that they are the versions you expect.
     Please carefully study the documentation linked above for further help.
     
     Original error was: libstdc++.so.6: cannot open shared object file: No such file or directory
     
     
     The above exception was the direct cause of the following exception:
     
     Traceback (most recent call last):
       File "//main.py", line 1, in <module>
         from ultralytics import YOLO
       File "/.local/lib/python3.10/site-packages/ultralytics/__init__.py", line 11, in <module>
         from ultralytics.models import NAS, RTDETR, SAM, YOLO, FastSAM, YOLOWorld
       File "/.local/lib/python3.10/site-packages/ultralytics/models/__init__.py", line 3, in <module>
         from .fastsam import FastSAM
       File "/.local/lib/python3.10/site-packages/ultralytics/models/fastsam/__init__.py", line 3, in <module>
         from .model import FastSAM
       File "/.local/lib/python3.10/site-packages/ultralytics/models/fastsam/model.py", line 5, in <module>
         from ultralytics.engine.model import Model
       File "/.local/lib/python3.10/site-packages/ultralytics/engine/model.py", line 7, in <module>
         import numpy as np
       File "/.local/lib/python3.10/site-packages/numpy/__init__.py", line 119, in <module>
         raise ImportError(msg) from e
     ImportError: Error importing numpy: you should not try to import numpy from
             its source directory; please exit the numpy source tree, and relaunch
             your python interpreter from there.
    

处理报错

动态依赖库缺失

其实仔细分析上述的报错内容,可以找出报错原因,就是缺失了 libstdc++.so.6 动态依赖库,将其补全即可,后续遇到 cannot open shared object file: No such file or directory 相似字样的报错都是动态依赖库缺失,需要一一补全。

  1. 创建目录

    mkdir -p usr/lib
    
  2. 补全依赖,将下述列出的动态依赖库复制到刚才创建的 usr/lib 目录中,基本上在系统的 /usr/lib/x86_64-linux-gnu/ 目录中都能找到这些动态依赖库

    usr/
      └── lib
         ├── libbsd.so.0
         ├── libbz2.so.1.0
         ├── libcuda.so.1
         ├── libexpat.so.1
         ├── libffi.so.7
         ├── libfribidi.so.0
         ├── libgcc_s.so.1
         ├── libGLdispatch.so.0
         ├── libglib-2.0.so.0
         ├── libGL.so.1
         ├── libGLX.so.0
         ├── libgthread-2.0.so.0
         ├── liblzma.so.5
         ├── libmd.so.0
         ├── libnvidia-ml.so.1
         ├── libnvJitLink.so.12
         ├── libpcre2-8.so.0
         ├── libstdc++.so.6
         ├── libutil.so.1
         ├── libuuid.so.1
         ├── libX11.so.6
         ├── libXau.so.6
         ├── libxcb.so.1
         └── libXdmcp.so.6
    
  3. 修改配置文件,新增映射目录

    "Dirs": [
         "nvidia",
         ".local",
         "usr"
     ]
    

磁盘空间不足

出现 No space left on device 报错时,说明 Nanos 的磁盘空间不足了,原因是 Nanos 默认分配的磁盘空间比较小,需要在配置文件中分配一个较大的磁盘空间
修改配置文件,在其中添加

"BaseVolumeSz": "6g"

不受支持的操作系统

Traceback (most recent call last):
  File "//main.py", line 1, in <module>
    from ultralytics import YOLO
  File "/.local/lib/python3.10/site-packages/ultralytics/__init__.py", line 11, in <module>
    from ultralytics.models import NAS, RTDETR, SAM, YOLO, FastSAM, YOLOWorld
  File "/.local/lib/python3.10/site-packages/ultralytics/models/__init__.py", line 3, in <module>
    from .fastsam import FastSAM
  File "/.local/lib/python3.10/site-packages/ultralytics/models/fastsam/__init__.py", line 3, in <module>
    from .model import FastSAM
  File "/.local/lib/python3.10/site-packages/ultralytics/models/fastsam/model.py", line 5, in <module>
    from ultralytics.engine.model import Model
  File "/.local/lib/python3.10/site-packages/ultralytics/engine/model.py", line 11, in <module>
    from ultralytics.cfg import TASK2DATA, get_cfg, get_save_dir
  File "/.local/lib/python3.10/site-packages/ultralytics/cfg/__init__.py", line 12, in <module>
    from ultralytics.utils import (
  File "/.local/lib/python3.10/site-packages/ultralytics/utils/__init__.py", line 817, in <module>
    USER_CONFIG_DIR = Path(os.getenv("YOLO_CONFIG_DIR") or get_user_config_dir())  # Ultralytics settings dir
  File "/.local/lib/python3.10/site-packages/ultralytics/utils/__init__.py", line 789, in get_user_config_dir
    raise ValueError(f"Unsupported operating system: {platform.system()}")
ValueError: Unsupported operating system: Nanos

看报错内容,可以发现在调用 get_user_config_dir 函数时,出现了问题,这里贴出 get_user_config_dir 函数的一部分源码

    if WINDOWS:
        path = Path.home() / "AppData" / "Roaming" / sub_dir
    elif MACOS:  # macOS
        path = Path.home() / "Library" / "Application Support" / sub_dir
    elif LINUX:
        path = Path.home() / ".config" / sub_dir
    else:
        raise ValueError(f"Unsupported operating system: {platform.system()}")

可以看出函数的核心就是通过操作系统类型来设置配置文件的目录,而 Nanos Unikernel 的 uname 系统调用默认返回的系统类型是 Nanos,因此产生了报错,针对上述报错,这里有两种解决方法。

  1. 修改配置文件,手动指定 uname 系统调用的返回系统类型,因为 Unikernel 本质上也是属于 Linux,因此我们可以模拟 Linux 系统

    "ManifestPassthrough": {
       "uname": {
         "sysname": "Linux"
       }
     }
    
  2. 除上述方法外,可以在报错中发现,在调用 get_user_config_dir 函数前,代码会通过 YOLO_CONFIG_DIR 环境变量来获取配置文件的目录,因此我们也可以在配置文件中添加环境变量来解决报错

    "Env": {
         "YOLO_CONFIG_DIR": "/.config"
     }
    

无法从 /proc/cpuinfo 解析处理器信息

Error in cpuinfo: failed to parse processor information from /proc/cpuinfo
Traceback (most recent call last):
  File "//main.py", line 3, in <module>
    model = YOLO("yolo11n.pt")
  File "/.local/lib/python3.10/site-packages/ultralytics/models/yolo/model.py", line 23, in __init__
    super().__init__(model=model, task=task, verbose=verbose)
  File "/.local/lib/python3.10/site-packages/ultralytics/engine/model.py", line 145, in __init__
    self._load(model, task=task)
  File "/.local/lib/python3.10/site-packages/ultralytics/engine/model.py", line 285, in _load
    self.model, self.ckpt = attempt_load_one_weight(weights)
  File "/.local/lib/python3.10/site-packages/ultralytics/nn/tasks.py", line 912, in attempt_load_one_weight
    model = (ckpt.get("ema") or ckpt["model"]).to(device).float()  # FP32 model
  File "/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 989, in float
    return self._apply(lambda t: t.float() if t.is_floating_point() else t)
  File "/.local/lib/python3.10/site-packages/ultralytics/nn/tasks.py", line 258, in _apply
    self = super()._apply(fn)
  File "/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File "/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File "/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File "/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 804, in _apply
    param_applied = fn(param)
  File "/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 989, in <lambda>
    return self._apply(lambda t: t.float() if t.is_floating_point() else t)
RuntimeError: Failed to initialize cpuinfo!

上述报错的原因是缺少 /proc/cpuinfo 文件,Pytorch 在运行前会通过 cpuinfo 模块检查 cpu 的基本信息,而这些信息从读取 /proc/cpuinfo 而来,Unikernel 由于其特性,并不需要 proc 文件系统,因此缺少该文件,我们可以手动进行补全

  1. 创建目录并复制文件

    mkdir proc
    cp /proc/cpuinfo ./proc
    
  2. 修改配置文件,添加映射

     "Dirs": [
         "nvidia",
         ".local",
         "usr",
         "proc"
     ]
    

最终运行效果

解决完成上述报错内容后,项目目录结构如下

.
├── .local
├── config.json
├── klibs
│   └── gpu_nvidia
├── main.py
├── nvidia
│   ├── 535.113.01
│   │   ├── gsp_ga10x.bin
│   │   └── gsp_tu10x.bin
│   └── LICENSE
├── proc
│   └── cpuinfo
└── usr
    └── lib
        ├── libbsd.so.0
        ├── libbz2.so.1.0
        ├── libcuda.so.1
        ├── libexpat.so.1
        ├── libffi.so.7
        ├── ......

配置文件如下

{
    "KlibDir": "./klibs",
    "Klibs": [
        "gpu_nvidia"
    ],
    "RunConfig": {
        "GPUs": 1
    },
    "Dirs": [
        "nvidia",
        ".local",
        "usr",
        "proc"
    ],
    "Args": [
        "main.py"
    ],
    "BaseVolumeSz": "6g",
    "Env": {
        "YOLO_CONFIG_DIR": "/.config"
    }
}

运行

ops pkg load eyberg/python:3.10.6 -c config.json -n
running local instance
booting /root/.ops/images/python3.10 ...
[0.263430] en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (circleci@690d85c32591)  Wed Nov 27 02:11:20 AM UTC 2024
Loaded the UVM driver, major device number 0.
[2.106690] en1: assigned FE80::B49A:8BFF:FE75:215C
Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/.config/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.
Downloading https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11n.pt to 'yolo11n.pt'...
100%|██████████| 5.35M/5.35M [00:00<00:00, 9.39MB/s]

Downloading https://ultralytics.com/images/bus.jpg to 'bus.jpg'...
100%|██████████| 134k/134k [00:00<00:00, 835kB/s]
image 1/1 /bus.jpg: 640x480 4 persons, 1 bus, 72.3ms
Speed: 1.8ms preprocess, 72.3ms inference, 235.7ms postprocess per image at shape (1, 3, 640, 480)