异构边缘上的快速、可移植的 Llama2 推理

Rust + Wasm 技术栈可以是 AI 推理中强大的 Python 替代方案。

与 Python 相比，Rust+ Wasm 应用程序的大小可以是 Python 的 1/100，速度可以提高 100 倍，最重要的是，可以在完全硬件加速的情况下安全地在任何地方运行，而无需对二进制代码进行任何更改。Rust 是 AGI 的语言[1]。

我们创建了一个非常简单的 Rust 程序[2]（40 行代码），以本机速度使用 llama2 模型进行推理。当编译为 Wasm 时，二进制应用程序[3]（仅 2MB）可以在有着异构硬件加速器的设备之间完全移植。Wasm 运行时（ WasmEdge[4] ）还为云环境提供了安全可靠的执行环境。事实上，WasmEdge Runtime 还可以与容器工具无缝协作，[5]可以跨许多不同的设备编排和执行可移植应用程序。

用 2MB 应用运行大语言模型 Llama2

视频加载中...

这项工作基于 Georgi Gerganov[7]创建的 llama.cpp 项目[8]。我们采用了原始的 C++ 程序在 Wasm 上运行。它适用于 GGUF 格式的[9]模型文件。

在你的电脑上运行 Llama 2 模型

步骤 1. 安装 WasmEdge 以及 GGML 插件

在 Linux 或 Mac (M1/M2) 计算机上使用以下命令来安装所有内容。请参阅此处了解更多详细信息[10]。

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugin wasi_nn-ggml

步骤 2. 下载预构建的 Wasm 应用程序和模型

curl -LO https://github.com/second-state/WasmEdge-WASINN-examples/raw/master/wasmedge-ggml-llama-interactive/wasmedge-ggml-llama-interactive.wasm

你还需要下载 GGUF 格式的 llama2 模型。下面的示例下载了一个量化为 5 位权重、针对对话特别优化的 llama2 7B 模型（请在此处查看更多信息[11]）。

curl -LO https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf

步骤 3. 运行试试！

使用 WasmEdge 运行 Wasm 推理应用程序，并将 GGUF 模型传递给它。你现在可以输入问题与模型聊天。

wasmedge --dir .:. --nn-preload default:GGML:CPU:llama-2-7b-chat.Q5_K_M.gguf wasmedge-ggml-llama-interactive.wasm default

Question:

这是一个完整的对话示例。

wasmedge --dir .:. --nn-preload default:GGML:CPU:llama-2-7b-chat.Q5_K_M.gguf wasmedge-ggml-llama-interactive.wasm default

Question:
Who is the "father of the atomic bomb"?
Answer:
The "father of the atomic bomb" is a term commonly associated with physicist J. Robert Oppenheimer. Oppenheimer was a leading figure in the development of the atomic bomb during World War II, serving as the director of the Manhattan Project, which was responsible for the development and deployment of the first nuclear weapons. He is often referred to as the "father of the atomic bomb" due to his significant contributions to the field of nuclear physics and his leadership in the development of the bomb.
Question:
Was he a communist?
Answer:
J. Robert Oppenheimer was not a communist. While he was associated with the Manhattan Project, which was a secret government project, and was involved in the development of the atomic bomb, he was not a member of the Communist Party or had any known political affiliations with communism. Oppenheimer was a physicist and a scientist, and his primary focus was on the development of nuclear weapons during World War II.

可选：配置模型

你可以使用环境变量来配置模型执行。

例如，以下命令指定了上下文长度为 4千 token，这是 llama2 的标准，并且每个响应中的最大 token 数为 1k。它还告诉 WasmEdge 在 runtime 层面打印模型的日志和统计数据。

LLAMA_LOG=1 wasmedge --dir .:. --nn-preload default:GGML:CPU:llama-2-7b-chat.Q5_K_M.gguf wasmedge-ggml-llama-interactive.wasm default

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  4096, 32000,     1,     1 ]
... ...
llm_load_tensors: mem required  = 4560.96 MB (+  256.00 MB per state)
...................................................................................................
Question:
Who is the "father of the atomic bomb"?
llama_new_context_with_model: kv self size  =  256.00 MB
... ...
llama_print_timings:      sample time =     3.35 ms /   104 runs   (    0.03 ms per token, 31054.05 tokens per second)
llama_print_timings: prompt eval time =  4593.10 ms /    54 tokens (   85.06 ms per token,    11.76 tokens per second)
llama_print_timings:        eval time =  3710.33 ms /   103 runs   (   36.02 ms per token,    27.76 tokens per second)
Answer:
The "father of the atomic bomb" is a term commonly associated with physicist J. Robert Oppenheimer. Oppenheimer was a leading figure in the development of the atomic bomb during World War II, serving as the director of the Manhattan Project, which was responsible for the development and deployment of the first nuclear weapons. He is often referred to as the "father of the atomic bomb" due to his significant contributions to the field of nuclear physics and his leadership in the development of the bomb.

边缘上的 llama。图片由 Midjourney 生成。

为什么不选Python？

像 llama2 这样的大语言模型通常使用 Python 进行训练（例如PyTorch 、 Tensorflow和 JAX）。但使用 Python 进行推理应用（AI 中约 95% 的计算）将是一个严重的错误。

Python 包具有复杂的依赖关系[12]。它们很难搭建和使用。
Python 的依赖非常大。Python 或 PyTorch的 Docker 镜像通常为数 GB[13] 甚至数十 GB[14]。这对于边缘服务器或设备上的 AI 推理来说尤其成问题。
Python 是一种非常慢的语言。比 C、C++ 和 Rust 等编译语言慢达 35,000 倍。[15]
由于 Python 速度很慢，因此大部分实际工作负载须委托给[16] Python wrapper 之下的本机共享库。这使得 Python 推理应用程序非常适合演示，但很难[17]在幕后修改以满足特定于业务的需求。
对本机库的笨重依赖以及复杂的依赖管理，使得在利用设备独特的硬件功能的同时跨设备移植 Python AI 程序变得非常困难。

LLM 工具链中常用的 Python 包直接相互冲突。

因 LLVM、 Tensorflow和 Swift 语言而闻名的 Chris Lattner[18] 在最近一期创业播客中接受了精彩的采访[19]。他讨论了为什么 Python 非常适合模型训练，但对于推理应用来说却是错误的选择。

Rust+ Wasm 的优点

Rust + Wasm 堆栈提供了统一的云计算基础设施，涵盖设备到边缘云、本地服务器和公共云。对于 AI 推理应用程序来说，它是 Python 堆栈的强大替代方案。难怪埃隆·马斯克说 Rust 是 AGI 的语言。

超轻量。包含所有依赖项的推理应用程序只有 2MB。它的大小不到典型 PyTorch 容器大小的 1%。
非常快。原生 C/Rust 速度可以贯穿推理应用程序的所有部分：预处理、张量计算和后处理。
可移植。相同的 Wasm 字节码应用程序可以在所有主要计算平台上运行，并支持异构硬件加速。
易于设置、开发和部署。没有更复杂的依赖。使用笔记本电脑上的标准工具构建单个 Wasm 文件并将其部署到任何地方！
安全且云就绪。Wasm 运行时旨在隔离不受信任的用户代码。Wasm 运行时可以通过容器工具进行管理，并轻松部署在云原生平台上。

Rust 推理程序

我们的演示推理程序是用 Rust 编写的并且编译成了 Wasm。Rust 源代码[20]非常简单。它只有 40 行代码。Rust 程序管理用户输入，跟踪对话历史记录，将文本转换为 llama2 的聊天模板，并使用 WASI NN 标准 API 运行推理操作[21]。


fn main() {
    let args: Vec = env::args().collect();
    let model_name: &str = &args[1];

    let graph =
        wasi_nn::GraphBuilder::new(wasi_nn::GraphEncoding::Ggml, wasi_nn::ExecutionTarget::AUTO)
            .build_from_cache(model_name)
            .unwrap();
    let mut context = graph.init_execution_context().unwrap();

    let system_prompt = String::from("<>You are a helpful, respectful and honest assistant. Always answer as short as possible, while being safe. <>");
    let mut saved_prompt = String::new();

    loop {
        println!("Question:");
        let input = read_input();
        if saved_prompt == "" {
            saved_prompt = format!("[INST] {} {} [/INST]", system_prompt, input.trim());
        } else {
            saved_prompt = format!("{} [INST] {} [/INST]", saved_prompt, input.trim());
        }

        // Set prompt to the input tensor.
        let tensor_data = saved_prompt.as_bytes().to_vec();
        context
            .set_input(0, wasi_nn::TensorType::U8, &[1], &tensor_data)
            .unwrap();

        // Execute the inference.
        context.compute().unwrap();

        // Retrieve the output.
        let mut output_buffer = vec![0u8; 1000];
        let output_size = context.get_output(0, &mut output_buffer).unwrap();
        let output = String::from_utf8_lossy(&output_buffer[..output_size]).to_string();
        println!("Answer:
{}", output.trim());

        saved_prompt = format!("{} {} ", saved_prompt, output.trim());
    }
}

要自行构建应用程序，只需安装 Rust 编译器及添加 wasm32-wasi 编译器目标。

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup target add wasm32-wasi

然后，查看源项目，并运行 Cargo 命令以从 Rust 源项目构建 Wasm 文件。

# Clone 源代码
git clone https://github.com/second-state/WasmEdge-WASINN-examples/
cd WasmEdge-WASINN-examples/wasmedge-ggml-llama-interactive/

# 构建 Rust 程序
cargo build --target wasm32-wasi --release

# 输出的 Wasm 结果文件
cp target/wasm32-wasi/release/wasmedge-ggml-llama-interactive.wasm .

在云端或边缘运行

一旦获得 Wasm 字节码文件，你可以将其部署在任何支持 WasmEdge Runtime 的设备上。你的设备上只需安装带有 GGML 插件的 WasmEdge[22]。我们目前有适用于通用 Linux、Ubuntu Linux 和 Mac M1/M2 的 GGML 插件。

基于 Llama.cpp[23]，WasmEdge GGML 插件将自动利用设备上的一切硬件加速来运行你的 Llama2 模型。例如，GGML 插件的 Mac OS 版本使用 Metal API 在 M1/M2 的内置神经处理引擎上运行推理工作负载。GGML 插件的 Linux CPU 版本使用 OpenBLAS 库来自动检测和利用现代 CPU（如 AVX 和 SIMD）上的高级计算特性。

这就是我们在不牺牲性能的情况下实现跨异构 AI 硬件和平台的可移植性的方法。

下一步是什么

虽然 WasmEdge GGML 工具目前可用（并且确实得到我们的云原生客户使用），但它仍处于早期阶段。如果你有兴趣为开源项目做出贡献并塑造未来大语言模型推理基础设施的方向，那么不妨为以下一些唾手可得的成果进行贡献[24]！

为更多硬件和操作系统平台添加 GGML 插件。Nvidia CUDA 显然是一个重要目标，我们很快就会实现。但我们也对 Linux 和 Windows 上的 TPU、ARM NPU 以及其他专用 AI 芯片感兴趣。
支持更多 llama.cpp[25] 配置。我们目前支持将一些配置选项从 Wasm 传递到 GGML 插件。但我们希望支持 GGML 提供的所有选项！
支持其他 Wasm 兼容语言的 WASI NN API。我们对 Go、Zig、Kotlin、JavaScript、C 和 C++ 特别感兴趣。
支持模型结果中的文本流。可以看到，当前的 WASI NN 标准 API 一次性返回全部推理结果。我们希望创建一个替代方案，可以通过模拟打字机体验，逐个返回字符。

其它 AI 模型

作为轻量级、快速、可移植且安全的 Python 替代品，WasmEdge 和 WASI NN 能够围绕大语言模型以外的流行 AI 模型构建推理应用程序。例如，

mediapipe-rs 项目[26]为 Google 的 Tensorflow 模型 mediapipe[27] 套件提供 Rust+ Wasm API 。
WasmEdge YOLO[28] 项目提供了与 YOLOv8[29] PyTorch 模型配合使用的 Rust+ Wasm API。
WasmEdge ADAS 演示[30]展示了如何使用英特尔的 OpenVINO 模型在自动驾驶汽车中执行道路分段。
WasmEdge Document AI 项目将为[31]一套流行的 OCR 和文档处理模型提供 Rust + Wasm API。

边缘轻量级 AI 推理才刚刚开始！

关于 WasmEdge

WasmEdge 是轻量级、安全、高性能、可扩展、兼容OCI的软件容器与运行环境。目前是 CNCF 沙箱项目。WasmEdge 被应用在 SaaS、云原生，service mesh、边缘计算、边缘云、微服务、流数据处理等领域。

✨ GitHub：https://github.com/WasmEdge/WasmEdge

官网：https://wasmedge.org/

‍‍ Discord 群：https://discord.gg/U4B5sFTkFc

文档：https://wasmedge.org/docs

参考资料

[1]

Rust 是 AGI 的语言: https://blog.stackademic.com/why-did-elon-musk-say-that-rust-is-the-language-of-agi-eb36303ce341

[2]

Rust 程序: https://github.com/second-state/WasmEdge-WASINN-examples/tree/master/wasmedge-ggml-llama-interactive

[3]

二进制应用程序: https://github.com/second-state/WasmEdge-WASINN-examples/blob/master/wasmedge-ggml-llama-interactive/wasmedge-ggml-llama-interactive.wasm

[4]

WasmEdge: https://github.com/WasmEdge/WasmEdge

[5]

WasmEdge Runtime 还可以与容器工具无缝协作，: https://wasmedge.org/docs/start/build-and-run/docker_wasm

[7]

Georgi Gerganov: https://github.com/ggerganov

[8]

llama.cpp 项目: https://github.com/ggerganov/llama.cpp

[9]

GGUF 格式的: https://github.com/philpax/ggml/blob/gguf-spec/docs/gguf.md

[10]

请参阅此处了解更多详细信息: https://github.com/second-state/WasmEdge-WASINN-examples/blob/master/wasmedge-ggml-llama-interactive/README.md#requirement

[11]

请在此处查看更多信息: https://www.secondstate.io/articles/convert-pytorch-to-gguf/

[12]

Python 包具有复杂的依赖关系: https://x.com/santiviquez/status/1676677829751177219

[13]

数 GB: https://hub.docker.com/r/pytorch/pytorch/tags

[14]

数十 GB: https://github.com/pytorch/serve/issues/1420

[15]

慢达 35,000 倍。: https://www.modular.com/blog/how-mojo-gets-a-35-000x-speedup-over-python-part-1

[16]

委托给: https://x.com/gdb/status/1676726449934331904

[17]

非常适合演示，但很难: https://podcasts.apple.com/ph/podcast/expanding-ai-chip-capabilities-beyond-nvidia-with/id315114957?i=1000627798935

[18]

Chris Lattner: https://en.wikipedia.org/wiki/Chris_Lattner

[19]

精彩的采访: https://www.youtube.com/watch?v=ap0VLOPyGqM

[20]

Rust 源代码: https://github.com/second-state/WasmEdge-WASINN-examples/tree/master/wasmedge-ggml-llama-interactive

[21]

NN 标准 API 运行推理操作: https://github.com/WebAssembly/wasi-nn

[22]

带有 GGML 插件的 WasmEdge: https://github.com/second-state/WasmEdge-WASINN-examples/tree/master/wasmedge-ggml-llama-interactive#requirement

[23]

Llama.cpp: https://github.com/ggerganov/llama.cpp

[24]

进行贡献: https://wasmedge.org/docs/contribute/overview

[25]

llama.cpp: https://github.com/ggerganov/llama.cpp

[26]

mediapipe-rs 项目: https://github.com/WasmEdge/mediapipe-rs

[27]

Tensorflow 模型 mediapipe: https://developers.google.com/mediapipe

[28]

WasmEdge YOLO: https://github.com/WasmEdge/WasmEdge/issues/2768

[29]

YOLOv8: https://ultralytics.com/yolov8

[30]

WasmEdge ADAS 演示: https://github.com/second-state/WasmEdge-WASINN-examples/tree/master/openvino-road-segmentation-adas

[31]

WasmEdge Document AI 项目将为: https://github.com/WasmEdge/WasmEdge/issues/2356

展开阅读全文

页面更新：2024-02-24

标签：边缘容器应用程序插件模型语言快速工具程序项目设备

1 2 3 4 5

异构边缘上的快速、可移植的 Llama2 推理

在你的电脑上运行 Llama 2 模型

可选：配置模型

为什么不选Python？

Rust+ Wasm 的优点

Rust 推理程序

在云端或边缘运行

下一步是什么

其它 AI 模型

参考资料

中国北极科考二十四载：向北而去，探索极地

（新华全媒+）拍张CT，看一下5亿多年前的化石长啥样！

农村电商如何正确发力

如何评估云渲染的性能和成本效益？

美国无限期豁免韩企对华芯片设备出口禁令

再度刷新世界纪录！“九章三号”光量子计算原型机研制成功

国产品牌直播《血泪控诉》带货：80%利润归主播，顶级网红

紧张的印尼直播电商：600万卖家何去何从？

中国压力来了！嫦娥八号开放200千克载荷，居然有近20个国家申请

诺奖得主曾被剥夺教职，网友集体要求高校道歉：获资助比拿诺奖难

重庆市合川区市场监督管理局关于8批次食品抽检不合格情况的通告

国际三巨头降价、GLP-1降糖药强势，国产胰岛素能顺利出海吗

“数据援青”项目签约青海扩“朋友圈”促大数据产业绿色发展

安信证券：A股大概率在四季度迎来一轮反弹

国际航行受油船舶需要“快进快出”通关洋浦海事局推出便民化措施

美国无限期豁免韩企对华芯片设备出口禁令

“数据援青”项目签约青海扩“朋友圈”促大数据产业

醴陵船湾镇：跑出项目建设“加速度”做实经济发展“硬支

2023年度上海市科技攻关“揭榜挂帅”项目指南

中国高原高海拔地区首个大型空气压缩储能（集成）项目正式

苏州工业园区一批项目进度条刷新！

河南省地质局获批省级重点研发专项项目

长白山40米口径射电望远镜建设项目启动

日本连续7个月实现经常项目顺差

饭后肚子胀怎么办？ 5个方法让你快速“消气”