把 Hermes 变成 API 服务:用 FastAPI 包装推理接口

手把手教你用 FastAPI 把 Hermes 模型包装成 OpenAI 兼容的 API 服务,支持流式输出和并发请求。

目录

  1. 为什么要把模型包成 API
  2. 方案一:FastAPI + llama-cpp-python
  3. 方案二:vLLM(生产级推荐)
  4. 加上认证和限流
  5. 请求队列和超时处理
  6. Docker 部署
  7. 性能基准参考

为什么要把模型包成 API

本地跑 Hermes 做实验很方便,但一旦要给多个应用或多个用户使用,你就需要一个 API 服务了。

把 Hermes 包成 API 有几个好处:

  1. 统一接口 — 前端、后端、移动端都通过同一个 API 调用模型
  2. 资源共享 — 一个 GPU 服务器同时服务多个客户端
  3. 兼容 OpenAI 格式 — 现有的 OpenAI SDK 代码改一下 base_url 就能用
  4. 可观测性 — 方便加日志、监控、限流

方案一:FastAPI + llama-cpp-python

适合个人使用或小团队内部服务,并发量不大的场景。

1
pip install fastapi uvicorn llama-cpp-python sse-starlette
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
"""Hermes API 服务 - 兼容 OpenAI Chat Completions 格式"""
import json
import time
import uuid
from typing import Optional

from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from sse_starlette.sse import EventSourceResponse
from llama_cpp import Llama

MODEL_PATH = "./Hermes-3-Llama-3.1-8B-Q5_K_M.gguf"
MODEL_NAME = "hermes-3-8b"

app = FastAPI(title="Hermes API", version="1.0.0")
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])

llm = Llama(model_path=MODEL_PATH, n_ctx=8192, n_gpu_layers=-1, chat_format="chatml", verbose=False)


class Message(BaseModel):
role: str
content: str

class ChatRequest(BaseModel):
model: str = MODEL_NAME
messages: list[Message]
temperature: float = Field(default=0.7, ge=0, le=2)
top_p: float = Field(default=0.9, ge=0, le=1)
max_tokens: int = Field(default=1024, ge=1, le=8192)
stream: bool = False
stop: Optional[list[str]] = None


@app.get("/v1/models")
async def list_models():
return {"object": "list", "data": [{"id": MODEL_NAME, "object": "model", "created": int(time.time()), "owned_by": "local"}]}


@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
try:
messages = [{"role": m.role, "content": m.content} for m in request.messages]

if request.stream:
return EventSourceResponse(stream_response(messages, request), media_type="text/event-stream")

response = llm.create_chat_completion(
messages=messages, temperature=request.temperature,
top_p=request.top_p, max_tokens=request.max_tokens, stop=request.stop
)

return {
"id": f"chatcmpl-{uuid.uuid4().hex[:8]}",
"object": "chat.completion",
"created": int(time.time()),
"model": request.model,
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": response["choices"][0]["message"]["content"]},
"finish_reason": response["choices"][0].get("finish_reason", "stop")
}],
"usage": response["usage"]
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))


async def stream_response(messages, request):
completion_id = f"chatcmpl-{uuid.uuid4().hex[:8]}"
stream = llm.create_chat_completion(
messages=messages, temperature=request.temperature,
top_p=request.top_p, max_tokens=request.max_tokens, stop=request.stop, stream=True
)
for chunk in stream:
delta = chunk["choices"][0].get("delta", {})
finish_reason = chunk["choices"][0].get("finish_reason")
data = {
"id": completion_id, "object": "chat.completion.chunk",
"created": int(time.time()), "model": request.model,
"choices": [{"index": 0, "delta": delta, "finish_reason": finish_reason}]
}
yield {"data": json.dumps(data, ensure_ascii=False)}
yield {"data": "[DONE]"}


@app.get("/health")
async def health_check():
return {"status": "ok", "model": MODEL_NAME}

if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)

用 OpenAI SDK 调用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# 非流式
response = client.chat.completions.create(
model="hermes-3-8b",
messages=[
{"role": "system", "content": "你是一个Python专家。"},
{"role": "user", "content": "解释什么是装饰器"}
]
)
print(response.choices[0].message.content)

# 流式
stream = client.chat.completions.create(
model="hermes-3-8b",
messages=[{"role": "user", "content": "写一首关于春天的诗"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)

这就是 OpenAI 兼容格式的好处——现有代码只需改 base_url 就能切换到本地 Hermes 模型。

方案二:vLLM(生产级推荐)

vLLM 内置了 OpenAI 兼容的 API 服务器,你甚至不需要自己写代码:

1
2
3
4
5
python -m vllm.entrypoints.openai.api_server \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--host 0.0.0.0 --port 8000 \
--dtype bfloat16 --max-model-len 8192 \
--gpu-memory-utilization 0.85

vLLM 的核心优势在于 Continuous Batching——把多个并发请求动态合并成一个 batch 一起推理,GPU 利用率远高于串行处理。高并发场景下吞吐量可以是 llama-cpp-python 的 5-10 倍。

关于模型量化方案的选择,参考 Hermes 量化全攻略

加上认证和限流

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from fastapi import Depends, HTTPException, Security
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from collections import defaultdict
import time

security = HTTPBearer()
VALID_API_KEYS = {
"sk-hermes-key-001": {"name": "frontend-app", "rate_limit": 60},
"sk-hermes-key-002": {"name": "backend-service", "rate_limit": 200},
}
request_counts = defaultdict(list)

def verify_api_key(credentials: HTTPAuthorizationCredentials = Security(security)):
api_key = credentials.credentials
if api_key not in VALID_API_KEYS:
raise HTTPException(status_code=401, detail="无效的 API Key")

config = VALID_API_KEYS[api_key]
now = time.time()
request_counts[api_key] = [t for t in request_counts[api_key] if t > now - 60]

if len(request_counts[api_key]) >= config["rate_limit"]:
raise HTTPException(status_code=429, detail=f"请求频率超限")

request_counts[api_key].append(now)
return api_key

请求队列和超时处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import asyncio
from contextlib import asynccontextmanager

inference_semaphore = asyncio.Semaphore(2)
MAX_QUEUE_SIZE = 50
request_queue_size = 0

@asynccontextmanager
async def managed_inference():
global request_queue_size
request_queue_size += 1
if request_queue_size > MAX_QUEUE_SIZE:
request_queue_size -= 1
raise HTTPException(status_code=503, detail="服务器繁忙,请稍后重试")
try:
async with asyncio.timeout(120):
await inference_semaphore.acquire()
try:
yield
finally:
inference_semaphore.release()
except asyncio.TimeoutError:
raise HTTPException(status_code=504, detail="请求超时")
finally:
request_queue_size -= 1

Docker 部署

1
2
3
4
5
6
7
8
9
FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04
WORKDIR /app
RUN apt-get update && apt-get install -y python3 python3-pip && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
COPY api_server.py .
COPY models/ ./models/
EXPOSE 8000
CMD ["python3", "api_server.py"]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# docker-compose.yml
version: '3.8'
services:
hermes-api:
build: .
ports:
- "8000:8000"
volumes:
- ./models:/app/models
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3

性能基准参考

Hermes-3 8B 的 API 服务性能(单请求,输出 256 tokens):

硬件 量化 首 token 延迟 吞吐量 (tokens/s)
RTX 3060 12GB Q4_K_M (llama.cpp) ~200ms ~25-35
RTX 4090 24GB Q5_K_M (llama.cpp) ~80ms ~60-80
RTX 4090 24GB FP16 (vLLM) ~50ms ~80-100
A100 40GB AWQ (vLLM) ~30ms ~120-150
A100 80GB FP16 (vLLM) ~25ms ~130-160

选哪个方案取决于使用场景:个人或小团队用 FastAPI + llama-cpp-python 足够;要对外提供服务或并发量高,上 vLLM。关于 Hermes 的基本能力和版本区别,参考 Hermes 入门指南

在 cocoloop 社区里也有不少同学分享了各种部署方案的实战经验和踩坑记录,遇到具体问题时可以去翻翻。

参与讨论

对这篇文章有疑问或想法?cocoloop 社区有不少开发者在讨论 Hermes 相关话题,欢迎加入交流。

前往 cocoloop 社区 →