五美元 VPS 跑 AI Agent：Hermes 极致低成本部署方案

谁说玩 AI 一定要烧钱

一提到跑大模型，大部分人脑子里浮现的画面是：机房里一排排 A100 显卡，风扇嗡嗡作响，电表刷刷转。

但实际情况是，很多 AI 应用场景根本不需要这么大的算力。你只是想有一个 24 小时在线的 AI 助手，能处理一些日常任务，比如回答问题、分析文本、自动执行一些脚本——这类需求，一台五美元的 VPS 就够了。

这篇文章会告诉你怎么在一台 1 核 1G 内存的廉价服务器上，把 Hermes Agent 跑起来，而且实际可用，不是玩具。

方案概述：不在本地跑模型

先打破一个思维定式：在 VPS 上部署 AI Agent，不一定要在 VPS 上跑模型。

1 核 1G 的服务器确实跑不动大模型（哪怕最小的 1.5B 量化版也勉强）。但你可以这样做：

VPS 负责：运行 Agent 框架、处理用户请求、执行工具调用、维护对话状态
模型推理：使用免费或低成本的远程 API

这种架构下，VPS 只需要跑一个 Python 进程，内存占用几十 MB，绑绑有余。

当然，如果你的 VPS 配置稍高（2 核 4G），也可以考虑在本地跑极小量化模型。两种方案我都会讲。

方案 A：VPS + 免费 API

可用的免费推理 API

目前有几个渠道可以免费或极低成本地调用 Hermes 模型的推理能力：

Hugging Face Inference API：每月有一定的免费额度，支持 Hermes 模型。速度一般，但够用于非实时场景。

Cloudflare Workers AI：每天 10000 次免费推理请求，支持多种开源模型。延迟低，适合实时对话。

OpenRouter：聚合了多个模型提供商，部分模型有免费额度。Hermes 3 在上面可以找到。

本地量化 + 远程调用：如果你家里有台闲置的电脑，可以在家里跑模型，VPS 通过 API 远程调用。

搭建步骤

第一步：购买 VPS

推荐几个便宜的选择：

服务商	价格	配置	备注
RackNerd	$10/年	1核 1G	经常有促销
BandwagonHost	$50/年	1核 1G	稳定性好
Vultr	$5/月	1核 1G	按小时计费
DigitalOcean	$4/月	512M	Droplet最低配

选 Ubuntu 22.04 或 Debian 12 系统，其他都行，这两个社区支持最好。

第二步：基础环境配置

# 更新系统
apt update && apt upgrade -y

# 安装 Python 和必要工具
apt install -y python3 python3-pip python3-venv git nginx

# 创建项目目录
mkdir -p /opt/hermes-agent
cd /opt/hermes-agent

# 创建虚拟环境
python3 -m venv venv
source venv/bin/activate

# 安装依赖
pip install fastapi uvicorn httpx pydantic

第三步：编写 Agent 核心

# agent.py
import httpx
import json
from datetime import datetime

class HermesAgent:
    def __init__(self, api_config):
        self.api_url = api_config["url"]
        self.api_key = api_config.get("key", "")
        self.model = api_config["model"]
        self.conversations = {}
        self.tools = self._register_tools()

    def _register_tools(self):
        return {
            "get_time": {
                "description": "获取当前时间",
                "function": lambda: datetime.now().strftime(
                    "%Y-%m-%d %H:%M:%S"
                )
            },
            "web_search": {
                "description": "搜索互联网信息",
                "function": self._web_search
            },
            "run_command": {
                "description": "执行系统命令",
                "function": self._safe_command
            }
        }

    async def _call_model(self, messages):
        """调用远程推理 API"""
        async with httpx.AsyncClient(timeout=60) as client:
            headers = {}
            if self.api_key:
                headers["Authorization"] = f"Bearer {self.api_key}"

            resp = await client.post(
                self.api_url,
                headers=headers,
                json={
                    "model": self.model,
                    "messages": messages,
                    "temperature": 0.7,
                    "max_tokens": 2048
                }
            )
            data = resp.json()
            return data["choices"][0]["message"]["content"]

    async def _web_search(self, query):
        """简单的网络搜索（使用免费搜索 API）"""
        async with httpx.AsyncClient() as client:
            resp = await client.get(
                "https://api.duckduckgo.com/",
                params={"q": query, "format": "json"}
            )
            data = resp.json()
            results = []
            for item in data.get("RelatedTopics", [])[:3]:
                if "Text" in item:
                    results.append(item["Text"])
            return "\n".join(results) if results else "未找到相关结果"

    def _safe_command(self, cmd):
        """安全的命令执行（白名单机制）"""
        import subprocess
        allowed_prefixes = [
            "date", "uptime", "df", "free",
            "cat /proc/loadavg", "whoami"
        ]
        if not any(cmd.startswith(p) for p in allowed_prefixes):
            return "不允许执行该命令"
        result = subprocess.run(
            cmd, shell=True,
            capture_output=True, text=True,
            timeout=10
        )
        return result.stdout or result.stderr

    async def chat(self, session_id, user_input):
        """处理用户对话"""
        if session_id not in self.conversations:
            self.conversations[session_id] = [
                {"role": "system", "content": """你是 Hermes Agent，
一个运行在服务器上的 AI 助手。你可以：
- 回答各种问题
- 搜索互联网获取最新信息
- 执行基础系统命令查看服务器状态
回答要简洁实用，不要说废话。"""}
            ]

        history = self.conversations[session_id]
        history.append({"role": "user", "content": user_input})

        # 保持历史在合理长度
        if len(history) > 21:
            history = [history[0]] + history[-20:]
            self.conversations[session_id] = history

        response = await self._call_model(history)
        history.append({"role": "assistant", "content": response})

        return response

第四步：Web 服务

# server.py
from fastapi import FastAPI, Request
from pydantic import BaseModel
import uvicorn

app = FastAPI()

# 配置远程 API（以 OpenRouter 为例）
agent = HermesAgent({
    "url": "https://openrouter.ai/api/v1/chat/completions",
    "key": "your-api-key-here",
    "model": "nousresearch/hermes-3-llama-3.1-8b"
})

class ChatRequest(BaseModel):
    session_id: str
    message: str

@app.post("/chat")
async def chat(req: ChatRequest):
    response = await agent.chat(req.session_id, req.message)
    return {"response": response}

@app.get("/health")
async def health():
    return {"status": "ok"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

第五步：Systemd 服务化

# /etc/systemd/system/hermes-agent.service
[Unit]
Description=Hermes AI Agent
After=network.target

[Service]
Type=simple
User=root
WorkingDirectory=/opt/hermes-agent
ExecStart=/opt/hermes-agent/venv/bin/python server.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

1
2
3

systemctl daemon-reload
systemctl enable hermes-agent
systemctl start hermes-agent

第六步：Nginx 反代 + HTTPS

# /etc/nginx/sites-available/hermes
server {
    listen 80;
    server_name your-domain.com;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 120s;
    }
}

1
2
3

# 安装 certbot 配置免费 HTTPS
apt install -y certbot python3-certbot-nginx
certbot --nginx -d your-domain.com

至此，你就有了一个通过 HTTPS 访问的 AI Agent 服务，总成本：VPS $5/月 + 域名 $1/月 + API 调用 $0（免费额度内）。

方案 B：本地量化模型（2核4G 以上）

如果你的 VPS 配置稍高一些，或者不想依赖外部 API，可以考虑在本地跑一个极小的量化模型。

能跑什么模型

VPS 配置	可运行模型	推理速度
1核 1G	基本跑不动	—
1核 2G	Qwen2 0.5B Q4	~5 token/s
2核 4G	Hermes 3 8B Q2	~2 token/s
4核 8G	Hermes 3 8B Q4	~5 token/s

注意看推理速度。2 token/s 是什么概念？生成一段 200 字的回答需要大约 50 秒。能用，但用户体验一般。如果你能接受这个速度，或者你的场景不需要实时响应（比如定时任务、批量处理），那完全可行。

用 llama.cpp 部署

llama.cpp 是目前 CPU 推理效率最高的方案：

# 编译 llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)

# 下载量化模型（以 Q2_K 为例，文件最小）
wget https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B-GGUF/resolve/main/Hermes-3-Llama-3.1-8B.Q2_K.gguf

# 启动 API 服务
./llama-server \
  -m Hermes-3-Llama-3.1-8B.Q2_K.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 2048 \
  -t 2 \
  --mlock

参数解释：

-c 2048：上下文长度设为 2048，越大越占内存。2G 内存的机器别超过 2048。
-t 2：线程数设为 CPU 核心数。
--mlock：锁定内存，防止模型被交换到磁盘。

内存优化技巧

1G 内存的机器真的很紧张，每一 MB 都得精打细算。

开 Swap：虽然 Swap 慢，但能防止 OOM：

# 创建 2G 的 swap 文件
fallocate -l 2G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab

精简系统服务：

# 关掉不需要的服务
systemctl disable snapd
systemctl disable unattended-upgrades
systemctl stop snapd

调整 Nginx 配置：

worker_processes 1;
worker_connections 128;

# 减少缓冲区大小
proxy_buffer_size 4k;
proxy_buffers 4 4k;

方案 C：混合架构（推荐）

实际上最划算的方案是混合的：简单任务用本地小模型处理，复杂任务转发到远程 API。

class HybridAgent:
    def __init__(self):
        self.local_url = "http://localhost:8080/v1/chat/completions"
        self.remote_url = "https://openrouter.ai/api/v1/chat/completions"
        self.remote_key = "your-key"

    async def route_request(self, messages, complexity="auto"):
        """根据任务复杂度选择模型"""
        if complexity == "auto":
            # 简单的复杂度判断：消息长度 + 任务类型
            last_msg = messages[-1]["content"]
            is_simple = (
                len(last_msg) < 200
                and "代码" not in last_msg
                and "分析" not in last_msg
                and "翻译" not in last_msg
            )
            complexity = "simple" if is_simple else "complex"

        if complexity == "simple":
            return await self._call_local(messages)
        else:
            return await self._call_remote(messages)

    async def _call_local(self, messages):
        """调用本地小模型"""
        async with httpx.AsyncClient(timeout=120) as client:
            resp = await client.post(
                self.local_url,
                json={
                    "messages": messages,
                    "temperature": 0.7,
                    "max_tokens": 512
                }
            )
            return resp.json()["choices"][0]["message"]["content"]

    async def _call_remote(self, messages):
        """调用远程大模型"""
        async with httpx.AsyncClient(timeout=60) as client:
            resp = await client.post(
                self.remote_url,
                headers={
                    "Authorization": f"Bearer {self.remote_key}"
                },
                json={
                    "model": "nousresearch/hermes-3-llama-3.1-70b",
                    "messages": messages,
                    "temperature": 0.7,
                    "max_tokens": 2048
                }
            )
            return resp.json()["choices"][0]["message"]["content"]

这种方案的好处是：日常简单问答由本地模型处理，不消耗 API 额度；遇到复杂任务自动升级到远程大模型，保证回答质量。

安全加固

把 AI Agent 暴露在公网上，安全问题不能忽视。

基础安全措施

# 修改 SSH 端口
sed -i 's/#Port 22/Port 2222/' /etc/ssh/sshd_config
systemctl restart sshd

# 安装 fail2ban 防暴力破解
apt install -y fail2ban
systemctl enable fail2ban

# 配置 UFW 防火墙
ufw allow 2222/tcp
ufw allow 80/tcp
ufw allow 443/tcp
ufw enable

API 鉴权

给 Agent API 加上简单的 token 验证：

from fastapi import Header, HTTPException

API_TOKEN = "your-secret-token-here"

@app.post("/chat")
async def chat(
    req: ChatRequest,
    authorization: str = Header(...)
):
    if authorization != f"Bearer {API_TOKEN}":
        raise HTTPException(status_code=401)
    response = await agent.chat(req.session_id, req.message)
    return {"response": response}

速率限制

防止被恶意刷请求：

from fastapi import Request
from collections import defaultdict
import time

request_counts = defaultdict(list)

@app.middleware("http")
async def rate_limit(request: Request, call_next):
    client_ip = request.client.host
    now = time.time()

    # 清理过期记录
    request_counts[client_ip] = [
        t for t in request_counts[client_ip]
        if now - t < 60
    ]

    # 每分钟最多 20 次请求
    if len(request_counts[client_ip]) >= 20:
        return JSONResponse(
            status_code=429,
            content={"error": "请求太频繁，请稍后重试"}
        )

    request_counts[client_ip].append(now)
    return await call_next(request)

实际运行效果

拿一台 RackNerd 的 $10/年 VPS（1 核 768M 内存）测试了一个月，数据如下：

平均响应时间：使用远程 API 方案，1.5-3 秒
每日请求量：大概 50-100 次对话
内存占用：Python 进程 + Nginx，稳定在 200MB 左右
月 API 成本：Cloudflare Workers AI 免费额度内，$0
稳定性：一个月内零宕机

对于个人使用来说，这个方案完全够用。如果你需要更强的本地推理能力，可以参考用 Hermes 打造私人 AI 学习助手里的本地部署方案。

成本对比

方案	月成本	推理能力	延迟	适合场景
VPS + 免费 API	$1-5	强（远程大模型）	1-3s	个人助手、小工具
VPS 本地小模型	$3-10	弱（2B 量化）	5-30s	离线任务、隐私敏感
混合方案	$3-10	灵活	1-30s	推荐方案
云 GPU（A10）	$200+	强	<1s	生产环境
ChatGPT Plus	$20	强	1-5s	个人使用

五美元方案的性价比优势在于长期使用。你拥有了一个完全可控的 AI 服务端点，可以按自己的需求定制，不受商业服务的各种限制。

写在最后

低成本部署 AI Agent 的核心思路就是「分层」：把推理和服务分开，推理用免费或低成本的远程 API，服务用便宜的 VPS。这个架构足够灵活，后期随时可以替换推理后端——从免费 API 切换到自建 GPU 服务器，Agent 代码不用改一行。

如果你想进一步了解 Hermes 模型的技术背景和能力边界，推荐阅读 Hermes 是什么。有问题或者想分享你的低成本部署经验，来 cocoloop 社区聊聊。

目录