工具
- ollama:用于下载和管理模型
- DeepSeek-R1:是要使用的 LLM模型
- Nomic-Embed-Text向量模型: 用于将文本库进行切分,编码,转换进入向量库
一、启动 ollama
1.拉取镜像
docker pull ollama/ollama:0.5.13-rc6
2.配置 Docker 使用 GPU
1) 安装 nvidia-container-toolkit
若使用 GPU,使用的是 Nvidia GPU,需安装 nvidia-container-toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
2) 配置 Docker 使用 Nvidia driver
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
3.启动容器
1) docker run 启动
只使用 CPU
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama_1 ollama/ollama:0.5.13-rc6
若使用 GPU
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama_1 ollama/ollama:0.5.13-rc6
2) docker compose 启动
docker-compose.yml
version: "3"
services:
ollama1:
image: ollama/ollama:0.5.13-rc6
container_name: ollama_1
restart: no
ports:
- 11434:11434
volumes:
- "./.ollama:/root/.ollama"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
二、安装 deepseek-r1
1.安装 deepseek-r1:7b
ollama run deepseek-r1:7b
2.安装 nomic-embed-text
nomic-embed-text 模型是一个强大的嵌入式文本处理工具
ollama pull nomic-embed-text
3.查看模型
root@509d39be4053:/# ollama list
NAME ID SIZE MODIFIED
nomic-embed-text:latest 0a109f422b47 274 MB 26 hours ago
deepseek-r1:7b 0a8c26691023 4.7 GB 26 hours ago
deepseek-r1:1.5b a42b25d8c10a 1.1 GB 26 hours ago
root@509d39be4053:/# ollama ps
NAME ID SIZE PROCESSOR UNTIL
deepseek-r1:7b 0a8c26691023 6.1 GB 58%/42% CPU/GPU 4 minutes from now
4.模型信息
1) deepseek-r1:7b 模型不支持工具
2) qwen3:4b 模型支持工具
root@01c01ee81203:~# ollama show deepseek-r1:7b
Model
architecture qwen2
parameters 7.6B
context length 131072
embedding length 3584
quantization Q4_K_M
Capabilities
completion
thinking
Parameters
stop "<|begin▁of▁sentence|>"
stop "<|end▁of▁sentence|>"
stop "<|User|>"
stop "<|Assistant|>"
License
MIT License
Copyright (c) 2023 DeepSeek
...
root@01c01ee81203:~# ollama show qwen3:4b
Model
architecture qwen3
parameters 4.0B
context length 262144
embedding length 2560
quantization Q4_K_M
Capabilities
completion
tools
thinking
Parameters
top_k 20
top_p 0.95
repeat_penalty 1
stop "<|im_start|>"
stop "<|im_end|>"
temperature 0.6
License
Apache License
Version 2.0, January 2004
...
5.构建本地知识库
初期接触LLM即大语言模型,觉得虽然很强大,但是有时候AI会一本正经的胡说八道,这种大模型的幻觉对于日常使用来说具有很大的误导性,特别是如果我们要用在生成环境下,由于缺少精确性而无法使用。 为什么会造成这种结果那,简单来说就是模型是为了通用性设计的,缺少相关知识,所以导致回复的结果存在胡说八道的情况,根据香农理论,减少信息熵,就需要引入更多信息。
从这个角度来说,就有两个途径,一是重新利用相关专业知识再次训练加强模型,或进行模型微调; 模型训练的成本是巨大的,微调也需要重新标记数据和大量的计算资源,对于个人来说基本不太现实; 二是在问LLM问题的时候,增加些知识背景,让模型可以根据这些知识背景来回复问题;后者即是知识库的构建原理了。
有个专门的概念叫RAG(Retrieval-Augmented Generation),即检索增强生成,是一种结合检索技术和生成模型的技术框架,旨在提升模型生成内容的准确性和相关性。其核心思想是:在生成答案前,先从外部知识库中检索相关信息,再将检索结果与用户输入结合,指导生成模型输出更可靠的回答。
简单概述,利用已有的文档、内部知识生成向量知识库,在提问的时候结合库的内容一起给大模型,让其回答的更准确,它结合了信息检索和大模型技术。
这样做有什么好处那?
- 由于日常的业务知识是保存到本地的,所以减少信息泄露的风险;
- 由于提问结合了业务知识,所以减少了模型的幻觉,即减少了模型的胡说八道;
- 模型的回复结合了业务知识和实时知识,所以实时性可以更好;
- 不用重新训练模型,微调模型降低了成本;
三、ollama 可视化
1.page-assist
可安装 google 浏览器插件,下载地质 https://github.com/n4ze3m/page-assist/releases
Ollama 设置:

RAG 设置:

使用:

四、GPU 使用情况
使用 nvidia-smi 查看 GPU 使用
zxm@zxm-pc:~$ nvidia-smi
Tue Mar 4 23:29:00 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1060 3GB Off | 00000000:01:00.0 On | N/A |
| 35% 35C P0 29W / 120W | 266MiB / 3072MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1781 G /usr/lib/xorg/Xorg 149MiB |
| 0 N/A N/A 1927 G /usr/bin/gnome-shell 30MiB |
| 0 N/A N/A 73461 G ...seed-version=20250228-151446.092000 82MiB |
+---------------------------------------------------------------------------------------+
ollama 使用 GPU
zxm@zxm-pc:~$ nvidia-smi
Tue Mar 4 23:31:21 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1060 3GB Off | 00000000:01:00.0 On | N/A |
| 35% 43C P2 78W / 120W | 2339MiB / 3072MiB | 18% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1781 G /usr/lib/xorg/Xorg 149MiB |
| 0 N/A N/A 1927 G /usr/bin/gnome-shell 31MiB |
| 0 N/A N/A 73461 G ...seed-version=20250228-151446.092000 78MiB |
| 0 N/A N/A 75704 C /usr/bin/ollama 2074MiB |
+---------------------------------------------------------------------------------------+
五、OpenAI 兼容性
1.列出本地模型
请求:
curl http://localhost:11434/api/tags
响应:
{
"models": [
{
"name": "qwen3:14b",
"model": "qwen3:14b",
"modified_at": "2025-10-24T15:34:59.157623395Z",
"size": 9276198565,
"digest": "bdbd181c33f2ed1b31c972991882db3cf4d192569092138a7d29e973cd9debe8",
"details": {
"parent_model": "",
"format": "gguf",
"family": "qwen3",
"families": [
"qwen3"
],
"parameter_size": "14.8B",
"quantization_level": "Q4_K_M"
}
},
{
"name": "nomic-embed-text:latest",
"model": "nomic-embed-text:latest",
"modified_at": "2025-10-23T15:44:00.04103599Z",
"size": 274302450,
"digest": "0a109f422b47e3a30ba2b10eca18548e944e8a23073ee3f3e947efcf3c45e59f",
"details": {
"parent_model": "",
"format": "gguf",
"family": "nomic-bert",
"families": [
"nomic-bert"
],
"parameter_size": "137M",
"quantization_level": "F16"
}
},
{
"name": "deepseek-r1:7b",
"model": "deepseek-r1:7b",
"modified_at": "2025-10-23T15:41:22.653128255Z",
"size": 4683075440,
"digest": "755ced02ce7befdb13b7ca74e1e4d08cddba4986afdb63a480f2c93d3140383f",
"details": {
"parent_model": "",
"format": "gguf",
"family": "qwen2",
"families": [
"qwen2"
],
"parameter_size": "7.6B",
"quantization_level": "Q4_K_M"
}
}
]
}
2.显示模型信息
curl http://localhost:11434/api/show -d '{
"model": "deepseek-r1:7b"
}'
3.调用嵌入模型
请求:
curl http://localhost:11434/api/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text:latest",
"prompt": "你好"
}'
响应:
{
"embedding": [
-0.12307237833738327,
0.29820987582206726,
-3.833275556564331,
0.06487877666950226,
1.4995296001434326,
0.23798543214797974,
-0.7658764123916626,
-0.44865143299102783,
-0.40256860852241516,
-1.2598717212677002,
-0.8907259702682495,
1.7141786813735962,
0.1831144392490387,
0.16616633534431458,
0.13582943379878998,
-1.0568212270736694,
0.05641965940594673,
-1.422386884689331,
-0.9263020753860474,
1.2330042123794556,
-0.8702852725982666,
0.8141596913337708,
-0.19736900925636292,
-0.8921308517456055,
4.122570514678955,
-0.3852195739746094,
0.8616183400154114,
1.2724435329437256,
-0.07922960817813873,
0.4311417043209076,
0.24930191040039062,
-0.8231167793273926,
-0.39267492294311523,
0.3824201822280884,
-2.01654052734375
]
}
3.调用对话
1) openapi
请求:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1:7b",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "你是谁?"
}
]
}'
相应:
{
"id": "chatcmpl-141",
"object": "chat.completion",
"created": 1761238914,
"model": "deepseek-r1:7b",
"system_fingerprint": "fp_ollama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "\u003cthink\u003e\n我是DeepSeek-R1,一个由深度求索公司开发的智能助手,我会尽我所能为您提供帮助。\n\u003c/think\u003e\n\n我是DeepSeek-R1,一个由深度求索公司开发的智能助手,我会尽我所能为您提供帮助。"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 53,
"total_tokens": 65
}
}
2) qwen
请求:
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-plus",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "你是谁?"
}
]
}'
响应:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "我是通义千问,阿里巴巴集团旗下的超大规模语言模型。我能够回答问题、创作文字,如写故事、公文、邮件、剧本等,还能进行逻辑推理、编程,表达观点,玩游戏等。我支持多种语言,包括但不限于中文、英文、德语、法语、西班牙语等。如果你有任何问题或需要帮助,欢迎随时告诉我!"
},
"finish_reason": "stop",
"index": 0,
"logprobs": null
}
],
"object": "chat.completion",
"usage": {
"prompt_tokens": 22,
"completion_tokens": 79,
"total_tokens": 101,
"prompt_tokens_details": {
"cached_tokens": 0
}
},
"created": 1761236409,
"system_fingerprint": null,
"model": "qwen-plus",
"id": "chatcmpl-c6ba1546-40ee-4978-914c-62b7dbe23efd"
}