如何在 Ollama 中实现多模型并发运行和资源管理？ - 面试题

Ollama 支持多模型并发运行，这对于需要同时处理多个请求或使用不同模型的应用场景非常有用。

1. 查看运行中的模型：

bash
# 查看当前加载的模型
ollama ps

输出示例：

shell
NAME       ID           SIZE      PROCESSOR    UNTIL
llama3.1   1234567890   4.7GB     100% GPU     4 minutes from now
mistral    0987654321   4.2GB     100% GPU     2 minutes from now

2. 并发请求处理：

Ollama 自动处理并发请求，无需额外配置：

python
import ollama
import concurrent.futures

def generate_response(prompt, model):
    response = ollama.generate(model=model, prompt=prompt)
    return response['response']

# 并发执行多个请求
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
    futures = [
        executor.submit(generate_response, "Tell me a joke", "llama3.1"),
        executor.submit(generate_response, "Explain AI", "mistral"),
        executor.submit(generate_response, "Write code", "codellama")
    ]
    
    for future in concurrent.futures.as_completed(futures):
        print(future.result())

3. 配置并发参数：

在 Modelfile 中设置并发相关参数：

dockerfile
FROM llama3.1

# 设置并行处理数量
PARAMETER num_parallel 4

# 设置批处理大小
PARAMETER num_batch 512

4. 使用不同模型处理不同任务：

python
import ollama

# 使用不同模型处理不同类型的任务
def process_request(task_type, input_text):
    if task_type == "chat":
        return ollama.generate(model="llama3.1", prompt=input_text)
    elif task_type == "code":
        return ollama.generate(model="codellama", prompt=input_text)
    elif task_type == "analysis":
        return ollama.generate(model="mistral", prompt=input_text)

5. 模型切换和卸载：

bash
# 手动卸载模型（释放内存）
ollama stop llama3.1

# 重新加载模型
ollama run llama3.1

6. 资源管理策略：

内存管理：

监控内存使用情况
根据硬件资源调整并发数量
定期卸载不常用的模型

GPU 分配：

dockerfile
# 指定 GPU 层数
PARAMETER num_gpu 35

# 完全使用 GPU
PARAMETER num_gpu 99

7. 高级并发模式：

python
import ollama
from queue import Queue
import threading

class ModelPool:
    def __init__(self, models):
        self.models = models
        self.queue = Queue()
        
    def worker(self):
        while True:
            task = self.queue.get()
            if task is None:
                break
            
            model, prompt = task
            response = ollama.generate(model=model, prompt=prompt)
            print(f"{model}: {response['response'][:50]}...")
            self.queue.task_done()
    
    def start_workers(self, num_workers=3):
        for _ in range(num_workers):
            threading.Thread(target=self.worker, daemon=True).start()
    
    def add_task(self, model, prompt):
        self.queue.put((model, prompt))

# 使用模型池
pool = ModelPool(["llama3.1", "mistral", "codellama"])
pool.start_workers(3)

pool.add_task("llama3.1", "Hello")
pool.add_task("mistral", "Hi")
pool.add_task("codellama", "Write code")