如何在 Ollama 中使用流式响应（streaming）来实时生成文本？ - 面试题

Ollama 支持流式响应，这对于需要实时显示生成内容的应用场景非常重要。

1. 启用流式响应：

在 API 调用时设置 "stream": true 参数：

bash
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Tell me a story about AI",
  "stream": true
}'

2. Python 流式响应示例：

python
import ollama

# 流式生成文本
for chunk in ollama.generate(model='llama3.1', prompt='Tell me a story', stream=True):
    print(chunk['response'], end='', flush=True)

# 流式对话
messages = [
    {'role': 'user', 'content': 'Explain quantum computing'}
]

for chunk in ollama.chat(model='llama3.1', messages=messages, stream=True):
    if 'message' in chunk:
        print(chunk['message']['content'], end='', flush=True)

3. JavaScript 流式响应示例：

javascript
import ollama from 'ollama-js'

const client = new ollama.Ollama()

// 流式生成
const stream = await client.generate({
  model: 'llama3.1',
  prompt: 'Tell me a story',
  stream: true
})

for await (const chunk of stream) {
  process.stdout.write(chunk.response)
}

4. 使用 requests 库处理流式响应：

python
import requests
import json

response = requests.post(
    'http://localhost:11434/api/generate',
    json={
        'model': 'llama3.1',
        'prompt': 'Hello, how are you?',
        'stream': True
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        data = json.loads(line)
        print(data.get('response', ''), end='', flush=True)

5. 流式响应的优势：

更好的用户体验：实时显示生成内容，减少等待时间
更低的内存占用：不需要缓存完整响应
更快的首字时间：立即开始显示内容
更好的交互性：用户可以提前看到部分结果

6. 处理流式响应的注意事项：

正确处理 JSON 行，每行都是一个独立的 JSON 对象
处理连接中断和重连逻辑
考虑添加超时机制
实现取消功能以停止生成

7. 高级流式处理：

python
import ollama
from queue import Queue
from threading import Thread

def stream_to_queue(queue, model, prompt):
    for chunk in ollama.generate(model=model, prompt=prompt, stream=True):
        queue.put(chunk['response'])
    queue.put(None)  # 结束标记

# 使用队列处理流式响应
queue = Queue()
thread = Thread(target=stream_to_queue, args=(queue, 'llama3.1', 'Tell me a story'))
thread.start()

while True:
    chunk = queue.get()
    if chunk is None:
        break
    print(chunk, end='', flush=True)

thread.join()