5月28日 07:01
How to implement multi-model concurrent execution and resource management in Ollama?
Ollama supports running multiple models concurrently, which is very useful for application scenarios that need to handle multiple requests simultaneously or use different models.
1. View Running Models:
bash# View currently loaded models ollama ps
Output example:
shellNAME ID SIZE PROCESSOR UNTIL llama3.1 1234567890 4.7GB 100% GPU 4 minutes from now mistral 0987654321 4.2GB 100% GPU 2 minutes from now
2. Concurrent Request Processing:
Ollama automatically handles concurrent requests without additional configuration:
pythonimport ollama import concurrent.futures def generate_response(prompt, model): response = ollama.generate(model=model, prompt=prompt) return response['response'] # Execute multiple requests concurrently with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor: futures = [ executor.submit(generate_response, "Tell me a joke", "llama3.1"), executor.submit(generate_response, "Explain AI", "mistral"), executor.submit(generate_response, "Write code", "codellama") ] for future in concurrent.futures.as_completed(futures): print(future.result())
3. Configure Concurrency Parameters:
Set concurrency-related parameters in Modelfile:
dockerfileFROM llama3.1 # Set parallel processing count PARAMETER num_parallel 4 # Set batch size PARAMETER num_batch 512
4. Use Different Models for Different Tasks:
pythonimport ollama # Use different models to process different types of tasks def process_request(task_type, input_text): if task_type == "chat": return ollama.generate(model="llama3.1", prompt=input_text) elif task_type == "code": return ollama.generate(model="codellama", prompt=input_text) elif task_type == "analysis": return ollama.generate(model="mistral", prompt=input_text)
5. Model Switching and Unloading:
bash# Manually unload model (free memory) ollama stop llama3.1 # Reload model ollama run llama3.1
6. Resource Management Strategies:
Memory Management:
- Monitor memory usage
- Adjust concurrency based on hardware resources
- Regularly unload unused models
GPU Allocation:
dockerfile# Specify number of GPU layers PARAMETER num_gpu 35 # Use GPU completely PARAMETER num_gpu 99
7. Advanced Concurrency Patterns:
pythonimport ollama from queue import Queue import threading class ModelPool: def __init__(self, models): self.models = models self.queue = Queue() def worker(self): while True: task = self.queue.get() if task is None: break model, prompt = task response = ollama.generate(model=model, prompt=prompt) print(f"{model}: {response['response'][:50]}...") self.queue.task_done() def start_workers(self, num_workers=3): for _ in range(num_workers): threading.Thread(target=self.worker, daemon=True).start() def add_task(self, model, prompt): self.queue.put((model, prompt)) # Use model pool pool = ModelPool(["llama3.1", "mistral", "codellama"]) pool.start_workers(3) pool.add_task("llama3.1", "Hello") pool.add_task("mistral", "Hi") pool.add_task("codellama", "Write code")