I don't know yet what goes here as I am not used to blogging, but yes , as time goes by , soon I shall figure out if I should stick to tweeting or in this world of AI you are asking for more slop.
bainymx
Software Engineer
Self-hosting LLMs gives you complete control over your AI infrastructure:
For Llama2-7B:
python -m venv llm-env
source llm-env/bin/activate
pip install torch transformers fastapi uvicorn
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class ChatRequest(BaseModel):
message: str
max_tokens: int = 256
@app.post("/chat")
async def chat(request: ChatRequest):
inputs = tokenizer(request.message, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=request.max_tokens)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return {"response": response}
Use Gunicorn with Uvicorn workers:
gunicorn main:app -w 2 -k uvicorn.workers.UvicornWorker
You now have a private, scalable LLM API. Consider adding rate limiting, authentication, and monitoring for production use.