The Missing Layer in Your Local AI Strategy

Fine-tuning doesn't add knowledge. It changes behavior.

Bottom Line: Fine-tuning doesn't make small models smarter. Retrieval gives them your knowledge. The agent runtime gives them discipline. Understanding which tool solves which problem is what separates a frustrating experiment from a genuinely useful one.

Fine-tuning doesn't add knowledge. I hoped it did. It doesn't. Let me explain.

That was the first thing I got wrong. And it sent me down the wrong path in thinking about how to make local models more capable. If you're experimenting with small models and finding gaps in what they can do, there's a reasonable chance you're making the same assumption.

Here's what actually changed my thinking.

What Fine-Tuning Actually Does

Researchers at German clinical institutions took a small open-source model and fine-tuned it on tumour taxonomy data. The base model had near-zero accuracy on ICD-10 coding outputs. The model had the clinical knowledge. What it lacked was the ability to format valid taxonomy codes without hallucinating the structure.

Fine-tuning fixed the output behaviour. Near-zero to production-grade accuracy, achieved by constraining the output space to valid formats rather than adding knowledge.

That's what instruction fine-tuning does, and it's what most people mean when they talk about fine-tuning a model: changing how it behaves, not what it knows. The output format, the vocabulary, the conventions, how it structures responses. If your problem is that the model doesn't know your domain, fine-tuning is the wrong tool. If your problem is that the model knows the domain but doesn't respond in the format or style you need, fine-tuning is exactly the right tool.

The distinction matters more than it sounds. Most people arrive at fine-tuning looking to add knowledge. That's not what you get. And spending weeks trying to train knowledge into a model that already has it, or one that should be getting it from retrieval, is time lost.

The tooling to do this yourself is more accessible than most people realise. Unsloth, built by Daniel and Michael Han, cuts VRAM overhead by up to 70 per cent through custom kernels and runs training two to five times faster than standard pipelines. There's a browser-based Studio interface that requires no Python setup to get started. A 7B model on 1,000 training examples can complete in minutes on a mid-range consumer GPU. The barrier to trying it is lower than the technical reputation of the field suggests.

One thing to know before you generate training data: if you're using consumer AI interfaces to build your seed examples, the default terms on most consumer tiers retain rights to use that data for future model training. Use API tiers or run the generation locally. The output looks the same. The data rights can be contractually different.

What Actually Solves the Knowledge Problem

If fine-tuning isn't for knowledge, retrieval is.

Retrieval-augmented generation gives the model access to your actual domain content at the moment it needs it. Your documentation, your processes, your product specifications, your internal knowledge base: pulled in at inference time, not baked in during training. The model doesn't need to have been trained on your business. It needs access to it when it's thinking.

I've built applications using embedding models and knowledge graphs to support this kind of retrieval. The honest observation is that RAG approaches have evolved considerably over the past eighteen months. What was current practice a year ago looks different now. The principle has stayed stable: the model reasons, the retrieval layer finds.

The place most implementations go wrong isn't the model. It's the retrieval architecture. Queries that loop without converging because the document set doesn't contain what they're searching for. Tool calls that cascade when the model hits uncertainty. Context windows filled with raw retrieval output that dilutes the model's attention until it starts drifting from its instructions.

All of these trace back to retrieval infrastructure, not to the model itself. Enforce stopping criteria. Compress retrieval outputs before they enter the context window. Build telemetry early enough to see the warning signs before they compound. The model performs to the quality of the information it receives, not just its own training.

The Layer Most People Skip Entirely

The model alone and the model inside a well-built agent runtime are not the same thing.

I use Agent Zero for my own work. It's lesser known than some of the more discussed frameworks, but it gives me what I actually need: the autonomy I'm prepared to extend to an agent on longer-horizon tasks, and the control I still want over how it operates. That balance matters when you're working with local models. The reliability floor is different to a frontier model API. The runtime is what imposes structure on top of that variability.

What an agent runtime does: it manages tool calling, handles reasoning loops, enforces stopping criteria, and manages context across a task. A capable model without a runtime is like a capable person with no workflow. The intelligence is there. The discipline to apply it consistently isn't.

For personal and experimental use, Agent Zero and Hermes are worth understanding. For production orchestration at scale, LangGraph is the framework to know. The specific tool matters less than the principle: the model is one component in a system, and the system determines what it can actually do in practice.

The Right Order

If I were building this from scratch with what I know now, the sequence would be this.

Start with the runtime. Pick a framework, connect it to your model, and run the same task you've been running through a raw interface. See what changes. The structured reasoning loop alone often surfaces what was missing without touching the model at all.

Add retrieval second. Give the model access to your actual domain content. A basic embedding setup retrieving from a clean, well-organised knowledge base consistently outperforms a sophisticated retrieval system working with poorly structured data. Get the content right before optimising the retrieval method.

Fine-tune last, and only for a specific reason. Before going there, check whether the problem is actually a prompt problem or well built skills. Better system instructions or targeted skills files solve more than people expect. A structured instruction file that defines the ask, the action, and the expected output format often resolves inconsistency without touching the model at all. Work through those options first. Only when you've tightened the instructions and the output is still wrong in ways you can't control does fine-tuning become the right question to ask. The specific problems it addresses well: inconsistent output formatting, the wrong vocabulary, a response structure your downstream process can't use. When retrieval and instructions can't fix it, fine-tuning potentially can.

The corrected mental model is simpler than the one I started with. The runtime gives the model discipline. Retrieval gives it knowledge. Fine-tuning shapes how it applies both. In that order, for those reasons.

What This Changes

The small models available now are genuinely capable in ways they weren't twelve months ago. But the capability gap most people experience isn't just about the model weights. It's about the architecture around them.

I was looking at the model as the thing to improve. The model was the least controllable variable. The runtime, the retrieval layer, and the output constraints are all things you can directly shape. That's where the work actually is.

The capability is already in the model. The architecture is how you reach it.

Framework Application

The .solved Execution Model applies directly here:

Uncover: Identify whether your local AI gap is a lack of domain knowledge (needs RAG) or a behavioral formatting issue (needs fine-tuning).
Unpack: Structure and clean your domain content first. RAG will only perform as well as the database it is querying.
Bridge: Select and connect your model to a structured agent runtime (like Agent Zero or LangGraph) to establish execution discipline.
Embed: Deploy the runtime and retrieval layers locally or on isolated VPS, ensuring strict data sovereignty.
Ideate: Refine system instructions and custom skills files continuously before committing to the complexity of custom fine-tunes.

The Bottom Line

The performance of your local AI is a product of both the model weights and the architecture you build around them. Understanding the sequence—runtime first to establish discipline, retrieval second to add domain knowledge, and fine-tuning last to shape behaviour—allows you to systematically isolate where the actual gaps are.

Escaping proprietary dependencies doesn't require a massive training budget. It requires a deliberate, step-by-step approach to structuring your stack.

Next Steps: If you are exploring smaller, fit-for-purpose models to solve a problem in your business, reach out to Intent Solved to map out the right architecture.

Related Resources

The Frontier Model Dependency You Haven't Priced In - The article that precedes this one. On why open source models are worth understanding before you actually need them.
Unsloth - The fine-tuning framework referenced in this article. Browser-based Studio interface, up to 70 per cent VRAM reduction, significantly faster than standard training pipelines.
Agent Zero - The agent runtime used personally for long-horizon local model tasks.
The .solved Execution Framework - The five-step framework Intent Solved uses to move organisations from Pilot Purgatory to production-grade AI capability.