Experiences with Local AI (as of 2025)

Arnau

4 weeks ago

Contents hide

1 Introduction

2 Goals

2.1 Building a work assistant

2.2 Personal assistant

2.3 Avoiding censorship

2.4 Vibe Coding assistant

2.5 Learning

2.6 Conclusion

Introduction

As you may have seen in my previous post, I’ve finally built a machine powerful enough to run Open Source AI models locally.

At this point, some of my goals have been achieved, others still require work, and a few —after gaining real experience— might simply not be worth the effort.

I’ll try not to make this excessively technical. Instead, I think the best approach is to revisit my expectations, see which ones have been met, which ones could be met in the short term, and which were entirely off the mark.

Goals

Building a work assistant

This is, without a doubt, the area where I’ve made the most progress. Naturally, it was the most urgent one: I work at least eight hours a day, so any improvement pays off immediately. I designed and ~~implemented~~… well, it was entirely implemented by LLMs through VibeCoding (they have improved a lot compared to the post I wrote about the topic some time ago).

It’s still not as complete as I would like. My idea wasn’t just to give it access to manuals, distribution lists and assorted documentation, but also to meeting recordings and my personal notes. And this is where I’ve hit several roadblocks. First, I’d need to record the meetings. If I do it from my computer, I have to warn everyone, which feels awkward. If I use Teams’ built-in recording, I need to download the file and feed it to a transcription model. And for the assistant to be truly useful, I’d need to train it to recognize who is speaking at each moment. Technically possible, but impractical and not scalable. Also, even though OneNote is great for taking notes and having them everywhere, it’s a closed-source product and exporting them automatically is far from easy.

Even with these limitations —and the poor state of the documentation— I’m satisfied. It doesn’t work for me, but it helps me understand errors faster and get unstuck when I’m stuck. Later on (especially during holidays) I’ll probably try to train it on the app’s code so that it can tell me what changes are needed to adapt an implementation to a specific requirement (dreaming is still free… for now).

The project also served to build the “skeleton” I’m using for every other agent I want to create (personal agent, financial agent…).

Personal assistant

This one hasn’t progressed as much as I expected. And to explain why, I need to give you a quick overview of what I found when I stepped into the wonderful (yes, without quotes —though equally complicated) world of DIY Open Source LLMs.

Like the rest of the Open Source ecosystem, it’s modular, flexible, varied, customizable, inexpensive… but it comes at a cost: fragmentation and the challenge of making everything fit together. I’ve run into many compatibility issues between library versions. New LLMs appear using capabilities unsupported by inference frameworks. When support finally arrives, those features might conflict with your drivers; and if you install the latest version of the drivers, it breaks other parts of the pipeline. After investing a ridiculous amount of time, you end up giving up and settling for fewer features until things mature.

This problem —along with the ones I’ll mention later— makes the gap between private services and what you can realistically do locally much bigger than I expected. And not only because of raw compute limitations.

And it’s not just compatibility issues. The models you can run locally —even with powerful hardware— are small, limited, and not very intelligent. You can compensate partially by refining prompts, but only up to a point. I’m not ruling out that some of the blame is mine, but this matches what I experienced when testing Open Source models in the cloud (Qwen, DeepSeek…). They’re cheap, yes, but quickly you start missing ChatGPT: incorrect answers, getting stuck even when corrected, and struggling with tasks that should be simple. These models can fail to generate a valid JSON file, which is fine if you’re reading it on screen, but not if they must communicate with an external service. And as you’ll see later, this impacts many more areas than it seems at first glance.

On top of that, I’ve realized private services include far more engineering than I thought, well beyond the model itself. Take the very first feature that made LLMs actually useful: web search. I implemented it, and it works reasonably well. But even though the pipeline is the same, cloud services achieve much more reliable results. My code works —it just doesn’t shine. It may look simple, but I assure you it isn’t, and the model’s intelligence heavily determines success.

For example, I had to explicitly tell it that when I ask for a comparison between two items and it lacks information, it must perform a separate web search for each. Before that, it simply looked for existing comparisons online. If I asked it to compare a Mercedes A-Class with an Audi A5, it would search for direct comparison pages. Now it mostly does it right —but not always. And if it first needs to determine today’s date to choose the most up-to-date sources, it often fails.

Another part of the problem is hardware limitations: I can only fetch and process a small number of results, and only a subset makes it to the model. Also, for privacy (and cost) reasons, I’m using Brave API instead of Google…

I also assume professional services cache results. With millions of users, overlapping queries are guaranteed. I can’t rely on that, meaning every search starts from scratch. This adds overhead and forces me to manually filter which results are reliable, accessible, recent, etc.

There are also very few good multimodal models (text + image). The ones that exist are outdated or too large to run locally. vLLM recently added beta support for suspending and resuming models in GPU memory. This could help by switching models depending on the input, but since the response quality isn’t good enough to begin with, I haven’t bothered testing it.

On top of that, I genuinely believe private models are far more refined and better aligned than Open Source ones —beyond what benchmarks reflect. It could still be that I’m not giving them enough instructions, but experience tells me that’s not the full story.

And honestly, this is where I’m stuck. It’s simple: I can’t get anywhere near the quality of private services, even for relatively simple tasks. For some queries it doesn’t matter, and I could use the local assistant, but since I often end up switching to private models when things get slightly more complex, the habit makes me forget the local option exists.

I haven’t given up. There’s plenty to improve (different models, different routing logic, better instructions…). But I didn’t expect the gap to be this large, and yes —it’s disappointing.

Avoiding censorship

No major progress here either —and this one was a particularly unpleasant surprise. Very few Open Source models are uncensored. I only found one decent candidate (surprisingly, from Google) and couldn’t integrate it with my project. It’s ironic that in such a “free” and alternative ecosystem, uncensored models are so hard to find. But in any case, as I mentioned earlier, if the model isn’t intelligent enough, the lack of censorship doesn’t help much. What’s the point of having an uncensored model if its search or reasoning abilities aren’t trustworthy?

Vibe Coding assistant

This was supposed to be my second assistant, but ironically it’s the one where I’ve made the least progress. The idea was to replicate something similar to Claude CLI’s Vibe Coding or OpenAI Codex: a terminal-based interface with access to all project files. But I haven’t managed to get these interfaces to communicate with my project. These tools are also Open Source, so it may just be another integration issue. And even if I solved it, with the Qwen models I usually run, I doubt I’d use them seriously given the limitations I described earlier. They might be useful for simple actions (copying files, removing files, spinning up Docker containers, managing repositories), but I wouldn’t trust them to write meaningful code.

Learning

Let’s finish with a clear win. I’m learning a lot through these projects, and although the possibilities aren’t as broad as I first imagined, they still have plenty of useful applications. Focusing only on LLMs, the only thing left for me to learn is how to perform light training (“finetuning”) so that models can retain some domain-specific information without having to search documentation every time.

Conclusion

The assistants I’ve built can summarize text, but they’re not yet intelligent enough to be truly useful. Summaries are fine for some tasks, but not enough to replace private services. So for now, I mainly use this project to search across different work-related sources, but for personal use it offers little value.

Maybe in a few months or years —if Open Source models improve and private services become more “polluted” with commercial content or lose impartiality (which is already happening)— it will make sense to have your own agent. But today, with subscription prices being what they are, running your own assistant doesn’t make much sense unless privacy is essential.

The same applies to services that use AI for investing (stocks, crypto…). It could be an interesting project, but if the LLMs I can run locally aren’t smart enough for smaller tasks, I’m definitely not trusting them with investment advice. There are affordable services out there that make reinventing the wheel pointless.

In short: unless you need an extremely specific and mechanical task done locally, cloud services will give you far more utility today.