I recently bought a new mini PC to host my homelab services. The new host is something like 10x more powerful than the old.

Old New
CPU Core i5-3427 i9-12900HK
RAM 16GB 32GB
Disk 60GB mSATA 1TB NVMe
GPU HD 4000 Iris Xe

It’s just sitting here doing nothing most of the time so I thought I might give running some AI stuff on it.

Previously I had set up one of the earlier versions of Stable Diffusion to run on my desktop. It was not fast, but it was interesting. With my GTX 1080 I could generate an image about every 2 minutes. Or maybe ten seconds. I forget. The whole business has evolved so much that my old GPU can’t load the models any longer.

But I always hear about people running agents on Rasberry Pis and such. Surely this high end (mobile) processor should be able to put in some work?

Anyways, here’s the compose.

  ollama:
    image: ollama/ollama:latest
    volumes:
      - type: volume
        source: ollama_data
        target: /root/.ollama
    restart: unless-stopped
  anythingllm:
    image: mintplexlabs/anythingllm
    environment:
      STORAGE_DIR: /app/server/storage
    volumes:
      - type: volume
        source: anythingllm_storage
        target: /app/server/storage

Standing up Chat

There’s two parts to this. The first is AnythingLLM which is very roughly a framework for doing “AI” things. It lets you set up prompts, workspaces with files and images already available to users, mediate tools and MCP connections, etc. It appears to be really quite powerful and flexible but I’m mostly using it because it puts a reasonable web frontend between me and the LLM models and has some basic authn/authz attached.

The second part is ollama, which is a way of downloading and executing LLM models. It handles loading them into memory (system or GPU) and is remarkably smart about system usage. For example, it identifies when your system is hyperthreaded or has Efficiency cores and adjusts its level of paralleism to maximize throughput given those constraints. When I run it, instead of executing on all 20 or so “cores” I see in linux it only forks out about 6 threads because that’s the number of real performance cores I have.

I loaded up deepseek-r1:latest (suggested by ollama docs) and punched in my first self hosted LLM interaction. It was a little bizarre.

> This is just a test message to see if this is configured correctly.

<think> Hmm, the user sent a simple test message saying "This is just a test
message..." They're probably checking if our configuration works properly or
verifying that I can follow their instructions.  Okay, seems like they're
testing my basic functionality here - whether I understand and respond
appropriately to an initial greeting or confirmation. Since they didn't actually
ask any substantive question yet, this appears to be more of a system check than
a real query.  I should keep my response simple and friendly since it's just a
test. No need for complex answers or analysis unless the user asks follow-up
questions later. A basic acknowledgment will show that I can both receive
messages and wait appropriately for their actual question. </think> 

Hello! How can I help you today?

As a friend said: “AI has developed anxiety”.

I tried a number of different test prompts and models after this. Mostly what I was doing was trying to experiment with how quickly I could generate responses and what kinds of models would fit into the host. This mini PC is doing other things than just crunching AI numbers so I needed to keep RAM constrained.

I tried llama 3.1 8B, deepseek-r1 8B and gemma3 12B for “chat”. It’s not apples to apples since gemma3 was a bit larger but I found that for my use case gemma3 was giving me the most reasonable responses and conversations. It runs at about 5 tokens per second, which is dreadfully slow compared to even the worst free tier ChatGPT-esque platforms but it’s actually about as fast as talking to a real human.

It’s even pretty reasonable at parsing photos, although that does take about 60 seconds each. It’s fast enough for that purpose that I might install an AI photo grouping/categorization app and turn it loose for a few hours each night, see what happens.

I’d say the biggest downside to self hosting these models is that their knowledge window is limited. I asked gemma about Brian Herberts Dune Sequels and it gave me a weird mismash of Frank Herbert’s 2nd through 4th novels instead. I might try seeing if I can hook up a snapshot of wikipedia (or live search) to supplement the knowledge base.

And then I tried to self host Tab Complete…

The next thing I wanted to figure out is if I could get Tab Complete working. As far as I can tell, Anything LLM doesn’t do this. But Continue does support pointing it at a self hosted Ollama, so I gave that whir.

This was surprisingly difficult to get working. For one, the docs for continue were misleading. They really want you to pay them for this and then use public LLM providers. And for reasonably good reason: everything I tried within this ecosystem was slow. In the case of Tab-complete, unusuably slow. I did a couple of test runs and in general it took about 10 seconds to generate a single line of code. There is almost certainly some optimization to be done here and running this on even a single GPU would probably be fairly performant but using Continue was such a pain I decided not to bother.

Embedded Chat was a bit more functional, so I’ll likely play around with that a little more.

Here’s the continue model config I ended up with.

name: Local Agent
version: 1.0.0
schema: v1
models:
  - name: Gemma3 12B
    provider: ollama
    model: gemma3:12b
    apiBase: https://ollama.mydomain.com
    apiKey: CaddyBearerToken
    roles:
      - chat
      - edit
      - apply
  - name: qwen2.5-coder 1.5b
    provider: ollama
    model: qwen2.5-coder:1.5b
    apiBase: https://ollama.mydomain.com
    apiKey: CaddyBearerToken
    roles:
      - apply
      - autocomplete
      - chat
      - edit

  
context:
  - provider: code
  - provider: docs
  - provider: diff
  - provider: terminal
  - provider: problems
  - provider: folder
  - provider: codebase