Are Local LLMs Ready for Production?

Tim Janik

Projects

In 2018 I recreated this blog with an SSG (Static Site Generator) in Python based on pandoc (asciidoctor for older pages), git timestamps and Jinja2 templates. Even though it cached pandoc invocations, building still took too long for my taste and lately I didn’t really feel at ease with modifying the code without a type checker in the loop.

Much of the original work went into the Jinja2 templates for posts, pages, atom, rss2 and sitemap generation and I wanted it to be preserved in a rewrite. So for a while, I dabbled with local LLMs (at first using Qwen3.5-27B) transforming the Jinja2 templates into various languages, to get a feel for how a non-Python version could look like. Ultimately, I ended up picking Go, because of its type safety, performance and its support for easy parallelism and approached the rewrite with Qwen3.6-27B as follows:

Transform the Jinja2 templates into Go templates with some scaffolding to test all templating features. The LLM even generated temporary Python test code for the old Jinja2 templates on the fly to produce reference output that it targeted with the new Go templates.
Give the LLM the markdown sources, Go templates and have it create the iris ssg command with the aim to reproduce the output of the old Python generated website.
Let the model iteratively improve the code, using HTML renderings from the original Python code as golden master.

The results were far from perfect, but close enough to build on, especially with a golden master to serve as objective. Once the LLM deemed the output good enough and ascribed remaining differences to semantically insignificant HTML cosmetics, I reviewed it myself and found several broken links, wrong dates, missing figcaptions, wrongly hardcoded values, etc. But I had to pick all those issues out of a 16k line diff.

An LLM has limited attention, giving it a 16k line diff with comparatively few significant differences can only get you so far. So after that, my job was foremost picking out significant differences and asking it to fix those one by one, where each of those fixes usually only took a few dozen lines. In addition, I asked it to generate a few simple scripts to unescape and normalize the generated HTML, which eliminated much of the noise in the diff, so I had an easier time picking out the significant changes myself.

Using a Qwen3.6-27B-MTP model on an RTX 4090 (24GB VRAM) at 4 bit quantization with llama.cpp (which landed MTP support only last week), I was able to discuss development options and iterate on planning steps at close to 80 tokens per second which felt nicely responsive and interactive. After that, I’d switch to a slower model configuration with longer context to have it carry out the development.

Left to its own devices, the Qwen LLMs can work long and persistently to reach a specific objective (e.g. minimize the diff against the HTML reference rendering), but will happily muddle all the logic into a single ever-growing source file. So along the way of improving the code to fix remaining bugs, I made several requests to increase modularity. At this point, excerpt generation, mailbox handling (for blog post comments), pandoc processing, adoc processing, globstar matching, page classification, etc are all separate modules. The modules are easier to review, independently testable and not intertwined with each other.

Ending up with a suitable architecture requires a steering hand and constant review. Part of the reason is that splitting out modules is often a judgement call: Not every 10-line logic is useful to isolate, but when you suspect a 7-line snippet will grow and needs to be reusable in the future, or you have a clear idea about layering and separation of concerns, you need to tell the LLM to factor this out before the mess becomes unmanageable. It also helps to every now and then have the LLM review and collect stats on the code base and make refactoring suggestions that you can decide on.

A particular strength in developing with LLMs is the ease and speed at which a few dozen lines of temporary test code may be generated on the fly to (dis-)prove an approach or reach an intermediate objective. An ambitious developer would also invest time to write temporary code to ensure an approach works (or just out of curiosity), but not on the scale of writing, running and throwing away several scripts/programs per hour.

The outcome of the porting effort is a single binary, trivially to deploy, parallelized and much faster site builds. The generated HTML is semantically equivalent to the old Python code output, plus several smaller bugs fixes that the old output had (e.g. TZ handling).

So, are local LLMs ready for production grade rewrites?

Well, at least for me, recent Open Weights models can definitely do the heavy lifting. Utilizing local LLMs for technical research or quickly coding a new approach provides astonishing new opportunities, it also positions local LLMs as becoming indispensable in my development workflow. But an LLM can not relieve you of setting clear objectives, enforcing good architecture and setting the guard rails.

Does the LLM have to be LOCAL?

Personally, I cannot imagine paying an LLM hosting provider for this. If I allow my development processes to be radically changed and reorganized around a new central tool, I need that tool to be reliably available without any kind of outside control or potential for interference. I also need to be able to turn to my tools at any time, for any lengths of time, online or offline, and without having to worry about arbitrary monthly quotas expiring while I am sleeping. To me, peace of mind is incomparably more important than quota maxxing at the bleeding edge frontier; local LLMs give me that.

Post comment via email