Swival is the AI agent I actually wanted

Swival is a CLI AI agent, and Swival 1.0.0 has just been tagged.

People are going to ask the obvious question.

Why build a new agent when Codex, Claude Code and Opencode already exist, are well established, and already look good enough for most people?

Because I wanted an agent that fixes the things existing agents still get wrong in actual daily use.

Privacy, local models, and not leaking secrets to a provider

Current agents are built around genuinely incredible models.

But I still don’t trust companies such as Anthropics with my data. For open source work, fine. For closed-source work, or anything sensitive or personal, I think the default posture of most current tools just isn’t acceptable.

Using an AI agent inevitably leaks internal information. Sometimes a lot of it.

That includes access tokens, internal project names, URLs, company names, and all the little bits of context that look harmless until they aren’t.

Mitigating that risk is something I have cared about for a long time. I even gave a Zigtoberfest talk about that exact topic.

So I wanted an agent that does two things properly.

First, it needs mitigations for leaking secrets to providers.

That means transparently encrypting secrets before sending them to providers, then decrypting them locally so that models can still reason about them without actually seeing them. And it means being able to block and redact specific strings such as internal project names, company names and URLs.

The fact that current agents, including the ones heavily used by corporations, still don’t ship these features is, to me, irresponsible and unacceptable.

Swival has transparent secret encryption and outbound LLM filters specifically to reduce how much information leaves your machine.

Second, it needs to work well with open source models. Not as a checkbox. Actually well.

Local models have predictable behavior and predictable cost. They don’t suddenly get worse because a provider changed something. They don’t suddenly get more expensive because pricing moved. You control them, not a third party.

And of course, they also don’t leak sensitive data to anyone.

Open source models are getting good fast. Gemma 4, Qwen 3.5, and GLM-5.1 make that pretty obvious. Plenty of exciting models are uploaded to Hugging Face every day.

At the same time, efficient local inference is turning into a basic requirement for modern devices, and it’s only going to improve. Apple M5 chips are a good illustration of where this is going.

No, local models aren’t a replacement for everything. But they’re already good enough for a lot of work, they have a bright future, and they can be fine-tuned.

Even modern small models have surprisingly strong agentic capabilities.

The frustrating part is that most agents are still optimized and tested mainly for frontier models. Even the ones that advertise support for many providers and local models usually behave badly with local models. Tools are used poorly. Context is managed poorly. Everything gets slower. Output quality gets unreliable. Then people blame the model instead of the tool.

I wanted an agent that performs well with any model. From large frontier models all the way down to small local models with a small context window that anyone can run on their own machine.

And if it fails, I want the first instinct to be improving the agent so that it helps the model deliver as much as it can, not immediately declaring the model useless.

That’s one of the main reasons Swival isn’t just for frontier models. A lot of that comes from excellent context management.

Swival has a /compact command. But in practice, it’s rarely needed, if at all. The agent keeps trying to deliver regardless of the constraints, and the context window isn’t something you should have to babysit during a long session.

And when I want to test new models, there’s nothing more convenient than HuggingFace CPU-less inference. So I wanted that to be trivial as well.

With Swival, it’s as easy as:

swival --provider huggingface --model zai-org/GLM-5.1

Agent-to-Agent is too useful to remain niche

The A2A protocol is great.

People who actually use it know how powerful it is, and they usually don’t want to go back to a single isolated agent.

Unfortunately, for most people, A2A is still one of these things they have vaguely heard about but never really use, mostly because mainstream tools such as Claude Code still don’t support it.

That’s a shame, because A2A changes what an agent can be.

With A2A, you can run multiple agents with different configurations and different models, and let tasks naturally reach for the right one.

So instead of stuffing documentation and skills into one local agent, you can have a dedicated documentation agent with direct access to the material, while other agents don’t need to carry all of that context.

Then, when an agent needs to know how to do something, it asks the documentation agent to research it and return a concise, accurate answer instead of dumping blind grep results.

And of course, that specialized documentation agent can run a small, cheap, local model.

That’s the larger idea. Don’t restrict yourself to one model, or even a tiny set of related models. Use as many models as you want, including open source ones, depending on what needs to be done.

I wanted A2A to be simple enough that people would actually use it.

Swival comes with built-in support for A2A and can act both as a server and as a client.

You can set up a network of specialized agents in minutes.

Open source, readable, powerful, and not a circus

Existing agents have become ridiculously bloated over time.

Claude Code is enormous. It flickers. It crashes. New versions keep adding gadgets I don’t care about while making the whole thing even heavier.

I don’t want a kitchen sink. I want a small, reliable tool.

Swival is lean, fully open source, doesn’t depend on any company, and isn’t optimized around a specific provider. It’s totally free and open, and has nothing to sell.

It’s written in Python because Python is simple, readable and maintainable. Nothing is obfuscated. Anyone can read the code, understand it, and modify it for their own needs.

And this isn’t a toy. It’s a workhorse. A boring one, which is exactly what I want here.

It focuses on the tools a developer actually needs, not on gimmicks. But it still has the features you would expect from a modern agent, and then some.

Benchmarking needs a real environment

I also wanted to run benchmarks.

I wanted a tool to evaluate models, settings, skills, MCP servers, and similar pieces on real-world tasks, in an environment that actually resembles how a user works.

A lot of benchmarking tools aren’t designed that way.

They either assume tools optimized for specific models, or they provide an environment that doesn’t feel much like the real thing.

And if you want to learn anything useful from evaluations, you need traces. Detailed ones. Accurate ones.

You need to be able to look at what happened and understand how a model behaved under different conditions.

Swival comes with strong reporting features. Combined with calibra, you can compare traces, diff them, and run evaluations that are actually meaningful.

Evaluating many configurations can burn through a lot of tokens.

That’s yet another reason I cared so much about making the agent work well with open source models running locally. For evaluations, cost is often more important than wall-clock speed.

Small models are great

The real problem is that models produce terrible code and then confidently tell you everything is fine.

Watching a model generate code is impressive. It’s hard not to be impressed when you type a prompt and a feature, or sometimes a whole project, appears in one shot.

And the final report is always soothing. Everything is done. Everything works.

Of course.

The reality is that AI-generated code is almost always poor quality.

It may compile. It may appear to work. But from a correctness perspective, it’s often terrible.

You may be very happy with the code generated by Claude Code with Opus 4.6 max pro high thinking max, and maybe even want to deploy it to production, merge it into open source projects, or write triumphant blog posts about it.

But there’s a good chance that the code is inefficient, buggy, hard to maintain, and going to cost you later.

There’s a trivial experiment anyone can try.

Ask your favorite agent to generate code, or even just a plan.

Then, in a separate environment, ask another AI agent, even one running the same model, to review that code or plan.

It’s very likely to find issues immediately. Sometimes critical ones.

As much as I like AI, in my own projects I refuse pull requests blindly generated by tools such as Claude Code for exactly that reason. And in a company context, I wouldn’t deploy that output to production either.

There are two ways to significantly improve quality and confidence.

First, write the tests first, then force the agent not to declare the task complete until the tests pass.

The tests don’t even have to be part of the application’s formal test suite. A simple shell script with curl commands can be enough.

What matters is that this becomes a contract the agent has to satisfy.

That’s much stricter than a prompt, because a contract can’t be hand-waved away or interpreted creatively.

Second, use a loop with an LLM-as-a-judge.

Let one agent write code, documentation or a plan. Then let another agent review that work against the original instructions, and force the implementer to retry until the reviewer thinks it’s correct.

Swival makes both of these approaches trivial because they’re built in.

Before starting a task, you can give the agent a script that will act as a reviewer.

That reviewer can be another swival instance with a custom configuration. Or, even more simply, you can start with --self-review, and tasks will be reviewed by the same instance and same model in a dedicated context.

There’s nothing else to wire together.

One of the most interesting things to watch is how bad the initial output of an LLM agent can be, especially for code, and how honest and picky a model can suddenly become when it’s reviewing its own output without realizing it.

After a couple of iterations, the code, plan or documentation is often far better than the first attempt.

This is also one of the main reasons I wrote a new agent at all.

I don’t want to use AI to generate a mountain of code just so I can brag about productivity if the result is unreliable, insecure and unmaintainable.

I wanted an agent that optimizes for quality rather than raw time savings.

It can be slow. It can be expensive. But I want the output to be something I can trust and deploy to production.

Long sessions shouldn’t make the agent dumber

Another thing I wanted was continuity.

I wanted an agent that remembers what I did before, and what it did before.

When I come back to the same project the next day, I want the agent to remember prior work without filling the live context with junk.

Swival does that in a way that feels much more natural than in other agents.

I also wanted it to stop making the same mistakes twice.

So Swival has a /learn command: at the end of a session, the agent can reflect on the issues it ran into and write concise instructions about how to avoid repeating them. And once those learnings exist, it will keep updating them automatically.

That has turned out to be much more effective than premade agent skills. Or, more accurately, it’s an extremely effective way to produce the right skills, because the agent discovers what it actually needs from real sessions instead of from speculation.

Modern features, but without the usual mess

Skills, MCP, parallel subagents and similar capabilities are table stakes for serious agent use now. Of course Swival supports all of that.

But I also wanted it to avoid the usual security and reliability mistakes.

So tool and MCP output are explicitly tagged as untrusted in order to reduce prompt-injection risk.

And markdown comments are ignored, so what you see in a rendered skill isn’t different from what the agent actually interprets.

There’s another common failure mode I have always found silly.

If an MCP command or tool returns a large output, many agents either stuff the whole thing into the context window or fail in some awkward way.

I wanted an agent that handles that properly. Swival writes large outputs to a temporary file and lets the agent access them in chunks later instead.

I also wanted a clean way to share files such as agent memories and AGENTS.md across multiple devices working on the same project, without committing them into a Git repository.

Swival has lifecycle hooks specifically for that sort of workflow.

Arbitrary commands are easy too. In ~/.config/swival/commands/, you can place either scripts or plain files. Then ! command_name will inject either the content of the file or the output of the script into the prompt.

Yes, other agents have versions of this. But I wanted it to be trivial from a user perspective.

Not five overlapping systems with five different names for basically the same thing. Just one simple mechanism.

The same goes for shell command inspection and rewriting.

I didn’t want people to need to learn some complicated generic hook system.

In Swival, enabling command middleware is safe and straightforward.

And more importantly, I wanted the agent to be usable programmatically, not just from the CLI.

I also didn’t want that to require a separate SDK with its own worldview and its own behavior. Everything the CLI agent does should be accessible in a consistent way from Python code.

This is why Swival can be used as a CLI, but also as a library. It exposes a very simple API so anyone can build custom agents, or more general applications, on top of a batteries-included agentic environment.

Small things matter

Some of the things I cared about aren’t glamorous.

They’re just the kind of rough edges that make a daily tool annoying.

For example, markdown rendering for LLM output looks nice, but I dislike the fact that copy-pasting rendered output often strips the markdown markers.

I also don’t love the idea of an agent accidentally deleting files.

These are small things. But they matter if you actually use the tool every day.

Swival renders LLM markdown output while preserving the formatting tags.

So the output looks good, but can still be copied and pasted without losing the markdown.

And even in full YOLO mode, it has safety guards against dangerous commands, plus built-in support for the AgentFS copy-on-write filesystem overlay.

Also, when a file is deleted using Swival’s own tools, it isn’t actually deleted. It’s moved to a Trash directory instead.

I have never personally seen an agent delete the wrong file. But if it ever does happen, I want recovery to be possible.

Why I use it

At this point, I use Swival almost exclusively. It’s reliable, and I’m happy with the output I get from it.

I use open source models as much as possible, both locally and via HuggingFace inference endpoints.

But when I need a frontier model, I use it with GPT-5.4. This is a fantastic model. It works amazingly well with a regular ChatGPT subscription, and I have never hit any usage limit.

If you are happy with your current agent, there are no reasons to switch.

But I would still strongly encourage you to try Swival. Maybe even use it alongside the agent you already use.

Because even with the very same models, it’s likely to find bugs your other agent never found. That isn’t magic: different agents expose different environments to models, and models behave differently in those environments.

I have used the audit command from the swival-command repository, and it found bugs and vulnerabilities in pretty much every code base I tried it on, including code bases that had already been audited with Codex and Claude Code.

That’s the kind of difference I care about. Whether the tool helps me find real problems and ship better work.

So yes, give it a try.

What’s next

Version 1.0.0 doesn’t mean Swival is done.

It means it now checks all the boxes from my original todo list, and it’s stable enough for daily use.

The API is also unlikely to see major breaking changes, so it’s something you can reasonably rely on if you want to build applications and agents on top of it.

There are still many planned changes and features. But Swival will remain driven by real-world usage and user feedback, not by a roadmap for the sake of having a roadmap.

Related projects are going to keep expanding as well.

Right now, they’re:

Calibra, to evaluate models, MCP servers, skills, and related configurations
Skillscheck, a linter for agent SKILL.md files
Agent Skill Lint, a Visual Studio extension for agent skills linting
Swival commands, a repository for user-contributed commands

Swival exists because I wanted an agent that takes privacy seriously, works with the models I actually want to run, gets better over long sessions, and pushes code quality up instead of pretending quality doesn’t matter.

Most agents still optimize for marketing. I wanted one that optimizes for the work.

That’s Swival.