Skip to content

Future Direction

Andi Lamprecht Andi Lamprecht ·· 20 min read· Draft

This document describes where Metis is going. It is organized by problem rather than by feature, and every section has the same shape: a named gap between the platform today and the vision, a description of the ideal state that closes the gap, and one or more examples of how other platforms have solved adjacent versions of the same problem.

These are not designs. They are not commitments to ship on a particular timeline. The examples drawn from other platforms are inspirations for shape, not prescribed implementations. ADS was the original vision that named most of these problems; Hamster and superpowers are later-discovered references that embody adjacent solutions to some of them. We are not building ADS-as-originally-specified, we are not building Hamster, and we are not building superpowers. We are building Metis, which draws from each and goes somewhere of its own.

The document closes with a section on the tensions we intend to keep living with rather than resolve. Those are not problems we plan to fix; they are balances the platform needs to hold continuously.

Problem 1 — Intent Capture Is Ad-Hoc

The gap. Conversations are the only first-class place where intent lives in Metis today. A product owner who wants to describe a piece of work starts a conversation, types into it, and hopes the AI’s first interpretation is close to what they meant. There is no durable, typed, versioned “brief” object — no artifact that says “this is what we decided we are building, this is who approved it, this is when we locked that in.” Downstream work inherits whatever shape the conversation happened to have.

The ideal state. Intent is captured in a first-class artifact with its own authoring surface — something a product owner can draft, a stakeholder can review and comment on, an AI can help refine, and a human approver can formally sign off on before any engineering work starts. The artifact is versioned, linked to downstream requirements and plans, and carries the identity of every contributor and approver. Starting engineering work without an approved brief is not just discouraged; the platform’s workflows refuse to proceed without one.

Examples in the wild. The ADS vision called this artifact a PRD, gave it the structure of a classic product requirements document with business context and acceptance criteria, and put it behind a PRD Approval gate owned by a product-owner role — with downstream SLRs, Implementation Plans, and work items all pinned to the approved PRD version. Hamster, which came after the ADS vision was written, arrived at a similar shape from a different direction: it calls the artifact a “brief,” positions it as distinct from a traditional PRD, and frames it as capturing intent, audience, and rationale rather than specification. Both shapes agree on the fundamentals: a named artifact, a named approver, a named approval, and downstream work that cannot start without it.

Problem 2 — Requirements Decomposition Has No Collaborative Gate

The gap. An approved brief should produce requirements. Today, Metis does not distinguish requirements from plans; both live inside workflow runs or as Markdown in the repository’s docs/superpowers/plans/ directory, both are produced by AI with minimal formalization, and neither is a surface where a product or architecture stakeholder reviews the decomposition before engineering starts. The result is that the step between “this is what we want” and “this is the work we will do” is where the most expensive rework happens — because nobody formally reviewed the decomposition in a setting that let them push back.

The ideal state. An approved brief flows into a requirements decomposition workflow. The output is a set of durable, scoped, individually-reviewable requirements that describe what the system must do, each linked to its originating brief. Stakeholders review the decomposition — not the implementation — through a collaborative editing surface, and an explicit approval gate blocks the transition to planning. Requirements that fail review get sent back to decomposition with feedback; requirements that pass review become the input to planning.

Examples in the wild. The ADS vision introduced System-Level Requirements as a named artifact between PRDs and Implementation Plans, with an explicit Requirements Mapping stage that produced SLRs from an approved PRD and a coverage matrix that showed which SLR acceptance criteria were satisfied by which plan work items. Hamster’s plan object is a looser combination of the same idea — tasks, subtasks, dependencies, and acceptance criteria — but it preserves the concept of a reviewable decomposition that humans align on before work begins.

Problem 3 — Traceability Is Implicit

The gap. The relationships between a conversation, a workflow run, a PR, and a release are knowable today only by walking the database manually. Nobody who needs them — a compliance auditor, a new engineer joining a project, a product owner who wants to know “what shipped in the last release,” an administrator investigating an incident — has a surface where they can start from any artifact and walk to the related artifacts in either direction. The links exist in the schema; they are not exposed as a first-class graph, not indexed for cross-project queries, and not navigable from the UI.

The ideal state. Every artifact the platform produces — brief, requirement, plan, workflow run, artifact file, PR, release, ADR, retrospective — is a node in a queryable graph, with explicit bidirectional links to its parents and children. A human or an agent can start from any node and navigate the graph to answer questions like “what approved this change,” “what work items are covered by this release,” “where did this requirement originate,” or “what requirements were satisfied by this PR.” The graph is indexed for cross-project queries and surfaced through a navigable UI.

Examples in the wild. The ADS vision called this the requirements hierarchy with first-class trace links, pinned to specific versions at each step, with a coverage matrix at the boundary between requirements and plans so that reviewers could see exactly which SLR acceptance criteria were covered by which Implementation Plan work items. Hamster, discovered later, arrived at a comparable shape under the name “context graph”: an automatically-maintained link structure connecting conversations, decisions, documents, briefs, plans, and people, surfaced as the central model for navigating a workspace. Both share the foundational idea: traceability is the graph, the graph is queryable, and the queryability is what makes the platform’s compliance and observability claims real.

Problem 4 — Multi-Agent Critique Isn’t Built In

The gap. The pattern of having one agent draft a piece of work and a second agent critique it before a human reviews it is known to produce better output than either agent alone. The pattern is well-understood, the cost is modest, and the uplift is substantial — but in Metis today, it is a pattern a workflow author has to build by hand inside a workflow, with bespoke nodes, bespoke prompt plumbing, and bespoke reconciliation logic. There is no bundled node type, no standard critique pattern, and no telemetry that says when critique is paying off.

The ideal state. Multi-agent critique is a first-class primitive in the workflow engine. A node type like critique: takes an upstream artifact and a critiquing-agent configuration and produces a critique artifact. A node type like reconcile: takes a drafted artifact and a critique artifact and produces a reconciled artifact. Bundled workflows incorporate these primitives for common patterns — plan critique, PR-description critique, ADR critique. Telemetry tracks when the pattern lifts outcomes and when it does not, so the platform can surface whether the extra tokens are earning their keep on a given workflow or project.

Examples in the wild. The pattern has been personally validated through a small experiment of using Codex to critique Claude’s plans and vice versa, with reconciled outputs consistently stronger than either unreviewed input. The ADS vision approached the same pattern through its tiered agent roles — a Supervisor agent with broader context reviewing a Developer agent with narrower context, for example — and through its escalation gates, where a failing piece of work was routed to a different agent with different authority. The underlying shape is the same: asymmetric agents, a structured disagreement, and a reconciliation step.

Problem 5 — Work Sequencing and WIP Is Unmanaged

The gap. Workflow runs start when dispatched. Nothing limits how many are in flight simultaneously. Nothing sequences them by dependency, priority, or cost budget. Nothing shows a human “here is everything currently in flight across my projects, and here is what will run next.” The absence of these controls does not matter when one person is using the platform; it matters quickly at team and organization scale, where concurrent runs compete for tokens, worker capacity, and human review attention.

The ideal state. A platform-level work queue with explicit WIP limits, configurable per project, per owner, or per cost budget. A visual board where a human can see and reorganize what is next. AI-assisted sequencing that analyzes dependencies between work items and proposes an execution order. Cost budgets as a first-class platform constraint that the queue honors automatically. The combination gives humans a lightweight, kanban-shaped control surface for managing the platform’s throughput without having to count individual runs in their heads.

Examples in the wild. Kanban itself — not as an implementation but as a decades-old principle — is the foundation: limit work in progress, pull rather than push, make flow visible. Hamster’s initiative and plan hierarchy provide a team-sized example of organizing work for a human-paced queue. The ADS vision called out the requirements kanban with explicit gate columns as a shape for this, with drag-and-drop reorganization on non-gated transitions.

Problem 6 — Intake Is Manual

The gap. Bugs filed in GitHub Issues, feature requests mentioned in Slack, incidents reported by email, compliance findings surfaced by external scanners — all of these are signals that should become Metis work items. Today, a human has to translate each of those signals into a Metis conversation or workflow invocation manually. The dormant Slack and GitHub-webhook adapters in the tree are the kind of integration the platform needs, but they are not wired in, and the intake pattern they would support is not yet a first-class platform capability.

The ideal state. External events — GitHub Issues, Slack messages, Linear tickets, email, webhook payloads from arbitrary external systems — arrive at the platform, run through an intake workflow that classifies them, extracts the intent, drafts a brief, and routes the brief to the appropriate human for triage. A bug filed in a connected repository can become a draft brief and a queued work item automatically, ready for a human to refine or reject. The human still owns the decision to pursue the work, but the translation from external signal to platform artifact is no longer their job.

Examples in the wild. The ADS vision treated intake as its own stage in the requirements pipeline, with a dedicated Ideas tab as the lightweight ambient-intake surface that humans could promote from, and with explicit promotion rules that carried context into the downstream PRD drafting step. Hamster’s connections layer is a later, shipping example of the ingress side of this: Linear, Jira, Slack, Notion, Figma, Google Drive, GitHub, and Cursor, with bidirectional sync in several cases. Both shapes treat intake as its own stage with its own workflow; neither treats external-signal-to-platform-artifact translation as a manual human chore.

Problem 7 — No Reusable Team Patterns

The gap. Teams on Metis share workflows through the platform-level resource layer, and projects can override platform-level workflows with their own. What is missing is the level between “individual workflow” and “whole team’s collection of workflows” — the pattern, the shape of how a particular kind of work gets done in a particular organization. A team that has a proven five-step PR review process, with specific critique agents, specific approval gates, and specific compliance outputs, should be able to name that pattern, version it, and instantiate it from a new workflow with one reference. Instead, today, every workflow author copies the pattern by hand each time.

The ideal state. A first-class “pattern” or “blueprint” concept at a level above individual workflows. Teams define patterns once, version them, and parameterize them. New workflows compose from patterns rather than duplicating them. When a pattern evolves, every workflow that instantiated it can be notified and optionally updated. The pattern is the unit of organizational learning; the workflow is the unit of specific-work definition.

Examples in the wild. The ADS vision carried this idea implicitly through its agent role definitions — a Supervisor, an Engineering Lead, a QA agent, a Story Writer, and so on — each of which was itself a reusable organizational pattern captured as a named role with scope, credentials, and authority. Hamster later made the same idea explicit at the workflow-composition level, under the name “blueprint”: reusable project patterns capturing team methodology, parameterized for a new project. The superpowers skills framework implements a comparable idea at a finer grain — a skill is a reusable named process that any session invokes and parameterizes. All three solve the same underlying problem: turn learned patterns into organizational assets, versioned and reusable, rather than institutional memory that has to be re-learned by each new team.

Problem 8 — No Workflow Editor

The gap. Workflows in Metis today are authored in YAML in a repository or through a form-driven builder in the web UI. This works for engineers comfortable with YAML and structured data. It does not work for product, QA, design, or compliance stakeholders who need to shape a workflow — for example, to insert an approval gate, add a critique node, change a provider, or add a retrospective step — but for whom YAML is a friction barrier. The result is that workflow authorship is gated by technical literacy in a way that limits who can shape the platform’s behavior.

The ideal state. A visual workflow editor with a DAG canvas, a node palette, parameter forms for each node type, inline validation, and a preview of the compiled workflow. A stakeholder can compose a workflow by dragging nodes, connecting them, configuring their parameters through a form, and saving — without touching YAML. The visual editor and the YAML authoring path produce the same underlying artifact, with lossless conversion in both directions so that a workflow authored in one surface can be edited in the other.

Examples in the wild. The ADS vision described a flow engine visualization as a Phase-2 deliverable and assumed a canvas-based authoring experience. Industry tools in adjacent spaces — n8n, Temporal’s UI, Azure Logic Apps, GitHub Actions’ visual editor — all provide some version of this for their respective workflow models. None of them is a drop-in inspiration because none of them has the agent-and-isolation model Metis workflows carry, but the general pattern of “DAG canvas plus node parameter forms” is well-understood.

Problem 9 — Worker Runtime Is V1

The gap. Workers are one-shot, ephemeral, and cold-started per run. Each run incurs the full cost of a container spin-up, a fresh clone of the target repository, a fresh installation of dependencies, and a fresh materialization of platform resources. For small repositories and occasional runs, this is fine. For larger repositories, for runs that could benefit from checkpointing across iterations, or for organizations running high workflow volume, the cost is substantial and the latency is noticeable.

The ideal state. A worker runtime that supports a warm container pool for common workflow shapes, persistent worktree volumes that preserve state across runs in the same isolation environment, git-cache optimization so that large-repo clones are incremental rather than full, cross-run checkpoint resume so that a multi-step workflow can pick up where it left off after a transient failure, and cost-aware scheduling that trades latency for cost when the budget is constrained. The runtime scales down to the current one-shot model when none of these optimizations are needed.

Examples in the wild. Persistent worker pools are the standard shape in most long-lived job systems — Temporal, Sidekiq, Celery — though most of those systems do not carry the per-run isolation requirements Metis does. Container-warm-pool patterns are well-known in serverless platforms. The ADS vision called out GKE Autopilot with namespace-per-PR and TTL teardown as an approach to ephemeral PR environments, which is an adjacent pattern with similar operational concerns.

Problem 10 — Integration Surface Is Narrow

The gap. Metis clones repositories and calls LLMs. That is the integration surface. The platform does not push status back to external work trackers (Linear, Jira), does not broadly ingest events from external tools, does not read from or write to external document stores (Notion, Google Drive), does not publish telemetry to external observability systems, and does not offer a first-class integration-author interface for the next such integration a team needs. Every integration added so far has been a one-off.

The ideal state. A first-class integration framework — a typed, versioned adapter interface that any engineer can use to add a new external integration without modifying the platform core. The framework supports event ingress (external signal arrives, intake workflow runs), event egress (Metis event fires, external system receives), two-way sync where appropriate (Linear ticket state reflects Metis run state and vice versa), and credential management through the existing secret store. Adding a new integration is a small amount of adapter code and a configuration entry; it is not a core-platform change.

Examples in the wild. The ADS vision did not name the integration framework as a first-class component, but it assumed several of the integrations the framework would need to support — GitHub App, Slack App, notifications to external approvers — and treated each as part of the platform’s connective tissue. Hamster’s connections layer is the closest published example of breadth: nine integrations out of the box, bidirectional sync where appropriate, credential management in-platform. More traditional iPaaS platforms — Zapier, Make, Tray — sit at a different point in the tradeoff space, optimizing for no-code configuration at the cost of depth. Metis sits closer to Hamster’s shape, because the integrations in question are depth-integrations with the platform’s workflow model, not generic event forwarders.

Problem 11 — Evidence and Trust Calibration Are Informal

The gap. The platform emits events. The platform writes audit logs. The platform records workflow outcomes. What the platform does not do is turn that evidence into an observable trust model — per workflow, per node type, per agent, per provider — that administrators can use to extend or constrain automation. Every workflow is treated the same regardless of its history; every approval gate enforces a human review regardless of whether the pattern behind that gate has been observed to produce reliable output thousands of times. Speed should be a function of trust, and trust should be a function of evidence; today, speed is a function of a workflow author’s preferences.

The ideal state. A trust-calibration model that tracks per-workflow and per-node-type metrics — success rates, human-override rates, cost per successful outcome, time-to-human-intervention — and exposes those metrics through dashboards for administrators and workflow authors. A feedback loop turns sustained evidence into progressive automation: a workflow that has succeeded hundreds of times without a human-corrected output can have its human review gate converted to a notification; a workflow that has regressed in quality can have its gate restored. Every extension and restriction of automation is logged and reversible.

Examples in the wild. The ADS vision called out a progressive gate trust model with four modes — a Mode 1 where agents worked the bench with full human supervision, and later modes earning more autonomy as evidence accumulated — and defined compliance events (gate decisions, PR merges, release rollbacks) as the substrate for measuring that evidence. The superpowers verification-before-completion skill is the human-scale version of the same idea: refuse to claim work is done without evidence. Both shapes agree on the link between measurement and autonomy; the ADS shape is the one most directly applicable to Metis.

Problem 12 — Governance Artifacts Are External

The gap. ADRs, change records, compliance reports, incident retrospectives, release notes, and other governance artifacts are produced today — if at all — outside Metis, in Confluence or similar, and linked back manually when they are linked at all. This means that the evidence chain that ADS-style governance assumes — intent to requirement to plan to run to code to release, all linked, all queryable, all auditable — has a hole in it wherever governance artifacts are supposed to sit. The hole is not an engineering gap; it is a governance gap.

The ideal state. Governance artifacts are produced by workflows, reviewed by humans through platform approval gates, linked into the platform’s context graph, and searchable from any of the other artifacts they relate to. An ADR is a workflow output, not a Confluence page. A release retrospective is a workflow that runs after release, producing an artifact that links back through the context graph to the brief, the requirements, the plans, and the runs. A change record is a workflow output, generated from the platform’s own activity, not a human task on top of the platform’s activity.

The retrospective case is a particularly sharp example. A post-delivery retrospective should be a workflow: it runs after release, it gathers the artifacts produced during the delivery, it produces a structured retrospective artifact with findings and follow-up actions, it routes the artifact to the team for review, and the follow-up actions become new briefs that re-enter the platform at the top. The workflow is the vehicle; the retrospective is the output; the follow-up is the next piece of intent. Nothing about that loop needs to leave the platform.

Examples in the wild. The ADS vision called out the compliance audit trail as a permanent record in the platform, with typed compliance events written via the platform API, and treated it as separate from observability. Industry practice in regulated environments — financial services, aviation, healthcare — routinely builds change records, incident reports, and audit trails as platform-produced artifacts rather than human-authored ones, for the simple reason that human-authored governance artifacts are less reliable and less traceable than platform-produced ones.

Tensions the Platform Will Keep Living With

The problems above are gaps the platform intends to close. The following are not gaps; they are balances the platform needs to hold continuously, and any attempt to resolve them by picking a side would damage the platform. They are listed here so that design decisions that touch them are made in the open, knowing the tension exists.

Global-versus-Project Consistency

Some capabilities belong to everyone, everywhere. Some belong to a single project. A workflow that enforces our organization’s PR review standards should be consistent across projects; a workflow that encodes a specific team’s release rhythm should be specific to that team. The platform must make both kinds possible, make the seams visible, and make overrides traceable — without forcing every team into a standardized process and without allowing every team to re-invent fundamentals. Deciding which capabilities fall on which side of this line is ongoing work, not a one-time decision.

Human-Paced Budgets versus Agent Throughput

Agents can generate work faster than humans can review it. Without a constraint, the platform will accumulate a growing backlog of AI-generated work awaiting human review, and the review queue will become the bottleneck that defines platform throughput. A cost budget — in tokens, in dollars, in calendar time, in human-attention minutes — is the right constraint, but budgeting is a political activity as much as an engineering one. The platform needs to surface cost in a way that makes tradeoffs negotiable rather than invisible, and it needs to give administrators levers to adjust throughput without having to manually cap individual runs.

Flexibility versus Guardrails

The platform wants to be powerful enough that a new team can shape it to their specific process, and safe enough that no team can accidentally do catastrophic damage to itself or to others. Those two goals pull in opposite directions at every design decision. A new node type adds power; it also adds the possibility of new failure modes. A new workflow configuration option adds flexibility; it also adds the surface area for misconfiguration. The platform must keep adding power while simultaneously keeping guardrails strong — and must be willing to refuse requests that would trade safety for flexibility in ways we do not want to support.

Reusability versus Team Specificity

A pattern captured as an organization-wide asset is only useful if it is general enough to apply across teams. A pattern tuned for one team’s specific workflow is only useful if it is specific enough to reflect that team’s actual process. The same tension that plays out at the capability level — consistency versus autonomy — plays out inside the resource library as well. A skill that is too general is a skill nobody uses because it does not fit their case; a skill that is too specific is a skill only one team benefits from. Keeping the library coherent and useful is an ongoing curatorial responsibility, not a one-time library-design exercise.


None of these tensions will go away, and none of them should. The goal of the platform is not to eliminate them but to make them legible — to surface the tradeoffs to the humans who need to decide, to record the decisions in a form the platform can remember, and to leave room for the decisions to be revisited as evidence accumulates. A platform that holds its tensions honestly is more durable than one that resolves them prematurely.

Last updated on