AI-first E2E testing for mobile apps.
Describe your business flows. Agents rove your live app to find bugs, UX flaws, and broken paths.
Install, init, smoke.
Three commands. No harness to wire up. No selectors to maintain.
$ dart pub global activate roveflow✓ activated roveflow 0.1.0$ roveflow init✓ detected Flutter project at .✓ wrote .claude/skills/roveflow/✓ wrote .mcp.json · docs/roveflow/scenarios.mdnext: run /roveflow --only=cold-setupinside Claude Code.
From flow to result.
Hand the agents a flow. They rove your live app the way users would, finding their own path to satisfy it. Along the way they report the bugs, UX flaws, and broken paths they hit.
You write the flow
Describe what should happen in plain English. Any flow you'd hand a user: book an appointment, sign up a new account, reset a password.
Agents rove your live app
Agents launch your app and find their own path to satisfy the flow. If none works, that's a finding in itself.
Findings with evidence
Every scenario returns pass, fail, or an explicit skip, plus the path it took and any bug or UX flaw it hit along the way. With a recording.
Your users don't follow the script.
Real users take paths you didn't write a test for. Roving agents find them first.
Scripted suites
- Only cover paths you thought to write
- Break when a button or route moves
- Silent on the paths users actually take
- Grow expensive to maintain
Roving agents
- Cover the paths users actually take
- Adapt when the UI changes
- Surface edge cases before ship
- Stay useful as the app grows
Different modes for different use cases.
Same engine, different briefing. The less you tell the agents, the more autonomously they rove.
| Mode | What you tell the agents | What they report | Autonomy | Status |
|---|---|---|---|---|
| Smoke | “Book an appointment for next Tuesday at 10am. Hints: New booking → 10am slot → Confirm. Expect the confirmation screen.” | pass / fail / skipped per scenario | Available | |
| UX emulator | “You're a first-time user trying to book an appointment. Report where you get confused or stuck.” | usability findings | Coming next | |
| Coverage expander | “Visit every screen reachable from home. Check each one. Report anything broken.” | screen inventory with health status | Coming next | |
| Chained flows | “Book an appointment, reschedule it to the next day, then cancel it. No hints. Find each step on your own.” | pass / fail against the flow | Coming next | |
| Crash hunter | “Use the app however you like. Report anything that crashes, freezes, or locks up.” | crashes with reproduction steps | Coming next |
Down the table: less direction, more autonomy.
Anatomy of a rove.
You write a flow. Agents run it. You get a structured result. Replay any failing step.
You write the flow.
A short YAML block. Plain English for the flow and its pass/fail conditions. No selectors.
id: open-detailgoal: Navigate from home to the detail screen.waypoints: # optional - tap_text: "Open Detail" - reach_screen: "Detail"pass: "Detail" app-bar title visiblefail: navigation does not occurAgents take the controls.
Agents tap, scroll, and check the UI on your live app.
You get a structured result.
pass, fail, or an explicit skip per scenario. Every run saves screenshots, notes, and a trace to docs/roveflow/runs/.
# Roveflow smoke · 2026-04-17T10:32Z| Scenario | Result | Duration ||---------------|---------------------|----------|| cold-setup | ✓ pass | 00:18 || open-detail | ✓ pass | 00:21 || back-to-home | ⊘ skipped: no_data | 00:12 |## Result states- pass · fail- skipped: no_data · setup_failed · setup_lostEvery state is actionable. No scenario ever"just fails".Record today. Replay next.
Every session records to disk. Walk back through the steps and see what the agent saw. Deterministic replays come next.
roveflow-runner → open-detail▸ tap_text("Open Detail")waypoint · tap_text: Open Detail▸ reach_screen checkwaypoint · reach_screen: Detail▸ navigate_backended_at_home: trueresult: pass · 3 tool calls · 00:21
Write the flow. Skip the scaffolding.
Scripted tests couple to every key and label. Rename a button, rename a test. Scenarios describe the flow instead.
// integration_test/open_detail_test.dartimport 'package:flutter_test/flutter_test.dart';import 'package:integration_test/integration_test.dart';import 'package:nav_app/main.dart';void main() { IntegrationTestWidgetsFlutterBinding.ensureInitialized(); testWidgets('open the detail screen', (tester) async { await tester.pumpWidget(const NavApp()); await tester.pumpAndSettle(); expect(find.byKey(const ValueKey('home')), findsOneWidget); await tester.tap(find.byKey(const ValueKey('open_detail_button'))); await tester.pumpAndSettle(); expect(find.byKey(const ValueKey('detail_body_text')), findsOneWidget); expect(find.text('Detail'), findsOneWidget); // Renaming any key above breaks this test. });}id: open-detailgoal: Navigate from home to the detail screen.waypoints: # optional - tap_text: "Open Detail" - reach_screen: "Detail"pass: "Detail" visiblefail: navigation does not occurRename “Open Detail” to “View More”. The right still passes. The left doesn't.
Ship faster. Break nothing.
CLI in your terminal. Slash command in Claude Code. A report you can read without grepping.
$ dart pub global activate roveflow✓ activated roveflow 0.1.0$ roveflow init✓ detected Flutter project at .✓ wrote .claude/skills/roveflow/ · commands · agents✓ wrote .mcp.json · docs/roveflow/scenarios.mdnext: run /roveflow --only=cold-setupinside Claude Code.
One activate. One init. Zero source edits.
roveflow init lays down the Claude Code skill, agent, MCP config, and scenarios file. Everything regenerable.
Already handled.
The first questions teams ask about AI-driven testing, each with Roveflow's answer.
Isn't AI testing too slow?
Each scenario runs in 30 to 60 seconds. Parallelism across devices, capped at device count. Lightweight agents keep per-scenario latency and cost low. Optimize wall-clock across the suite.
Won't this burn through tokens?
Pick the model. Haiku keeps runs cheap. Sonnet or Opus take on the hard flows. The skill gives agents enough context that smaller models hold up. Bring your own Anthropic key. No markup.
What if the model silently changes?
Pin model versions per run. Recorded replays come next, so a run's evidence survives a model bump.
Will results flip between runs?
Every run takes a fresh path, the way a real user would. That's the whole point: it finds what a fixed script cannot.
Start roving.
Install the CLI. Point it at your mobile app. Catch the bugs no one wrote a test for.