I recently sat through a demo for a startup that promised to scrape publicly available documents using an AI agent.
To be fair, this was not a cold demo. We gave them the target URL ahead of time. The documents were public. The scope was known.
Thirty minutes later, their agent had successfully retrieved... three documents.
Not edge-case documents either. These were the obvious ones.
At that point, the demo was painful -- not because the team did not know what they were doing, but because it exposed something deeper about how fragile many "agentic" systems still are in practice.
A Simple Counter-Experiment
After the demo, curiosity got the better of me.
I decided to try the same task using Testronaut -- with only a couple of small adjustments to how the mission was written and how navigation retries were handled.
Same goal. Same public docs.
Within minutes, the agent had collected all 30 documents.
Including the large ones.
No hand-holding. No pre-indexed shortcuts. Just better agent flow control and a clearer sense of what mattered at each step.
That alone was interesting -- but what came next was more telling.
Removing the Crutch
Next, I removed the biggest advantage: the URL.
I rewrote the mission text to say, in effect:
"For this jurisdiction, find where public documents are hosted, identify the relevant ones, and download them."
- No prior knowledge of where the documents lived.
- No consistent site structure.
- Different jurisdictions, different platforms, different quirks.
Testronaut navigated each site, found the documents, and downloaded them successfully.
That is when it stopped feeling like a demo win and started feeling like a systems lesson.
Scaling the Boring Way (On Purpose)
Eventually, I ended up with a large list of jurisdictions.
So I did what any reasonable engineer would do:
I wrote a small Bash script that looped over the list and kicked off the mission for each one.
Then I went to watch Zootopia 2 with my kids.
Four hours later, I came back.
- About 100 jurisdictions processed.
- Documents discovered.
- Files downloaded.
- Logs intact.
No heroics. No babysitting.
The Part That Really Stuck With Me
Here is the part that reframed things for me:
The token cost was roughly $25/hour.
That is it.
When you contrast that with AI startups raising millions -- often to solve this exact class of problem -- it forces an uncomfortable question:
How much of the challenge is actually AI capability... and how much is agent design, flow control, and operational discipline?
The Real Takeaway
This is not a victory lap.
The startup I watched demo is not incompetent. They are building in a space that is genuinely hard. Agentic systems are brittle when poorly constrained.
But that is the point.
The difference between:
- an agent that stalls after 3 documents, and
- an agent that quietly processes 100 jurisdictions unattended
was not a breakthrough model, secret training data, or massive infrastructure spend.
It was:
- better mission framing
- clearer success criteria
- retry logic that respects reality
- and treating agents like systems, not magic
Agentic flows are powerful -- but only when we stop treating them like demos and start treating them like engineering.
And sometimes, the best way to see that clearly is to fix something small... then go watch a movie.