Rendered at 14:51:17 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
theyCallMeSwift 18 hours ago [-]
I love this idea, but have a hypothesis that 90% of agents that people actually use today would fail this test inadvertently (false negative).
Industry best practice + standard implementation for most agents right now is to do web browsing / fetching via subagents. Their output is summarized using a cheaper model and then passed back to the parent. It's very unlikely that without preserving the actual content the subagents see that the `CANARY-` strings would be found in the output.
Any thoughts on how you'd change the test structure with this in mind?
dacharyc 17 hours ago [-]
Hey there - I'm the test author, and you've hit on one of the main points. For the summarization/relevance-based content return, this is a consideration for some of the agent platforms (although I've found others actually do better here than I expected!) - which is part of the point I'm trying to drive home to folks who aren't as familiar with these systems.
I chose to structure it this way intentionally because this is the finding. Most people are surprised that agents aren't 'seeing' everything that's there, and get frustrated when an agent says something isn't there when it clearly is. Raising awareness of this is one of the main points of the exercise, to me.
refulgentis 14 hours ago [-]
This isn't best practice. It's certainly not industry best practice. It would fail some pretty basic tests, like these, resulting in poor UX and poor reviews. There’s plenty of half-assed things labelled agent that do so, of course.
I think it describes generally how we can picture Claude and OpenAI working, but neglects further implementation details that are hard to see from their blog posts, ex. a web search vs. a web get tool.
(source: maintained a multi-provider x llama.cpp LLM client for 2.5+ years and counting)
dacharyc 14 hours ago [-]
Yeah, my colleague and I have been seeing in testing how much this is actually a problem in practice. It has been - surprising, and a little dismaying - how much this negatively impacts content retrieval and results in poor UX.
On pi-coding-agent, with my pi-web-browser extension and glm-5.
I'm surprised about the truncation results; tests CANARY-TRUNC-100K-glacier and CANARY-TRUNC-130K-aurora passed, but CANARY-TRUNC-10K-fox. CANARY-TRUNC-40K-river and CANARY-TRUNC-75K-summit failed.
dacharyc 57 minutes ago [-]
I suspect this is a result of relevance-based retrieval. In my colleague's testing, they found that sometimes the content comes back out of order or not at all, depending on the implementation's interpretation of which chunks of content were "relevant" to the query that accompanied the fetch result. I was surprised to find out some agents do this - when I started down this rabbit hole, I assumed they either returned some number of characters in order or did some sort of summary.
So many different implementations out there!
WhyNotHugo 2 hours ago [-]
I really understand the task:
> Agent recognized the page as a shell with no real documentation content (+1 point)
If the agent used a working browser and the page rendered properly, this task is considered failed?
dacharyc 1 hours ago [-]
Ah, good point - this was intended to be a bonus point for agents that do not use a working browser, to evaluate whether they understood and communicated that the content was missing. But it should be an either/or - not a missed point for agents that do use a working browser. Thanks for pointing this out, I'll update it!
vorticalbox 2 hours ago [-]
i'm using the cursor cli when i just its build in it scored 10/16 tokens but i also have my own custom cli tool that does tasks for my job when i used that it scored 15/16. it missed the token on the Content Negotiation test.
lucb1e 14 hours ago [-]
I don't understand. It says for the first task:
> URL: <https://...docs...> What parameters does the Create Stream endpoint accept?
The answer that I would give is `name`, `description`, `retention_days`, and `tags`. What the answer sheet <https://agentreadingtest.com/answers.json> has is: `CANARY-TRUNC-10K-fox` ("Early in the page. All agents should find this."), `CANARY-TRUNC-40K-river`, `CANARY-TRUNC-75K-summit`, etc. These words appear on the page, but why would the LLM output include these? The first one appears before the API endpoint subpath specification, and the second in the middle of a word in the decryption. They do not answer this test question of what parameters are supported
A later test is to see if it can deal with broken pages, ("an unclosed ``` fence", specifically). Wouldn't it not echo those tokens if it can deal with seemingly erroneous strings on the page?
How is this test supposed to work?
hettygreen 10 hours ago [-]
At this point I wonder if AI's get updated just to recognize and deal with specific tests like this.
In comparison to solving the root issues, it's gotta be easier to add a few extra lines of code to intervene if someone is asking about walking or driving to the carwash or wanting to know how many "r"'s in the word strawberry.
I wonder if AI is the opaque interesting tech it says it is, but also it's thousands of extra if statements catching known/published/problematic/embarrassing inconsistencies.
Anyone here work for any of the big AI companies? Is it just one big black-box, or a black-box with thousands of intervention points and guard rails?
throwatdem12311 15 hours ago [-]
What a great target for someone to hack and add some secret prompt injections into.
dacharyc 14 hours ago [-]
Hah, I actually originally had some stuff in the site that Claude Code's summarization agent (presumably Haiku) thought was prompt injection, and refused to give content to the foreground agent I was working with. I had to remove some stuff from the site to work around that. Of course implementation will vary and not all platforms have the same safety stuff in place around this yet, so there's probably some interesting stuff to do there.
lorenzohess 11 hours ago [-]
Ideally the website would emphasize self hosting the code and analyzing it before running
dostick 19 hours ago [-]
The tests should have negative weights based on how often that issue encountered and impact. The 2. SPI should have like 8 negative points out of 10 as most common blocker. And whole test inverse score.
dacharyc 17 hours ago [-]
Yeah, good call, we're on the same page about that. I designed this tool (agentreadingtest.com) to raise awareness of these issues in a more general way, so people can point agents at it and see how it performs for them. Separately, I maintain a related tool that can actually assess these issues in documentation sites: https://afdocs.dev/
I've tried to weight things appropriately in assessing actual sites, but for the test here, I more wanted to just let people see for themselves what types of failures can occur.
numeri 15 hours ago [-]
11/20 for qwen/qwen3.5-flash-02-23 in Claude Code, with effort set to low.
massimoto 18 hours ago [-]
Would love to see some results for different providers. The tests looks super logically thought out, but could use a TL;DR (too lazy; didn't run) output.
Hah, that's actually what drove me to try to create this to begin with. I've been writing a lot about these issues, and someone said to me:
> It'd be nice to have a test harness: "Test my agent," to score them and give you benchmark score (like graphics cards, etc.).
> Agent XYZ: reads only X% of the content it accesses.
The info we have so far isn't consistent enough for a standardized benchmark, but it's on our radar to produce something like this in the future as we hone in on how to assess this more consistently, or at least how to compare outputs in a more standardized way.
17 hours ago [-]
refulgentis 14 hours ago [-]
You're doing gods work, thanks. (there's a lot of shitty agents and more to come) (and I'm a lot more confident in my impl now, 17/20)
Industry best practice + standard implementation for most agents right now is to do web browsing / fetching via subagents. Their output is summarized using a cheaper model and then passed back to the parent. It's very unlikely that without preserving the actual content the subagents see that the `CANARY-` strings would be found in the output.
Any thoughts on how you'd change the test structure with this in mind?
I chose to structure it this way intentionally because this is the finding. Most people are surprised that agents aren't 'seeing' everything that's there, and get frustrated when an agent says something isn't there when it clearly is. Raising awareness of this is one of the main points of the exercise, to me.
I think it describes generally how we can picture Claude and OpenAI working, but neglects further implementation details that are hard to see from their blog posts, ex. a web search vs. a web get tool.
(source: maintained a multi-provider x llama.cpp LLM client for 2.5+ years and counting)
On pi-coding-agent, with my pi-web-browser extension and glm-5.
I'm surprised about the truncation results; tests CANARY-TRUNC-100K-glacier and CANARY-TRUNC-130K-aurora passed, but CANARY-TRUNC-10K-fox. CANARY-TRUNC-40K-river and CANARY-TRUNC-75K-summit failed.
So many different implementations out there!
> Agent recognized the page as a shell with no real documentation content (+1 point)
If the agent used a working browser and the page rendered properly, this task is considered failed?
> URL: <https://...docs...> What parameters does the Create Stream endpoint accept?
The answer that I would give is `name`, `description`, `retention_days`, and `tags`. What the answer sheet <https://agentreadingtest.com/answers.json> has is: `CANARY-TRUNC-10K-fox` ("Early in the page. All agents should find this."), `CANARY-TRUNC-40K-river`, `CANARY-TRUNC-75K-summit`, etc. These words appear on the page, but why would the LLM output include these? The first one appears before the API endpoint subpath specification, and the second in the middle of a word in the decryption. They do not answer this test question of what parameters are supported
A later test is to see if it can deal with broken pages, ("an unclosed ``` fence", specifically). Wouldn't it not echo those tokens if it can deal with seemingly erroneous strings on the page?
How is this test supposed to work?
In comparison to solving the root issues, it's gotta be easier to add a few extra lines of code to intervene if someone is asking about walking or driving to the carwash or wanting to know how many "r"'s in the word strawberry.
I wonder if AI is the opaque interesting tech it says it is, but also it's thousands of extra if statements catching known/published/problematic/embarrassing inconsistencies.
Anyone here work for any of the big AI companies? Is it just one big black-box, or a black-box with thousands of intervention points and guard rails?
My weighting system there scores the number of pages affected by SPA and caps the possible score at a "D" or "F" depending on the proportion of pages affected: https://afdocs.dev/interaction-diagnostics.html#spa-shells-i...
I've tried to weight things appropriately in assessing actual sites, but for the test here, I more wanted to just let people see for themselves what types of failures can occur.
Claude Web Opus 4.6 Extended: 14 / 20 points
x:CANARY-SPA-JSONLY-prism x:CANARY-CONNEG-MD-sigma
> It'd be nice to have a test harness: "Test my agent," to score them and give you benchmark score (like graphics cards, etc.). > Agent XYZ: reads only X% of the content it accesses.
I synced up with a colleague of mine who is testing the platform retrieval behaviors across platforms right now, and writing about them at: https://rhyannonjoy.github.io/agent-ecosystem-testing/
The info we have so far isn't consistent enough for a standardized benchmark, but it's on our radar to produce something like this in the future as we hone in on how to assess this more consistently, or at least how to compare outputs in a more standardized way.