Benchmarks

SAK needs benchmarks to establish engineering credibility. Benchmarks are not only for model ranking. They define product capability, regression quality, and sales credibility.

Memory benchmarks

Memory benchmarks can be divided into two types:

Long-context understanding benchmarks that test model accuracy at long context limits.
Agent-specific memory benchmarks that test long conversations, cross-session retrieval, state evolution, and memory attribution.

LoCoMo

LoCoMo is a useful direction for conversation memory evaluation and fits early GUM benchmark design. It covers single-hop, multi-hop, temporal, commonsense, and adversarial questions, and can combine Recall@k with LLM-as-a-Judge.

GUM should test:

Cross-session factual recall.
Expired facts and state changes.
Explanation of which memories were used.
Avoiding hallucinations caused by wrong memories.

Web Agent benchmarks

Web Agent benchmarks should cover search, extraction, Textify, and browser execution:

Search relevance and source credibility.
Extraction fidelity from webpages and PDFs to Markdown.
Dynamic page handling for SPAs, pagination, tables, and login states.
Action reliability for click, type, scroll, and monitor workflows.
Cost and latency across tokens, runtime, and retry behavior.

Benchmarks ​

Memory benchmarks ​

LoCoMo ​

Web Agent benchmarks ​

Benchmarks

Memory benchmarks

LoCoMo

Web Agent benchmarks