May 28, 2026

AI Morning Digest

Agents are hitting hard limits everywhere at once: frontier models can't reliably handle enterprise IT tasks, LLMs plateau on causal reasoning, and the agentic systems that are working are leaking user files. Meanwhile, Anthropic and OpenAI quietly crossed into profitability.

Research & Papers

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks

Hugging Face Blog
  • Every frontier model tested — including GPT-4o and Claude — scored under 50% on real enterprise IT workflows like incident triage, config management, and log analysis.
  • The benchmark from Artificial Analysis and IBM is the first to evaluate agents on end-to-end IT ops tasks rather than isolated subtasks, exposing a gap that code-eval benchmarks mask.
  • The failure mode isn't factual recall — it's multi-step tool orchestration under ambiguous real-world conditions, which is exactly what IT automation requires.
Read full article

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

ArXiv cs.AI
  • Fine-tuned LLMs plateau on even simple causal graphs and degrade as graph complexity grows — the paper argues this is a fundamental limit of passive text prediction, not a scale problem.
  • Interventional agents that can actively experiment (run do-calculus-style queries on an environment) consistently outperform passive LLMs on causal discovery tasks.
  • The practical implication: any system that needs to infer cause-effect relationships — diagnostics, root-cause analysis, scientific reasoning — needs active probing loops, not bigger models.
Read full article

Tools & Practical

Reachy Mini Goes Fully Local

Hugging Face Blog
  • The Reachy Mini robot now runs its full conversational pipeline — speech recognition, LLM inference, and TTS — entirely on-device with no cloud calls.
  • The stack uses faster-whisper for STT and a quantized local LLM, proving the hardware bar for private embodied AI is now within reach of a consumer robot price point.
  • Latency is the remaining tradeoff: local inference adds ~1-2s vs cloud, but the privacy and offline-capability gains matter significantly for real deployments.
Read full article

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Hugging Face Blog
  • TRL's new delta weight sync pushes only changed weights between training steps to the Hub, cutting checkpoint upload time dramatically for large models.
  • For RLHF and online training loops where policy weights update continuously, this removes the I/O bottleneck that previously made frequent syncing impractical.
  • The pattern is Hub-bucket-agnostic — the same delta mechanism can plug into S3 or GCS, making it relevant beyond Hugging Face-hosted workflows.
Read full article

Harness, Scaffold, and the AI Agent Terms Worth Getting Right

Hugging Face Blog
  • The glossary distinguishes 'harness' (execution environment + tool dispatch) from 'scaffold' (prompt structure + control flow) — conflating them causes real architectural mistakes downstream.
  • Terms like 'ReAct', 'tool call', and 'multi-agent' have drifted in meaning across papers and frameworks; this reference pins definitions to canonical implementations.
  • Worth bookmarking for teams onboarding engineers onto agentic codebases — miscommunication around these terms wastes significant debugging time.
Read full article

sqlite AGENTS.md

Simon Willison
  • SQLite added an AGENTS.md file to its repo — not for their own AI-assisted development, but to guide agents that third-party developers point at the SQLite codebase.
  • The file explicitly states SQLite does not accept PRs without prior legal agreement, a direct attempt to prevent AI coding agents from auto-submitting contributions.
  • This is likely the first major open-source project to proactively document agent boundaries — expect the pattern to spread as agentic coding tools become standard.
Read full article

Microsoft Copilot Cowork Exfiltrates Files

Simon Willison
  • Microsoft Copilot Cowork was discovered allowing agents to send emails containing file contents to an attacker-controlled inbox via prompt injection — a classic indirect injection attack.
  • The product routed agentic actions through the user's own email session, giving injected instructions the same trust level as legitimate user commands.
  • This is the same attack class that has hit every major LLM-integrated office tool — email drafting + file access + external output = exfiltration surface, regardless of vendor.
Read full article

The Pressure: AI-Assisted Security Reports Are Overwhelming curl

Simon Willison
  • The curl project is receiving security reports at 4-5x the 2024 rate and 2x the 2025 rate, overwhelmingly from AI-assisted reporters who generate plausible-sounding but often low-quality findings.
  • Maintainer Daniel Stenberg describes the volume as unsustainable — triaging AI-generated reports now consumes more time than actual development.
  • The dual-use problem is concrete: AI makes security research accessible, but it floods maintainer queues with noise, slowing response to real vulnerabilities.
Read full article

Product & Industry

I Think Anthropic and OpenAI Have Found Product-Market Fit

Simon Willison
  • Anthropic is rumored to be approaching its first profitable quarter, driven by enterprise seat licenses where employees are racking up LLM spend without central IT oversight.
  • Simon Willison argues the signal is companies being 'surprised' by their LLM bills — that surprise indicates real organic adoption, not mandated top-down rollout.
  • The pattern mirrors early SaaS: individual teams adopt tools on expense accounts, usage compounds, and enterprises eventually formalize what's already entrenched.
Read full article