Algorithm & Blues
AI research translated into decisions executives can actually make. Published every Sunday since May 2025. No hype, no filler — one clear argument per issue.
Benchmarking is the de facto measure of LLM performance with leaderboards tracking reasoning scores, hallucination rates, etc. These metrics have been useful for comparing systems that respond to prom…
1 week ago • From Voluntary Principles to Enforceable Obligations
Anthropic’s latest research looks at telemetry from deployed agents, and the findings offer a grounded view of how autonomy is developing in day-to-day workflows.
Most agent benchmarks grade the choreography, i.e., whether the model called the right API, in the right order, with the right parameters.
Historically, we’ve treated prompt injection as a content problem… someone slips an instruction into a document or a ticket, but you clean the input, add a filter, maybe a classifier, and call it done…
With the advent of frameworks like MCP, it’s reasonable to assume that if an agent can access the right information, it will use it. Build the tool, package the skill, expose the API, and trust the mo…
Enterprises often deploy AI agents with basic monitoring, give them API access, watch the logs, and review incidents after they happen.
As organizations expand AI agent use for workflow automation, code generation, and discovery, they hit a fundamental bottleneck: the Generate-Execute-Feedback loop. Because agents are predictive rathe…
A couple of weeks ago OpenAI admitted something that has stuck with me; prompt injection is unlikely to ever be fully solved or eliminated.
One of the more subtle shifts happening with AI has little to do with models getting smarter and more to do with how work moves inside organizations. Increasingly, AI is being used less as a decision-…
A recent research effort consolidated more than a dozen AI risk frameworks into a unified mitigation taxonomy, shifting emphasis from abstract concerns to concrete mitigation mechanisms. Most of the r…
Conversations about AI tend to center on models… what they are, how powerful they’ve become, whether they’re explainable, whether they’re approved. That emphasis is understandable. It’s also increasin…
Forrester published an interesting piece this month identifying what they call three “functional planes” in enterprise agentic AI: one for building agents, one for orchestrating them inside business w…
With reasoning models, there’s a natural expectation that they should accurately outline the steps behind their answers. When those steps are visible, they appear reviewable and governable, much like…
We talk a lot about agentic AI as if autonomy alone defines the shift. In practice, agency requires structure before it delivers capability; these systems can produce actions, but without clear decisi…
A recent paper studies how decision quality and adoption vary depending on whether there is alignment between the system’s reasoning model and the user’s cognitive approach. (https://lnkd.in/eNbuSGWT)
A new study on generative AI governance highlights a pattern that many large organizations are working through. A multinational firm with fifty autonomous business units issued a unified set of AI pri…
LLMs solve graduate-level math… but they also fail at basic logic. A new paper from Oxford explains why this happens and suggests important factors for evaluating models in production.
Large reasoning models (LRMs) perform impressively on many benchmarks, but a new study using the Deep Reasoning Dataset (DeepRD) shows the illusion fades once problems grow more complex. The authors g…
A new paper warns that self-evolving LLM agents can “mis-evolve”, drifting into behaviors or architectures that diverge from their design goals. (arxiv.org/abs/2509.26354)
Current popular AI agent architectures do not support dynamic structural evolution after deployment. Agents execute tasks, adapt to their surroundings, and coordinate with other systems, but their und…
Ok, so this might be a tad controversial.. A new paper from DeepFlow on human–AI collaboration explores the concept of the Manager Agent—an autonomous system designed to coordinate workflows between…
There have been several recent studies that point to organizational culture as the determining factor in the successful use of AI.
There has been much talk about recent research from MIT revealing 95% of enterprises report no measurable profit impact from their AI deployments. After three years of “AI transformation” initiatives…
Benchmarks look impressive on slides in that they tell you what a model can do in a controlled environment. But once agents are running code pulling data, or moving tickets, the real test begins. The…
The pace of change for AI, perceived or otherwise, continues to accelerate. I recently read a piece claiming “vibe coding is dead” and that we’ve entered the era of multi-agent swarms coordinating lik…
A new study on socio-algorithmic norms caught my eye this week…
Most AI “guardrails” feel like the software equivalent of a baby gate: designed to keep trouble contained, but easily toppled with a determined push. The past year of jailbreaks and “DAN” prompts prov…
Enterprise automation has always carried a paradox: the more we automate, the more we need to explain.
🎸🎶 Algoithm & Blues Vol. 14: Can AI Govern AI? Exams Say…Maybe.
In early AI thought-experiments, an agent would observe its world, choose a path, and act with purpose. It was autonomy well defined—not too broad to be meaningless, not so narrow it couldn’t capture…
Most dialogue systems still treat user emotion as background noise. That’s a costly blind spot, especially in customer-facing scenarios. A new paper from Feng et al. (Emotionally Intelligent Task-orie…
A new paper from researchers across OpenAI, DeepMind, Anthropic, Meta, Amazon, the UK AI Safety Institute and others argues that because today’s reasoning models often “think out loud” in natural lang…
“Reasoning LLMs” promise to navigate multi-step solutions and complexity with fidelity. Apple’s new paper, “The Illusion of Thinking,” shows how quickly that breaks.
Most enterprise GenAI platforms still operate like actors stuck in a never-ending improv exercise: every new prompt resets the scene, and each response feels oddly familiar, yet decoupled from past pe…
Large language models (LLMs) are getting better at solving problems in ways that break the traditional AI playbook. A new study, “LLM-First Search: Self-Guided Exploration of the Solution Space” (arXi…
We’ve spent decades refining enterprise automation by scripting processes to the finest detail. Yet each rule written in stone creates another exception—a stranded transaction left waiting in someone’…
A recent paper, “Reasoning Like an Economist,” continues to upend the “bigger is better” LLM narrative.
Most financial services firms may be overlooking a more effective AI strategy by focusing on massive trillion-parameter models when smaller, specialized models designed specifically for financial appl…
Forget scaling models bigger—last week, researchers from UCSB suggested the future of AI might actually lie in evolutionary loops, not brute-force parameters. In their paper, “Agents of Change: Self-E…
One of the biggest challenges with today’s language models is that scaling usually requires starting fresh—burning through massive compute budgets just to increase model size or improve performance. B…
Transformers can quote the whole internet—but they still forget what you told them two screens ago. Hopfield 2.0 might finally give them a ‘hippocampus’.
(This is the inaugural post in a new Sunday series that will dive deep into the AI-nerdverse, enter at your own risk 🙃)