This Month in Tech: September 2025 | Critical Thinking Officer

Introducing Claude Sonnet 4.5
Anthropic has released Claude Sonnet 4.5, a new frontier model with huge improvements in coding, computer usage, reasoning, and math. The model has state-of-the-art performance on benchmarks like SWE-bench Verified and OSWorld (for real-world software coding and computer task abilities). Along with the model, Anthropic has released Claude Code 2.0 and new features to the Claude apps.
Give your AI eyes: Introducing Chrome DevTools MCP
The Chrome DevTools MCP is a new tool that allows AI coding assistants to see and interact with a live Chrome browser through the Model Context Protocol, giving AI “eyes” to observe and debug web applications in real-time. The tool acts as a bridge between AI agents (like Cursor, Claude, and Gemini) and Chrome’s DevTools capabilities, letting them navigate pages, inspect DOM elements, analyze performance, simulate user interactions, and debug issues based on actual browser feedback rather tha…
Google DeepMind unveils its first “thinking” robotics AI
Google DeepMind’s Gemini Robotic project has announced a pair of new models that work together to create the first robots that ‘think’ before acting.
How AWS S3 serves 1 petabyte per second on top of slow HDDs
AWS S3 is a scalable multi-tenant storage service with APIs to store and retrieve objects. It offers extremely high availability and durability at a relatively low cost. What started as a service optimized for backups and media storage for e-commerce websites has grown into the main storage system used for analytics and machine learning on massive data lakes. The growing trend now for entire data infrastructure projects to be based on S3, which gives them the benefit of stateless nodes while …
Qwen3-Omni (GitHub Repo)
Qwen3-Omni is a newly released, multilingual, omni-modal foundation model by Alibaba Cloud’s Qwen team capable of processing and generating text, images, audio, and video in real time, with state-of-the-art performance across modalities.
Open SWE (GitHub Repo)
Open SWE is an open-source, cloud-based coding agent built with LangGraph that autonomously understands codebases and implements solutions, from planning to pull requests. It has human-in-the-loop feedback, parallel task execution, and automatic issue/pull request management. Users can initiate tasks via a UI or directly from GitHub issues with specific labels.
The art of prototyping
Just because AI lets anyone spin up an app in minutes doesn’t mean they understand prototyping - real prototyping is about asking the right questions, not just building things. The best prototypes are disposable tools designed to answer one specific question: does this solve a real problem (Role), does it feel right to use (Look and Feel), or can we actually build it (Implementation)? Pick one lens, build just enough to learn something, then throw it away.
Meta’s Open LLM for Code and World Modeling
Meta’s CWM is a 32B decoder-only LLM trained on code execution traces and reasoning tasks to explore world models in code generation.
[How is it possible that Claude Sonnet 4.5 is able to work for 30 hours to build an app like Slack)
Claude Sonnet 4.5’s system prompt reveals how it is able to work for 30 hours. This post goes into detail about how it works. A copy of the system prompt is available in the thread.
DeepSeek-V3.1-Terminus launches with improved agentic tool use and reduced language mixing errors
DeepSeek released V3.1-Terminus, which strengthens its Code Agent and Search Agent and addresses users’ feedback about Chinese/English mix-ups. The model shows benchmark gains in agentic tool tasks (BrowseComp, SWE Verified, etc.) and offers both “chat” and “reasoner” modes for different use cases. Terminus remains open-weight under an MIT license, making it viable for customization and deployment both via API and local environments.
Abundant Intelligence
OpenAI wants to create a factory that can produce a gigawatt of new AI infrastructure every week. The project will take several years and require innovation at every level of the stack. A lot of the infrastructure will be built in the US. The company will release more details about its plans and partners over the next couple of months.
Grok 4 Training Resource Footprint
Grok 4 is the largest (known) training run to date. Researchers estimate it cost $490 million, required enough electricity to support a 4,000-person town for a year, and had a carbon footprint roughly equivalent to annual emissions from 3 airplanes.
Tool Calls Are Expensive And Finite
Tool calling is many orders of magnitude more costly than calling a plain old function from code. People should design their agentic systems according to the limit on how many tool calls their agents can effectively make. Using a tool call to add two numbers once probably doesn’t matter, but scaling the problem up to 1,000 numbers will require a long wait and may exceed context window limits. Calling a function many times in a loop is one of the most common ways to solve a problem with code.
OpenAI upgrades Codex with a new version of GPT-5
OpenAI’s new model, GPT-5-Codex, spends its ‘thinking’ time more dynamically than previous models. It can spend up to seven hours on a coding task. The model is rolling out now in Codex products to all ChatGPT Plus, Pro, Business, Edu, and Enterprise users. It will be made available to API customers in the future.
The second wave of MCP: Building for LLMs, not developers
Teams that shift from API shaped tools to workflow-shaped tools see meaningful improvements in reliability and efficiency. MCP works best when tools handle complete user intentions rather than exposing individual API operations. Large language models don’t work like developers - they have to constantly rediscover which tools exist, how to use them, and in what order, so building tools around workflows produces better results.
Qwen3-Next
Qwen3-Next is a new model architecture featuring a sparse Mixture-of-Experts design with hybrid attention and multi-token prediction. Its 80B-parameter base model activates only 3B parameters at inference, enabling 10x faster throughput for long-context tasks.
Nvidia unveils new GPU designed for long-context inference
Nvidia has announced a new GPU called the Rubin CPX designed for context windows larger than 1 million tokens. The GPU, meant to be used as part of a broader ‘disaggregated inference’ infrastructure approach, is optimized for the processing of large sequences of context. It performs better on long-context tasks like video generation and software development. The Rubin CPX will be available at the end of 2026.
[Why do we take LLMs seriously as a potential source of biorisk)
Anthropic details why it implemented ASL-3 safety protections against biological weapons development, noting Claude now exceeds expert performance on virology evaluations, and controlled trials showed significant uplift with Opus 4 for acquiring bioweapons compared to using just the open internet. It’s obviously a low-probability risk, but the impact is so high — just one infraction could mean thousands of deaths — that it still requires serious evaluation and restriction.
The summer of vibe coding is over
Companies like Anysphere face soaring LLM inference costs, forcing price hikes and prompting exit strategies like reverse acqui-hires. The coding AI market’s rapid growth, driven by reasoning models, is now pressured by expensive compute costs, leading vendors to adopt usage-based pricing.
OpenAI for Science
OpenAI for Science is a project within OpenAI aimed at building an AI-powered platform to accelerate scientific discovery. It aims to prove that AI models are ready to accelerate fundamental science and research all over the world. This thread shares several examples of how GPT-5 has helped and advanced science. More about OpenAI for Science will be unveiled in the coming months.
Parallel AI Agents Are a Game Changer
Parallel agents are able to work on multiple problems simultaneously rather than sequentially, making a real difference in productivity.

Need help with strategic technology?