ALWAYS UP-TO-DATE GUIDE

Release Notes

Guide update history - new content added regularly

Last updated: May 27, 2026 +16 Sections +175 Sections
2026

May 2026

May 27, 2026

Major Restructure ... New Chapter 6 dedicated to AWS + renumber 6-15 → 7-16

MAJOR

Structural reorganization of the guide to separate cloud-specific content from provider-agnostic content. Chapter 5 (Terraform) returns to being pure Terraform, and all AWS-specific content that was spread across 3.15 (AWS MCP), 5.15 (CFN/CDK), 12.6 (Bedrock), and 12.7 (Least-Privilege IAM) is now consolidated in a new dedicated Chapter 6: "AWS with AI ... Deep Dive". Section 5.9.5 (Hard Guardrail IAM) stays in the Terraform chapter and also appears duplicated as 6.2 in the AWS context. Previous chapters 6-15 have been renumbered to 7-16. Old links (chapter-5.html#5.15, chapter-12.html#12.6, etc.) continue to work via JS redirect-stubs. This restructure prepares the ground for future dedicated chapters on GCP and Azure.

+ New Chapter 6 ... AWS with AI: Deep Dive with 6 sections: 6.1 AWS MCP Server (full deep dive, previously in 3.15), 6.2 Hard Guardrail Plan-Only IAM (duplicated from 5.9.5), 6.3 CloudFormation and CDK with AWS MCP (moved from 5.15), 6.4 Claude Code on Bedrock (moved from 12.6), 6.5 Least-Privilege IAM for AWS MCP (moved from 12.7), 6.6 Closing + roadmap for GCP/Azure
+ Renumbered chapters 6-15 → 7-16: Kubernetes (6→7), CI/CD (7→8), Observability (8→9), Container Security (9→10), FinOps (10→11), Runbook RAG (11→12), Guardrails (12→13), GitOps (13→14), Governance (14→15), Conclusion (15→16). All ~80 cross-references in prose and code updated via migration script
+ JS redirect-stubs at old positions (#5.15, #12.6 [now chapter-13.html#13.6], #12.7 [now chapter-13.html#13.7]): keep the anchor valid but auto-redirect to the new position in Ch. 6, avoiding 404s on previously shared links
+ Section 3.15 reduced to intro + install command + forward-ref to Ch. 6.1. The MCP catalog stays in Ch. 3, but the AWS-specific deep dive lives in the dedicated chapter
+ Section 5.9.5 gains a bidirectional cross-link: blue callout at the top of the section in Ch. 5 pointing to its duplicate at Ch. 6.2, making the intentional duplication explicit and reinforcing the Hard Guardrail's importance in both disciplines
+ Sidebar, TOC, landing, and release notes updated with the new topology: new AWS Deep Dive card in orange, renumbered chapter badges, AWS icon in the side menu
May 25, 2026

Chapter 5 ... New Section 5.17: From Monolith to Composable (Refactoring CLAUDE.md into Skills and Subagents)

MAJOR

A brand new section that closes the pedagogical arc of chapter 5. After the reader builds the monolithic CLAUDE.md in 5.8 (Step 8, ~530 lines with 13 architectural patterns) and learns subagents + skills in 5.13, section 5.17 shows how to refactor the monolith into a composed system: a slim ~80-line CLAUDE.md (identity + safety + tooling + pointers) plus 3 subagents (terraform-architect, terraform-cost-reviewer, terraform-security-reviewer) and 5 skills (tf-scaffold-stack, tf-variables-review, tf-naming-review, tf-cross-stack, tf-outputs-review). Includes diagnosis with an always-on vs on-demand table, the partition rule, a walkthrough migrating 3 concrete patterns, the final directory structure, and when NOT to refactor. Connects with section 5.9.5 (Hard Guardrail), 5.13 (Subagents and Skills), chapter 6 (same principle), and chapter 15 (book-level pattern). Step 8 of section 5.8 gains a forward-reference pointing to 5.17.

+ Full new section 5.17 with 6 subsections: 5.17.1 Diagnosis (table of the cost of loading on-demand content as always-on across 70% of the current CLAUDE.md), 5.17.2 The Partition Rule (always-on vs on-demand, and skill vs subagent), 5.17.3 Walkthrough migrating 3 patterns (Variable Pattern → skill, New Stack Checklist → skill+subagent, Cost Analysis Workflow → subagent), 5.17.4 The final ~80-line slim CLAUDE.md shown in full, 5.17.5 Final directory structure, 5.17.6 When NOT to refactor (low-discoverability and attention-dilution signals, when to migrate)
+ Forward-reference in Step 8 of section 5.8: a blue callout explaining that the monolithic CLAUDE.md is intentional for learning purposes, with a direct link to section 5.17 where the reader learns to refactor after seeing subagents (5.13) and agent teams (5.14)
+ New public template terraform-architect-pack/ in devopsai-templates: a composed pack with a README, the slim CLAUDE.md (~80 lines), 3 subagents under .claude/agents/, and 5 skills under .claude/skills/. Coexists with the monolithic CLAUDE-terraform-architect.md template; each serves a different project maturity stage
+ Update on the monolithic CLAUDE-terraform-architect.md: a blockquote at the top distinguishing "monolithic" from "composed" and pointing to the new pack as the natural evolution
May 22, 2026

Chapter 5 — Production-Grade CLAUDE.md, Hard Guardrail IAM, and Infracost via plugin

MAJOR

Three substantive additions to chapter 5 in this release. Section 5.8 got a new Step 8 dedicated to a Production-Grade CLAUDE.md, with excerpts of Tooling Strategy and Stack Architecture, plus a callout in Step 2 distinguishing the minimal template (good for initial testing) from the production-grade template (canonical reference for real projects). Section 5.9.5 is an entirely new addition focused on actual defense-in-depth: a Hard Guardrail Plan-Only IAM Role for production. And section 5.10 (Infracost integration) stopped recommending manual CLI installation and started using the official Claude Code plugin, an integration that brings friction down to zero.

+ New subsection 5.8 Step 8 — Production-Grade CLAUDE.md: excerpts of Tooling Strategy (matrix of which agent commands consult Context7 before writing HCL, module naming patterns, validation rules) and Stack Architecture (numbered stacks, the assume_role + default_tags workspace-aware pattern, native S3 backend without DynamoDB lock). A callout in Step 2 makes it clear that the walkthrough's initial template is minimal for testing and points to Step 8 when the project evolves toward production
+ New subsection 5.9.5 — Hard Guardrail Plan-Only IAM Role for Production: a table of the 5 defense-in-depth layers (CLAUDE.md soft, Hook PreToolUse soft harness, IAM Plan-Only hard AWS, Permission Boundary hard AWS, SCP Organizations hard org), an IAM policy JSON ready to paste with explicit Deny for all destructive actions Terraform would need to apply (Create, Delete, Modify, Put, Update on RDS, EC2, S3, DynamoDB, IAM, and more), and workspace-aware role assumption inside the provider block so the plan-only role is used in production and the full-access role in dev/staging
+ Section 5.10 (Infracost integration) updated: the recommendation moved from a manual CLI (brew install infracost/tap/infracost or curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/scripts/install.sh | sh + infracost auth login) to the official Claude Code plugin (claude plugin marketplace add infracost/agent-skills + claude plugin install infracost@infracost). The flow now uses native slash commands: /infracost:breakdown, /infracost:fix-tags, /infracost:optimize. No more PATH wiring or separate authentication
+ New cross-references to section 13.7: the tiered IAM strategy for AWS MCP described in section 13.7 (sandbox dev / production read-only / production write-controlled) now has direct application to the Terraform case via section 5.9.5. The reader sees the same principle applied at two different entry points and understands the pattern is canonical, not specific to a single tool
+ New public template in devopsai-templates: CLAUDE-terraform-architect.md: 450+ lines with a Senior Infrastructure Architect agent definition plus 13 architectural patterns (numbered stacks, native S3 backend, assume_role + default_tags workspace-aware, the variables-in-objects pattern, resource label naming conventions, count for lists, Name-only tags, kebab-case file organization, splat outputs, cross-stack state, new-stack checklist, plus the Plan-Only IAM Role for production and Infracost via plugin). The template is versioned at github.com/filipemotta/devopsai-templates and ready to copy as a CLAUDE.md in any Terraform project
+ Canonical defense-in-depth pattern established: CLAUDE.md (soft, instructs the model) + Hook PreToolUse (soft, client-side harness) + IAM Plan-Only (hard, AWS contract) + Permission Boundary (hard, AWS contract) + SCP Organizations (hard, organizational contract). Each layer addresses a different failure mode. Once the reader internalizes this pattern, guardrails stop being treated as a single decision and start being designed as layered security architectures
+ Subnav-ch5 updated: section 5.9 was wrapped in a <details> block in the product sidebar to host the new 5.9.5 Hard Guardrail IAM Plan-Only sub-entry. Readers walking through the chapter clearly see there is extra depth inside the Workspace Safety topic
May 21, 2026

Chapter 5 restructured — MCP decision, from-zero walkthrough, explicit multi-cloud

MAJOR

Substantive restructure of three sections of chapter 5 to fix an architectural confusion. The chapter now establishes clearly that Terraform is multi-cloud by design (provider-agnostic) and offers a concrete decision tree for the question that keeps showing up in real teams: "which MCPs do I install to get started?". Section 5.8 stopped being an AWS-specific case and became an evolutionary walkthrough applicable to any cloud, starting from mkdir and ending at the second module reusing the accumulated context.

+ Section 5.3 rewritten: "MCP Applied to Terraform — Choosing Your Architecture". It now presents a decision tree with three architectural paths (pure Terraform MCP / AWS MCP Server / Hybrid) with clear criteria for when to use each. Emphasizes that Terraform supports AWS, GCP, Azure, Cloudflare, and dozens of other providers, so the MCP choice depends on the team profile, not on the tool
+ Section 5.5 rewritten: "Installing the Right MCPs". Installing the Terraform MCP is the default for any multi-cloud team, and the AWS MCP Server (introduced in 3.15.1) comes in as an optional add-on for AWS-only shops that want call_aws. Step 3 of the section makes the explicit cross-reference to 3.15.1
+ Section 5.8 transformed: from "Practical Case AWS" to "Practical Walkthrough: From Zero to First Module (Generic, Cloud-Agnostic)". It is a step-by-step evolutionary path applicable to any provider (AWS, GCP, Azure, Cloudflare, etc.). It starts with mkdir and git init, then asks Claude to generate the repository CLAUDE.md, writes the first module with validate and plan, runs apply in an isolated sandbox, and ends with a second module reusing the context Claude has already built up in the project
+ Clear pointer for when to migrate to Spec-Driven: the 5.8 walkthrough ends by pointing to section 5.16 (Spec-Driven Development applied to Terraform). The reader knows when the manual flow hits its limit and the SDD framework starts paying off
+ For AWS-only shops: direct guidance for when the AWS MCP Server (section 3.15.1) covers your needs without needing to install the Terraform MCP separately. For multi-cloud shops or shops using Terraform as the common language, the Terraform MCP is the right investment
+ Filesystem MCP removed as a recommendation: the previous recommendation of @modelcontextprotocol/server-filesystem is gone from 5.5 (it was redundant in Claude Code, which already has native Read, Write, Edit, and Bash). In its place, an explanatory callout about when that MCP still makes sense (Claude Desktop, Cursor without equivalent native tools, headless SDK integrations in environments where the filesystem is not exposed)
+ Section 5.12 cleanup (troubleshooting): the checklist now verifies the Terraform MCP installation instead of the filesystem MCP, aligning the troubleshooting flow with the new default architectural decision
May 18, 2026

Section 13.7 — AWS MCP Server: Least-Privilege IAM and Operational Guardrails

MAJOR

Critical security addition to chapter 12. Ever since the AWS MCP Server hit GA (May 2026, section 3.15), teams across the world started attaching AdministratorAccess to the role used by Claude Code just to make the agent "work". This section is blunt: AdministratorAccess + call_aws is a production outage waiting to happen. The agent inherits the user's IAM permissions and no automatic sandbox exists for MCP tools. Section 13.7 delivers the full playbook to avoid this, in seven practical subsections with ready-to-paste policies.

+ 13.7.1 — The Blast Radius Problem: why call_aws inherits the user's IAM, why no automatic sandbox exists for MCP tools, and what happens when a misinterpreted prompt meets overly broad permissions (delete on production DynamoDB, terminate of critical EC2 instances, drop of RDS tables)
+ 13.7.2 — Three Deployment Tiers: official matrix for configuring the agent at each stage. Tier 1 development sandbox (isolated account, synthetic data). Tier 2 production read-only (investigation, troubleshooting, FinOps). Tier 3 production write-controlled (surgical subset of mutating actions with runtime guardrails). When to promote and when to stay put
+ 13.7.3 — Ready-to-Use IAM Templates: three complete JSON policies ready to paste. Policy A: dev sandbox (broad inside an isolated account). Policy B: production read-only (Describe*, List*, Get* only). Policy C: production write-controlled (explicit deny for destructive actions like iam:Delete*, ec2:TerminateInstances, rds:Delete*, dynamodb:DeleteTable, s3:DeleteBucket)
+ 13.7.4 — Defense in Depth (Five Layers): the IAM policy is only layer 1. The other four: permission boundary attached to the agent role, SCP at the AWS Organization level so that not even a compromised root escapes the contract, CloudTrail for forensic logging, Bedrock Guardrails for prompt and response filtering. Each layer covers a different failure mode
+ 13.7.5 — Agent vs Human via IAM Condition Keys: how to differentiate the agent from the human inside the same principal using aws:PrincipalTag, aws:RequestTag and sts:RoleSessionName. Concrete example: write in production only allowed when aws:PrincipalTag/AgentMode=false. The same role blocks mutation when the session is the agent and allows it when the human is logged in with MFA
+ 13.7.6 — CloudTrail and Real-Time Alarms: EventBridge rule capturing eventName matching a list of destructive actions (TerminateInstances, DeleteDBInstance, DeleteTable, etc.) and userIdentity.sessionContext.sessionIssuer.userName matching the agent role. Immediate notification on Slack via SNS + Lambda, with a direct link to the event in CloudTrail. Reaction in seconds, not hours
+ 13.7.7 — Anti-Destructive Hook Inside Claude Code Itself: PreToolUse hook in .claude/settings.json that intercepts call_aws invocations and blocks destructive patterns before they leave the machine. Validation script that inspects the payload (action name + ARN) and returns exit 2 if the action is on the blocklist. Last line of defense, executed client-side
+ Red callout in section 3.15.1: direct pointer from the AWS MCP Server introduction to section 13.7, with the sentence "before deploying call_aws in any account with real data, read the least-privilege IAM playbook in section 13.7". Explicit connection between the technical entry point and the security playbook
May 18, 2026

Section 4.17 — Headless Claude Code: From CLI to SDK to Embedded Orchestrators

MAJOR

New senior-level section (~2,300 words) that closes a critical gap: until now the guide treated Claude Code as an interactive IDE tool. But in DevOps, most of the value shows up when it runs without a terminal — inside GitHub Actions, Lambdas reacting to CloudWatch Alarms, or external orchestrators like Hermes. Section 4.17 maps the three official Anthropic paths to run Claude Code non-interactively, shows three real production patterns, teaches CI authentication without friction, and establishes mandatory cost guardrails. Includes a critical heads-up: starting June 15, 2026, the Agent SDK and claude -p on subscription plans (Pro, Max, Team) will draw from a separate credit pool from interactive sessions — anyone running orchestrators in production needs to plan ahead.

+ 4.17.1 — The "Claude Code as Library" Pattern: why moving the agent from the IDE to the pipeline is what separates pilots from production. Difference between interactive mode (REPL), headless CLI, and embedded SDK
+ 4.17.2 — The CLI Route (claude -p): Print Mode with --output-format stream-json, --allowedTools, --max-turns, --permission-mode acceptEdits. Concrete PR review example via cat diff.patch | claude -p in a 12-line workflow
+ 4.17.3 — The SDK Route (Agent SDK): claude-agent-sdk (Python) and @anthropic-ai/claude-agent-sdk (TypeScript) packages. query() function, event streaming via async iterator, options allowed_tools, max_turns, system_prompt, mcp_servers. When to choose SDK vs CLI
+ 4.17.4 — Three Orchestrator Patterns (real world): (1) GitHub Actions with automated PR review via anthropics/claude-code-action; (2) AWS Lambda event-driven reacting to a CloudWatch Alarm — initial investigation with AWS MCP Server and Slack posting; (3) Hermes Agent — open-source external orchestrator with persistent memory, multi-project routing by capabilities
+ 4.17.5 — CI Authentication + Cost Guardrails: claude setup-token (long-lived OAuth token, 1 year) for subscription plans; ANTHROPIC_API_KEY for direct API; AWS Bedrock + Vertex AI for enterprise deployment. Mandatory cost guardrails: --max-budget-usd, --max-turns, --allowedTools as a whitelist. Heads-up about June 15, 2026 separate credit pool
+ 4.17.6 — When NOT to Use Headless: six honest anti-patterns — initial exploration, cross-cutting refactor, tasks requiring active human approval, huge contexts (>200k tokens), critical production without guardrails, brand-new code from scratch. Section closes reminding the reader that headless is orchestration, not autonomy
May 14, 2026

AWS MCP Server (GA, May 2026) — The New Canonical Entry Point + Section 3.15 Restructure

MAJOR

Between May 6 and 9, 2026, AWS and Anthropic announced the general availability of the AWS MCP Server — an official, pricing-free MCP server that becomes the new canonical entry point for ~80% of AWS interactions from Claude Code. The guide was restructured end-to-end: Section 3.15 was split into four subsections, section 5.15 gained a new decision section, and section 7/section 9 received repositioning callouts. The central philosophy: augment, not replace — the awslabs specialized servers remain valuable for task-oriented workflows.

+ 4 powerful tools: call_aws (access to 15,000+ AWS APIs via Python SDK), run_script (Python sandboxed server-side for multi-step orchestration), search_documentation and read_documentation (live AWS docs — bye knowledge cutoff)
+ Section 3.15 restructured into 4 subsections: 3.15.1 AWS MCP Server (GA) as canonical entry point • 3.15.2 awslabs ecosystem as specialized servers • 3.15.3 decision tree to choose between them • 3.15.4 new patterns unlocked (knowledge-cutoff bypass + run_script orchestration)
+ Section 5.15.6 (new): "When to use call_aws vs specialized awslabs servers" — IaC decision table + side-by-side example of S3 operations (raw call_aws vs s3-tables-mcp-server) showing real trade-offs
+ Callouts in section 7 and section 9: existing sections (eks-mcp-server, CloudWatch, Cost Explorer) repositioning awslabs servers as specialized — they complement, not compete with, the canonical entry point
+ New teaching examples: knowledge-cutoff bypass via read_documentation while exploring S3 Vectors (announced post-model-cutoff) + run_script orchestrating orphan EBS snapshot cleanup server-side in a single call
+ Pricing-free, IAM-controlled: zero cost for the MCP itself, you only pay for the underlying AWS resources; permissions controlled by the user/CI IAM role. Part of the AWS Agent Toolkit with plugins aws-core, aws-agents, aws-data-analytics
+ "Augment, not replace" stance: we keep the awslabs ecosystem (terraform-mcp, eks-mcp, cloudwatch, cost-explorer) as reference for task-oriented flows. The AWS MCP Server becomes the default for 80% of cases; the rest stays with specialized servers
May 7, 2026

Section 13.6 — Claude Code on Amazon Bedrock: Enterprise Deployment for Regulated Industries

MAJOR

Following the Apr/2026 Anthropic + AWS announcement, Claude Code now runs natively on the customer's own Amazon Bedrock account — all inference stays within the customer-controlled AWS perimeter, nothing flows to Anthropic servers. This section explains why that change unlocks adoption in banks, hospitals, defense, telecom and government, what the architecture looks like, and how to set up a senior-grade deployment from scratch with federated IAM, model pinning, per-team cost tagging and Bedrock Guardrails. Includes a critical disambiguation between "Claude on Bedrock" (compliance, hosted in the customer's AWS) and "Claude Platform on AWS" (procurement-only via AWS Marketplace, infra remains at Anthropic).

+ 13.6.1 — The Compliance Gap: why "where the prompts travel" is the gate that separates an authorized pilot from a CISO veto in regulated industries (PCI-DSS, HIPAA, LGPD, SOC2, FedRAMP)
+ 13.6.2 — Bedrock-native architecture: diagram of the inference flow inside the AWS perimeter, data at rest and in transit under customer control, KMS BYOK, VPC endpoints (PrivateLink) — plus the callout explaining what is not "Claude on Bedrock"
+ 13.6.3 — Senior 8-step setup: enable model access in Bedrock, create an IAM role with least-privilege, SSO federation (Identity Center), model pinning via inference profile (no silent upgrades), Application Inference Profiles for per-team cost tagging, Bedrock Guardrails, observability via CloudTrail + CloudWatch, and CLAUDE_CODE_USE_BEDROCK=1 with ANTHROPIC_BEDROCK_BASE_URL
+ 13.6.4 — The Mantle endpoint (May/2026): a new AWS feature that exposes Bedrock as an OpenAI-compatible endpoint, simplifying integrations with legacy tools and third-party SDKs without losing inference isolation
+ 13.6.5 — Real-world Q2 Code (Banking): walkthrough of the actual Claude Code rollout for a digital bank's platform team, covering banking regulation, SoX, and a control-mapping table tying Bedrock Guardrails + IAM to compliance requirements
+ 13.6.6 — What This Means for Your Career: why mastering regulated AI deployments is the next senior differentiator — knowing how to prompt is not enough; you need to know how to convince the CISO
May 6, 2026

Spec-Driven Development — Full Framework Applied to Terraform, Kubernetes and Observability

MAJOR

Strategic update: the guide now covers the Spec-Driven Development (SDD) framework end-to-end, with methodological foundation plus three practical applications. With AWS Q Developer EOL pushing teams toward Kiro, and Anthropic only recommending the Skills+Subagents+Hooks pattern (without shipping a first-party feature), the guide positions itself as the canonical reference for SDD native in Claude Code — the exact official pattern, ready to use.

+ Section 4.16 — Spec-Driven Agents: framework foundation. The three pillars (Contracts, Agents, Runtime), the canonical seven-phase workflow, Claude Code implementation, Kiro vs Spec Kit vs Native landscape, and when NOT to use SDD (honest anti-patterns)
+ Section 5.16 — Spec-Driven IaC: applied to Terraform with full walkthrough of a Multi-AZ VPC for a PCI environment, drift detection as spec validation, and tooling (IBM iac-spec-kit + native Claude Code)
+ Section 7.19 — Spec-Driven Kubernetes: manifests + policy from requirements. payment-service walkthrough (PCI latency-critical tier) generating Deployment, NetworkPolicy, PDB, HPA and Kyverno — with the spec acting as a Dev ↔ Platform contract
+ Section 9.15 — SLOs as Specs: applied to observability. Versioned slo.md becomes the source of truth that auto-generates Prometheus rules + Grafana dashboards + runbooks. SLO changes become regenerable
+ Public templates in devopsai-templates: CLAUDE-spec-driven.md, 4 skills (/spec-create, /spec-execute, /spec-status, /spec-validate), 5 subagents (requirements, design, tasks, implementation, spec-validator) and the enforce-spec.sh hook (PreToolUse). All plug-and-play in any project
+ Canonical positioning: aligned with Anthropic's official pattern (Skills + Subagents + Hooks for multi-phase workflows) — Anthropic recommends the pattern without shipping a first-party feature; the guide closes the gap with concrete, Context7-validated implementation
Stack: Claude Code (Skills + Subagents + Hooks), Terraform, Kubernetes (Kyverno, NetworkPolicy, PDB, HPA), Prometheus, Grafana. Model: Opus 4.7. Templates: github.com/filipemotta/devopsai-templates
April 2026 (6 updates)
Apr 30, 2026

Section 4.15 — Managed Agents API: File-Based vs Stateful Subagents

NEW

Anthropic launched the Managed Agents API (Apr/2026, beta managed-agents-2026-04-01). Rather than a passing mention, the guide now has an opinionated section honestly comparing both approaches — file-based (our default across all chapters) vs managed (server-side stateful alternative) — and helps the reader decide when each makes sense.

+ Four core concepts: Agent, Environment, Session, Events — each with its file-based analog
+ Comparison table with 12 criteria: where it runs, state, portability, MCP auth via Vaults, cost, observability, compliance
+ Decision tree: 4 questions to choose between file-based and managed without doubt
+ Practical example: Stateful Incident War Room — handoff between on-calls in different shifts (BR → US → EU) with memory store, vaults, agent_toolset_20260401, direct comparison vs manual INCIDENT_NOTES.md
+ Real caveats: beta, total vendor-lock, limited observability, different pricing, stdio MCPs don't work — clear stance: file-based remains the guide's default
Apr 30, 2026

Opus 4.7 — Section 3.13 Rewrite + Task Budgets in Agent Teams

UPDATE

Anthropic released Claude Opus 4.7 with important changes for agentic use. Section 3.13 was rewritten from scratch reflecting what changed; chapters 5.14, 6.14 and 6.18 got Task Budget examples for Agent Teams; devopsai-templates were updated.

+ Adaptive Thinking off by default: must enable with thinking: {type:"adaptive"} — behavior changed vs 4.6
+ New xhigh level: recommended for coding/agentic. Thinks more than high; better for Agent Teams + repeated tool calling
+ Task Budgets (beta): token budget for the entire agentic loop. output_config.task_budget + header task-budgets-2026-03-13. Per-session cost cap with no abrupt cutoff.
+ Strict effort at low/medium: model strictly scopes to the request — good for cost, but watch out for under-thinking
+ Behavior changes: spawns fewer subagents by default (must request parallelization explicitly), more regular progress updates, better memory, real-time cybersec safeguards
+ Chapters 5.14, 6.14, 6.18: practical task_budget examples in Agent Teams for Terraform, EKS upgrade and multi-region GKE
+ Cross-guide audit: claude-opus-4-6 references updated to claude-opus-4-7 across all chapters and templates (Fast Mode kept as 4.6-exclusive)
Apr 29, 2026

GCP MCP Servers + GKE — Full Coverage of Google's Ecosystem

NEW

Four new sections balance the guide's cloud coverage: 3.16 introduces Google's remote-HTTP approach (vs AWS local-stdio), 6.16 compares GKE vs EKS through an AI lens, 6.17 walks through all 27 GKE MCP tools in a real incident-response flow, and 6.18 shows a multi-region Agent Team with automatic reconciliation.

+ Section 3.16: ADC + OAuth setup, segmented endpoints (read/full/delete), reference table of 14+ GCP MCP servers, senior examples for Cloud Run, Logging, Monitoring (PromQL!) and Asset Inventory
+ Section 7.16: GKE Autopilot vs EKS, Workload Identity vs IRSA, ComputeClass vs Karpenter, multi-cloud decision table
+ Section 7.17: Full practical incident response with 10 GKE MCP tool calls — read-only for triage, controlled promotion to /mcp full, validation via Cloud Monitoring
+ Section 7.18: Agent Team provisioning GKE across 3 regions in parallel (~14min vs 40h manual), with reconciliation pass that catches cross-region drift before incidents
+ Bonus: reference to google/skills repo as canonical example of vendor-published Skills — aligned with the pattern taught in 3.10
Stack: GKE MCP (27 tools, 3 endpoints), Cloud Logging, Cloud Monitoring (PromQL), Cloud Asset Inventory, Resource Manager, Cloud Run. Auth via ADC + roles/mcp.toolUser. Native audit via Cloud Audit Logs.
Apr 13, 2026

RAG vs Long Context — When You Need a Vector DB

NEW

New section 12.10 covering the 2023→2026 paradigm shift: with 1M token context windows, grep + long context solves most internal doc search cases without a vector DB.

+ lazy_rag.py: Complete implementation with ripgrep + Claude API — zero infrastructure, ~50 lines
+ MCP Server: runbook_mcp.py with FastMCP — search and read runbooks as Claude Code tools
+ Decision tree: <50MB → grep | 50-500MB → BM25 | 500MB+ → full RAG
+ Cost comparison: Setup, infra, cost/query, freshness, debuggability
Stack: ripgrep, FastMCP, Anthropic API, prompt caching. Inspired by Claude Code's architecture (Glob/Grep/Read without vector DB)
Apr 09, 2026

PagerDuty MCP — AI-Powered Incident Response

NEW

New section 9.14 dedicated to the official PagerDuty MCP Server (70+ tools). Closes the complete observability cycle: Detect (CloudWatch) → Visualize (Grafana) → Respond (PagerDuty).

+ 3AM Scenario: Complete P1 incident triage in 3 minutes — automatic correlation between incidents, on-call, metrics, and history
+ Assisted Postmortem: Claude drafts timeline, root cause, recurrence analysis, and action items automatically
+ 70+ tools: Incidents, on-call, escalation, workflows, change events, status pages
Stack: PagerDuty/pagerduty-mcp-server (official), uvx, read-only by default
Apr 08, 2026

ArgoCD MCP — AI-Connected GitOps

NEW

Section 14.6 expanded with the official ArgoCD MCP Server (Argo Labs). Claude talks directly to the ArgoCD API — lists apps, reads logs, syncs deployments, and executes resource actions in real time.

+ Before vs After: Sync failure investigation from ~5 min (manual) to ~30 sec (conversational)
+ 2 real scenarios: Kyverno webhook blocking deploy diagnosis + coordinated multi-app rollback
+ Security: Read-only mode (MCP_READ_ONLY=true) for production, write enabled in dev/staging
Stack: argoproj-labs/mcp-for-argocd (official), npx, 13 exposed tools
March 2026 (7 updates)
Mar 27, 2026

Practical Skills Across All Chapters

NEW

Added 12 practical skills complementing existing subagents. Each chapter now has the complete pattern: Subagent (interactive conversation) + Skills (quick, repeatable actions).

+ Ch 6 K8s: /k8s-debug (cluster diagnostics) + /k8s-review (manifest review)
+ Ch 7 CI/CD: /pipeline-debug (pipeline failures) + /pipeline-review (workflow security)
+ Ch 8 Obs: /incident-debug (incident diagnosis) + /cost-review (cost optimization)
+ Ch 9 Sec: /security-scan (security scanning) + /iam-review (IAM permissions review)
+ Templates: 12 skills published to devopsai-templates repository
Mar 27, 2026

Cursor → Claude Code Migration (All Chapters)

REVISION

Complete migration of 82 Cursor IDE references across 12 chapters. The guide now focuses on Claude Code CLI + VS Code as primary tools, with Cursor only as a compatible alternative.

.cursorrules → CLAUDE.md: All guardrails migrated to universal format
Prompts & subagents: "In Cursor" → "In Claude Code", creation via .claude/agents/
Ch 14 Governance: cursorrules-templates/ → claude-md-templates/ with CLAUDE-sre.md, etc.
Chapters affected: 1, 2, 3, 4, 5, 6, 7, 8, 9, 13, 14, 15
Mar 27, 2026

Chapter 5 Terraform — Complete Revision

UPDATE

Migrated Cursor references to Claude Code, added practical skills, and updated AWS Provider versions (v5+/v6).

Cursor → Claude Code: .cursorrules migrated to CLAUDE.md, generic prompts, Cursor as alternative only
+ Section 5.13 expanded: New skills /terraform-review (security + costs) and /terraform-plan (blast radius)
AWS Provider: References updated from v5 to v5+/v6 (current provider: v6.33)
+ Templates: skills/terraform-review and skills/terraform-plan in public repository
Mar 26, 2026

Karpenter + AI: Intelligent Node Scaling

NEW

New section 7.15 dedicated to Karpenter v1 focusing on how AI helps with configuration, troubleshooting, consolidation and cost optimization. Senior-level content based on 2025-2026 best practices research.

+ 6.15.1: Core Concepts — NodePool, EC2NodeClass, NodeClaim (v1 API), CAS → Karpenter migration
+ 6.15.2: AI-Generated NodePool — Business context prompt, complete YAMLs (On-Demand + Spot)
+ 6.15.3: Consolidation — 3 strategies, Disruption Budgets, Spot-to-Spot (15+ instance types)
+ 6.15.4: Troubleshooting with AI — Real scenarios (nodes not provisioning, NodeClaim stuck, consolidation stalled)
+ 6.15.5: Cost Optimization — Utilization analysis, Spot strategies, Grafana dashboards
+ 6.15.6: CLAUDE.md for Karpenter — Practical template with safety rules and debugging steps
+ 6.15.7: Advanced Patterns — Drift Detection, Topology Spread, GPU, Node Expiration (TTL)
Mar 25, 2026

Subagents & Agent Teams — Complete Rewrite with Official Documentation

UPDATE

Sections 4.12, 4.13, and 4.14 rewritten based on updated official Claude Code documentation. Removed Cursor IDE references, complete subagent frontmatter, Subagents vs Agent Teams comparison table, and significantly expanded content.

Section 4.12: Agent in Action — Migrated from .cursorrules to CLAUDE.md, instructions via VS Code + Claude Code CLI
Section 4.13: Subagents — Complete frontmatter (14 fields), persistent memory, hooks, isolation, background tasks, @-mention
Section 4.14: Agent Teams — Comparison table, display modes, plan approval, quality gates, shutdown/cleanup
Removed: All Cursor IDE references (now consistent with VS Code + Claude Code CLI)
Fixed: Agent Teams doesn't require Opus 4.6 — works with any model (requires Claude Code v2.1.32+)
Mar 09, 2026

AWS MCP Servers — Complete Ecosystem for DevOps/Cloud

NEW

4 new sections covering the complete AWS MCP Servers ecosystem for DevOps — from the unified gateway (Core MCP) to specialized tools for observability, IaC, and FinOps.

+ Section 3.15: AWS Core MCP Server — Unified gateway with 13 roles (finops, monitoring, container-orchestration, etc.)
+ Section 5.15: CloudFormation & CDK with AWS MCP — Terraform alternatives with official MCPs
+ Section 9.11: CloudWatch MCP Server — Native AWS observability (alarms, logs, metrics)
+ Section 9.12: Cost Explorer MCP — Intelligent FinOps with AI (cost analysis, forecasting, alerts)
+ Reference table: 11+ individual AWS MCPs mapped by guide chapter
+ Subagent: finops-analyst for automated cost analysis
MCPs: awslabs.core-mcp-server, awslabs.cloudwatch-mcp-server, awslabs.cfn-mcp-server, awslabs.aws-iac-mcp-server, awslabs.cost-explorer-mcp-server, awslabs.aws-api-mcp-server
Mar 09, 2026

Grafana MCP + OpenTelemetry — Multi-Cloud Observability

NEW

New section on the official Grafana MCP Server (Grafana Labs) combined with OpenTelemetry for vendor-neutral, multi-cloud observability. Includes complete LGTM stack and practical investigation scenarios.

+ Section 9.13: Grafana MCP + OpenTelemetry — dashboards, PromQL, Loki, Tempo, alerts and incidents via MCP
+ LGTM Stack: Complete Docker Compose with OTel Collector + Prometheus + Loki + Tempo + Grafana
+ 3 scenarios: Latency investigation, signal correlation (traces↔logs↔metrics), dashboard creation via AI
+ Comparison: Grafana MCP vs CloudWatch MCP — when to use each
Stack: grafana/mcp-grafana, OpenTelemetry Collector, Prometheus, Loki, Tempo, grafana/otel-lgtm
February 2026 (7 updates)
Feb 19, 2026

Claude Code SDK — Programmatic Automation

NEW

New section on the Claude Code SDK for programmatic automation — use the same Claude Code from your terminal via TypeScript/JavaScript or headless CLI. Includes practical DevOps examples with cross-references to existing chapter content.

+ Section 8.14: Claude Code SDK — Programmatic Automation (comparison table: GitHub Action vs SDK vs Headless CLI)
+ Git Hook: Pre-commit security review with headless CLI (connects with 7.4 and 7.7)
+ TypeScript SDK: Batch flaky test analyzer (connects with 7.5)
+ Quality Gate: Examples for GitLab CI and Jenkins (connects with 7.3 and 7.6)
+ Internal CLI: Pipeline analyzer with security, performance and reliability analyses (connects with 7.10)
Feb 13, 2026

Hooks: Automation & Guardrails in Claude Code

NEW

New section on Claude Code Hooks — automations that run in response to events (PreToolUse, PostToolUse, Stop, SessionStart). Includes practical examples for each DevOps domain.

+ Section 3.14 — Hooks: Event types, anatomy, variables and complete examples
+ Section 5 — Terraform-specific hook (block destroy, auto-validate .tf)
+ Section 7 — Kubernetes-specific hook (protect namespaces, validate manifests)
+ Section 8 — CI/CD-specific hook (block force push, validate workflows)
+ Section 9 — Observability-specific hook (audit log, incident verification)
Feb 9, 2026

Claude Opus 4.6: Agent Teams, Adaptive Thinking and Compaction API

NEW

Complete coverage of Claude Opus 4.6 new features, including Agent Teams for multi-agent collaboration, Adaptive Thinking for reasoning control, Compaction API for context management, and Fast Mode for 2.5x faster output.

+ Section 3.13 — Opus 4.6: Adaptive Thinking, Compaction API and Fast Mode
+ Section 4.14 — Agent Teams: Multiple Agents Working in Parallel
+ Section 5.14 — Agent Teams: Multi-Module Infrastructure Refactoring
+ Section 7.14 — Agent Teams: Multi-Agent EKS Upgrade Validation
+ Section 9.10 — Agent Teams: Automated War Room for P1 Incidents
~ Sections 8, 9 — Fast Mode and Opus 4.6 alternative mentions
~ Section 15 — Updated Opus 4.6 reference + Agent Teams capability
~ Technical fixes: Trivy, FastMCP, AWS MCP Servers, K8sGPT, Go version
Feb 7, 2026

Context7: Real-Time Documentation via MCP

NEW

New section on Context7, a free MCP Server that injects official up-to-date documentation into the AI's context, eliminating hallucinations caused by deprecated APIs or outdated syntax.

+ Section 3.12: Context7 — Real-Time Documentation via MCP (architecture, installation, practical usage)
+ Examples: Terraform providers, Kubernetes APIs, GitHub Actions, ArgoCD/Helm
+ Section 4.5: Mention of Context7 usage with AI agents
Install: claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
Feb 5, 2026

Immersive Section 1 Rewrite

UPDATE

Section 1 completely rewritten as an immersive and interactive experience, with 8 redesigned sections, clickable elements, and a transformative narrative.

+ Section 1.1: "3 AM. 47 Alerts." — Visceral hook with real incident scenario
+ Section 1.4: Interactive tab system with 5 before/after scenarios (Terraform, K8s, CI/CD, Incident, FinOps)
+ Section 1.6: Interactive flip cards — 8 pain points with solutions and corresponding chapters
+ Section 1.8: Animated implementation timeline (Week 1 to Month 3)
New structure: 1.1 Hook | 1.2 Human Glue | 1.3 AI & Standardization | 1.4 Interactive Transformation | 1.5 What's Different | 1.6 Pain Points Map | 1.7 What You'll Build | 1.8 Journey
Feb 4, 2026

AI-Assisted EKS/Kubernetes Upgrade

NEW

New complete section on AI-assisted EKS cluster upgrades, including deprecated API detection, addon compatibility matrix, and CI/CD automation.

+ Section 7.13: AI-Assisted EKS/Kubernetes Upgrade
+ Tools: Pluto, Kubent, eksup, EKS Cluster Insights
+ MCP Server: Amazon EKS MCP Server (AWS Managed)
+ Skill: /eks-upgrade-check for automated verification
+ Template: CLAUDE-eks-upgrade.md for upgrade projects
+ CI/CD: GitHub Actions for continuous compatibility verification
Addons covered: VPC CNI, CoreDNS, kube-proxy, Karpenter, Cluster Autoscaler, EBS/EFS CSI, External-DNS, External-Secrets, Cert-Manager, Metrics Server, Ingress-NGINX, AWS LB Controller, ArgoCD
Feb 4, 2026

Section 12 Update - RAG

UPDATE

Complete rewrite of the RAG for Runbooks chapter with detailed tool comparison, step-by-step implementation with Qdrant, evaluation metrics, and troubleshooting.

+ Comparison: Qdrant vs Pinecone vs Elasticsearch vs ChromaDB
+ Recommendation: Qdrant (open-source, Rust, native hybrid search)
+ Implementation: Complete step-by-step from zero to working RAG
+ Integration: Slack bot with /runbook command
+ Metrics: RAGAS, Recall@K, Precision@K, MRR, Faithfulness
+ Troubleshooting: Common problems and solutions
Frameworks: LlamaIndex vs LangChain | Embeddings: text-embedding-3-small | Evaluation: RAGAS, Braintrust, LangSmith
Feb 4, 2026

Section 12 - RAG Expanded

UPDATE

RAG chapter expansion with multi-source integration (Jira, GitHub, Confluence, Slack) and Agent/MCP creation for Claude to use RAG without hallucinating.

+ Section 12.5: Integrating Multiple Sources (Jira, GitHub, Confluence, Slack)
+ Section 12.6: Creating Agent/MCP that Uses RAG (Anti-Hallucination)
+ Loaders: Complete code for Jira, GitHub, Confluence and Slack
+ MCP Server: FastMCP with query_runbooks tool for Claude Desktop
+ Subagent: Alternative for Claude Code with anti-hallucination prompt
Sources: Jira API, GitHub API, Confluence API, Slack API | Framework: FastMCP | Chapter now with 10 sections
January 2026 (3 updates)
Jan 28, 2026

GitHub Actions with AI

Expanded section on intelligent CI/CD with GitHub Actions, including self-diagnosing workflows and Claude Code integration.

+ Section 8.13: Intelligent GitHub Actions
+ Workflows: Ready-to-use templates
+ Integration: Claude Code in pipelines
Jan 15, 2026

MCP Servers on Kubernetes

New section on hosting MCP Servers on Amazon EKS with autoscaling, RBAC/IRSA security, and Cognito integration.

+ Section 7.12: Hosting MCP Servers on Kubernetes/EKS
+ Deploy: Complete manifests for EKS
+ Security: RBAC, IRSA, Cognito
Jan 2, 2026

DevOps Skills and Templates

Launch of the skills and templates repository for Claude Code, with ready-to-use automations for DevOps projects.

+ Repository: devopsai-templates on GitHub
+ 8 Skills: Terraform, Kubernetes, CI/CD, Observability, Security, FinOps, GitOps, DevOps General
+ Templates: CLAUDE.md for different scenarios
2025
December 2025 (1 update)
Dec 15, 2025

Official Guide Launch

LAUNCH

Launch of the first complete version of "The AI-Native DevOps Engineer" guide with 15 chapters covering the entire DevOps cycle with AI.

+ 15 Sections: Complete content from Terraform to Governance
+ +300 Pages: End-to-end practical examples
+ Tools: Claude Code, K8sGPT, MCP Servers, Cursor IDE

Access to All Updates

When you purchase the guide, you get access to all current content and future updates for 1 year.

Buy Guide - $57

14-day guarantee | 1 year access | Updates included