Guide update history - new content added regularly
Structural reorganization of the guide to separate cloud-specific content from provider-agnostic content. Chapter 5 (Terraform) returns to being pure Terraform, and all AWS-specific content that was spread across 3.15 (AWS MCP), 5.15 (CFN/CDK), 12.6 (Bedrock), and 12.7 (Least-Privilege IAM) is now consolidated in a new dedicated Chapter 6: "AWS with AI ... Deep Dive". Section 5.9.5 (Hard Guardrail IAM) stays in the Terraform chapter and also appears duplicated as 6.2 in the AWS context. Previous chapters 6-15 have been renumbered to 7-16. Old links (chapter-5.html#5.15, chapter-12.html#12.6, etc.) continue to work via JS redirect-stubs. This restructure prepares the ground for future dedicated chapters on GCP and Azure.
A brand new section that closes the pedagogical arc of chapter 5. After the reader builds the monolithic CLAUDE.md in 5.8 (Step 8, ~530 lines with 13 architectural patterns) and learns subagents + skills in 5.13, section 5.17 shows how to refactor the monolith into a composed system: a slim ~80-line CLAUDE.md (identity + safety + tooling + pointers) plus 3 subagents (terraform-architect, terraform-cost-reviewer, terraform-security-reviewer) and 5 skills (tf-scaffold-stack, tf-variables-review, tf-naming-review, tf-cross-stack, tf-outputs-review). Includes diagnosis with an always-on vs on-demand table, the partition rule, a walkthrough migrating 3 concrete patterns, the final directory structure, and when NOT to refactor. Connects with section 5.9.5 (Hard Guardrail), 5.13 (Subagents and Skills), chapter 6 (same principle), and chapter 15 (book-level pattern). Step 8 of section 5.8 gains a forward-reference pointing to 5.17.
terraform-architect-pack/ in devopsai-templates: a composed pack with a README, the slim CLAUDE.md (~80 lines), 3 subagents under .claude/agents/, and 5 skills under .claude/skills/. Coexists with the monolithic CLAUDE-terraform-architect.md template; each serves a different project maturity stage
CLAUDE-terraform-architect.md: a blockquote at the top distinguishing "monolithic" from "composed" and pointing to the new pack as the natural evolution
Three substantive additions to chapter 5 in this release. Section 5.8 got a new Step 8 dedicated to a Production-Grade CLAUDE.md, with excerpts of Tooling Strategy and Stack Architecture, plus a callout in Step 2 distinguishing the minimal template (good for initial testing) from the production-grade template (canonical reference for real projects). Section 5.9.5 is an entirely new addition focused on actual defense-in-depth: a Hard Guardrail Plan-Only IAM Role for production. And section 5.10 (Infracost integration) stopped recommending manual CLI installation and started using the official Claude Code plugin, an integration that brings friction down to zero.
brew install infracost/tap/infracost or curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/scripts/install.sh | sh + infracost auth login) to the official Claude Code plugin (claude plugin marketplace add infracost/agent-skills + claude plugin install infracost@infracost). The flow now uses native slash commands: /infracost:breakdown, /infracost:fix-tags, /infracost:optimize. No more PATH wiring or separate authentication
CLAUDE-terraform-architect.md: 450+ lines with a Senior Infrastructure Architect agent definition plus 13 architectural patterns (numbered stacks, native S3 backend, assume_role + default_tags workspace-aware, the variables-in-objects pattern, resource label naming conventions, count for lists, Name-only tags, kebab-case file organization, splat outputs, cross-stack state, new-stack checklist, plus the Plan-Only IAM Role for production and Infracost via plugin). The template is versioned at github.com/filipemotta/devopsai-templates and ready to copy as a CLAUDE.md in any Terraform project
<details> block in the product sidebar to host the new 5.9.5 Hard Guardrail IAM Plan-Only sub-entry. Readers walking through the chapter clearly see there is extra depth inside the Workspace Safety topic
Substantive restructure of three sections of chapter 5 to fix an architectural confusion. The chapter now establishes clearly that Terraform is multi-cloud by design (provider-agnostic) and offers a concrete decision tree for the question that keeps showing up in real teams: "which MCPs do I install to get started?". Section 5.8 stopped being an AWS-specific case and became an evolutionary walkthrough applicable to any cloud, starting from mkdir and ending at the second module reusing the accumulated context.
call_aws. Step 3 of the section makes the explicit cross-reference to 3.15.1
mkdir and git init, then asks Claude to generate the repository CLAUDE.md, writes the first module with validate and plan, runs apply in an isolated sandbox, and ends with a second module reusing the context Claude has already built up in the project
@modelcontextprotocol/server-filesystem is gone from 5.5 (it was redundant in Claude Code, which already has native Read, Write, Edit, and Bash). In its place, an explanatory callout about when that MCP still makes sense (Claude Desktop, Cursor without equivalent native tools, headless SDK integrations in environments where the filesystem is not exposed)
Critical security addition to chapter 12. Ever since the AWS MCP Server hit GA (May 2026, section 3.15), teams across the world started attaching AdministratorAccess to the role used by Claude Code just to make the agent "work". This section is blunt: AdministratorAccess + call_aws is a production outage waiting to happen. The agent inherits the user's IAM permissions and no automatic sandbox exists for MCP tools. Section 13.7 delivers the full playbook to avoid this, in seven practical subsections with ready-to-paste policies.
call_aws inherits the user's IAM, why no automatic sandbox exists for MCP tools, and what happens when a misinterpreted prompt meets overly broad permissions (delete on production DynamoDB, terminate of critical EC2 instances, drop of RDS tables)
Describe*, List*, Get* only). Policy C: production write-controlled (explicit deny for destructive actions like iam:Delete*, ec2:TerminateInstances, rds:Delete*, dynamodb:DeleteTable, s3:DeleteBucket)
aws:PrincipalTag, aws:RequestTag and sts:RoleSessionName. Concrete example: write in production only allowed when aws:PrincipalTag/AgentMode=false. The same role blocks mutation when the session is the agent and allows it when the human is logged in with MFA
eventName matching a list of destructive actions (TerminateInstances, DeleteDBInstance, DeleteTable, etc.) and userIdentity.sessionContext.sessionIssuer.userName matching the agent role. Immediate notification on Slack via SNS + Lambda, with a direct link to the event in CloudTrail. Reaction in seconds, not hours
PreToolUse hook in .claude/settings.json that intercepts call_aws invocations and blocks destructive patterns before they leave the machine. Validation script that inspects the payload (action name + ARN) and returns exit 2 if the action is on the blocklist. Last line of defense, executed client-side
call_aws in any account with real data, read the least-privilege IAM playbook in section 13.7". Explicit connection between the technical entry point and the security playbook
New senior-level section (~2,300 words) that closes a critical gap: until now the guide treated Claude Code as an interactive IDE tool. But in DevOps, most of the value shows up when it runs without a terminal — inside GitHub Actions, Lambdas reacting to CloudWatch Alarms, or external orchestrators like Hermes. Section 4.17 maps the three official Anthropic paths to run Claude Code non-interactively, shows three real production patterns, teaches CI authentication without friction, and establishes mandatory cost guardrails. Includes a critical heads-up: starting June 15, 2026, the Agent SDK and claude -p on subscription plans (Pro, Max, Team) will draw from a separate credit pool from interactive sessions — anyone running orchestrators in production needs to plan ahead.
claude -p): Print Mode with --output-format stream-json, --allowedTools, --max-turns, --permission-mode acceptEdits. Concrete PR review example via cat diff.patch | claude -p in a 12-line workflow
claude-agent-sdk (Python) and @anthropic-ai/claude-agent-sdk (TypeScript) packages. query() function, event streaming via async iterator, options allowed_tools, max_turns, system_prompt, mcp_servers. When to choose SDK vs CLI
anthropics/claude-code-action; (2) AWS Lambda event-driven reacting to a CloudWatch Alarm — initial investigation with AWS MCP Server and Slack posting; (3) Hermes Agent — open-source external orchestrator with persistent memory, multi-project routing by capabilities
claude setup-token (long-lived OAuth token, 1 year) for subscription plans; ANTHROPIC_API_KEY for direct API; AWS Bedrock + Vertex AI for enterprise deployment. Mandatory cost guardrails: --max-budget-usd, --max-turns, --allowedTools as a whitelist. Heads-up about June 15, 2026 separate credit pool
Between May 6 and 9, 2026, AWS and Anthropic announced the general availability of the AWS MCP Server — an official, pricing-free MCP server that becomes the new canonical entry point for ~80% of AWS interactions from Claude Code. The guide was restructured end-to-end: Section 3.15 was split into four subsections, section 5.15 gained a new decision section, and section 7/section 9 received repositioning callouts. The central philosophy: augment, not replace — the awslabs specialized servers remain valuable for task-oriented workflows.
call_aws (access to 15,000+ AWS APIs via Python SDK), run_script (Python sandboxed server-side for multi-step orchestration), search_documentation and read_documentation (live AWS docs — bye knowledge cutoff)
awslabs ecosystem as specialized servers • 3.15.3 decision tree to choose between them • 3.15.4 new patterns unlocked (knowledge-cutoff bypass + run_script orchestration)
call_aws vs specialized awslabs servers" — IaC decision table + side-by-side example of S3 operations (raw call_aws vs s3-tables-mcp-server) showing real trade-offs
eks-mcp-server, CloudWatch, Cost Explorer) repositioning awslabs servers as specialized — they complement, not compete with, the canonical entry point
read_documentation while exploring S3 Vectors (announced post-model-cutoff) + run_script orchestrating orphan EBS snapshot cleanup server-side in a single call
aws-core, aws-agents, aws-data-analytics
Following the Apr/2026 Anthropic + AWS announcement, Claude Code now runs natively on the customer's own Amazon Bedrock account — all inference stays within the customer-controlled AWS perimeter, nothing flows to Anthropic servers. This section explains why that change unlocks adoption in banks, hospitals, defense, telecom and government, what the architecture looks like, and how to set up a senior-grade deployment from scratch with federated IAM, model pinning, per-team cost tagging and Bedrock Guardrails. Includes a critical disambiguation between "Claude on Bedrock" (compliance, hosted in the customer's AWS) and "Claude Platform on AWS" (procurement-only via AWS Marketplace, infra remains at Anthropic).
CLAUDE_CODE_USE_BEDROCK=1 with ANTHROPIC_BEDROCK_BASE_URL
Strategic update: the guide now covers the Spec-Driven Development (SDD) framework end-to-end, with methodological foundation plus three practical applications. With AWS Q Developer EOL pushing teams toward Kiro, and Anthropic only recommending the Skills+Subagents+Hooks pattern (without shipping a first-party feature), the guide positions itself as the canonical reference for SDD native in Claude Code — the exact official pattern, ready to use.
slo.md becomes the source of truth that auto-generates Prometheus rules + Grafana dashboards + runbooks. SLO changes become regenerable
/spec-create, /spec-execute, /spec-status, /spec-validate), 5 subagents (requirements, design, tasks, implementation, spec-validator) and the enforce-spec.sh hook (PreToolUse). All plug-and-play in any project
Anthropic launched the Managed Agents API (Apr/2026, beta managed-agents-2026-04-01). Rather than a passing mention, the guide now has an opinionated section honestly comparing both approaches — file-based (our default across all chapters) vs managed (server-side stateful alternative) — and helps the reader decide when each makes sense.
INCIDENT_NOTES.md
Anthropic released Claude Opus 4.7 with important changes for agentic use. Section 3.13 was rewritten from scratch reflecting what changed; chapters 5.14, 6.14 and 6.18 got Task Budget examples for Agent Teams; devopsai-templates were updated.
thinking: {type:"adaptive"} — behavior changed vs 4.6
xhigh level: recommended for coding/agentic. Thinks more than high; better for Agent Teams + repeated tool calling
output_config.task_budget + header task-budgets-2026-03-13. Per-session cost cap with no abrupt cutoff.
task_budget examples in Agent Teams for Terraform, EKS upgrade and multi-region GKE
claude-opus-4-6 references updated to claude-opus-4-7 across all chapters and templates (Fast Mode kept as 4.6-exclusive)
Four new sections balance the guide's cloud coverage: 3.16 introduces Google's remote-HTTP approach (vs AWS local-stdio), 6.16 compares GKE vs EKS through an AI lens, 6.17 walks through all 27 GKE MCP tools in a real incident-response flow, and 6.18 shows a multi-region Agent Team with automatic reconciliation.
google/skills repo as canonical example of vendor-published Skills — aligned with the pattern taught in 3.10
New section 12.10 covering the 2023→2026 paradigm shift: with 1M token context windows, grep + long context solves most internal doc search cases without a vector DB.
New section 9.14 dedicated to the official PagerDuty MCP Server (70+ tools). Closes the complete observability cycle: Detect (CloudWatch) → Visualize (Grafana) → Respond (PagerDuty).
Section 14.6 expanded with the official ArgoCD MCP Server (Argo Labs). Claude talks directly to the ArgoCD API — lists apps, reads logs, syncs deployments, and executes resource actions in real time.
Added 12 practical skills complementing existing subagents. Each chapter now has the complete pattern: Subagent (interactive conversation) + Skills (quick, repeatable actions).
Complete migration of 82 Cursor IDE references across 12 chapters. The guide now focuses on Claude Code CLI + VS Code as primary tools, with Cursor only as a compatible alternative.
Migrated Cursor references to Claude Code, added practical skills, and updated AWS Provider versions (v5+/v6).
New section 7.15 dedicated to Karpenter v1 focusing on how AI helps with configuration, troubleshooting, consolidation and cost optimization. Senior-level content based on 2025-2026 best practices research.
Sections 4.12, 4.13, and 4.14 rewritten based on updated official Claude Code documentation. Removed Cursor IDE references, complete subagent frontmatter, Subagents vs Agent Teams comparison table, and significantly expanded content.
4 new sections covering the complete AWS MCP Servers ecosystem for DevOps — from the unified gateway (Core MCP) to specialized tools for observability, IaC, and FinOps.
New section on the official Grafana MCP Server (Grafana Labs) combined with OpenTelemetry for vendor-neutral, multi-cloud observability. Includes complete LGTM stack and practical investigation scenarios.
New section on the Claude Code SDK for programmatic automation — use the same Claude Code from your terminal via TypeScript/JavaScript or headless CLI. Includes practical DevOps examples with cross-references to existing chapter content.
New section on Claude Code Hooks — automations that run in response to events (PreToolUse, PostToolUse, Stop, SessionStart). Includes practical examples for each DevOps domain.
Complete coverage of Claude Opus 4.6 new features, including Agent Teams for multi-agent collaboration, Adaptive Thinking for reasoning control, Compaction API for context management, and Fast Mode for 2.5x faster output.
New section on Context7, a free MCP Server that injects official up-to-date documentation into the AI's context, eliminating hallucinations caused by deprecated APIs or outdated syntax.
Section 1 completely rewritten as an immersive and interactive experience, with 8 redesigned sections, clickable elements, and a transformative narrative.
New complete section on AI-assisted EKS cluster upgrades, including deprecated API detection, addon compatibility matrix, and CI/CD automation.
Complete rewrite of the RAG for Runbooks chapter with detailed tool comparison, step-by-step implementation with Qdrant, evaluation metrics, and troubleshooting.
RAG chapter expansion with multi-source integration (Jira, GitHub, Confluence, Slack) and Agent/MCP creation for Claude to use RAG without hallucinating.
Expanded section on intelligent CI/CD with GitHub Actions, including self-diagnosing workflows and Claude Code integration.
New section on hosting MCP Servers on Amazon EKS with autoscaling, RBAC/IRSA security, and Cognito integration.
Launch of the skills and templates repository for Claude Code, with ready-to-use automations for DevOps projects.
Launch of the first complete version of "The AI-Native DevOps Engineer" guide with 15 chapters covering the entire DevOps cycle with AI.
When you purchase the guide, you get access to all current content and future updates for 1 year.
Buy Guide - $5714-day guarantee | 1 year access | Updates included