15 Sections + 150 Sections of AI applied to DevOps, SRE and Platform Engineering
Understand the real problem and AI fundamentals for DevOps
And Why You Need to Lead It
1.1 3 AM. 47 Alerts.
1.2 DevOps as Human Glue
1.3 Why AI Fails Without Standardization
1.4 See the Transformation
1.5 What Makes This Guide Different
1.6 Critical Pain Points
1.7 What You Will Build
1.8 The Journey Begins
LLMs, Claude, MCP and Agent Frameworks
2.1 What Are Large Language Models (LLMs)
2.2 Claude, GPT and Other Models: When to Use Each
2.3 Critical Limitations You Need to Know
2.4 Model Context Protocol (MCP): The Bridge Between LLM and Tools
2.5 Agent Frameworks: LangChain, CrewAI and AutoGen
2.6 Architectural Decisions of This Guide
VS Code, Claude Code CLI and Environment Configuration
3.1 The IDE as DevOps Command Center
3.2 VS Code + Claude Code: The Recommended Setup
3.3 Essential Extensions for DevOps
3.4 Configuration Files: .cursorrules, CLAUDE.md, settings.json
3.5 Starting Your Environment: claude init
3.6 Copilots, Chat and Agents in the IDE
3.7 MCP: The Model Context Protocol
3.8 Common Anti-Patterns in IDE AI Usage
3.9 Agent in Action - Initial IDE Setup
3.10 Claude Skills: Complete Guide
3.11 Building Your Own MCP Server
3.12 Context7: Real-Time Documentation via MCP
3.13 Opus 4.7: Adaptive Thinking, Task Budgets and xhigh Effort
3.14 Hooks: Automation & Guardrails in Claude Code
3.15 AWS MCP Servers — From Ecosystem to Canonical
› 3.15.1 AWS MCP Server (GA, May 2026) — The New Canonical Entry Point
› 3.15.2 The awslabs Ecosystem — Specialized Servers Worth Keeping
› 3.15.3 Decision Tree — When to Use What
› 3.15.4 New Patterns Unlocked by GA
3.16 GCP MCP Servers — Google's Remote Approach
From Concept to Practice with Specialized Subagents
4.1 What Is (and Isn't) an AI Agent
4.2 Practical Difference: Prompt, Script, Copilot and Agent
4.3 Anatomy of a DevOps Agent
4.4 Types of Agents in the DevOps World
4.5 MCP Applied to Agents
4.6 Minimum Guardrails for Production
4.7 When NOT to Use Agents
4.8 Anatomy of an Agent Execution (Step-by-Step)
4.9 Anti-Patterns: Where Agents Fail (Real Cases)
4.10 Debugging: When the Agent Makes Mistakes
4.11 Metrics: How to Measure Agent ROI
4.12 Agent in Action — Configuring the DevOps Agent
4.13 Specialized Subagents: Creating an AI Team
4.14 Agent Teams: Multiple Agents Working in Parallel
4.15 Managed Agents API — Anthropic's Stateful Alternative
4.16 Spec-Driven Agents: From Vibe Coding to Engineering Discipline
› 4.16.1 The Problem with Vibe Coding
› 4.16.2 The Three Pillars: Contracts, Agents, Runtime
› 4.16.3 The Canonical Workflow
› 4.16.4 Implementing in Claude Code
› 4.16.5 Landscape: Kiro vs Spec Kit vs Native Claude Code
› 4.16.6 When NOT to Use Spec-Driven
4.17 Headless Claude Code — From CLI to SDK to Embedded Orchestrators
› 4.17.1 The Claude Code as Library Pattern
› 4.17.2 The CLI Route — claude -p (Print Mode)
› 4.17.3 The SDK Route — Agent SDK (Python + TypeScript)
› 4.17.4 Three Orchestrator Patterns — Real-World Examples
› 4.17.5 CI Authentication and Cost Guardrails
› 4.17.6 When NOT to Use Headless Mode
Terraform and Kubernetes with AI in practice
Intelligent Infrastructure in Practice
5.1 Terraform in the Real World (The Silent Pain)
5.2 Where AI Really Helps in Terraform
5.3 MCP Applied to Terraform: Choosing Your Architecture
5.4 Installation Guide: VS Code, Cursor and Claude Code CLI
5.5 Installing the Right MCPs
5.6 Guardrails Configuration: Protecting Production
5.7 The "End-to-End" Flow (Supervised)
5.8 Practical Walkthrough: From Zero to First Module (Generic, Cloud-Agnostic)
5.9 Workspace Safety: AI as Environment Guardian
› 5.9.5 Hard Guardrail IAM Plan-Only
5.10 FinOps: AI as Cost Analyst
5.11 Agent in Action - The Infrastructure Architect
5.12 Terraform MCP Troubleshooting
5.13 Specialized Subagent: terraform-reviewer
5.14 Agent Teams: Multi-Module Infrastructure Refactoring
5.15 Beyond Terraform — CloudFormation & CDK with AWS MCP
› 5.15.6 When to Use call_aws vs Specialized awslabs Servers
5.16 Spec-Driven IaC: From Business Requirements to Modules
› 5.16.1 Why IaC Is the Killer App for Spec-Driven
› 5.16.2 The Three Pillars Applied to Terraform
› 5.16.3 Walkthrough: Multi-AZ VPC for PCI Environment
› 5.16.4 Drift Detection as Spec Validation
› 5.16.5 Tooling: IBM iac-spec-kit and Native Claude Code
5.17 From Monolith to Composable: Refactoring CLAUDE.md into Skills and Subagents
› 5.17.1 Diagnosis: Why Refactor
› 5.17.2 The Partition Rule: Always-On vs On-Demand
› 5.17.3 Walkthrough: Migrating 3 Concrete Patterns
› 5.17.4 The Final CLAUDE.md (~80 lines)
› 5.17.5 Final Directory Structure
› 5.17.6 When NOT to Refactor
MCP Server, IAM Hard Guardrail, CFN/CDK, Bedrock, and Least-Privilege
6.1 AWS MCP Server (deep dive)
6.2 Hard Guardrail Plan-Only IAM (duplicated from 5.9.5)
6.3 CloudFormation and CDK with AWS MCP
6.4 Claude Code on Bedrock
6.5 Least-Privilege IAM for AWS MCP
6.6 Closing ... future GCP and Azure
Operation, Policies and Intelligent Scaling
7.1 Kubernetes: The Distributed Operating System
7.2 K8sGPT: From CLI to Continuous Monitoring
7.3 MCP for Kubernetes: Giving "Eyes" to the Agent
7.4 MCP Installation Verification
7.5 Real End-to-End Case: Payment Service Down
7.6 Policies as Code: Kyverno + AI
7.7 Intelligent Autoscaling: KEDA + AI
7.8 Intelligent HPA/VPA: AI-Guided Configuration
7.9 Deployment Strategies: Canary and Blue-Green with AI
7.10 Specialized Subagent: k8s-troubleshoot
7.12 Hosting MCP Servers on Kubernetes/EKS
7.13 AI-Assisted EKS/Kubernetes Upgrade
7.14 Agent Teams: Multi-Agent EKS Upgrade Validation
7.15 Karpenter + AI: Intelligent Node Scaling
7.16 GKE vs EKS — When Each Makes Sense
7.17 GKE MCP in Action — The 27 Tools in a Real Flow
7.18 Agent Team — Multi-Region GKE Provisioning
7.19 Spec-Driven Kubernetes: From Requirements to Manifests + Policy
› 6.19.1 The Manifests-Without-Intent Problem
› 6.19.2 The Three Pillars Applied to Kubernetes
› 6.19.3 Real Walkthrough: payment-service Deployment
› 6.19.4 The Spec as Dev ↔ Platform Contract
7.11 Chapter Conclusion
CI/CD, Observability and Troubleshooting
Pipelines as Product
8.1 Pipelines as Product
8.2 Intelligent Test Selection (Predictive Test Selection)
8.3 Pipeline Failure Auto-Triage
8.4 Supply Chain Security
8.5 Flakiness: Unstable Tests
8.6 Pipeline Generation with AI
8.7 Pipeline Security
8.8 When NOT to Use AI in CI/CD
8.9 ROI of AI in CI/CD
8.10 Subagent: ci-security-analyst
8.11 Final Configuration
8.12 End-to-End Practical Scenario
8.13 GitHub Actions with Claude Code
8.14 Claude Code SDK — Programmatic Automation
8.15 Conclusion
From Signal Overload to Intelligent Action
9.1 The Problem of Signal Overload
9.2 Logs, Metrics and Traces Correlation
9.3 AI Support for On-Call
9.4 MTTR Reduction with Assisted Decision
9.5 Resource Forecasting with Prophet
9.6 Real Incident Practical Case
9.7 Incident Agent Architecture
9.8 Persona and Subagent Configuration
9.9 Limitations and When NOT to Use AI
9.10 Agent Teams: Automated War Room for P1 Incidents
9.11 CloudWatch MCP Server — Native AWS Observability
9.12 Cost Explorer MCP — Intelligent FinOps with AI
9.13 Grafana MCP + OpenTelemetry — Multi-Cloud Observability
9.14 Chapter Conclusion
9.15 SLOs as Specs: Auto-Generating Alerts and Dashboards
› 8.15.1 The SLO Document Gap
› 8.15.2 The Three Pillars Applied to Observability
› 8.15.3 Real Walkthrough: payment-service SLO
› 8.15.4 The Compounding Effect: SLO Changes Become Regenerable
Security, FinOps and Runbook RAG
Intelligent Vulnerability Triage
10.1 The Problem of Security at Scale
10.2 Intelligent Vulnerability Triage
10.3 Triage System Architecture
10.4 Automated Prioritization: From Detection to Action
10.5 Secrets Management with AI
10.6 Specialized Subagent: security-auditor
10.7 Agent Configuration (.cursorrules)
10.8 Limitations and When NOT to Use AI
10.9 Chapter Conclusion
Intelligent Cloud Cost Reduction
11.1 The Structural Problem of FinOps
11.2 Intelligent FinOps Architecture
11.3 Implementation: Essential Components
11.4 FinOps ROI with AI
11.5 Conclusion
Operational Knowledge Instantly Accessible
12.1 The Problem of Distributed Documentation
12.2 Fundamentals: RAG, BM25 and Embeddings
12.3 Stack Selection: Detailed Comparison (2025-2026)
12.4 Step-by-Step Implementation with Qdrant + LlamaIndex
12.5 Integrating Multiple Sources: Jira, GitHub, Confluence, Slack
12.6 Creating an Agent/MCP that Uses RAG (Anti-Hallucination)
12.7 Slack Integration for Quick Access
12.8 RAG Metrics and Evaluation
12.9 Limitations and Troubleshooting
12.10 Conclusion and Implementation Checklist
Security, GitOps and Organizational Adoption
Human-in-the-Loop and Responsibility
13.1 Why AI Without Limits Becomes Risk
13.2 Human-in-the-Loop in Practice
13.3 Simple Guardrails That Work
13.4 Responsibility Remains Human
13.5 Security Checklist
13.6 Claude Code on Amazon Bedrock: Enterprise Deployment for Regulated Industries
13.6.1 The Compliance Gap
13.6.2 Architecture: Bedrock-Native Inference
13.6.3 Senior Setup: From Zero to Production
13.6.4 The Mantle Endpoint
13.6.5 Real-World Pattern: Q2 Code (Banking)
13.6.6 What This Means for Your Career
13.7 AWS MCP Server: Least-Privilege IAM and Operational Guardrails
13.7.1 The Blast Radius Problem
13.7.2 Three Deployment Tiers
13.7.3 Ready-to-Use IAM Templates
13.7.4 Defense in Depth: Five Layers
13.7.5 Agent vs Human: IAM Condition Keys
13.7.6 CloudTrail and Real-Time Alarms
13.7.7 PreToolUse Anti-Destructive Hook in Claude Code
ArgoCD, Flux and Intelligent Automation
14.1 What is GitOps (Recap)
14.2 Where AI Adds Value in GitOps Flow
14.3 Automated IaC PR Review
14.4 Manifest Generation with Claude
14.5 Drift Detection and Correction
14.6 Integration with ArgoCD and Flux
14.7 Production-Ready Prompts
14.8 Where to Place Prompts and How to Automate
14.9 Guardrails: What NOT to Automate
How to Scale AI in DevOps at the Enterprise
15.1 The Problem of Uncoordinated Adoption
15.2 Governance Architecture: Repositories and Structure
15.3 CLAUDE.md: The Single Source of Truth
15.4 Adoption Architecture: People and Processes
What's coming and how to prepare
Future Vision and Your Action Plan
16.0 What You've Seen in This Guide
16.1 What Changed (and What Didn't)
16.2 What's Coming in the Next 2-3 Years
16.3 How to Prepare (Practical Actions)
16.4 The Real Risks (Not the Hype)
16.5 The Inconvenient Truth
16.6 Your 6-Month Plan
16.7 The Final Principle
16.8 You're Ready. Start Tomorrow.
+300 pages of technical content applied to real production
14-day Guarantee | 1 Year Access | Updates Included