AWS DOP-C02: DevOps Engineer Professional Study Guide
The AWS Certified DevOps Engineer - Professional (DOP-C02) validates advanced skills in building and operating automated, resilient systems on AWS, spanning CI/CD pipelines, infrastructure as code, monitoring, incident response, and security automation. It is a 180-minute, 75-question exam (scored 100-1000, passing at 750) aimed at engineers with two or more years of provisioning, operating, and managing AWS environments. Expect scenario-heavy questions that require choosing between similar valid services based on cost, blast radius, and operational overhead.
Domain 1: SDLC Automation
- CodePipeline orchestrates source, build, test, and deploy stages; each stage can hold multiple actions that run in parallel (same runOrder) or sequentially (incrementing runOrder).
- Build the artifact once in an early stage and promote the same versioned artifact through every environment; never rebuild per stage, which risks environment drift.
- CodeBuild runs commands from a buildspec.yml; the install, pre_build, build, and post_build phases run in order, and the artifacts section defines which files become the output artifact.
- CodeBuild builds have a maximum timeout of 8 hours; a build exceeding it is terminated, so split long builds or offload work.
- To reach private resources (RDS, private endpoints) from a build, configure the CodeBuild project to run inside the VPC's private subnets with a security group; CodeBuild then loses default public internet access unless a NAT gateway is present.
- Inject secrets into CodeBuild via the buildspec env/secrets-manager (or parameter-store) mapping so values are pulled at runtime and not echoed in logs or stored in the template.
- Add a manual approval action in a stage to gate promotion (for example between dev and prod); approvals can publish to an SNS topic for notification.
- For CloudFormation deploys, use two actions: create a change set, then a manual approval, then execute the change set so reviewers see the diff before changes apply.
- Trigger pipelines on source events using an EventBridge rule on the repository state change (CodeCommit/CodeConnections) rather than slow periodic polling.
- Lambda deployments use versions and aliases with CodeDeploy traffic shifting (canary or linear) so traffic moves gradually behind the alias.
- Authenticate Docker to ECR with: aws ecr get-login-password --region <r> | docker login --username AWS --password-stdin <acct>.dkr.ecr.<r>.amazonaws.com.
- For monorepos, split into per-service pipelines triggered only on changes to each service's path to reduce unnecessary runs.
- Cross-account artifact sharing requires the pipeline's S3 artifact bucket to use a customer managed KMS key whose key policy and the bucket policy both grant the consuming account's role access.
- CodeDeploy appspec.yml lifecycle hook scripts each have a configurable timeout; if a hook exceeds it the deployment fails, so set adequate timeouts for long-running hooks.
Domain 2: Configuration Management and IaC
- A single CloudFormation stack supports up to 500 resources; refactor large templates into nested stacks or modules to stay under the limit and improve reuse.
- Share values across stacks by declaring Outputs with Export and consuming them via Fn::ImportValue; an exported value cannot be changed or deleted while another stack imports it.
- Set DeletionPolicy: Retain (and UpdateReplacePolicy: Retain) on stateful resources like databases, and enable stack termination protection, to prevent accidental data loss.
- Some property updates force a replacement: CloudFormation creates a new resource and deletes the old one, which can drop data unless a Snapshot or Retain policy is set.
- Detect configuration drift with aws cloudformation detect-stack-drift, then review per-resource drift status to find out-of-band changes.
- Preview changes safely with change sets: aws cloudformation create-change-set --stack-name <s> --change-set-name <c> --template-body file://tpl.yaml, then execute after review.
- When a stack is stuck in UPDATE_ROLLBACK_FAILED, use ContinueUpdateRollback (optionally skipping resources that cannot roll back) to return it to a stable state.
- Store configuration in SSM Parameter Store; use SecureString for secrets (aws ssm put-parameter --name /app/db/password --value <v> --type SecureString), which encrypts with KMS.
- Run commands across fleets with Run Command (aws ssm send-command --document-name AWS-RunShellScript --targets Key=tag:Env,Values=prod ...) targeting instances by tag.
- Use State Manager associations to continuously enforce desired configuration on managed instances on a schedule, preventing drift.
- Prefer immutable infrastructure: bake a tested AMI (EC2 Image Builder) and replace instances rather than mutating live ones, which avoids config drift and gives repeatable deploys.
- Govern multi-account environments with AWS Control Tower to enroll accounts and apply preventive (SCP-based) and detective (Config-based) guardrails.
- Validate CDK constructs at synth time with CDK Aspects or cdk-nag to catch policy and compliance violations before deployment.
- Throttle StackSets rollouts with a low FailureToleranceCount/Percentage and MaxConcurrentCount so a faulty change halts after only a few account or Region failures.
Domain 3: Resilient Cloud Solutions
- An Auto Scaling group behind a load balancer with health checks provides self-healing: failed instances are terminated and replaced automatically to maintain desired capacity.
- Spread the Auto Scaling group across multiple Availability Zones so the loss of one AZ does not take down the application.
- Use ELB health checks (not only EC2 status checks) on an ASG so application-layer failures, not just hardware faults, mark instances unhealthy.
- Increase the ASG health check grace period when instances need long boot/initialization time, so they are not killed before the app is ready.
- Disaster recovery strategies trade cost for RTO/RPO: backup and restore (cheapest, slowest), pilot light, warm standby (scaled-down running copy), and multi-site active/active (fastest, costliest).
- Warm standby keeps scaled-down compute pre-provisioned in a second Region with continuous data replication, ready to scale up on failover.
- Use Aurora Global Database for sub-second cross-Region replication and fast promotion of the secondary during a regional failover.
- Route 53 health checks with failover routing automatically direct traffic to a healthy endpoint; latency-based routing optimizes for performance across Regions.
- AWS Global Accelerator uses endpoint health checks and the AWS backbone to redirect traffic automatically to healthy regional endpoints with static anycast IPs.
- Decouple components with Amazon SQS, and add a dead-letter queue with a redrive policy after maxReceiveCount so poison messages are isolated, not lost.
- Blue/green deployment runs a new environment alongside the old, switches traffic after validation, and keeps the old environment for fast rollback.
- Configure CodeDeploy automatic rollback by associating CloudWatch alarms with the deployment group so a triggered alarm reverts the deployment.
- CodeDeploy for ECS supports a termination wait time that keeps the original (blue) task set running for a configured number of minutes after the cutover for rollback.
- Use scheduled scaling for known recurring spikes and predictive scaling to add capacity ahead of forecasted demand, reducing cold-start latency.
Domain 4: Monitoring and Logging
- Publish custom metrics with aws cloudwatch put-metric-data --namespace MyApp --metric-name <m> --value <v>; metrics are namespace plus name plus dimensions.
- CloudWatch standard-resolution custom metrics aggregate to 1-minute granularity; set StorageResolution=1 for high-resolution 1-second metrics at higher cost.
- Create a metric from logs with put-metric-filter; a metric filter only emits a data point when a line matches, so configure missing-data treatment (or a default value of 0) for reliable alarming.
- Reduce alarm noise by requiring M out of N datapoints to breach over the evaluation window rather than alarming on a single spike.
- Composite alarms combine multiple alarms with AND/OR logic to alert only on meaningful correlated conditions and to suppress alarm storms.
- EC2 emits CloudWatch metrics every 5 minutes by default (basic monitoring); enable detailed monitoring for 1-minute granularity on standard EC2 metrics.
- Memory and disk usage are not default EC2 metrics; install and configure the CloudWatch agent (amazon-cloudwatch-agent) to push those metrics and application logs.
- Query logs with CloudWatch Logs Insights: aws logs start-query with a query string like 'fields @timestamp, @message | filter @message like /ERROR/'.
- Set log retention explicitly (aws logs put-retention-policy --retention-in-days 30); log groups default to never expire, which grows storage cost indefinitely.
- Use the CloudWatch embedded metric format (EMF) to emit structured JSON logs from which CloudWatch automatically extracts custom metrics.
- AWS X-Ray provides end-to-end request tracing and a service map showing per-service latency and errors to locate the slow or failing segment.
- X-Ray samples by default (a reservoir of fixed traces per second plus a percentage of the remainder), so not every request is traced; adjust the sampling rule to capture more.
- Use an organization CloudTrail trail (multi-Region, auto-applies to new Regions) delivering to a central S3 bucket for complete cross-account API auditing.
- Use CloudWatch cross-account observability to share metrics, logs, and traces from many source accounts into a central monitoring account.
Domain 5: Incident and Event Response
- EventBridge is a serverless event bus: rules match event patterns from AWS services, custom apps, and SaaS, then route matching events to targets like Lambda, SNS, SQS, and SSM Automation.
- Codify remediation as SSM Automation documents (runbooks) and trigger them automatically from alarms or events to remove human delay from recovery.
- Separate notification from remediation: CloudWatch alarm to SNS notifies on-call, while EventBridge to Lambda or SSM Automation performs the automated fix.
- Match GuardDuty findings with an EventBridge rule (source aws.guardduty, detail-type GuardDuty Finding) that invokes a Lambda or SSM Automation runbook to remediate.
- Make remediation actions idempotent so re-running causes no extra harm, and add guardrails (approval steps, scope limits) for high-impact actions.
- Add an EventBridge target dead-letter queue and Lambda retry/DLQ so events that fail to deliver or process are captured for reprocessing instead of lost.
- Use EventBridge archive and replay to retain matched events and re-emit them later for recovery or testing.
- For Lambda async invocations, configure on-failure destinations (SQS queue or SNS topic) to capture failed events after retries are exhausted.
- Use EventBridge Scheduler (or scheduled rules) for cron/rate-based triggers, such as a nightly runbook, with timing managed by AWS.
- AWS Security Hub aggregates findings across security services and emits them to EventBridge to drive automated response workflows.
- Standard SQS queues are at-least-once delivery so consumers must be idempotent; use a FIFO queue with content-based deduplication when exactly-once and ordering matter.
- Orchestrate multi-step incident workflows with AWS Step Functions state machines for branching, retries, and human-approval steps.
- Prevent unintended remediation on protected resources by adding a tag-based condition or branch in the Automation document, or by scoping the triggering rule to skip protected tags.
- Use AWS Chatbot (Amazon Q Developer in chat applications) to deliver alerts to Slack/Chime and to run approved runbooks from chat.
Domain 6: Security and Compliance
- Grant pipeline and compute permissions via IAM roles (CodeBuild service role, EC2 instance profile, OIDC for external CI), never long-lived access keys checked into config.
- Federate external CI (such as GitHub Actions) to an IAM role using OIDC for short-lived temporary credentials; create the provider with aws iam create-open-id-connect-provider --url https://token.actions.githubusercontent.com --client-id-list sts.amazonaws.com.
- Store secrets in Secrets Manager (with built-in rotation via a Lambda) or SSM Parameter Store SecureString; both encrypt at rest with KMS and are retrieved via IAM-authorized calls.
- Enable rotation with aws secretsmanager rotate-secret --rotation-lambda-arn <arn> --rotation-rules ...; use the multi-user (alternating) strategy for zero-downtime database rotation and have apps re-fetch on auth failure.
- Service Control Policies (SCPs) set the maximum permissions for an account; an action an SCP does not allow is denied even if an identity-based policy explicitly allows it.
- S3 account-level Block Public Access overrides bucket policies; a public-granting bucket policy is still blocked when account BPA is enabled.
- Use AWS Config managed rules (for example S3_BUCKET_PUBLIC_READ_PROHIBITED) with automatic remediation via an SSM Automation document to detect and revert violations.
- Set up cross-account access for a third party with an IAM role the vendor assumes that includes an ExternalId condition in the trust policy to prevent the confused-deputy problem.
- For pipeline cross-account deploys, create a scoped IAM role in each target account whose trust policy allows only the pipeline account/role to assume it.
- Enable default S3 encryption with KMS: aws s3api put-bucket-encryption ... ApplyServerSideEncryptionByDefault with SSEAlgorithm aws:kms and a key ARN.
- KMS enforces a mandatory 7 to 30 day waiting period before key deletion, during which you can cancel the scheduled deletion to restore access.
- Enable CloudTrail log file integrity validation (digest files with hashing) and store logs in S3 with Object Lock for tamper-evident, immutable audit records.
- Scan container images with Amazon ECR enhanced scanning (powered by Amazon Inspector) and add a pipeline check that fails the stage when critical findings exist before deploy.
- Prevent secret leakage by retrieving credentials at runtime from Secrets Manager/Parameter Store via the build role and running automated secret scanning on commits/PRs to block credentials before merge.
AWS DOP-C02 exam tips
- Read for the qualifier: questions ask for the MOST cost-effective, LEAST operational overhead, or FASTEST recovery. Multiple options are technically valid, so let the qualifier eliminate the rest.
- Default to managed and serverless over self-managed: EventBridge over cron on EC2, SSM Automation over custom scripts, and Secrets Manager rotation over hand-rolled rotation reduce operational overhead.
- Know the rollback and traffic-shifting story for each compute type: CodeDeploy blue/green and canary/linear for Lambda and ECS, CloudWatch-alarm-based automatic rollback, and the ECS termination wait time.
- Memorize the hard numbers: CodeBuild 8-hour build max, CloudFormation 500 resources per stack, KMS 7-30 day deletion window, EC2 basic monitoring 5 minutes vs detailed 1 minute, and high-resolution metric StorageResolution=1.
- When a question mentions cross-account or cross-Region, immediately think KMS key policies plus bucket/role trust policies, organization CloudTrail trails, and CloudWatch cross-account observability.
Study guide FAQ
How is the DOP-C02 exam structured and scored?
It is 75 questions (multiple choice and multiple response) over 180 minutes, scored on a scaled range of 100 to 1000 with a passing score of 750. The six domains are weighted, with SDLC Automation, Monitoring and Logging, and Security and Compliance carrying the most questions.
Should I take the Associate exams before attempting DOP-C02?
AWS no longer requires prerequisites, but the Professional exam assumes the depth of the SysOps Administrator and Developer Associate exams. Most candidates do best with two or more years of hands-on AWS experience plus comfort with CI/CD, IaC, and operations before sitting this exam.
How much CloudFormation and CDK do I really need to know?
A lot. Domain 2 expects you to know stack limits, nested stacks, change sets, drift detection, DeletionPolicy and UpdateReplacePolicy, Fn::ImportValue cross-stack references, StackSets concurrency controls, and how to recover a stack stuck in UPDATE_ROLLBACK_FAILED. Be able to reason about when an update forces resource replacement.
What is the most common reason people fail this exam?
Picking a technically correct answer that ignores the qualifier. Several options will work, but only one is the most cost-effective, lowest-overhead, or fastest-recovery choice the scenario demands. Practice distinguishing between similar services (SQS vs FIFO, warm standby vs pilot light, basic vs detailed monitoring) on those exact dimensions.