Google Cloud Professional Cloud DevOps Engineer Study Guide
The Google Cloud Professional Cloud DevOps Engineer exam validates your ability to apply Site Reliability Engineering (SRE) principles and build, deploy, and operate reliable services on Google Cloud using CI/CD pipelines, observability, and incident management. It is aimed at DevOps and SRE practitioners who balance feature velocity against reliability through error budgets, automation, and Google Cloud's Operations suite. The exam is 2 hours, contains roughly 50-60 questions, and requires a scaled score of 700 to pass.
Domain 1: Bootstrapping a Cloud Development Environment
- An error budget equals 1 minus the SLO: a 99.9% availability SLO yields a 0.1% error budget, the allowable unreliability the team may 'spend' over a rolling window before slowing risky releases.
- Toil is work that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth; SRE teams aim to keep toil below 50% of their time.
- When the error budget is exhausted, the policy is to freeze risky feature releases and redirect engineering effort to reliability work until the budget recovers.
- Set the SLO to the lowest level that keeps users happy; each additional 'nine' of reliability dramatically increases cost and effort, and 100% reliability is unrealistic.
- Shared ownership of reliability between developers and SREs aligns incentives so teams build services that are operable, observable, and debuggable, avoiding 'throw it over the wall' handoffs.
- Surplus error budget can be deliberately spent on calculated risk such as faster rollouts, chaos experiments, or planned maintenance.
- Cloud Shell is a free, browser-based, on-demand VM with the Cloud SDK (gcloud, kubectl, etc.) preinstalled; only the 5 GB $HOME directory persists and the VM terminates after about 20 minutes of inactivity.
- Cloud Workstations provides managed, container-based, reproducible and isolated development environments defined by a workstation configuration, suitable for teams needing consistent IDE setups.
- gcloud config set project PROJECT_ID sets the active project; gcloud config set compute/region us-central1 sets the default compute region for the active configuration.
- gcloud auth login performs an interactive browser OAuth flow for the CLI, while gcloud auth application-default login sets up Application Default Credentials (ADC) for local application code.
- Named CLI configurations are managed with gcloud config configurations create and gcloud config configurations list, letting you switch between project/account/region profiles.
- gcloud services enable run.googleapis.com enables an API (here the Cloud Run Admin API) before the corresponding service can be used.
- The recommended environment topology is a separate GCP project per environment (dev/staging/prod) under a folder hierarchy, governed centrally by organization policies.
- Keyless authentication is preferred: use Workload Identity Federation for CI/CD systems and Application Default Credentials for local development instead of long-lived service account key files.
- Grant elevated roles using IAM Conditions for time-bounded access or just-in-time elevation rather than permanent broad grants, following least privilege.
Domain 2: Building Applications
- Cloud Build is Google Cloud's fully managed CI/CD platform; it executes ordered steps defined under the steps: field in cloudbuild.yaml, runs tests, builds Docker images, and can trigger deployments via Cloud Deploy.
- Artifact Registry is the recommended managed registry for container images and language packages (Maven, npm, Python); it supports per-repository IAM and automated vulnerability scanning, superseding the deprecated Container Registry.
- gcloud builds submit --tag builds an image and pushes it to Artifact Registry; gcloud artifacts repositories create --repository-format=docker creates a Docker repo; gcloud auth configure-docker configures Docker auth to Artifact Registry.
- A canary deployment routes a small slice (typically 1-5%) of real traffic to the new version to observe error rate, latency, and business metrics before a full rollout.
- A blue/green deployment keeps two identical environments and switches all traffic at once, enabling near-instant rollback by flipping back to the previous environment.
- Binary Authorization is a deploy-time control for GKE and Cloud Run that blocks images lacking valid cryptographic attestations from trusted authorities, enforcing a trusted supply chain.
- SLSA-aligned build provenance combined with Binary Authorization attestations created during the Cloud Build pipeline proves images passed required checks before deployment.
- Infrastructure as Code with Terraform or Google's Infrastructure Manager (managed Terraform) manages GCP resources declaratively, enabling peer-reviewed, reproducible, versioned environment changes.
- Application code should use Application Default Credentials via an attached service account or Workload Identity rather than embedding credentials, so secrets never live in the image.
- Secret Manager stores application secrets and is referenced at runtime; Memorystore for Redis is the managed choice for low-latency session and hot-data caching.
- Firestore is the serverless, autoscaling document database well suited to serverless and event-driven applications.
- Optimize build speed and cost with build caching (reuse unchanged steps and cache layers via Kaniko or the Cloud Build cache), and parallelize independent steps while staging tests by speed.
- Configure Artifact Registry cleanup policies to automatically delete or keep-most-recent stale image versions, controlling storage cost and clutter.
- Externalizing configuration (twelve-factor style) and containerizing the application make builds reproducible, testable, and portable across environments.
Domain 3: Deploying Applications
- An SLI is a measured indicator (e.g., the fraction of requests served successfully under 200ms) and an SLO is the target set on that SLI (e.g., 99.9% of requests succeed); the SLI is the measurement and the SLO is the goal.
- The best user-facing SLIs are request success rate and latency as experienced by the user, measured at the serving edge or load balancer, not internal-only signals.
- Burn-rate alerting fires when the error budget is consumed faster than sustainable; multi-window, multi-burn-rate alerts (a fast-burning short window plus a slower long window) reduce false alarms while catching fast degradation early.
- Cloud Run runs stateless containers serverlessly, scales to zero, and supports first-class traffic splitting across revisions by percentage.
- gcloud run deploy --source builds and deploys directly from source; gcloud run deploy --allow-unauthenticated --region deploys a public service; gcloud run services update-traffic --to-revisions splits traffic to a named revision.
- Set Cloud Run ingress to 'internal' (optionally with a Serverless VPC Access connector) to restrict a service to internal-only traffic from within the VPC and internal load balancers.
- Google Kubernetes Engine (GKE) manages containerized workloads with autoscaling, self-healing, and rolling updates; Autopilot mode further offloads node management.
- A Cloud Build trigger connected to a repository (e.g., GitHub, Cloud Source Repositories) automatically starts builds on Git push or pull request events.
- Cloud Monitoring provides time-series metrics, dashboards, and SLO-based alerting policies; Cloud Logging centralizes structured logs from GCP services and applications; Cloud Trace (with OpenTelemetry) captures distributed latency traces.
- The three observability pillars are metrics, logs, and traces; metrics show health over time, traces reveal latency across microservice calls, and logs provide detailed event context.
- Control logging and monitoring cost with log exclusion filters, lowering the application log level (drop DEBUG in prod), reducing metric cardinality by dropping unused labels, and short Logging retention with a sink to cheaper long-term storage.
- Alert on user-facing SLO and symptom signals (latency, error rate) rather than on every internal cause metric, to reduce alert fatigue and pages that are not actionable.
- Cloud Deploy is the managed continuous delivery service that orchestrates progressive promotion through targets (dev to staging to prod) with approvals and built-in rollback.
- Pair automated test gates that block failing builds with progressive delivery (canary or blue/green) and automated rollback to limit blast radius when a defect reaches production.
Domain 4: Integrating Google Cloud Services
- Autoscaling automatically matches capacity to load: GKE Horizontal Pod Autoscaler scales pods on CPU or custom metrics, Cloud Run scales instances on request concurrency, and Managed Instance Groups autoscale Compute Engine VMs.
- The GKE Cluster Autoscaler adds nodes when pods cannot be scheduled and removes underutilized nodes after safely evicting pods, keeping the node pool right-sized.
- Improve GKE resilience by spreading pods across zones and nodes with pod anti-affinity in multi-zone clusters, and set resource requests and limits so the scheduler places pods well and prevents noisy neighbors.
- Memorystore for Redis is a fully managed in-memory cache; caching hot reads with a TTL aligned to data update frequency cuts database load and serves data in microseconds.
- Cloud CDN caches static content at edge points of presence in front of the global HTTP(S) load balancer, reducing user latency and origin load; cache behavior is driven by Cache-Control headers.
- Cloud Pub/Sub is fully managed asynchronous messaging that decouples producers from consumers, smoothing traffic spikes; Cloud Tasks also provides reliable async task execution with per-task control.
- Secret Manager is the correct service for storing application secrets for runtime retrieval with IAM-controlled access and versioning.
- Apigee and API Gateway provide managed API gateways that authenticate, throttle, and monitor APIs in front of backend services.
- Cloud Functions (Cloud Run functions) run lightweight event-driven code triggered by Cloud Storage events, Pub/Sub messages, or HTTP; Cloud Scheduler runs cron jobs that invoke endpoints or publish to Pub/Sub.
- Serverless VPC Access connector or Direct VPC egress lets Cloud Run and Cloud Functions reach private resources inside a VPC, such as a Cloud SQL private IP or internal services.
- Choose cost-effective compute: Spot (preemptible) VMs fit fault-tolerant restartable batch jobs, while Committed Use Discounts (CUDs) reduce cost for steady, predictable always-on workloads.
- Cut Cloud Run cost by lowering min-instances toward zero during off-peak hours; cut Cloud Storage cost with Object Lifecycle Management rules that transition data to colder classes or delete it.
- Reduce BigQuery cost and scan volume by selecting only needed columns (avoid SELECT *) and partitioning and clustering tables so queries scan less data.
- Relieve a saturated Cloud SQL primary by adding read replicas for read-only traffic and a caching layer for hot repeated reads; smooth bursty load with a Pub/Sub queue feeding autoscaling Cloud Run workers.
Domain 5: Managing Deployed Applications
- A blameless postmortem documents the timeline and contributing factors of an incident and produces tracked action items without attributing fault, creating psychological safety so engineers report issues honestly.
- Recurrence is prevented only by completing owned, tracked postmortem action items with due dates; these fix the root cause, add earlier-detection monitoring, and automate remediation.
- Effective incident response assigns a clear Incident Commander plus defined roles (communications lead, scribe, operations responders) and follows a structured communication plan to avoid chaos.
- Pre-written, validated runbooks for known failure modes reduce mean time to recovery (MTTR) by encoding diagnosis, remediation, rollback, and escalation steps any on-call engineer can follow.
- During an active incident, restoring service availability takes priority over cost optimization; mitigate first (e.g., roll back, add capacity), then investigate root cause afterward.
- Roll back a Cloud Run regression by routing 100% of traffic to a previous known-good revision using gcloud run services update-traffic --to-revisions, since prior revisions are retained.
- Cloud Monitoring alerting policies trigger notifications through notification channels when metrics such as error rate exceed a threshold; gcloud beta monitoring channels create --channel-content-from-file creates a channel.
- Error Reporting automatically captures, deduplicates, and groups application exceptions with stack traces and notifies on new error types.
- gcloud logging read with a resource.type filter retrieves logs (e.g., Cloud Run logs); gcloud logging sinks create with a BigQuery destination routes logs to a dataset for analysis or retention.
- Define alert policies as code using the Terraform google_monitoring_alert_policy resource or gcloud alpha monitoring policies create for repeatable, version-controlled alerting.
- The recommended availability SLI is the ratio of successful, sufficiently fast (good) requests to total valid requests measured at the serving edge, which reflects real user experience.
- Multi-window, multi-burn-rate error budget alerts are the recommended SLO alerting design, balancing fast detection of severe burn against precision for slower degradation.
- Set up layered alerting: SLO and symptom alerts for reliability, resource-utilization alerts for capacity, plus budget and anomaly alerts via Cloud Billing budgets for cost.
- Route compliance and audit logs to a dedicated, access-controlled log sink (e.g., to BigQuery or Cloud Storage) and use exclusion filters to keep noisy non-essential logs out of the _Default bucket.
Google Cloud Professional Cloud DevOps Engineer exam tips
- Almost every question is a scenario that pits reliability against velocity; default to SRE doctrine, error budgets, SLOs based on user-facing SLIs, and toil reduction rather than ad hoc heroics.
- Know the deployment strategies cold: canary (gradual percentage of real traffic), blue/green (instant switch and rollback), and rolling updates, and which GCP service implements each (Cloud Run traffic splitting, GKE, Cloud Deploy).
- Memorize the Operations suite mapping: Cloud Monitoring = metrics/alerts/dashboards, Cloud Logging = logs/sinks, Cloud Trace = distributed traces, Error Reporting = grouped exceptions, Cloud Profiler = performance.
- When a question asks for the 'most secure' or 'recommended' authentication, prefer keyless options: Workload Identity Federation for CI/CD and Application Default Credentials with attached service accounts, never long-lived JSON keys.
- For incident scenarios, pick blameless postmortems, defined incident roles, runbooks, and mitigate-then-fix; for alerting scenarios, pick multi-window multi-burn-rate SLO alerts over static threshold alerts on internal causes.
Study guide FAQ
How is the error budget calculated and what happens when it runs out?
The error budget is 1 minus the SLO, so a 99.9% availability SLO gives a 0.1% budget over a rolling window. When the budget is healthy, teams ship features aggressively; when it is exhausted, the policy is to freeze risky releases and prioritize reliability work until the budget recovers.
What is the difference between an SLI, an SLO, and an SLA?
An SLI is the measured indicator (e.g., percentage of requests succeeding under 200ms), an SLO is the internal target set on that SLI (e.g., 99.9%), and an SLA is the external contractual commitment to customers, usually set looser than the SLO so you can act on budget burn before breaching the contract.
How much hands-on Google Cloud experience does this exam assume?
Google recommends about three years of industry experience including at least one year managing solutions on Google Cloud. You should be comfortable with gcloud commands, Cloud Build, Cloud Deploy, GKE, Cloud Run, the Operations suite, and applying SRE practices, not just memorizing definitions.
Is this exam more about tools or about SRE concepts?
Both, but SRE concepts dominate the framing. Many questions describe a situation and ask for the practice that best balances reliability and velocity (error budgets, toil reduction, blameless postmortems), then expect you to choose the correct Google Cloud service or command to implement it.