Google Cloud Professional Cloud Architect Study Guide
The Google Cloud Professional Cloud Architect exam validates your ability to design, develop, and manage robust, secure, scalable, and dynamic cloud solutions on Google Cloud, balancing business and technical requirements. It is aimed at experienced cloud architects who translate stakeholder needs into reliable architectures and guide the technical implementation across compute, storage, networking, security, and operations. The 2-hour, ~50-60 question exam (scaled to a 700 passing score) is scenario-heavy, often referencing case studies, and rewards trade-off thinking over rote service memorization.
Domain 1: Designing and Planning a Cloud Solution Architecture
- A VPC network is a global resource; subnets are regional resources, each tied to one region with its own IP CIDR range. Resources across regions in the same VPC communicate privately without extra peering.
- Compute choice maps to how much infrastructure you want to manage: Compute Engine (full VM control), GKE (managed Kubernetes/containers), Cloud Run (stateless containers, scales to zero, billed per request), App Engine (fully managed platform).
- Cloud Storage holds unstructured objects (images, video, backups); Cloud SQL serves regional MySQL/PostgreSQL/SQL Server; Cloud Spanner provides horizontally scalable, globally consistent relational data; BigQuery is the serverless analytics data warehouse.
- Shared VPC designates a host project that owns the network and subnets, while service projects attach to it - network admins manage firewall rules, routes, and subnets centrally while teams deploy in their own projects. A common pattern is one host project with separate service projects per environment.
- VPC Network Peering requires a peering configuration on BOTH sides (vpc-a to vpc-b AND vpc-b to vpc-a) and does not allow overlapping IP ranges or transitive routing.
- Overlapping CIDR ranges cannot be routed directly between networks; you must re-IP one side or use Private Service Connect to expose specific endpoints.
- Memorystore is the managed in-memory cache supporting Redis and Memcached, used to offload repeated reads from databases and cut latency from milliseconds to microseconds.
- Filestore provides managed NFS (POSIX-compatible shared file system) for multiple Compute Engine VMs or GKE pods needing concurrent shared file access.
- Pub/Sub is the global messaging service for decoupling producers from consumers, with at-least-once delivery by default and a 7-day default message retention.
- Sole-tenant nodes provide dedicated physical Compute Engine hardware for a single customer, used for compliance, licensing (BYOL), or isolation requirements.
- Use Migrate to Virtual Machines (formerly Migrate for Compute Engine) to lift-and-shift VMs from on-prem or other clouds into Compute Engine with minimal downtime.
- For latency-sensitive single-region users, a regional bucket close to users beats a multi-region bucket; multi-region buckets favor global availability and durability over single-region latency.
- Object Lifecycle Management can transition objects between storage classes automatically - for example Standard for 30 days then transition to Archive for long-term cold retention.
- Microservices architecture is favored when independent scaling, independent deployment, and fault isolation are priorities; a monolith is simpler when those are not required.
Domain 2: Managing and Provisioning a Solution Infrastructure
- A regional Managed Instance Group (MIG) spreads VMs across zones and, with an autoscaler, scales on CPU, load-balancing utilization, or custom Cloud Monitoring metrics; autohealing recreates failed instances using a health check.
- The Global external Application Load Balancer is a Layer 7 load balancer with a single anycast IP that routes users to the nearest healthy backend, supports URL-based routing, SSL termination, Cloud CDN, and Google Cloud Armor.
- Internal passthrough Network Load Balancer (Layer 4) handles internal TCP/UDP traffic; choose passthrough NLB for non-HTTP protocols and Application LB for HTTP(S) features.
- Infrastructure as Code via Terraform or Google's native Infrastructure Manager describes desired state declaratively, enabling version control, peer review, and repeatable, consistent provisioning.
- Private Google Access is a subnet-level setting letting VMs with only internal IPs reach Google APIs (Cloud Storage, BigQuery, Pub/Sub) over Google's private network.
- Cloud NAT gives private VMs (no external IP) outbound-only internet access for patches and third-party APIs; it is managed and does not allow unsolicited inbound connections.
- GKE Autopilot manages nodes, scaling, and node pools automatically and bills per pod resource request; GKE Standard gives you node pool control and node-level billing.
- Cloud Interconnect (Dedicated or Partner) provides high-bandwidth private connectivity to on-prem; Cloud VPN provides encrypted tunnels over the public internet (HA VPN offers a 99.99% SLA).
- Artifact Registry is the recommended store for container images and language packages on GCP, superseding the deprecated Container Registry.
- Private Service Connect lets consumers reach producer-published services (and Google APIs) privately via an internal IP, without VPC peering or exposing the producer network.
- A compact placement policy provisions VMs physically close together for low network latency (HPC, tightly coupled workloads); a spread policy maximizes fault isolation.
- Local SSD offers the highest IOPS/lowest latency block storage but is ephemeral (data lost on stop/terminate); use it for scratch/cache, not durable data.
- Reduce Cloud SQL primary load by adding a read-cache layer with Memorystore, adding read replicas for reporting, and offloading heavy analytics to BigQuery.
- Configure an autoscaler stabilization/cool-down period and consider predictive autoscaling to smooth scaling decisions and pre-provision ahead of forecast demand.
Domain 3: Designing for Security and Compliance
- Follow least privilege: grant predefined or custom IAM roles scoped to specific services/actions rather than broad primitive roles (Owner/Editor/Viewer), and grant roles to Google Groups rather than individual users.
- Workloads should authenticate using service accounts with least-privilege roles, never user credentials, and avoid downloaded service account keys where possible.
- Workload Identity (and Workload Identity Federation) lets GKE pods and external workloads call Google APIs using short-lived, auto-rotated credentials without exported service account keys; bind the Kubernetes SA to the Google SA.
- VPC Service Controls create a security perimeter around managed services (BigQuery, Cloud Storage) to prevent data exfiltration - data cannot move outside the perimeter even with valid IAM credentials.
- CMEK via Cloud KMS gives you cryptographic control over data at rest - you can rotate, disable, or destroy keys independently of Google; CSEK lets you supply your own raw key material.
- Organization Policy constraints enforce guardrails org-wide, such as restricting resource locations, disabling external IPs (compute.vmExternalIpAccess), and limiting allowed machine types.
- Binary Authorization enforces that only trusted, attested (signed) container images can be deployed to GKE or Cloud Run.
- Google Cloud Armor attaches security policies to the external Application Load Balancer for DDoS protection, WAF rules, rate limiting, and edge IP/geo filtering.
- Secret Manager stores secrets (API keys, passwords) with versioning, IAM-based access control, and audit logging; add a version via 'gcloud secrets versions add'.
- Cloud DLP (Sensitive Data Protection) discovers, classifies, and de-identifies sensitive data (PII, PCI) in storage and streams.
- IAM Conditions enable context-aware access, granting roles only when conditions (time, resource attribute, request IP) are met.
- Resource hierarchy is Organization > Folders > Projects > Resources; IAM policies are inherited downward and are additive (a deny needs an explicit IAM deny policy).
- Cloud Audit Logs capture Admin Activity (always on), Data Access, System Event, and Policy Denied logs; Data Access logs (except BigQuery) are disabled by default and must be enabled.
- An Organization Policy resource-location constraint restricts where new resources can be created to satisfy data residency and compliance requirements.
Domain 4: Analyzing and Optimizing Processes
- Committed Use Discounts (CUDs) give up to ~57% off (most/general-purpose resource types, 3-year) for committing to 1- or 3-year resource usage; ideal for predictable, steady-state baseline workloads.
- Spot VMs (the successor to preemptible VMs) offer up to ~90% off but can be preempted with ~30 seconds notice; use them for fault-tolerant, interruptible batch, HPC, and data pipelines.
- Sustained Use Discounts apply automatically with no commitment when a VM runs for a large fraction of the billing month - no action required.
- A common cost pattern is to cover baseline demand with CUDs and absorb spikes with autoscaled on-demand instances; for batch, use Spot VMs / Dataflow FlexRS for worker pools.
- Recommender / Active Assist surfaces rightsizing and idle-resource recommendations - rightsize VMs down to a smaller machine type matching observed utilization.
- An SLI is a measured indicator (latency, availability, error rate); an SLO is a target/threshold set on an SLI; an SLA is the external contractual commitment with consequences.
- Reduce BigQuery bytes scanned (and cost) by partitioning tables (e.g., by date) with matching date filters, clustering on high-cardinality filter columns, and selecting only needed columns - never SELECT *.
- Choose BigQuery on-demand (per-TB scanned) for sporadic queries; choose BigQuery Editions with a committed slot (capacity) reservation for predictable, high-volume workloads.
- Persistent disks bill for provisioned capacity regardless of VM state (even stopped); delete or snapshot-and-delete unneeded disks to stop charges.
- Object Lifecycle Management rules transition objects to Nearline/Coldline/Archive after an age threshold (e.g., to Coldline/Archive after 30 days) to cut storage cost on infrequently accessed data.
- Cloud CDN caches content at edge PoPs near users, reducing latency and origin egress cost for cacheable static content.
- Cloud Billing budgets with threshold alerts and API/service quotas provide proactive spend guardrails; budgets alert but do not automatically cap spend.
- An error budget (derived from the SLO) quantifies acceptable unreliability and balances reliability investment against feature velocity - spend it on releases and slow down when it burns too fast.
- Nearline targets ~monthly access (30-day minimum), Coldline ~quarterly (90-day minimum), and Archive yearly/long-term (365-day minimum); each adds retrieval fees and early-deletion charges.
Domain 5: Managing Implementation
- Cloud Build is Google Cloud's serverless, fully managed CI/CD service that runs build steps from a cloudbuild.yaml to compile code, run tests, build images, push to Artifact Registry, and deploy to Cloud Run, GKE, or App Engine.
- Cloud Build triggers fire on repository events - create one bound to a repo and branch (e.g., --branch-pattern=^main$) to automate builds on push to main.
- Cloud Deploy is the managed continuous delivery service for progressive releases (with promotion, canary strategies, and required manual approvals) to GKE and Cloud Run.
- Speed up pipelines with build caching - cache dependencies/layers via Cloud Storage or use the Kaniko cache for Docker layer reuse.
- Canary deployment shifts a small percentage of traffic to a new version, validates with metrics, then progressively rolls forward or automatically rolls back on regression.
- Cloud Run achieves canary releases through revision traffic splitting - assign weighted percentages across revisions; the Application LB URL map can also weight traffic for canaries.
- Grant the Cloud Build service account only the specific roles it needs (e.g., Kubernetes Engine Developer to deploy to GKE) rather than weakening cluster security or over-granting.
- Use 'terraform plan' to detect configuration drift, then either update the code or run 'terraform import' to reconcile real resources into state.
- A blue-green deployment runs two identical environments and switches all traffic at once for instant cutover and rollback; canary shifts traffic gradually.
- Policy-as-code (OPA, Terraform Sentinel, or custom build steps) validates infrastructure in the pipeline, blocking oversized machine types, missing autoscaling, or policy violations before production.
- Improve CI/CD security and repeatability with least-privilege service accounts and pipeline-as-code stored in version control, plus automated load/performance tests in staging.
- Use Migrate to Virtual Machines for VM lift-and-shift migrations and Migrate to Containers to modernize VM workloads into GKE containers.
- App Engine and Cloud Run support gradual traffic migration and one-command rollback to a previous version/revision.
- A managed rollout to a target can require a manual approval gate before promotion (e.g., promoting from staging to production) to enforce change control.
Domain 6: Ensuring Solution and Operations Reliability
- Eliminate zonal single points of failure with a regional MIG distributing instances across multiple zones behind a load balancer; add the MIG as a backend to a backend service with health checks.
- Protect against full regional outages by replicating data and deploying capacity across multiple regions with cross-region failover - a multi-zone design alone does not survive a regional outage.
- Cloud SQL high availability (regional) keeps a synchronous standby in another zone in the same region with automatic failover; it does NOT protect against a full regional outage - use cross-region replicas for that.
- Health checks with autohealing continuously probe instances and automatically recreate VMs that fail, reducing mean time to recovery without manual intervention.
- Take regular, automated disk snapshots (stored redundantly) to recover from data loss; schedule them with a resource policy (e.g., daily snapshot schedule with a retention period).
- Set SLOs from user needs and define SLIs measured on percentiles (p95/p99 latency), not just averages, because averages hide tail latency that real users experience.
- Configure Cloud Monitoring alerting policies on relevant metrics/SLOs; SLO burn-rate alerts fire when the error budget is being consumed too quickly.
- A 99.9% availability SLO over a 30-day month allows about 43 minutes of downtime; the corresponding error budget gates risky changes when depleted.
- DR strategies trade cost against RTO/RPO: backup-and-restore is cheapest but has the highest RTO; warm standby costs more with lower RTO; hot/active-active is most expensive with near-zero RTO.
- Cloud Trace provides distributed tracing to find latency across microservice call paths; Cloud Profiler gives low-overhead, continuous CPU/memory profiling to locate code hotspots.
- Predictive autoscaling pre-provisions capacity ahead of forecasted demand and scales down afterward, avoiding cold-start latency during predictable spikes.
- Use the four golden signals (latency, traffic, errors, saturation) and the SRE error-budget model to balance reliability investment against feature work without over-engineering.
- Cloud Monitoring, Cloud Logging, Error Reporting, Cloud Trace, and Cloud Profiler together form the Cloud Operations (formerly Stackdriver) observability suite.
- Spanner and multi-region Cloud Storage/BigQuery provide built-in cross-region redundancy; regional services require explicit cross-region replication for high availability.
Google Cloud Professional Cloud Architect exam tips
- Read scenario questions for the dominant constraint - cost, latency, compliance/data residency, RTO/RPO, or operational overhead - then eliminate options that violate it before comparing the rest.
- Know the compute decision tree cold: Compute Engine vs GKE vs Cloud Run vs App Engine, driven by how much infrastructure the team wants to manage and the workload/scaling model.
- Default to managed and serverless services (Cloud Run, BigQuery, Pub/Sub, Cloud SQL) when the scenario emphasizes minimizing operational overhead, unless a hard requirement forces otherwise.
- Distinguish regional vs multi-region resilience: Cloud SQL HA and regional MIGs survive a zone failure but NOT a region failure - cross-region replication is required for regional DR.
- Memorize the cost-optimization ladder: Spot VMs (interruptible batch), CUDs (steady baseline), Sustained Use Discounts (automatic), rightsizing via Recommender, and storage lifecycle/class transitions.
Study guide FAQ
How many questions are on the exam and what score do I need to pass?
The exam runs 2 hours and contains roughly 50-60 multiple-choice and multiple-select questions, including some tied to provided case studies. Google does not publish an official pass mark, but the score is scaled and the commonly cited passing threshold is around 700.
Do I need to memorize the official case studies before the exam?
Yes. Google publishes case studies (such as EHR Healthcare, Helicopter Racing League, Mountkirk Games, and TerramEarth) and a portion of exam questions reference them. Study each company's business and technical requirements ahead of time so you do not waste exam minutes re-reading them, and practice mapping their requirements to GCP services.
How technical is this exam compared to the Associate Cloud Engineer?
It is more architectural and trade-off oriented than hands-on. You still must recognize gcloud commands, Terraform snippets, and service limits, but most questions test choosing the best design given competing business and technical constraints rather than step-by-step implementation. Real-world architecture experience helps significantly.
How long is the certification valid and how do I renew it?
The Professional Cloud Architect certification is valid for two years. To renew, you retake the current version of the exam (Google typically opens a renewal window starting about 60 days before expiration), since GCP services and best practices evolve continuously.