Google Cloud Professional Data Engineer Study Guide
The Google Cloud Professional Data Engineer exam validates your ability to design data processing systems, build and operationalize batch and streaming pipelines, store and prepare data for analysis, and enable machine learning on Google Cloud. It targets practitioners who design, build, secure, and maintain data systems, and the 120-minute exam scores on a 700-point scale (passing is roughly 70%). Expect heavily scenario-based questions that ask you to pick the most cost-effective, scalable, and operationally sound service for a given access pattern, latency, and consistency requirement.
Domain 1: Designing Data Processing Systems
- Choose a data store primarily by access pattern (analytical OLAP vs operational/transactional) and by latency/throughput needs; structured relational data favors Cloud SQL or Spanner, columnar analytics favor BigQuery, and high-throughput key-based access favors Bigtable.
- Cloud Bigtable delivers single-digit-millisecond latency at high QPS and scales horizontally to petabytes, making it the choice for time-series, IoT, and other key-based operational workloads.
- Cloud Spanner is the only Google database that gives global horizontal write scaling with strong (external) consistency via TrueTime; a multi-region config provides zero RPO and up to 99.999% availability.
- BigQuery is a serverless, columnar, petabyte-scale warehouse for analytics; it separates storage from compute and charges on-demand by bytes scanned or by reserved slots.
- Use streaming (real-time) processing when sub-second to second-level latency is required; Pub/Sub plus Dataflow is the canonical pattern for continuous event processing.
- Bigtable row key design that promotes a high-cardinality field and reverses the timestamp (e.g., sensorId#reverseTimestamp) spreads writes across tablets and avoids hotspots from sequential keys.
- Reduce Dataproc cost with ephemeral, job-scoped clusters plus autoscaling, and add Spot (preemptible) VMs as secondary workers while keeping standard primary workers; this can cut compute cost 60-90%.
- Dataproc Serverless for Spark runs on-demand Spark batches with no cluster to provision, autoscale, or tear down.
- Spanner and Bigtable both support autoscaling that adjusts compute/node count to actual load, eliminating over-provisioning.
- Mix BigQuery pricing models for cost: on-demand for ad hoc spiky workloads, and slot reservations (Editions) for steady, predictable reporting workloads.
- For interactive dashboards, BI Engine reservations plus materialized views lower both latency and cost on recurring queries.
- In Dataflow, use event-time windowing with watermarks and allowed lateness to correctly attribute and handle late-arriving events.
- Achieve exactly-once semantics with Pub/Sub to Dataflow using the BigQuery Storage Write API in exactly-once mode, or windowed stateful deduplication keyed on a unique event ID within the arrival window.
- BigLake tables governed by Dataplex provide unified governance and fine-grained access over data stored in Cloud Storage while querying it in place.
Domain 2: Ingesting and Processing the Data
- Pub/Sub is the managed, scalable messaging service for high-volume event ingestion with at-least-once delivery; it decouples producers from consumers in event-driven pipelines.
- Dataflow (managed Apache Beam) provides unified batch and streaming processing with autoscaling workers and exactly-once processing semantics.
- Configure a Pub/Sub dead-letter topic with --dead-letter-topic and --max-delivery-attempts so messages that exceed max delivery attempts are routed aside for inspection instead of blocking the subscription.
- Enable Pub/Sub message ordering and publish with an ordering key when consumers require in-order delivery of related messages.
- Windowing groups unbounded streaming data into finite windows for aggregation; fixed (tumbling), sliding (hopping), and session windows are the valid Beam strategies.
- Datastream is the managed change data capture (CDC) service that replicates changes from MySQL, PostgreSQL, and Oracle into Google Cloud for real-time sync.
- Columnar file formats Parquet and ORC are best for analytical scans; Avro is row-based but schema-rich and good for streaming/serialization and schema evolution.
- The BigQuery Storage Write API offers higher throughput and lower cost than legacy streaming inserts; switch to batch load jobs (free) when latency tolerance is hours rather than seconds.
- Dataflow Streaming Engine offloads window state and shuffle to a managed backend, enabling smaller workers and better autoscaling; Dataflow Shuffle does the same for batch jobs.
- Control Dataflow cost by constraining min/max workers to match steady load with --max-workers and a min-workers setting.
- Pub/Sub Lite (provisioned-capacity messaging) was historically the cheapest option for predictable, fixed-rate streams, but it is deprecated and shut down as of June 30, 2026 (closed to new customers since September 2024); migrate predictable workloads to Pub/Sub or Google Cloud Managed Service for Apache Kafka.
- Set a Pub/Sub subscription acknowledgement deadline up to 600 seconds with gcloud pubsub subscriptions create --ack-deadline=600 for long-running message processing.
- Create a Pub/Sub BigQuery subscription that writes directly to a table with --bigquery-table and --use-topic-schema, avoiding a separate Dataflow job for simple ingestion.
- Launch a parameterized Dataflow Flex Template with gcloud dataflow flex-template run using --template-file-gcs-location and --parameters.
Domain 3: Storing the Data
- Cloud Storage is the object store for data lakes holding raw or unstructured data of any format, with unlimited scale and tight integration with BigQuery and Dataflow.
- Cloud SQL is managed MySQL, PostgreSQL, and SQL Server for regional transactional applications; add read replicas to scale reads and offload reporting from the primary.
- Cloud Spanner provides global scale with strong consistency and horizontal write scaling; choose it when a single relational system must span regions with zero downtime schema changes.
- Bigtable is a wide-column NoSQL store for low-latency, high-throughput operational and time-series workloads; BigQuery is the OLAP SQL warehouse - know this pairing cold.
- Avoid sequential or monotonically increasing Bigtable row keys (timestamps, auto-increment IDs) because they cause hotspots on a single tablet; use field promotion and salting/reversal instead.
- Cloud Storage storage classes by access frequency: Standard, Nearline (< monthly), Coldline (< quarterly), and Archive (< yearly); use the Archive class to minimize at-rest cost for annually accessed data.
- Coldline has a 90-day minimum storage duration and Nearline a 30-day minimum; deleting objects earlier triggers early-deletion charges.
- Object Lifecycle Management rules with SetStorageClass automatically transition objects to cheaper classes by age, and Autoclass moves objects between classes automatically based on actual access patterns.
- Consolidating many tiny objects into fewer larger files significantly cuts per-object operation (Class A/B) costs.
- BigLake and BigQuery external tables let you query Cloud Storage files in place without loading them into BigQuery storage.
- Memorystore for Redis or Memcached is the managed in-memory cache for frequently read data and low-latency lookups.
- Create a regional Spanner instance with gcloud spanner instances create --config=regional-us-central1 --nodes=2, and apply a Cloud Storage lifecycle policy with gsutil lifecycle set config.json gs://my-bucket.
- Create a Cloud SQL read replica with gcloud sql instances create --master-instance-name, and a Bigtable SSD instance with gcloud bigtable instances create --cluster-config and --cluster-storage-type=SSD.
- Bigtable autoscaling adjusts node count to actual CPU and storage utilization, balancing performance against cost without manual resizing.
Domain 4: Preparing and Using Data for Analysis
- On-demand BigQuery pricing is driven by bytes scanned; partitioning and clustering tables prune data before scanning and are the primary levers to cut query cost.
- Partition tables by a date/timestamp or integer-range column so filters skip irrelevant partitions; cluster on frequently filtered/joined columns (e.g., partition by event_date and cluster by customer_id) to physically co-locate rows.
- Selecting only the columns you need (never SELECT *) reduces bytes read because BigQuery uses columnar storage.
- Materialized views precompute and store query results/aggregations, returning faster and cheaper answers for repeated queries and refreshing incrementally.
- BigQuery query results caching returns cached results at no charge when the query text and underlying data are unchanged.
- BI Engine accelerates interactive dashboards with in-memory analysis and caching, dramatically lowering dashboard query latency.
- BigQuery ML (BQML) lets analysts create, train, evaluate, and invoke models directly in SQL with CREATE MODEL and ML.PREDICT, avoiding data export for common model types.
- Restrict access without exposing base tables using authorized views, row-level security (row access policies), and column-level security (policy tags enforced via Dataplex/Data Catalog).
- External (federated) and BigLake tables query Cloud Storage data (Parquet, Avro, ORC, CSV, JSON) in place, avoiding an ETL load step for exploration.
- BigQuery time travel lets you query or restore table data from a recent retention window (default 7 days, configurable 2-7 days).
- Set a maximum bytes billed limit on a query to prevent runaway cost from an accidental full-table scan.
- A baseline slot commitment (BigQuery Editions reservation) delivers flat, predictable cost for steady analytics workloads versus variable on-demand billing.
- Clustering both sides of a join on the join key reduces shuffle by co-locating sorted blocks, speeding up large joins.
- Reduce storage cost with long-term storage pricing (tables/partitions untouched for 90 days are billed roughly 50% less automatically) and partition expiration to drop old partitions; use partition decorators for incremental loads that target only the new partition.
Domain 5: Maintaining and Automating Data Workloads
- Cloud Composer is the managed Apache Airflow service that orchestrates DAG-based pipelines in Python, with retries and task dependencies and deep integrations to BigQuery, Dataflow, and Cloud Storage.
- Dataflow templates (classic and Flex) package pipeline code as reusable, parameterized artifacts in Cloud Storage so operators can launch jobs without source access, schedulable from Cloud Scheduler or Composer.
- Dataform manages SQL-based ELT transformation pipelines inside BigQuery with version control, dependencies, and testing.
- BigQuery scheduled queries automate recurring SQL; create one with bq mk --transfer_config --data_source=scheduled_query --params.
- Cloud Monitoring collects pipeline metrics (latency, throughput, error rates) and drives alerting policies, while Cloud Logging captures structured worker logs for troubleshooting.
- Apply least privilege with dataset-scoped roles such as roles/bigquery.dataViewer and roles/bigquery.dataEditor rather than broad project-level grants; bind roles with gcloud projects add-iam-policy-binding --member --role.
- Data governance combines Dataplex and Data Catalog (discovery, tagging, lineage, policy tags) with IAM for least-privilege resource access.
- Protect BigQuery data with customer-managed encryption keys (CMEK) configured through Cloud KMS when you need control over key rotation and access beyond Google-managed keys.
- For predictable BigQuery cost use capacity-based pricing with slot reservations; a committed baseline plus an autoscaling max balances guaranteed capacity against efficiency.
- Cloud Billing budgets with alerts plus detailed billing export to BigQuery give per-team spend visibility and proactive alerting.
- The BigQuery Storage Read API provides high-throughput parallel reads for large exports without per-byte query charges.
- Right-size Cloud Composer environments and offload simple scheduling to native triggers (Cloud Scheduler, Pub/Sub, Eventarc) to reduce always-on orchestration cost.
- Create a Composer 2 environment with gcloud composer environments create --location --image-version, then deploy DAGs with gcloud composer environments storage dags import.
- Export a Cloud SQL database to a SQL dump in Cloud Storage with gcloud sql export sql for backups and migrations.
Google Cloud Professional Data Engineer exam tips
- Read each scenario for the qualifying constraints first - words like global, strongly consistent, sub-millisecond, serverless, cheapest, and least operational overhead usually map to exactly one service and eliminate the distractors.
- Memorize the storage decision tree: BigQuery for analytics/OLAP, Bigtable for high-throughput key/time-series, Spanner for global relational with strong consistency, Cloud SQL for regional relational, Cloud Storage for unstructured/data-lake, Memorystore for caching.
- When a question asks how to lower BigQuery cost, default to partitioning, clustering, selecting fewer columns, materialized views, result caching, and slot reservations versus on-demand - and watch for maximum-bytes-billed limits.
- For streaming questions, know the Pub/Sub-to-Dataflow-to-BigQuery pattern, windowing types (fixed, sliding, session), watermarks and allowed lateness, dead-letter topics, and the Storage Write API exactly-once mode.
- Eliminate answers that violate best practices: sequential Bigtable row keys, always-on Dataproc clusters, broad project-level IAM grants, and SELECT * are almost always wrong choices.
Study guide FAQ
How long is the exam and what score do I need to pass?
You have 120 minutes to answer roughly 50-60 multiple-choice and multiple-select questions. Scoring is on a 700-point scale where 700 is the passing mark, which is approximately 70% correct. There is no penalty for wrong answers, so answer every question.
How much hands-on Google Cloud experience does Google recommend before taking it?
Google recommends roughly 3 or more years of industry experience including 1 or more years designing and managing solutions on Google Cloud. The exam is deeply scenario-based, so practical familiarity with BigQuery, Dataflow, Pub/Sub, Bigtable, and Dataproc matters far more than rote memorization.
Do I need to know machine learning and BigQuery ML in depth?
You need a working understanding rather than deep ML expertise. Know when to use BigQuery ML (SQL-based models like linear/logistic regression and forecasting), Vertex AI for custom and managed ML, and pre-trained APIs. Expect questions on choosing the right tool and on preventing issues like overfitting and data leakage rather than deriving algorithms.
How current is the exam, and does it still cover legacy services?
The exam tracks current Google Cloud services and naming, so expect Dataplex, BigLake, Datastream, Dataform, the BigQuery Storage Write/Read APIs, and BigQuery Editions slot reservations. Legacy terms like Data Studio are now Looker Studio and legacy streaming inserts are superseded by the Storage Write API; favor the modern service in answers unless a question explicitly constrains otherwise.