Google Cloud Professional Data Engineer Practice Exam Questions

What the Google Cloud Professional Data Engineer exam covers

Designing Data Processing Systems149 questions
Ingesting and Processing the Data189 questions
Storing the Data151 questions
Preparing and Using Data for Analysis143 questions
Maintaining and Automating Data Workloads173 questions

Free Google Cloud Professional Data Engineer sample questions

A sample of 10 questions with answers and explanations. Sign up free to practice all 805.

Question 1Designing Data Processing Systems

Which service is a serverless, petabyte-scale data warehouse for analytics?
- ABigQueryCorrect
- BCloud SQL
- CBigtable
- DCloud Spanner
✓ Correct answer: A

BigQuery is Google Cloud's serverless, petabyte-scale data warehouse specifically designed for analytics workloads. It provides fully managed SQL query capabilities with automatic scaling and no infrastructure management. BigQuery uses columnar storage and parallel execution to achieve fast analytical queries on massive datasets.
Why the other options are wrong
- BCloud SQL is a managed relational database for transactional (OLTP) workloads, not a serverless petabyte-scale analytics warehouse.
- CBigtable is a wide-column NoSQL store for high-throughput key lookups, not a SQL analytics warehouse.
- DCloud Spanner is a globally distributed relational database optimized for transactional consistency, not petabyte-scale analytical queries.
Question 2Storing the Data

A Cloud Storage lifecycle policy is configured to delete objects from the Coldline class after 45 days to save money, but the bill shows unexpected early-deletion charges. What explains this?
- CColdline has a 90-day minimum storage duration, so deleting at 45 days incurs early-deletion charges for the remaining daysCorrect
- ALifecycle deletions always trigger network egress charges, which appear on the bill as unexpected early-deletion line items
- BColdline requires Object Versioning to be enabled, and the retained noncurrent versions are what is being billed extra
- DLifecycle rules cannot delete Coldline objects at all, so the objects persist and continue accruing storage charges
✓ Correct answer: C

Cloud Storage charges a minimum storage duration for cold storage classes: 30 days for Nearline, 90 days for Coldline, and 365 days for Archive. When a lifecycle rule deletes a Coldline object at 45 days - less than the 90-day minimum - Cloud Storage charges for the remaining 45 days as an early-deletion fee. To avoid this, the lifecycle delete rule for Coldline objects should be set to trigger at or after 90 days.
Why the other options are wrong
- ADeletion does not incur egress; the charge is an early-deletion fee tied to Coldline's minimum storage duration, not network traffic.
- BColdline does not require Object Versioning, and versioning is unrelated to the minimum-duration early-deletion charge seen here.
- DLifecycle rules can and do delete Coldline objects; the deletion itself is what triggers the minimum-duration fee.
Question 3Preparing and Using Data for AnalysisSelect all that apply

A query uses SELECT * on a large table and is expensive. Besides selecting fewer columns, which TWO actions reduce on-demand bytes scanned for repeated analytics? (Choose TWO)
- ACast the partition column inside the WHERE clause
- BUse a wildcard table query across all date shards
- CCreate a materialized view that precomputes the needed aggregationCorrect
- DAdd clustering on the most frequently filtered columnsCorrect
✓ Correct answer: C, D

A materialized view precomputes and stores aggregation or join results, so queries against it read far fewer bytes than scanning the full base table. Clustering physically co-locates rows that share the same clustering column values into adjacent storage blocks, allowing BigQuery to skip entire blocks when a filter on a clustering column is present. Both techniques directly reduce the volume of data scanned on re-execution of analytical queries, lowering on-demand costs.
Why the other options are wrong
- ACasting the partition column inside the WHERE clause defeats partition pruning by preventing the query planner from evaluating the filter at planning time, which increases bytes scanned rather than reducing them.
- BUsing a wildcard table query across all date shards forces BigQuery to scan every shard matching the wildcard pattern, typically scanning more data than a properly partitioned table with a direct filter.
Question 4Storing the Data

A Bigtable cluster serving a time-series IoT app shows severe read/write hotspotting on a few nodes while others are idle. Row keys are currently 'sensorTimestamp' (timestamp prefix). What change best fixes the hotspot?
- AIncrease the Bigtable garbage-collection max-versions setting so old cell versions are retained longer
- BRedesign row keys to avoid monotonically increasing prefixes, e.g., field-promote/salt with sensorID before the timestampCorrect
- CSwitch the cluster from SSD to HDD storage so writes are spread across cheaper slower disks
- DAdd more column families to the table so the write load is spread across more families
✓ Correct answer: B

Bigtable stores rows in sorted lexicographic order by row key and distributes contiguous key ranges across nodes (tablets). When row keys begin with a monotonically increasing timestamp, all new writes land on the same tablet holding the latest key range, causing a write hotspot. The fix is to place a high-cardinality prefix - such as the sensorID - before the timestamp so that writes are distributed across many tablets. Field promotion and salting are both valid strategies to achieve this distribution.
Why the other options are wrong
- ARaising max-versions changes how many cell versions are kept and does nothing to spread a timestamp-prefixed hotspot across nodes.
- CMoving from SSD to HDD only lowers throughput and cost; it does not fix a row-key hotspot concentrated on a few nodes.
- DExtra column families do not redistribute rows across nodes, so a monotonically increasing key still hotspots the same tablet.
Question 5Designing Data Processing Systems

After migrating an enterprise data warehouse to BigQuery, each regional analyst must see only the rows for their own region within a shared table, without maintaining separate copies. Which BigQuery feature enforces this in the design?
- AColumn-level access control that masks sensitive fields based on the querying user's policy tags
- BRow-level security policies that filter rows based on the querying user's identityCorrect
- CAuthorized views that expose a per-region SELECT query while hiding the base table from analysts
- DPartitioning the table by region so each analyst queries only their own partition directly
✓ Correct answer: B

BigQuery row-level security (RLS) policies let you attach filter predicates to a table that are transparently applied whenever a user queries it. By binding a policy to a user or group, BigQuery automatically restricts the visible rows to those matching that user's identity attribute - such as a region column - with no separate table copies required. This is the canonical feature for per-user row filtering within a single shared table.
Why the other options are wrong
- AColumn-level access with policy tags restricts which columns a user can read, not which rows, so it cannot limit an analyst to their own region's records.
- CAuthorized views require a separate view per region and manual grants, which does not enforce per-user row filtering within one shared table.
- DPartitioning is a storage and cost-pruning mechanism, not a security control, and does nothing to prevent an analyst from querying another region's partition.
Question 6Ingesting and Processing the DataSelect all that apply

You are planning the networking and security for a Dataflow pipeline that must read from a Cloud SQL instance and write to BigQuery without exposing data to the public internet. Which TWO design choices align with best practices? (Choose TWO)
- ARun the Dataflow workers with no external IP addresses inside the VPCCorrect
- BUse Private Google Access (and private connectivity to Cloud SQL) so workers reach Google APIs and the database privatelyCorrect
- CAssign a public IP to every worker and allow 0.0.0.0/0 ingress
- DDisable VPC firewall rules entirely to simplify connectivity
✓ Correct answer: A, B

Removing external IP addresses from Dataflow workers (running them as internal-only VMs) eliminates a public internet attack surface. Private Google Access enables those workers to reach Google APIs - including the BigQuery API - over Google's internal network without traversing the public internet. Cloud SQL can be reached via Private IP or Cloud SQL Auth Proxy within the same VPC, so the entire data path from source to sink stays within private networking.
Why the other options are wrong
- CAssigning a public IP to every worker and allowing 0.0.0.0/0 ingress exposes each worker VM directly to the internet, creating a large attack surface that violates the requirement of keeping data off the public internet.
- DDisabling VPC firewall rules entirely removes all network-level controls and allows unrestricted ingress and egress on the worker subnet, which reopens the public internet exposure the design is specifically meant to prevent.
Question 7Ingesting and Processing the Data

A team wants Cloud Build to automatically rebuild and stage a Dataflow Flex Template image whenever code is merged to the main branch, then notify the deployment workflow. Which trigger configuration is appropriate?
- AA Cloud Build trigger on the repository's main branch push that runs build steps to build the image, push to Artifact Registry, and build the Flex Template specCorrect
- BA manual Cloud Build run that a developer is expected to remember to start by hand after every merge into the main branch of the repository
- CA Cloud Scheduler job that triggers a Cloud Build image rebuild every hour regardless of whether any code was actually merged to main
- DA BigQuery scheduled query configured to rebuild the container image and stage the Flex Template spec whenever it runs on its cadence
✓ Correct answer: A

Cloud Build supports repository triggers that fire automatically on a push to a specified branch. The build steps can build the Docker container image, push it to Artifact Registry with a versioned tag, and run gcloud dataflow flex-template build to generate and store the updated Flex Template spec file, making the pipeline fully automated. A downstream notification step or Pub/Sub message can then inform the deployment workflow that a new template is ready.
Why the other options are wrong
- BA manual run depends on a person remembering after each merge, so it is not the automatic, merge-driven rebuild the requirement calls for.
- CAn hourly rebuild wastes builds and does not correlate to merges, rebuilding even when nothing changed and lagging real merges.
- DA BigQuery scheduled query only runs SQL and cannot build container images or Flex Template specs.
Question 8Storing the Data

A Dataplex lake spans Cloud Storage and BigQuery. The operations team wants to be alerted automatically whenever a scheduled data-quality scan on the curated zone fails (for example, when null-customer-ID checks exceed a threshold). What is the recommended way to wire this alerting?
- APublish Dataplex data-quality scan results to Cloud Logging/Pub/Sub and create a Cloud Monitoring or log-based alert that notifies on failuresCorrect
- BHave an on-call engineer manually open the Dataplex console each morning to read the latest data-quality scan results and raise an incident by hand
- CConfigure a Cloud Storage lifecycle rule that automatically deletes the rows that fail the data-quality checks so the failing records simply disappear
- DRely on the existing BigQuery slot reservation utilization alerts to indirectly detect when a curated-zone data-quality scan has failed its checks
✓ Correct answer: A

Dataplex data-quality scan runs emit results as structured log entries in Cloud Logging and can publish events to Pub/Sub. By creating a log-based metric or a Cloud Monitoring alerting policy that watches for scan failure events or rules exceeding failure thresholds, the operations team receives automated notifications via email, PagerDuty, Slack, or other channels. This pattern follows the Google Cloud observability model and requires no custom polling code.
Why the other options are wrong
- BManually checking the console each morning is not automatic alerting and can miss failures for hours.
- CA lifecycle rule deletes objects by age and cannot target failing rows, and deleting data hides problems rather than alerting on them.
- DSlot reservation utilization measures compute usage and has no relationship to data-quality scan pass/fail results.
Question 9Maintaining and Automating Data Workloads

A daily orchestrated pipeline must load files from Cloud Storage, run a Dataflow Flex Template, then run a BigQuery transformation, with per-step retries, dependency ordering, alerting on failure, and historical backfills. Which Google Cloud service is purpose-built for this?
- BCloud Composer (managed Apache Airflow) authoring the steps as a dependency-ordered DAG with retries and alertingCorrect
- ACloud Scheduler firing three independent HTTP targets on staggered timers that assume each prior step has finished before the next starts
- CA single chained BigQuery scheduled query that calls the Flex Template and load in sequence using SQL scripting statements
- DA Pub/Sub topic with three subscribers that each react to the same message and run their step in parallel without ordering
✓ Correct answer: B

Cloud Composer is Google Cloud's managed Apache Airflow service, purpose-built for multi-step pipeline orchestration. It natively supports dependency ordering between heterogeneous tasks (Cloud Storage, Dataflow, BigQuery), configurable retries per task, alerting callbacks on failure, and the Airflow backfill command for re-running historical intervals - all without custom glue code.
Why the other options are wrong
- ACloud Scheduler on independent timers cannot enforce true step dependencies, per-step retries, or backfills; it only assumes ordering by timing.
- CA BigQuery scheduled query cannot natively launch a Dataflow Flex Template or orchestrate GCS loads with dependency ordering and backfills.
- DPub/Sub fan-out runs subscribers in parallel with no ordering guarantee, so it cannot sequence load then Dataflow then BigQuery.
Question 10Ingesting and Processing the Data

Analysts need an interactive, visual data-preparation tool to explore messy datasets and define cleaning/transformation 'recipes' that Google then executes as managed pipelines, without writing Beam or SQL by hand. Which Google Cloud service is designed for this?
- ACloud Dataprep (by Trifacta), an interactive visual data-wrangling and preparation toolCorrect
- BDataproc Jupyter notebooks running hand-written PySpark cells to clean and transform each dataset
- CBigQuery BI Engine, an in-memory acceleration layer that speeds up interactive dashboard queries
- DCloud Composer DAGs authored in Python to orchestrate the cleaning and transformation steps on a schedule
✓ Correct answer: A

Cloud Dataprep (by Trifacta) lets analysts visually profile and explore data, then build transformation 'recipes' through a point-and-click interface; the recipes are executed as managed pipelines (on Dataflow) without the user writing Beam or SQL. It targets exactly the non-coder, exploratory data-preparation workflow described. The other options require coding (PySpark, Python DAGs) or serve a different purpose (BI Engine accelerates queries, not data prep).
Why the other options are wrong
- BPySpark notebooks require writing code by hand, which is precisely what the analysts want to avoid with a visual recipe tool.
- CBI Engine accelerates query response times and has no data-wrangling or transformation-recipe capability.
- DComposer orchestrates pipelines but offers no interactive visual interface for defining cleaning recipes.

Unlock all 805 Google Cloud Professional Data Engineer questions

Related Google resources

Google Cloud Professional Data Engineer study guideKey concepts
Google practice examsAll Google
Certification pathWhere this fits
Certification exam guides & tipsBlog
Plans & pricingFree & paid
Google Cloud Professional Cloud Security Engineer practice examRelated
Google Cloud Professional Cloud Network Engineer practice examRelated
Associate Google Workspace Administrator practice examRelated

Google Cloud Professional Data Engineer practice exam FAQ

How many questions are in the Google Cloud Professional Data Engineer practice exam on CertGrid?

CertGrid has 805 practice questions for Google Cloud Professional Data Engineer, covering 5 exam domains. The real Google Cloud Professional Data Engineer exam is 50-60 qs in 120 min. CertGrid's timed mock is a fixed 50 questions.

What is the passing score for Google Cloud Professional Data Engineer?

Google does not publish a fixed passing score for this exam; CertGrid uses readiness scoring for practice. You have about 120 min to complete it. CertGrid tracks your readiness against the exam objectives so you know where to focus.

Are these official Google Cloud Professional Data Engineer exam questions?

No. CertGrid is an independent practice platform. We do not provide real or leaked exam questions. Our questions are original and designed to help you practice the concepts, scenarios, and difficulty style of the Google Cloud Professional Data Engineer exam.

Can I practice Google Cloud Professional Data Engineer for free?

Yes. You can start practicing Google Cloud Professional Data Engineer for free with daily practice and sample questions. Paid plans unlock full timed exams, complete explanations, and domain analytics.

What CertGrid is (and is not)

CertGrid is an independent IT certification practice platform for Azure, AWS, Google, Cisco, Security, Linux, Kubernetes, Terraform, and other certification tracks. It provides objective-mapped practice questions, readiness scoring, weak-domain drills, and explanations to help learners understand what to study next.

Independent & original. CertGrid is an independent practice platform and is not affiliated with or endorsed by Google. Questions are original practice items designed to mirror certification concepts and exam style. CertGrid does not provide official exam questions or braindumps.