DP-203: Azure Data Engineer Associate Study Guide
DP-203 validates the skills of an Azure Data Engineer Associate: designing and implementing data storage, building batch and streaming data processing pipelines, and securing, monitoring, and optimizing data solutions on Azure. It targets data professionals working with Azure Synapse Analytics, Data Factory, Azure Databricks, Stream Analytics, Data Lake Storage Gen2, and related services. You have 120 minutes and need a scaled score of 700 to pass.
Domain 1: Design and Implement Data Storage
- Azure Data Lake Storage Gen2 is built on Blob Storage with a hierarchical namespace enabled; this namespace enables true directories and atomic, efficient folder operations, and is required for analytics workloads. StorageV2 (general-purpose v2) is the account type that supports it.
- In a Synapse dedicated SQL pool a table is split across exactly 60 distributions. Hash distribution spreads rows by a chosen column, round-robin spreads evenly with no key, and replicate copies the full table to every distribution (best for small dimension tables under ~2 GB).
- Choose a hash-distribution key that is high-cardinality, evenly distributed, frequently used in joins/GROUP BY, and not updated; this minimizes data skew and avoids costly data movement during joins.
- If you create a table in a dedicated SQL pool without specifying an index, it defaults to a clustered columnstore index (CCI), which gives high compression and fast analytical scans over large tables.
- Run ALTER INDEX REORGANIZE on a clustered columnstore index to compress open/closed rowgroups and improve scan performance; use REBUILD for heavier fragmentation. Small rowgroups (under ~100K rows) hurt columnstore efficiency.
- Partition large fact tables by a date column (for example monthly boundaries) to enable partition elimination and efficient partition switching; avoid over-partitioning because each of the 60 distributions further splits the data.
- A Type 2 slowly changing dimension preserves history by inserting a new row with a new surrogate key plus EffectiveStartDate, EffectiveEndDate, and an IsCurrent flag; Type 1 overwrites without keeping history.
- Delta Lake provides ACID transactions and time travel through a transaction log (Delta Log) of ordered, atomic JSON commits; you query historical versions with VERSION AS OF or TIMESTAMP AS OF.
- Parquet is a columnar format with built-in schema metadata and per-row-group min/max statistics, enabling column pruning and row-group skipping; for streaming append-heavy ingestion Avro (row-based) is often preferred, then converted to Parquet for analytics.
- Serverless SQL pools cannot create persistent managed tables; they query external data (Parquet, CSV, Delta) using schema-on-read views and lake databases. Persistent storage requires a dedicated SQL pool or external storage with a metadata layer.
- To query external files from a SQL pool you create three objects: an EXTERNAL DATA SOURCE (location), an EXTERNAL FILE FORMAT (for example FORMAT_TYPE = PARQUET), and a DATABASE SCOPED CREDENTIAL, then CREATE EXTERNAL TABLE.
- Load data into a dedicated SQL pool with the COPY statement (preferred, simplest) or PolyBase, staging files in ADLS Gen2 or Blob Storage; COPY supports many formats and parallel ingestion without external table setup.
- A medallion lakehouse layers data into Bronze (raw ingested), Silver (cleansed and conformed), and Gold (curated business-level aggregates) zones to progressively improve data quality.
- Secure stored data with customer-managed keys (CMK) in Azure Key Vault for control of the key lifecycle, plus virtual network service endpoints, private endpoints, and storage firewall rules to restrict network access. Azure Blob lifecycle management policies automatically tier or delete aging data.
Domain 2: Develop Data Processing
- Implement incremental (delta) loading with the high-watermark pattern: a Lookup activity reads the last watermark from a control table, a Copy activity query filters rows where LastModifiedDate is greater than the watermark, and a final step updates the control table with the new maximum value.
- Azure Data Factory trigger types: Schedule (wall-clock recurrence), Tumbling Window (fixed non-overlapping windows that support backfill of historical periods and dependencies on other tumbling-window triggers), and Storage Event (fires on blob created/deleted in ADLS Gen2 or Blob Storage).
- Run the OPTIMIZE command on a Delta table to bin-pack many small files into fewer larger files (targeting roughly 1 GB) and fix the small-file problem; OPTIMIZE ZORDER BY co-locates related data for better skipping. For Parquet outside Delta, compact with an ADF pipeline or Spark job.
- Azure Stream Analytics windowing functions: Tumbling (fixed, non-overlapping), Hopping (fixed size with a smaller hop, so windows overlap, for example 5-minute size with 1-minute hop), Sliding (emits on event arrival), Session (groups by gaps), and Snapshot.
- In Stream Analytics, configure late-arrival tolerance (watermark/event ordering policy) so out-of-order and late events still land in the correct window; for example a 5-minute tumbling window with a 3-minute late-arrival delay.
- Stream Analytics joins streaming data with slowly changing static data using a reference data input from Blob Storage or Azure SQL Database.
- In Spark Structured Streaming, enable checkpointing by setting checkpointLocation to a reliable path in ADLS Gen2; this persists processed offsets and operator state, providing exactly-once processing and resumption after failure.
- Use a broadcast join (broadcast the smaller DataFrame) in Spark to avoid a shuffle when joining a large DataFrame with a small one; the small side is replicated to every executor.
- Read CSV in Spark with an explicit StructType schema (rather than relying on inferSchema) for performance and correctness, setting header to true or false to match the file.
- ADF control-flow activities: Get Metadata (check file existence or properties), If Condition (branch on an expression), ForEach (iterate), Execute Pipeline (invoke a child pipeline such as Pipeline B from Pipeline A), and Lookup (retrieve a config or watermark value).
- ADF mapping data flow transformations: Conditional Split (route rows by predicate, for example null checks), Derived Column (compute or replace values such as defaulting nulls), Alter Row (tag rows as insert/update/delete/upsert), Aggregate, Join, and Window.
- Use the mapping data flow Debug feature with a small data sample to preview the output of each transformation step interactively before running in production.
- Configure pagination rules on a REST source dataset in ADF Copy activity to automatically retrieve all pages from a paged API.
- T-SQL/Spark analytic functions for time-series logic include LAG and LEAD (access prior/next rows) and windowed aggregates with OVER (PARTITION BY ... ORDER BY ...); these support comparisons across ordered events.
Domain 3: Secure, Monitor, and Optimize Data Storage and Processing
- Microsoft Purview discovers and classifies sensitive data by registering each data source and running a scan with the built-in (default) system classification rules, which detect national IDs, credit card numbers, and similar PII across Synapse, ADLS Gen2, and Azure SQL Database. Apply glossary terms with asset assignments for business context.
- Synapse dedicated SQL pool workload management uses workload classifiers to map queries (by user/role/label) to workload groups; set min_percentage_resource to reserve memory and importance to schedule prioritized queries ahead of lower-importance ones.
- Workload isolation: a workload group can cap concurrency and reserve a minimum resource percentage so one team or job cannot starve critical workloads.
- Resource classes (static like staticrc10 or dynamic like largerc) control memory granted per query in a dedicated SQL pool; assigning a user to a larger resource class gives more memory (helpful for large loads or CCI builds) but reduces concurrency.
- Row-level security (RLS) uses a security policy with a filter predicate function to automatically restrict which rows a user sees, based on their identity or role.
- Dynamic data masking (DDM) obfuscates column values at query time (for example aX@XXXX.com for email) for unprivileged users while privileged users see real data; it does not encrypt the stored data.
- Restrict column access with GRANT SELECT on only the allowed columns and DENY SELECT on sensitive columns (such as salary); a DENY always overrides a GRANT.
- Transparent Data Encryption (TDE) encrypts data at rest and is enabled by default on dedicated SQL pools; Always Encrypted protects specific columns end-to-end so even administrators cannot read plaintext.
- Restrict network access to Synapse and storage using IP firewall rules, managed private endpoints, and the Synapse workspace managed identity for secure, credential-free access to other Azure resources.
- ADLS Gen2 uses POSIX-like access control lists (ACLs) for read/write/execute on files and folders, layered with Azure RBAC at the container/account level; ACLs give fine-grained directory-level control RBAC alone cannot.
- Optimize dedicated SQL pool query performance by hash-distributing frequently joined large tables on the join key, replicating small dimension tables, creating and updating statistics, and using materialized views for repeated aggregations. Stale results often trace to out-of-date statistics or a cached result set.
- Fix data skew or slow distributed queries by choosing a higher-cardinality, more evenly distributed hash key, which reduces uneven distribution loading and data movement.
- Monitor with Azure Monitor diagnostic settings: send ADF and Synapse logs to a Log Analytics workspace, create alert rules on failed pipeline runs (and action groups for notification), and build dashboards with Azure Monitor Workbooks; ADF also exposes trigger run history and pipeline/activity-level run history.
- Tune Spark jobs by enabling autoscale with a max node count and bounding per-job allocation via spark.dynamicAllocation.maxExecutors, and increase executor instances and memory per executor for memory-bound workloads.
DP-203 exam tips
- Decide table distribution by size and role: replicate small dimensions, hash-distribute large facts on a high-cardinality join key, and use round-robin only for staging; many exam questions hinge on spotting data skew or unnecessary data movement.
- Match the windowing function to the requirement: tumbling for non-overlapping aggregates, hopping for overlapping windows (size plus smaller hop), sliding for event-driven output, and session for activity bursts; watch the size-versus-hop wording closely.
- When a question mentions exactly-once or fault tolerance in streaming, the answer almost always involves checkpointing to a reliable location (ADLS Gen2) plus watermarking for late data.
- For security scenarios, map the exact requirement to one feature: RLS for rows, DDM or column GRANT/DENY for columns, TDE for at-rest, Always Encrypted for end-to-end, CMK in Key Vault for key control, and ACLs for ADLS Gen2 folders.
- Read carefully for incremental-load cues (watermark column, last-modified date, control table): the expected pattern is Lookup plus filtered Copy plus update-watermark, not a full reload.
Study guide FAQ
What is the difference between a Synapse dedicated SQL pool and a serverless SQL pool?
A dedicated SQL pool provisions and bills for reserved compute (measured in DWUs) and stores managed, persistent tables optimized with clustered columnstore indexes and distributions, ideal for a data warehouse. A serverless SQL pool is pay-per-query compute with no provisioned resources; it cannot create persistent managed tables and instead queries external files (Parquet, CSV, Delta) in the data lake using schema-on-read and lake databases.
When should I use Avro versus Parquet?
Use Avro (a row-based format) for write-heavy streaming ingestion and message schemas where whole records are written and read together. Use Parquet (columnar) for analytical query workloads, because its column storage and per-row-group statistics enable column pruning and row-group skipping. A common pattern is to land data as Avro and convert it to Parquet (or Delta) for analytics.
How does the high-watermark incremental load pattern work in Azure Data Factory?
You store the last successfully processed value (often a LastModifiedDate or an increasing ID) in a control table. A Lookup activity reads that watermark, a Copy activity runs a source query filtering rows greater than the watermark, and a final activity updates the control table with the new maximum value. This loads only changed rows instead of reloading the entire dataset on every run.
How much do I need to know about coding versus design for DP-203?
Expect a mix. You should recognize and reason about T-SQL (distribution, indexing, RLS, GRANT/DENY), PySpark/Spark SQL (StructType schemas, broadcast joins, structured streaming with checkpointing), Stream Analytics query patterns (windowing, reference data), and KQL-style monitoring concepts. Equally important is design judgment: choosing distribution strategies, partitioning, file formats, lakehouse zones, and the right security or monitoring feature for a stated requirement.