DP-900: Azure Data Fundamentals Study Guide
The DP-900: Azure Data Fundamentals exam validates foundational knowledge of core data concepts and how they are implemented using Microsoft Azure data services. It is aimed at candidates beginning to work with data in the cloud - including business stakeholders, students, and anyone preparing for role-based Azure data certifications - and requires no prior hands-on experience. The exam covers data concepts, relational data, non-relational data, and analytics workloads on Azure.
Domain 1: Describe Core Data Concepts
- Structured data conforms to a fixed schema with defined columns and data types (relational tables); semi-structured data has some organization via tags or keys but variable fields (JSON, XML, key-value); unstructured data has no predefined model (images, video, audio, PDFs, free-form text).
- OLTP (Online Transaction Processing) systems handle many short, atomic, real-time read/write transactions, use highly normalized schemas to minimize redundancy, and are optimized for fast inserts, updates, and deletes.
- OLAP (Online Analytical Processing) systems are optimized for complex analytical queries over large volumes of historical data, use denormalized schemas (star/snowflake), and support aggregation and trend analysis rather than frequent transactional updates.
- The ACID properties guarantee reliable transactions: Atomicity (all-or-nothing), Consistency (valid state to valid state), Isolation (concurrent transactions do not interfere), and Durability (committed transactions survive failures).
- Durability is achieved through transaction logs and write-ahead logging, ensuring committed changes persist even after power loss, hardware failure, or crashes.
- ETL (Extract, Transform, Load) transforms data before loading it into the target; ELT (Extract, Load, Transform) loads raw data first and transforms it using the target system's compute, which suits scalable cloud platforms and preserves raw data.
- Batch processing collects data over a period and processes it as a group on a schedule; stream processing ingests and processes data continuously in near real time as events arrive.
- Normalization organizes relational data into multiple related tables to reduce redundancy and prevent insertion, update, and deletion anomalies, following normal forms (1NF, 2NF, 3NF).
- CSV (comma-separated values) is a text-based, row-oriented format that separates fields with commas; JSON and XML are common semi-structured formats; Avro, Parquet, and ORC are common big-data file formats (Parquet and ORC are columnar).
- The data engineer designs and builds data pipelines and storage; the data analyst explores, visualizes, and reports on data to drive business decisions; the database administrator manages databases, availability, security, and performance.
- A data analyst's core work is turning existing data into business value using SQL, BI tools, and visualization platforms to build dashboards and reports for stakeholders.
- Relational databases enforce a schema and relationships via primary and foreign keys; non-relational (NoSQL) stores trade rigid schema for flexibility and horizontal scalability.
- A data lake stores large volumes of raw data in many formats (structured, semi-structured, unstructured) cheaply; a data warehouse stores curated, structured data optimized for analytical querying.
- Transactional workloads prioritize fast, consistent individual record operations; analytical workloads prioritize reading and aggregating large historical datasets.
Domain 2: Identify Considerations for Relational Data
- Azure SQL Database is a fully managed PaaS offering for new cloud applications, with built-in high availability (99.99% SLA), automated backups with point-in-time restore, automatic patching, and no OS access.
- Azure SQL Managed Instance provides near 100% compatibility with on-premises SQL Server (SQL Server Agent, cross-database queries, SSIS, CLR) and is the best choice for lift-and-shift migrations with minimal code changes.
- SQL Server on Azure Virtual Machines is IaaS, giving full control over the OS and SQL Server instance; choose it when you need OS-level access or features not available in the managed offerings.
- Azure SQL Database elastic pools let multiple databases share a pool of compute (DTUs or vCores), optimizing cost when databases have varying, non-overlapping peak usage patterns.
- Azure offers managed open-source relational databases: Azure Database for MySQL and Azure Database for PostgreSQL - choose the one matching the application's existing engine for minimal migration changes. (Azure Database for MariaDB was retired on 19 September 2025; migrate MariaDB workloads to Azure Database for MySQL Flexible Server.)
- Purchasing/compute models include the DTU model and the vCore model; the provisioned compute tier reserves capacity for predictable workloads, while serverless auto-scales and can pause for intermittent workloads.
- SQL command categories: DDL (Data Definition Language) defines schema - CREATE, ALTER, DROP; DML (Data Manipulation Language) manipulates data - SELECT, INSERT, UPDATE, DELETE; DCL (Data Control Language) - GRANT, REVOKE manages permissions.
- A primary key uniquely identifies each row and enforces both uniqueness and a NOT NULL constraint; a foreign key references a primary key in another table to enforce referential integrity.
- An index improves the speed of data retrieval; create an index on a column frequently used in WHERE clauses or joins (for example, an index on a ProductName column you search often) - clustered indexes define physical row order, nonclustered indexes are separate structures.
- A view is a saved query presenting data as a virtual table; a stored procedure is a precompiled set of SQL statements that can accept parameters, contain control-flow logic, and execute efficiently as a unit.
- Normalization (1NF, 2NF, 3NF) minimizes redundancy and improves data integrity by ensuring each fact is stored once and tables are linked by keys.
- Managed relational services reduce administrative overhead by handling automated backups, automatic patching/updates, built-in high availability, and intelligent performance tuning.
- Relationships are typically one-to-many (most common), one-to-one, or many-to-many (resolved with a junction/bridge table); these are enforced through primary and foreign keys.
- Azure SQL Database point-in-time restore allows recovery to any time within the configured backup retention period, and geo-replication/failover groups provide cross-region resilience.
Domain 3: Describe Considerations for Non-Relational Data
- Azure Cosmos DB is a globally distributed, multi-model NoSQL database guaranteeing single-digit millisecond latency at the 99th percentile, automatic multi-region replication, and elastic horizontal scaling.
- Cosmos DB APIs include API for NoSQL (Core, the native API using SQL-like queries), API for MongoDB, API for Apache Cassandra, API for Apache Gremlin (graph), and API for Table - the API is chosen at account creation and cannot be changed afterward.
- Use the Cosmos DB API for MongoDB to migrate existing MongoDB apps with minimal change, and the Gremlin (graph) API for traversing relationships such as finding routes or connections between entities.
- Cosmos DB offers five consistency levels from strongest to weakest: Strong, Bounded Staleness, Session (default), Consistent Prefix, and Eventual - stronger consistency increases latency, weaker consistency improves performance and availability.
- A partition key determines how data is logically and physically distributed across partitions; choose one with high cardinality and even access distribution to avoid hot partitions and enable horizontal scale.
- Azure Blob Storage stores large unstructured objects (files, images, video, backups) in a flat namespace of containers and blobs - it is object storage, not tabular storage.
- Blob Storage has three access tiers: Hot (frequent access, low storage cost, higher access cost), Cool (infrequent, minimum 30-day retention), and Archive (lowest storage cost, highest retrieval cost and latency, offline). Lifecycle management policies can move blobs between tiers automatically.
- Blob types are block blobs (most files and documents), append blobs (logging scenarios), and page blobs (random-access files such as VM disks).
- Azure Data Lake Storage Gen2 is built on top of Azure Blob Storage and adds a hierarchical namespace (true directories) plus POSIX-style ACLs, optimized for big-data analytics workloads.
- Azure Table Storage is a schemaless NoSQL key-value store; entities in the same table can have different properties and are uniquely identified by a partition key plus a row key.
- Azure Files provides fully managed file shares accessible over SMB and NFS, supporting lift-and-shift of applications that expect a file share.
- Graph databases store data as nodes (entities) and edges (relationships) and are queried by traversing those relationships - ideal for social networks, recommendations, and route/path finding.
- Document databases store semi-structured documents (typically JSON) where each document can have a different shape, and are queried by document fields rather than fixed columns.
- NoSQL stores are designed to scale horizontally (scale out across many servers) rather than vertically, trading rigid relational structure and joins for flexibility and massive scale.
Domain 4: Describe an Analytics Workload
- Azure Synapse Analytics is an integrated analytics platform combining dedicated SQL pools (MPP - massively parallel processing - for data warehousing), serverless SQL pools (on-demand querying of data lake files), Apache Spark pools (big-data processing), and Synapse Pipelines for orchestration.
- Dedicated SQL pools use massively parallel processing to run large data-warehouse queries; serverless SQL pools let you query files in the data lake on demand without provisioning capacity, charging per data processed.
- Azure Data Factory is Azure's cloud ETL/ELT orchestration service for building, scheduling, and monitoring data pipelines that move and transform data between sources and destinations.
- A self-hosted integration runtime is a gateway installed on-premises that lets Azure Data Factory securely access on-premises data sources without exposing them to the public internet.
- Azure Stream Analytics is a fully managed real-time analytics engine that ingests continuous event streams (often from Azure Event Hubs or IoT Hub) and processes them with SQL-like queries and time windows.
- Azure Event Hubs is a big-data streaming ingestion service for high-throughput event data; IoT Hub adds bidirectional device communication for IoT telemetry.
- A star schema has a central fact table directly surrounded by denormalized dimension tables, giving simple single-hop joins optimized for fast analytical queries.
- A snowflake schema normalizes the dimension tables into multiple related tables, reducing redundancy and storage at the cost of additional joins during queries.
- Fact tables store measurable, quantitative numeric data (measures) such as sales amounts and quantities, along with foreign keys to dimensions; dimension tables store descriptive attributes used to filter, group, and label facts.
- Power BI Desktop is the free authoring tool for connecting to data sources, building data models, and creating reports; the Power BI service (app.powerbi.com) is the cloud platform for publishing, sharing, and scheduling refresh of reports and dashboards.
- A Power BI dataset (semantic model) is the data model layer - tables, relationships, and measures - that reports and dashboards are built on; it can combine multiple sources and refresh on a schedule.
- Common Power BI visuals include bar/column charts, line charts, pie charts, KPI cards, maps/geographic visuals, tables and matrices, plus interactive slicers and filters for data exploration.
- A modern analytics pipeline follows stages: data ingestion from source systems, data storage (often a data lake), data transformation/processing, data modeling, and finally data visualization and reporting.
- Analytical (OLAP) data warehouses consolidate historical data from multiple sources using denormalized star/snowflake schemas to optimize read and aggregation performance rather than transactional throughput.
DP-900 exam tips
- DP-900 is conceptual, not hands-on - focus on knowing what each Azure service does and when to choose it (for example SQL Database vs. Managed Instance vs. SQL on a VM, or Blob vs. Data Lake Gen2 vs. Table Storage).
- Memorize the mapping of workload type to service: OLTP relational to Azure SQL Database, big-data/document NoSQL to Cosmos DB, object/file storage to Blob/Data Lake Gen2, data warehousing to Synapse, ETL orchestration to Data Factory, and real-time streaming to Stream Analytics.
- Be precise on terminology that the exam tests as true/false or multiple-select: ACID properties, ETL vs. ELT order, structured vs. semi-structured vs. unstructured, star vs. snowflake, and DDL vs. DML vs. DCL.
- Watch for multiple-response questions (Select all / pick two or three) and case-style scenarios that describe a business need - read the requirement carefully and match it to the single best service or feature.
- You need a scaled score of 700 (out of 1000) to pass and have about 45 minutes; pace yourself, answer every question (there is no penalty for guessing), and flag uncertain items for review.
Study guide FAQ
How long is the DP-900 exam and what score do I need to pass?
The exam runs about 45 minutes and you must reach a scaled score of 700 out of 1000 to pass. The question mix typically includes single-answer multiple choice, multiple-response (select all that apply), and true/false style items; there is no penalty for guessing, so answer everything.
Do I need hands-on Azure experience or coding skills to pass DP-900?
No. DP-900 is a fundamentals exam that tests conceptual understanding rather than hands-on tasks. You should recognize core data concepts and know what each Azure data service does and when to use it. A free Azure trial to click through Blob Storage, Cosmos DB, and Power BI helps reinforce concepts but is not required.
What is the difference between Azure SQL Database, SQL Managed Instance, and SQL Server on a VM?
Azure SQL Database is fully managed PaaS best for new cloud apps with minimal admin overhead. SQL Managed Instance is also PaaS but offers near 100% on-premises SQL Server compatibility, making it ideal for lift-and-shift migrations. SQL Server on Azure VMs is IaaS, giving full OS and instance control when you need features or access the managed options do not provide.
When should I use Azure Cosmos DB versus Blob Storage versus Data Lake Storage Gen2?
Use Cosmos DB for globally distributed, low-latency NoSQL data such as documents, key-value, graph, or column-family workloads. Use Blob Storage for large unstructured objects like images, video, and backups in a flat container/blob namespace. Use Data Lake Storage Gen2 (built on Blob with a hierarchical namespace and ACLs) when you need a file-system structure and fine-grained access for big-data analytics.