DP-100: Azure Data Scientist Associate Study Guide
DP-100 (Azure Data Scientist Associate) validates your ability to design machine learning solutions, prepare data, train and evaluate models, and operationalize them with Azure Machine Learning. It targets data scientists and ML engineers who use the Azure ML SDK v2, CLI v2, and studio to run experiments and deploy models. Expect heavy emphasis on the Azure ML workspace, compute, jobs, MLflow tracking, endpoints, and MLOps retraining patterns.
Domain 1: Design and Prepare a Machine Learning Solution
- The Azure Machine Learning workspace is the top-level resource that centralizes data assets, models, compute, jobs, environments, and endpoints for a team's ML work.
- A workspace automatically provisions and references companion resources: Azure Storage (default datastore), Azure Key Vault (secrets), Azure Container Registry, and Application Insights.
- A compute instance is a managed single-user dev workstation (pre-configured Jupyter/VS Code) for interactive authoring; a compute cluster is multi-node managed compute for training and batch jobs that can autoscale.
- Set a compute cluster's minimum nodes to 0 so it scales down to zero when idle, eliminating idle cost; it scales up on demand when jobs are queued.
- Low-priority (Spot) VM nodes in a compute cluster cut cost substantially but can be pre-empted, so use them for fault-tolerant or interruptible training workloads.
- Datastores are connection configurations that reference an Azure storage account (Blob, ADLS Gen2, File); data assets are named, versioned references to specific data for reproducibility and lineage.
- Data asset versioning lets an experiment reference an exact data version, providing reproducibility and lineage so any result can be reproduced or audited later.
- Datastore authentication options include account key, SAS token, and managed identity; SAS tokens give time-limited, revocable access without exposing account keys and work with managed virtual networks.
- Training code must reference data through Azure ML datastores or data assets, not paths mounted on a local dev machine, because compute clusters cannot reach the developer's local filesystem.
- Use managed identity with RBAC role assignments (for example Storage Blob Data Reader) to grant compute access to storage without embedding credentials.
- Azure RBAC role assignments control workspace access; common built-in roles include AzureML Data Scientist, AzureML Compute Operator, Owner, Contributor, and Reader.
- Responsible AI covers interpretability/transparency (explaining predictions) and fairness/bias assessment; Azure ML provides a Responsible AI dashboard for these analyses.
- Cost controls include co-locating compute and storage in the same region (avoid egress), Azure Reserved VM Instances for steady workloads, Cool/Cold blob tiers for infrequent data, and resource tags with budgets and cost alerts.
- Create a workspace via CLI v2 with az ml workspace create --name <name> --resource-group <rg>; manage compute, data, and jobs with the az ml command group.
Domain 2: Explore Data and Train Models
- Hyperparameters (learning rate, tree depth, regularization strength, batch size, number of layers) are set before training and control how the algorithm learns; model parameters are learned during training.
- Split data into training (fit parameters), validation (tune hyperparameters and select models), and test (final unbiased generalization estimate) sets to avoid overfitting bias.
- Cross-validation (for example k-fold) trains and validates across multiple data folds to give a more robust performance estimate than a single train/validation split.
- Feature engineering transforms raw data into informative features (polynomial terms, binning, temporal components, encodings) to make patterns more learnable.
- Regression metrics include MSE, RMSE, MAE, and R-squared; classification metrics include accuracy, precision, recall, F1, and AUC (area under the ROC curve).
- For imbalanced classification (for example a 5% positive class), raw accuracy is misleading; optimize F1 or AUC, which reflect performance on the minority class.
- Overfitting (low training error, poor generalization) is reduced with regularization, cross-validation, simpler models, early stopping, and more representative data.
- Azure ML hyperparameter sweep jobs search hyperparameter spaces; sampling options are grid, random, and Bayesian (which uses prior trial results to choose the next trial).
- An early-termination policy stops poorly performing sweep runs; a bandit policy terminates runs whose primary metric falls outside a slack factor/amount from the best run, saving compute.
- MLflow tracking integrated with Azure ML logs metrics, parameters, and artifacts (models, plots) across jobs/runs for comparison in the SDK and studio.
- An Azure ML environment (curated or custom Docker/conda) packages dependencies so a training run is reproducible across compute targets.
- Automated ML (AutoML) systematically tries multiple algorithms, preprocessing steps, and hyperparameters for classification, regression, time-series forecasting, and more, selecting the best model by a primary metric.
- Distributed training scales a single job across multiple nodes/GPUs; PyTorch uses DistributedDataParallel (DDP) and frameworks like Horovod also support data-parallel training.
- For GPU training throughput, mount data with a prefetching/caching data loader near compute, use mixed-precision (FP16) to engage Tensor Cores, and use distributed data-parallel training across GPUs.
Domain 3: Prepare a Model for Deployment
- Registering a model creates a named, versioned artifact with metadata linking it to the job, metrics, and data that produced it, enabling deployment, rollback, and audit.
- A managed online endpoint serves real-time, low-latency predictions over a REST API with the model kept in memory to avoid cold-start delays.
- A batch endpoint scores large volumes of data asynchronously (on a schedule or on demand) on a compute cluster, ideal when per-record latency is tolerable.
- A scoring/entry script (typically score.py) defines init() to load the model once and run() to process each inference request, including preprocessing and postprocessing.
- An inference environment (curated or custom Docker/conda) pins the runtime and dependencies so inference is reproducible and matches training.
- An online endpoint can front multiple deployments behind one scoring URI, with traffic allocation percentages set per deployment to enable blue/green and canary rollouts.
- Set traffic allocation per deployment (for example az ml online-endpoint update --traffic "green=100") to gradually shift live traffic to a new model version.
- Managed online endpoints provide managed infrastructure, autoscaling, key/token authentication, SSL/TLS, and logging without you managing VMs.
- Enable autoscaling rules so endpoint instance count tracks demand within min/max bounds; set a minimum instance count to keep warm instances and reduce cold starts.
- Improve online latency by loading the model once in init() (not per request), right-sizing the instance SKU, and tuning max concurrent requests per instance.
- Reduce model size and inference cost with quantization or distillation, and use smaller container images to cut build time, deployment cost, and cold-start latency.
- Increase batch scoring throughput by raising mini-batch size and the instance/worker count to parallelize across the cluster.
- Model explainability tools reveal which features drive predictions; SHAP gives local (per-prediction) and global (aggregate) feature importance, and permutation importance is another option.
- Register a model with the CLI: az ml model create --name <name> --version 1 --path ./model --type custom_model (other types include mlflow_model and triton_model).
Domain 4: Deploy and Retrain a Model
- Managed online endpoints expose models over REST with low latency and automatic scaling for real-time prediction; batch endpoints handle scheduled, high-volume asynchronous scoring.
- MLOps operationalizes the ML lifecycle: Azure ML pipelines orchestrate multi-step workflows (data prep, train, evaluate, register) for reproducible, schedulable retraining.
- CI/CD integration with GitHub Actions or Azure DevOps invokes the Azure ML CLI/SDK to trigger pipeline runs on code or data changes, automating retraining and deployment.
- Data drift is a shift in the production data distribution relative to training data; it degrades accuracy over time and signals a need to retrain.
- Azure ML monitors deployed models by comparing recent inference data against a baseline to detect data drift and performance degradation.
- Common retraining triggers are arrival of fresh labeled data, a recurring schedule, and monitoring that detects drift or a drop in production performance.
- Prefer triggering retraining on a sensible schedule (for example weekly) or on detected drift rather than overly frequent (hourly) runs that waste compute.
- Use canary deployments to send a small traffic percentage to a candidate model first, compare its monitored metrics, then shift full traffic; blue/green swaps once validated.
- Capture inference inputs, outputs, latency, and error rates as production telemetry; log metrics and latency to Application Insights to feed drift detection and troubleshooting.
- Lineage from registered data and model assets plus tracked runs supports governance, audit, and root-cause investigation of production issues.
- Right-size cost by using GPU compute clusters that autoscale to zero between scheduled runs and configuring endpoint autoscaling with a low or zero minimum that scales up on demand.
- Manage and forecast spend with Microsoft Cost Management using budgets, cost alerts, and resource tags.
- Monitor CPU/GPU and memory utilization plus request latency from endpoint metrics; add instances or tune concurrency and autoscale thresholds to address bottlenecks.
- Key CLI v2 operations: az ml online-endpoint invoke --name <ep> --request-file sample.json, az ml online-deployment update --set instance_count=3, and az ml schedule create -f schedule.yml for scheduled pipelines.
DP-100 exam tips
- Master the v2 vocabulary: workspace, datastore, data asset, environment, compute instance vs compute cluster, command/sweep/pipeline jobs, and managed online vs batch endpoints; many questions hinge on choosing the right resource for a scenario.
- Know the online-vs-batch endpoint decision cold: real-time low-latency REST equals managed online endpoint; high-volume, latency-tolerant, scheduled scoring equals batch endpoint.
- Watch for cost-optimization questions: compute clusters scaling to zero (min nodes 0), Spot/low-priority VMs, autoscale min/max, reserved instances, and same-region co-location are recurring correct answers.
- On imbalanced data, never pick raw accuracy; choose F1 or AUC. Be ready to map metrics to problem types (RMSE/R-squared for regression, precision/recall/F1/AUC for classification).
- Expect yes/no "does this solution meet the goal" series and az ml CLI v2 syntax questions; read each scenario carefully and recognize valid command flags like --traffic, --set, and --request-file.
Study guide FAQ
Does DP-100 use the Azure ML SDK v1 or v2?
The current exam centers on the v2 experience: the Python SDK v2, CLI v2 (az ml command group), and Azure ML studio. Learn v2 concepts like data assets, environments, command/sweep/pipeline jobs, and managed online/batch endpoints rather than the deprecated v1 Estimator and inference-config patterns.
How much coding is on the exam?
You should be able to read and reason about Python (SDK v2) and az ml CLI commands and YAML job specs, but you write little to no code from scratch. Most items are scenario-based multiple choice asking you to pick the right service, configuration, metric, or command.
What is the passing score and format?
You need 700 out of 1000 to pass. The exam runs about 100 minutes with roughly 40 to 60 questions in mixed formats: multiple choice, multiple-select, drag-and-drop ordering, and case studies, including the 'does this solution meet the goal' yes/no series.
How do I choose between AutoML and a hyperparameter sweep job?
Use AutoML when you want Azure ML to automatically try many algorithms and preprocessing/hyperparameter combinations and surface the best model with minimal code. Use a sweep job when you already have a chosen algorithm/training script and want to tune its hyperparameters using sampling (grid, random, Bayesian) with an early-termination policy such as bandit.