AWS MLA-C01: Machine Learning Engineer Associate Study Guide
The AWS Certified Machine Learning Engineer - Associate (MLA-C01) validates your ability to build, deploy, and operationalize ML solutions on AWS, spanning data preparation, model development, deployment/orchestration, and monitoring/security. It is aimed at engineers with at least one year of hands-on experience using Amazon SageMaker and related AWS services. The exam is 130 minutes, contains scored and unscored questions, and requires a scaled score of 720 out of 1000 to pass.
Domain 1: Data Preparation for ML
- Amazon S3 is the primary, most cost-effective store for large-scale raw ML training data in a data lake; it provides 11 nines of durability and integrates directly with SageMaker, Athena, and Glue.
- SageMaker Data Wrangler is a visual, low-code tool for data preparation and feature engineering; it offers 300+ built-in transforms and can export generated code as a SageMaker Pipelines step, Python script, or Feature Store job.
- SageMaker Feature Store has an online store for low-latency real-time inference lookups and an offline store (backed by S3, in Parquet) for batch training, ensuring the same feature definitions serve both training and inference.
- When creating a SageMaker Feature Group, --record-identifier-feature-name names the unique record column and an event-time feature is required to track record versions over time.
- Fit scalers and encoders on the training split only, then apply the fitted transform to validation and test sets; fitting on the full dataset causes data leakage and inflated metrics.
- Split data into training (teaches the model), validation (guides hyperparameter tuning and early stopping), and test (unbiased final generalization estimate) sets to avoid overfitting to evaluation data.
- Address class imbalance with resampling (oversampling the minority class, undersampling the majority, or SMOTE to synthesize minority examples) or by applying class weights during training.
- Encode categorical variables with one-hot encoding (no order), label encoding (ordinal), or learned embeddings (high cardinality) to make text usable by numeric algorithms.
- Handle missing values with imputation (mean, median, or model-based) or informed removal; handle outliers with capping, transformation, or removal depending on whether they are errors or signal.
- AWS Glue runs serverless Spark-based ETL at scale for ML data prep, and Amazon Athena runs serverless SQL queries directly over S3 data using a pay-per-scanned-data model.
- SageMaker Ground Truth assists data labeling (manual, automated, or human-in-the-loop); label quality directly affects model accuracy.
- aws s3 cp ./data s3://my-bucket/train/ --recursive copies a local folder including all subfolders to S3; head-object retrieves object metadata and aws s3 ls --recursive --summarize lists and totals objects.
- ShardedByS3Key evenly distributes distinct S3 object files across multiple training instances, while FullyReplicated copies the entire dataset to every instance.
- Pipe input mode streams data directly from S3 into the training container without first downloading to the EBS volume, reducing startup time and disk needs; compacting many small JSON files into larger Parquet files partitioned by date improves training throughput and lowers Athena query cost.
Domain 2: ML Model Development
- Amazon SageMaker is AWS's end-to-end managed ML platform covering data labeling, feature engineering, notebooks, distributed training, tuning, evaluation, model registry, and scalable inference; built-in algorithms train and deploy without managing servers.
- SageMaker Automatic Model Tuning (AMT) runs parallel/sequential training jobs using Bayesian optimization or random search over a defined hyperparameter space to maximize or minimize a target metric.
- In AMT, set HyperParameterTuningJobObjective Type to Minimize for losses/error metrics and Maximize for accuracy/F1; define search spaces with ContinuousParameter(0.001, 0.2), IntegerParameter, and CategoricalParameter(['relu','tanh']).
- Use precision (fraction of positive predictions correct), recall (fraction of actual positives captured), and F1 (their harmonic mean) for binary classification; AUC-ROC summarizes discrimination across thresholds.
- For imbalanced binary classification prefer F1 or AUC-PR (precision-recall) over plain accuracy, which is misleading when one class dominates.
- For regression use RMSE or MAE (error in the target's units) or R-squared (proportion of variance explained); RMSE penalizes large errors more heavily than MAE.
- Overfitting shows low training error but high validation/test error; combat it with more representative data, regularization (L1/L2, dropout), early stopping, and cross-validation.
- Underfitting shows high error on both training and validation data, indicating the model is too simple or undertrained; address it with more capacity, features, or training time.
- Regularization reduces overfitting: L1 (Lasso) drives weights to zero for feature selection, L2 (Ridge) shrinks weights smoothly, and dropout randomly disables neurons during training.
- k-fold cross-validation rotates train/test splits across k folds to give a more robust generalization estimate than a single split, at higher compute cost.
- Transfer learning fine-tunes a large pre-trained model on a smaller task-specific dataset, achieving strong results with less data and compute than training from scratch.
- SageMaker Clarify detects pre-training data bias (e.g., class imbalance) and post-training prediction bias (e.g., disparate impact), and computes SHAP-based feature attributions for explainability.
- SageMaker Model Registry versions models, tracks lineage, and manages an approval status (PendingManualApproval, Approved, Rejected) to gate governed deployment.
- Amazon Bedrock serves foundation models via API for generative AI; submit a training job via sagemaker_client.create_training_job(...) with AlgorithmSpecification.TrainingImage, and set instance_count=4 on the Estimator for distributed training.
Domain 3: Deployment and Orchestration
- SageMaker real-time endpoints serve synchronous, low-latency predictions over a persistent HTTPS endpoint via the InvokeEndpoint API; suited to interactive applications needing millisecond responses.
- SageMaker Batch Transform scores large datasets in S3 offline, writes results back to S3, and terminates compute when done, avoiding the cost of a persistent endpoint; start it with aws sagemaker create-transform-job.
- SageMaker Asynchronous Inference queues requests, supports large payloads (up to 1 GB) and long processing times, writes results to S3, and can scale to zero when idle.
- SageMaker Serverless Inference provisions compute per request and scales to zero between requests, making it cost-effective for intermittent or bursty traffic with no charges when idle.
- SageMaker Multi-Model Endpoints host many models behind one endpoint, loading them on demand to share infrastructure and reduce cost when individual models receive sparse traffic.
- Before serving traffic you must create-model (container + artifacts + role) and then create-endpoint-config; the endpoint is then created/updated from that config.
- Production variants with traffic weights on a single endpoint enable A/B testing of two model versions, splitting live traffic by configured weight.
- Blue/green deployment with traffic shifting moves traffic from old to new; in BlueGreenUpdatePolicy the Linear routing type shifts a fixed percentage per step, while Canary shifts a small first batch then the remainder.
- Invoke a deployed model with aws sagemaker-runtime invoke-endpoint specifying --endpoint-name, --content-type, and --body; apply a new config to a live endpoint with aws sagemaker update-endpoint.
- SageMaker Pipelines orchestrates ML workflow steps (processing, training, evaluation, registration, deployment) as a versioned directed acyclic graph with full lineage tracking.
- Build CI/CD MLOps with SageMaker Pipelines plus the Model Registry, integrated with CodePipeline and CodeBuild for repeatable, automated promotion of approved models.
- Trigger retraining when new data arrives, on a schedule, or when drift/performance degradation is detected; an EventBridge rule can launch a SageMaker Pipeline as its target via aws events put-targets.
- Configure endpoint auto scaling against the scalable dimension sagemaker:variant:DesiredInstanceCount, using a target-tracking policy on a metric such as SageMakerVariantInvocationsPerInstance.
- Set the auto scaling minimum capacity high enough to protect p99 latency during spikes and allow scale-in during low traffic to control cost.
Domain 4: Monitoring and Security
- SageMaker Model Monitor captures inference requests/responses (Data Capture), compares live distributions against a training baseline, and emits CloudWatch metrics and violation reports for data-quality, model-quality, bias, and feature-attribution drift.
- Data drift occurs when production input feature distributions diverge from the training distribution due to seasonality or upstream changes, degrading accuracy and signaling the need to retrain.
- Create a monitoring schedule with aws sagemaker create-monitoring-schedule, referencing a config that points to the computed baseline statistics and constraints.
- Attach a least-privilege IAM execution role to training jobs and endpoints granting only the needed S3 actions and bucket prefixes; AWS rotates temporary role credentials so no static keys live in code.
- Attach managed policies with aws iam attach-role-policy --role-name SageMakerExecRole --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess, but prefer scoped custom policies for production least privilege.
- SageMaker writes inference container stdout/stderr to CloudWatch Logs under /aws/sagemaker/Endpoints and training logs under /aws/sagemaker/TrainingJobs; retrieve them with aws logs get-log-events.
- Publish and alarm on endpoint metrics such as ModelLatency and Invocations in the AWS/SageMaker namespace using aws cloudwatch put-metric-alarm (note ModelLatency is reported in microseconds).
- Use KMS customer-managed keys (CMKs) to encrypt SageMaker S3 artifacts, training-instance EBS volumes, and SageMaker-managed storage, giving control over key rotation, policies, and audit via CloudTrail.
- Run SageMaker jobs and endpoints in a VPC with private subnets and VPC interface endpoints (PrivateLink) so traffic to S3/SageMaker stays off the public internet.
- create-training-job ... --enable-network-isolation blocks the container's outbound network access, preventing data exfiltration during training.
- Pass --vpc-config '{"Subnets":["subnet-abc"],"SecurityGroupIds":["sg-123"]}' to place jobs in a VPC; combine with S3 bucket policies and VPC endpoint policies restricting access to approved buckets.
- Achieve multi-team isolation by assigning each team a distinct IAM execution role scoped via least-privilege policies to its own S3 prefixes and KMS keys.
- AWS CloudTrail records SageMaker management (control-plane) API activity for audit and compliance, capturing who did what and when.
- Optimize ML cost by right-sizing instances, enabling auto scaling, using serverless/async inference where suitable, and batching predictions; add an evaluation/quality gate in the pipeline that blocks model registration unless metrics meet a threshold on a holdout set.
AWS MLA-C01 exam tips
- Match the inference modality to the workload: real-time for low-latency interactive calls, Batch Transform for bulk offline scoring, Asynchronous for large payloads/long jobs, Serverless for intermittent traffic, and Multi-Model Endpoints to consolidate many sparsely used models.
- Watch for data leakage in data-prep questions: scalers, encoders, and imputers must be fit on training data only, then applied to validation and test.
- Default to least privilege and managed credentials: scoped IAM execution roles, KMS CMKs for encryption, VPC + network isolation, and CloudTrail for audit are the recurring 'most secure' answers.
- Pick metrics by problem type and balance: F1/AUC-PR for imbalanced classification, AUC-ROC for general discrimination, and RMSE/MAE/R-squared for regression - never plain accuracy on imbalanced data.
- Know the CLI specifics: create-model and create-endpoint-config precede create-endpoint; invoke-endpoint needs --endpoint-name/--content-type/--body; and EventBridge plus SageMaker Pipelines is the canonical drift-triggered retraining pattern.
Study guide FAQ
What score do I need to pass the MLA-C01 and how is the exam structured?
You need a scaled score of 720 out of 1000. The exam runs 130 minutes and includes a mix of scored and unscored questions across four domains; some unscored items are used to evaluate future questions and do not affect your result. The largest weight is on Data Preparation, followed by Deployment/Orchestration, Model Development, and Monitoring/Security.
How much hands-on experience and what background should I have?
AWS recommends at least one year of hands-on experience with Amazon SageMaker and related AWS ML services, plus general familiarity with the ML lifecycle. You should be comfortable with Python, basic data engineering on S3/Glue/Athena, and the SageMaker SDK and CLI commands for training, tuning, and deploying models.
How much coding and CLI knowledge does the exam expect?
Expect to recognize and reason about SageMaker Python SDK and AWS CLI usage rather than write code from scratch. Know commands like aws sagemaker create-transform-job, aws sagemaker-runtime invoke-endpoint, aws sagemaker update-endpoint, aws s3 cp --recursive, and Estimator parameters such as instance_count and input modes (Pipe, ShardedByS3Key).
Is this exam about building algorithms or about operationalizing ML on AWS?
It is heavily MLOps and engineering focused. You apply ML concepts (metrics, overfitting, bias) but most questions test how to prepare data, train and tune with SageMaker, deploy via the right inference option, automate with Pipelines and the Model Registry, and monitor and secure models in production - not deriving algorithms by hand.