Prometheus Certified Associate (PCA) Study Guide
The Prometheus Certified Associate (PCA) is a vendor-neutral CNCF exam that validates foundational observability and Prometheus skills, including metrics fundamentals, PromQL, exporters, instrumentation, alerting, and dashboards. It is a 90-minute, multiple-choice exam aimed at developers, SREs, and platform engineers who deploy or query Prometheus and want to prove practical, hands-on monitoring knowledge.
Domain 1: Observability Concepts
- Observability is built on three pillars: metrics (numeric time-series), logs (discrete timestamped event records carrying rich context), and traces (following one request across services to locate latency or errors).
- Prometheus is a metrics-based, time-series monitoring and alerting system that pulls (scrapes) metrics over HTTP from target endpoints, conventionally /metrics.
- A time series is uniquely identified by its metric name plus its full set of label key/value pairs; changing any label value creates a new, distinct series.
- Cardinality is the number of distinct time series a metric produces across all its label-value combinations; high-cardinality labels (user IDs, request IDs, raw URLs) explode memory and TSDB series counts.
- An SLI (Service Level Indicator) is a measured metric such as availability or latency; an SLO is the target you commit to for that SLI; the error budget is the allowed unreliability (e.g., 0.5% of requests) that can be spent before the SLO is breached.
- The USE method inspects Utilization, Saturation, and Errors for each resource; it is best suited to resources like CPU, memory, disk, and network.
- The RED method tracks Rate (requests/sec), Errors, and Duration (latency); it is best suited to request-driven services.
- Saturation measures how full a resource is or how much headroom remains (e.g., queue depth, CPU run-queue length), and is a leading indicator of impending overload.
- For high-cardinality per-event data such as per-request analysis, a tracing system is more appropriate than encoding the data as metric labels.
- When a target disappears or stops returning a series, Prometheus marks the series stale (inserts a staleness marker) so it returns no data rather than carrying the last value forever.
- Summary quantiles are computed client-side per instance and are not additive, so they cannot be mathematically aggregated across instances; histogram buckets can be summed across instances and quantiles computed server-side with histogram_quantile.
- Labels are stored in the TSDB and exposed through the query API, so putting PII into metric names or labels both retains sensitive data broadly and explodes cardinality; minimize collection of sensitive data.
- Dimensional metrics let you slice and aggregate one metric by labels (per service, instance, status code) rather than creating a separate metric per attribute combination.
- Metrics are quantitative measurements collected and analyzed over time, and are the most storage-efficient pillar for trends and alerting compared to logs and traces.
Domain 2: Prometheus Fundamentals
- Prometheus uses a pull model: it scrapes HTTP /metrics endpoints exposed by instrumented apps or exporters, so targets do not need to know any collector address.
- The built-in up metric is 1 if the last scrape of a target succeeded and 0 if it failed, making it the primary target-health signal.
- Key startup flags: --config.file=/etc/prometheus/prometheus.yml sets the config, --storage.tsdb.retention.time=15d sets local retention (15d is the default), and --web.enable-lifecycle enables the /-/reload and /-/quit HTTP endpoints.
- With --web.enable-lifecycle on, reload config without restarting via curl -X POST http://localhost:9090/-/reload; validate config first with promtool check config prometheus.yml.
- global.scrape_interval (commonly 15s) sets the default scrape frequency; shorter intervals improve resolution and detection speed but increase storage, CPU, and network cost.
- Dynamic targets come from service discovery (kubernetes_sd_config, consul, ec2, etc.) combined with relabel_configs in the scrape job to filter, rename, and set labels.
- For long-term and horizontally scalable storage, Prometheus supports remote_write to backends like Thanos, Cortex, or Mimir, queried through a single global endpoint.
- Federation pulls a small, aggregated subset of metrics (e.g., precomputed recording rules) from one Prometheus into a higher-level Prometheus; it is not a substitute for full long-term storage of raw samples.
- High availability is achieved by running two or more identically configured Prometheus replicas scraping the same targets and evaluating the same rules; Alertmanager deduplicates their alerts.
- The Pushgateway is only for short-lived or batch jobs that cannot be scraped; overusing it makes it a single point of failure, loses the per-target up signal, and retains stale metrics after instances die, masking crashes.
- Use recording rules to downsample raw series into the aggregates you actually query, reducing query cost on dashboards and alerts.
- Set sample_limit (and label/series limits) on a scrape job so an over-limit scrape is rejected entirely; when a scrape exceeds the limit the whole scrape fails and the target is marked up=0 with none of its samples ingested.
- If honor_labels is true, label values exposed by the target are kept and Prometheus does not overwrite them with the configured job/instance labels.
- When a counter decreases, rate() and increase() treat the drop as a counter reset and assume the counter restarted from zero rather than counting a negative value.
Domain 3: PromQL
- Filter a series with label matchers: = (exact), != (not equal), =~ (regex match), and !~ (regex not match), e.g., http_requests_total{job="api",status=~"5.."}.
- rate() computes the per-second average rate of a counter over a range, is counter-reset aware, and extrapolates to the window edges; it is the standard choice for smoothed dashboards and alert thresholds.
- irate() computes the instantaneous per-second rate from only the last two samples in a range and is best for fast-moving, volatile graphs, not alerting.
- Choose a rate() range window at least about 4x the scrape_interval so it always contains multiple samples; with fewer than two samples in the window rate() returns no result.
- increase() gives the total rise of a counter over a range (rate multiplied by the window) and is convenient for human-readable counts.
- Aggregation operators (sum, avg, min, max, count, stddev, topk, bottomk, quantile, count_values, group) collapse an instant vector, optionally grouped with by(...) or without(...).
- Compute a 95th-percentile latency from a histogram with histogram_quantile(0.95, sum by (le) (rate(metric_bucket[5m]))); always preserve the le label when aggregating buckets.
- Compute an error ratio with sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])).
- Vector matching joins by identical label sets; when one side has extra labels use on()/ignoring() together with group_left/group_right for many-to-one or one-to-many matches, e.g., rate(app_requests_total[5m]) * on(service) group_left(team) service_info.
- predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0 forecasts whether a resource (such as free disk) will be exhausted within a future window based on the recent trend.
- label_replace(v, "new_label", "$1", "src_label", "(.*)") creates or rewrites a label using a regex capture against an existing label.
- absent(metric{job="x"}) and absent_over_time(...) return a result only when the matching series is missing, which is how you alert on data that has disappeared entirely.
- topk(3, ...) and sort_desc(...) rank series; for example sort_desc(sum by (job)(rate(http_requests_total[5m]))) lets you take the top jobs by request rate.
- An unexpected 'many-to-many matching not allowed' or extra-results error usually means one side has extra labels and you omitted on()/ignoring() with group_left/group_right.
Domain 4: Instrumentation and Exporters
- Direct instrumentation uses a Prometheus client library (Go, Java, Python, etc.) to define and expose metrics on a /metrics endpoint that Prometheus scrapes.
- The four core metric types are Counter (monotonically increasing, e.g., requests_total), Gauge (goes up and down, e.g., temperature), Histogram (bucketed observations with _bucket/_sum/_count), and Summary (client-side quantiles).
- Prefer histograms over summaries when you need to aggregate across instances, because histogram bucket counts are additive and percentiles are computed at query time with histogram_quantile.
- Follow naming conventions: use base units (seconds, bytes), append the _total suffix to counters, and keep a consistent label set, since a metric name must always carry the same label names or collection errors occur.
- Avoid high-cardinality labels such as user IDs, request IDs, or raw URLs; normalize URLs into a bounded route/handler label like /users/:id to prevent a cardinality explosion and possible OOM.
- node_exporter exposes host/OS metrics (CPU, memory, disk, filesystem, network) and listens on :9100 by default.
- blackbox_exporter probes endpoints over HTTP, HTTPS, TCP, ICMP, and DNS; Prometheus scrapes the exporter itself, passing the real target in the ?target= URL parameter and selecting a module with &module= (e.g., http_2xx).
- For blackbox/probe-style exporters, relabeling copies the target into the __param_target label (which becomes ?target=) and points __address__ at the exporter so the exporter, not the target, is scraped.
- cAdvisor (exposed via the kubelet) provides per-container resource metrics, and kube-state-metrics exposes Kubernetes object state; both are standard in Kubernetes monitoring.
- snmp_exporter monitors network devices: deploy it, configure it to poll the device via SNMP, and have Prometheus scrape the exporter.
- Validate exposition-format output with promtool check metrics, typically by piping a scrape, e.g., curl -s localhost:9100/metrics | promtool check metrics.
- In Kubernetes use kubernetes_sd_config with the pod role plus relabel_configs to act on pod annotations and labels for discovery and scrape configuration.
- Use relabel_configs to set stable job and instance labels and meaningful labels from service discovery, avoiding volatile values like ephemeral pod IPs as the instance label.
- Targets dropped by relabeling (action: drop) are removed during service discovery before scraping and are never contacted, so they consume no scrape resources.
Domain 5: Alerting and Dashboards
- Alerting rules live in Prometheus rule files and are evaluated by Prometheus itself; when an expression matches, Prometheus sends alerts to Alertmanager.
- Alertmanager handles routing to receivers (email, Slack, PagerDuty, webhook), plus grouping, deduplication, silences, and inhibition; Prometheus only generates the alerts.
- The for clause requires the alert condition to stay true continuously for a sustained duration before the alert transitions from pending to firing, which reduces flapping.
- While an alert is pending, any single evaluation that returns no matching sample resets it to inactive and restarts the for timer.
- Recording rules precompute frequently used or expensive PromQL expressions into new time series, e.g., a success-ratio series, making dashboards and alerts faster and consistent.
- evaluation_interval controls how often both recording and alerting rules are evaluated, and is configured globally (independent of scrape_interval).
- Test rules offline with promtool test rules tests.yml and validate rule syntax with promtool check rules.
- Connect Prometheus to Alertmanager with an alerting: block containing alertmanagers: and static_configs (or service discovery) in prometheus.yml; start Alertmanager with --config.file=/etc/alertmanager/alertmanager.yml.
- group_wait sets how long to buffer initial alerts in a new group before the first notification; group_interval governs follow-up sends for new alerts in an existing group; repeat_interval is how long to wait before re-sending a still-firing, unchanged alert.
- Inhibition (inhibit_rules) automatically mutes lower-severity dependent alerts while a related higher-severity source alert is firing, based on matching labels.
- Prefer symptom-based alerts tied to SLOs (error-rate or latency burn rate) for paging, and reserve cause-based alerts for diagnostics to reduce noise.
- Multi-window multi-burn-rate alerting pairs a fast-burn alert (short window, high burn-rate threshold) to page on severe budget consumption with a slow-burn alert for gradual erosion.
- Alert on missing data with expr: absent(up{job="x"}) and an appropriate for, so you are notified when an entire series stops reporting.
- Use a routing tree with child routes matching on labels (e.g., team) plus a top-level default receiver, and aggregate at the service level to avoid one alert per instance.
- Grafana is the de facto dashboarding tool for Prometheus, using PromQL queries against Prometheus as a data source.
Prometheus Certified Associate (PCA) exam tips
- Master the difference between rate() (smoothed, counter-reset aware, for alerts and dashboards) and irate() (last two samples, for volatile graphs), and remember a range window should be at least ~4x the scrape_interval so it always holds multiple samples.
- Be fluent in histogram_quantile with sum by (le) for percentiles, and know why histograms aggregate across instances while summary quantiles do not - this distinction is heavily tested.
- Memorize Alertmanager timing semantics: group_wait vs group_interval vs repeat_interval, plus what inhibition, silences, and grouping each do.
- Know the standard exporters and their default ports/roles (node_exporter :9100, blackbox_exporter probe model with ?target=, cAdvisor, kube-state-metrics, snmp_exporter) and the relabeling that wires probe exporters together.
- Practice reading prometheus.yml: scrape_configs, relabel_configs, service discovery, sample_limit, honor_labels, and the lifecycle flags (--web.enable-lifecycle, /-/reload), since config interpretation questions are common.
Study guide FAQ
What format and passing requirements does the PCA exam have?
The PCA is a multiple-choice, online proctored exam lasting about 90 minutes. It is scored on a scaled basis and the passing score is 750. There is no hands-on lab component, but many questions present config snippets or PromQL expressions to interpret.
How much PromQL do I need to know?
PromQL is roughly a quarter of the exam and is the single highest-value area to study. You should be comfortable with label matchers, rate/irate/increase, aggregation with by/without, histogram_quantile, vector matching with on/ignoring and group_left/group_right, and functions like predict_linear, absent, and label_replace.
Do I need to know Grafana and Alertmanager for the exam?
Yes, at a foundational level. Expect questions on Alertmanager routing, grouping, deduplication, inhibition, silences, and notification timing, and on the roles of recording vs alerting rules. Grafana appears as the standard dashboarding layer that queries Prometheus via PromQL.
What are the most common conceptual traps to watch for?
Common traps include confusing SLI/SLO/error budget, assuming summary quantiles can be aggregated across instances (they cannot), misusing the Pushgateway for long-running services, embedding high-cardinality or PII values in labels, and forgetting that the for clause resets if any evaluation returns no data.