Alert configuration should control business behavior, not system structure
Alert configuration should make business behavior reviewable: wording, thresholds, variants, labels, routing, timing, and feedback options. Lifecycle guarantees belong in code.
Hardcoded business behavior slows the system down
Alerting systems change because the business learns.
The first version of a rule is rarely the final version. A threshold is too sensitive. A message is too vague. A variant should apply to one customer segment but not another. A response option should close the alert. A selected answer should add a label, remove a tag, or trigger a follow-up. A schedule should avoid Fridays. A rule should pause during a seasonal window. One owner group needs a lower dispatch limit because its queue is already full.
None of that is unusual. It is normal rule tuning once alerts meet real work. When that behavior is hardcoded, small changes wait behind releases.
A threshold stays too sensitive. A message stays vague. A response option stays wrong. Not because anyone agrees with the current behavior, but because fixing it means touching code.
The alerting system gets less accurate for a stupid reason: it becomes annoying to improve.
Configuration should keep those changes visible, reviewable, and cheap. It should cover business judgment without turning every wording edit into a release.
Not everything should be configurable
The opposite failure is just as common.
A higher-up gets tired of hardcoded rules and decides everything should be configurable. Soon the alerting system has tables for behavior that should have stayed in code, half-documented condition languages, runtime switches that can break contracts, and configuration rows that only one person understands.
That doesn’t remove complexity. It just moves it out of the repo and into data, where it is usually harder to search, test, review, and reason about. Code has types, tests, pull requests, local search, explicit dependencies, and deployment history. Configuration only has those properties when the system is designed to give them back.
Lifecycle behavior should not be casually configurable. The alert state model, candidate retry behavior, audit schema, payload contract, writer operation types, idempotency rules, and failure classification are the parts that keep the system reliable while business behavior changes around them.
It’s not config good and code bad. Business behavior belongs in configuration. System structure belongs in code.
Business behavior changes more often than system structure
Business behavior changes because it reflects judgment. Wording, thresholds, routing, variants, labels, schedules, response options, blackout windows, and dispatch limits depend on how the organization wants to handle a situation. Those choices shift as the rule runs, users respond, exceptions appear, and the business decides which signals are actually worth attention.
System structure changes more slowly. A candidate still needs a status. A writer still needs to log side effects. A payload still needs validation before delivery. A retry still needs to distinguish a temporary failure from a permanent rejection. A response still needs to attach to the alert that created it. Those are reliability decisions, not tuning decisions.
Mixing the layers creates drag. Hardcoded thresholds make ordinary tuning too slow. Configurable state transitions make the system fragile. Message templates in code make wording changes expensive. Retry semantics in a table make failures harder to understand when something breaks.
The boundary should follow change frequency and blast radius. Frequent business changes deserve reviewed configuration. Structural guarantees deserve code. Business owners can control the knobs that express judgment: thresholds, wording, timing, routing, and response options. Engineering owns the machinery that keeps the workflow reliable under retries, rejected payloads, missing context, duplicate candidates, and downstream failures.
Templates, labels, variants, and thresholds
Templates belong in configuration when the message is part of business behavior. Body text, section wording, localized labels, answer text, short labels, placeholders, and instructions often change once recipients start using the alert. Those edits should not require a new binary or a delivery layer rewrite.
That does not mean templates should be free-form magic. Allowed variables should be explicit. A template should fail validation when it references a placeholder the payload does not provide. Otherwise a harmless wording edit can ship a broken alert, because apparently strings are where reliability goes to die.
Variants also belong in configuration when they represent business scope. A rule may run differently by product group, account type, customer segment, region, value threshold, eligibility window, or owner group. Those variants should be visible because they define which situations create work.
Labels and tags are configuration when they express downstream business state. Selecting an answer might label an account, mark an opportunity for follow-up, or clear a previous flag. The action should be explicit and tied to the response option. The writer should execute a small set of known operations, not infer side effects from answer text.
Response options belong near the same surface. Which answers are visible, which require a note, which close the alert, which create a reminder, and which are available to specific owner groups are business decisions. The system can expose them as configuration while still enforcing the response contract in code.
Thresholds, schedules, dispatch limits, blackout windows, grouping fields, and owner rules belong there for the same reason. They decide when the system creates work, who receives it, and how much pressure one recipient can absorb.
Configuration tables as reviewable control surfaces
Configuration tables are useful when they are narrow, typed, validated, and aligned with the alert lifecycle.
A practical configuration surface separates alert settings, variants, scope conditions, templates, allowed placeholders, response options, answer-specific actions, actor restrictions, schedules, repeat rules, dispatch limits, attachments, and blackout windows. Each part should have one job. Each part should be reviewable without opening the scheduler, enrichment logic, or writer.
The separation matters. A threshold change should not touch the audit model. A localized wording change should not touch routing. A response option should be able to attach a controlled writeback without changing enrichment. A schedule change should not require editing detection logic.
Reviewability is what separates configuration from runtime soup. The system should make active rows visible. It should show which variant a rule belongs to, when it was updated, which answer or label it references, which owner group it affects, and which payload fields it depends on.
It should also validate references before bad configuration reaches users. Unknown answer IDs, unsupported attachment types, disallowed placeholders, invalid owner groups, unresolved labels, and missing entity references should fail before they create broken alerts. Configuration is only safe when the system refuses vague instructions.
Keeping lifecycle logic in code
The alert lifecycle should stay in code because it defines the system’s guarantees.
Candidate creation, claiming work with an execution ID, retry backoff, dead-letter behavior, enrichment suppression, writer invocation, payload validation, audit row construction, and state transitions are structural concerns. They need predictable behavior under failure. They need tests, explicit ownership, and one searchable place where the rules live.
Payload contracts follow the same rule. A writer should accept a known shape and map it into the external business system. It should not discover at runtime that a configuration row invented a new operation type. An enrichment service should know how to build message sections, questions, filters, attachments, and actions. Configuration can choose values inside the contract. It should not rewrite the contract.
Some parameters can sit around structural logic without owning it. Maximum attempts, timeout overrides, dispatch limits, schedules, and repeat windows can be configurable. The behavior itself should remain code-owned. A table can say a rule repeats after seven days. Code should decide how that interacts with alert history, reminders, pending candidates, suppression, and audit.
Failure handling needs the same discipline. Configuration can enable a dispatch lane or set a timeout. Code should decide what counts as transient failure, what counts as permanent rejection, when a candidate is dead, and what must be logged before the request returns. Those decisions are the difference between an alerting system and a scheduled notification script with better branding.
The boundary that prevents chaos
Configure the business-facing parts: wording, labels, variants, thresholds, scope conditions, routing rules, dispatch timing, per-owner limits, grouping fields, blackout windows, attachments, reminder eligibility, response options, and controlled writeback actions attached to those responses.
Keep the structural system in code: lifecycle, state transitions, candidate processing, retry classification, idempotency, payload validation, enrichment contracts, writer behavior, audit schema, and the small set of allowed side-effect operations.
That boundary lets the alerting system change without becoming arbitrary. Business behavior can move quickly because it is configuration. Reliability can stay stable because the structural rules are code.
Configuration should make common business changes cheap. Code should keep the workflow from collapsing into a pile of switches no one can safely touch.
More in this domain: Automation
Browse allData-Driven Alerts: System Breakdown
Data-driven alerts turn agreed business conditions into assigned, stateful work. The useful part is the loop: detection, queueing, enrichment, routing, response, writeback, audit, and rule tuning.
Deduplication, cooldowns, and expiry in operational alerting
An alerting system without state is a scheduled spam machine. It needs durable identity, cooldowns, expiry, reminders, suppression, and reopening rules to stay useful.
Related patterns
Why alert feedback should be structured first
Free text helps, but structured alert feedback lets the system measure relevance, timing, duplicates, bad data, and rule quality. Human response becomes evidence the rules can learn from.
How we decide which metrics deserve a dashboard and which deserve a workflow
Some metrics are for observation. Others need ownership, thresholds, timing, and structured action. We decide explicitly which system shape each metric actually deserves.
An alert is not a notification
A notification says something happened. An operational alert identifies a business situation, assigns ownership, carries enough context to act, records the response, and becomes workflow state.
What makes a KPI trustworthy enough to automate around
A KPI is not ready to drive action just because it exists on a dashboard. It needs stable meaning, reliable updates, and failure behavior that will not create new chaos.