Stack: Ruby 3+, Rails 7+
Audience: Backend engineers building or maintaining production-grade Rails services
Goal: Add real-time observability and on-call alerting to a critical business process
Part 3: Hooking It All Together — Rake Task + Cron
3.1 Rake Task
Create lib/tasks/billing.rake:
namespace :billing do desc "Run billing health check: emit Datadog metrics and alert if unhealthy" task health_check: :environment do Monitoring::BillingHealthCheck.new( billing_week: BillingWeek.current ).run endend
Run it manually:
bundle exec rake billing:health_check
3.2 Cron Script
Create scripts/cron/billing_health_check.sh:
#!/bin/bashsource /apps/myapp/current/scripts/env.shbundle exec rake billing:health_check
Using Healthchecks.io (or similar) to wrap the cron gives you a second layer of alerting: if the cron doesn’t ping within the expected window, you get an alert – even if the app never starts.
3.3 Crontab Entry
# Run billing health check every Thursday at 5:30 AM30 5 * * 4 . /apps/myapp/current/scripts/cron/billing_monitoring.sh
⚠️ Important for managed deployments: If your crontab is version-controlled but not auto-deployed (e.g., Capistrano without cron management), changes to the file in your repo do not automatically update the server. Always verify with
crontab -lafter deploying.
Part 4: Building the Datadog Dashboard
Once metrics are flowing, set up a dashboard for at-a-glance visibility.
4.1 Create the Dashboard
- Datadog → Dashboards → New Dashboard
- Name it: “Billing Health Monitor”
- Click + Add Widgets
4.2 Add Timeseries Widgets
For each metric, add a Timeseries widget:
| Widget title | Metric | Visualization |
|---|---|---|
| Unbilled Orders | billing.unbilled_orders | Line chart |
| Missing Billing Records | billing.missing_billing_records | Line chart |
| Failed Charges | billing.failed_charges | Line chart |
Widget configuration:
- Graph: select metric →
billing.unbilled_orders - Display as: Line
- Timeframe: Set to “Past 1 Week” or “Past 1 Month” after data starts flowing (not “Past 1 Hour” which shows nothing between weekly runs)
4.3 Add Reference Lines (Optional but Useful)
For the unbilled orders widget, add a constant line at your alert threshold:
- In the widget editor → Markers → Add marker at
y = 10(yourBILLING_UNBILLED_THRESHOLD) - Color it red to make the threshold visually obvious
4.4 Where to Find Your Custom Metrics
- Metric Explorer: app.datadoghq.com/metric/explorer — type
billing.to autocomplete and graph any metric - Metric Volume: app.datadoghq.com/metric/volume — confirms Datadog has received the metric (appears within 2-5 minutes of first emission)
Part 5: Testing the Integration End-to-End
5.1 Test Datadog Metrics (no alerts, safe in any env)
# Rails consolerequire 'datadog/statsd'host = ENV.fetch('DD_AGENT_HOST', '127.0.0.1')statsd = Datadog::Statsd.new(host, 8125)statsd.gauge('billing.unbilled_orders', 0)statsd.gauge('billing.missing_billing_records', 0)statsd.gauge('billing.failed_charges', 0)statsd.closeputs "Sent — check /metric/explorer in Datadog in ~2-3 minutes"
5.2 Test PagerDuty (staging)
# Rails console — staging# First, verify the key exists:Rails.application.credentials[:staging][:pagerduty_billing_integration_key].present?# Then trigger a test incident:svc = Monitoring::BillingHealthCheck.new(billing_week: BillingWeek.current)svc.send(:trigger_pagerduty, "TEST: Billing health check — staging validation #{Time.current}")# Remember to resolve the incident in PagerDuty UI immediately after!
5.3 Test PagerDuty (production) — Preferred Method
Use PagerDuty’s built-in test instead of triggering from code:
- PagerDuty → Services → Billing Pipeline → Integrations
- Find the integration → click “Send Test Event”
This fires through the same pipeline without touching your app or risking a real alert.
5.4 Test PagerDuty (production) — via Rails Console
If you must test via code in production, use a unique dedup key so it doesn’t collide with real billing alerts, and coordinate with your on-call engineer first:
svc = Monitoring::BillingHealthCheck.new(billing_week: BillingWeek.current)Pagerduty::Wrapper.new( integration_key: svc.send(:pagerduty_integration_key)).client.incident("billing-health-test-#{Time.current.to_i}").trigger( summary: "TEST ONLY — please ignore — integration validation", source: "rails-console", severity: "critical")
5.5 Test the Full Service Class (production, after billing has run)
Once billing has completed successfully for the week, all counts will be 0 and no PagerDuty alert will fire:
result = Monitoring::BillingHealthCheck.new(billing_week: BillingWeek.current).runputs result# => { unbilled_orders_count: 0, missing_billing_records_count: 0, failed_charges_count: 0, ... }
Common Gotchas
1. StatsD is Fire-and-Forget
UDP has no acknowledgment. If the agent isn’t running, your statsd.gauge() calls return normally with no error. Always verify the agent is reachable by checking for your metric in the Datadog UI after sending — don’t rely on exception-free code as proof of delivery.
2. Metric Volume vs Metric Explorer
- Metric Volume (
/metric/volume): Confirms Datadog received the metric. Good for first-time setup verification. - Metric Explorer (
/metric/explorer): Lets you actually graph and analyze the metric over time. This is where you do your monitoring work.
3. Rescue Around Everything
Both emit_datadog_metrics and trigger_pagerduty should have rescue blocks. Your monitoring code must never crash your main business process. The job that failed to alert is better than the job that crashed silently because the alert raised an exception.
def emit_datadog_metrics(results) # ... emit metricsrescue => e Rails.logger.error("Failed to emit Datadog metrics: #{e.message}") # Do NOT re-raise — monitoring failure is never a reason to abort the jobend
4. Environment Parity for the Datadog Agent
In production the agent runs as a sidecar or daemon. In local development and staging, it often doesn’t. This is fine — just make sure your code uses ENV.fetch('DD_AGENT_HOST', '127.0.0.1') so the host is configurable per environment, and don’t be alarmed when staging metrics don’t appear in Datadog.
5. PagerDuty Dedup Keys Prevent Double-Paging
If your cron job or health check can run more than once for the same underlying issue (retry logic, manual reruns), always use a stable dedup_key tied to the resource and time period — not a timestamp. A timestamp-based key creates a new PagerDuty incident on every run.
Summary
| Concern | Tool | How |
|---|---|---|
| Custom business metrics | Datadog StatsD | Datadog::Statsd#gauge via local agent (UDP) |
| APM / request tracing | Datadog ddtrace | Datadog.configure initializer |
| Metric visualization | Datadog Dashboards | Timeseries widgets per metric |
| Critical alert on failure | PagerDuty Events API v2 | Pagerduty::Wrapper + dedup key |
| Secondary notification | Google Chat / Slack webhook | HTTP POST to webhook URL |
| Scheduled execution | Cron + Rake | Shell script wrapping bundle exec rake |
| Cron liveness monitoring | Healthchecks.io | Ping before/after cron run |
Both integrations together give you a complete observability loop: your scheduled jobs run on time, emit metrics to Datadog for trending and analysis, and page the right engineer via PagerDuty the moment something goes wrong — before any customer notices.
Further Reading
- Datadog DogStatsD Ruby Docs
- Datadog ddtrace Rails Integration
- PagerDuty Events API v2 Reference
- PagerDuty Ruby Gem
- Healthchecks.io — Cron Monitoring
Happy Integration!