Monitoring – The Rails Drop

How to Integrate Datadog and PagerDuty into an Enterprise Rails Application – Part 2

Stack: Ruby 3+, Rails 7+
Audience: Backend engineers building or maintaining production-grade Rails services
Goal: Add real-time observability and on-call alerting to a critical business process

Part 3: Hooking It All Together — Rake Task + Cron

3.1 Rake Task

Create lib/tasks/billing.rake:

			
namespace :billing do
  desc "Run billing health check: emit Datadog metrics and alert if unhealthy"
  task health_check: :environment do
    Monitoring::BillingHealthCheck.new(
      billing_week: BillingWeek.current
    ).run
  end
end

		

Run it manually:

bundle exec rake billing:health_check

3.2 Cron Script

Create scripts/cron/billing_health_check.sh:

			
#!/bin/bash
source /apps/myapp/current/scripts/env.sh
bundle exec rake billing:health_check

Using Healthchecks.io (or similar) to wrap the cron gives you a second layer of alerting: if the cron doesn’t ping within the expected window, you get an alert – even if the app never starts.

3.3 Crontab Entry

			
# Run billing health check every Thursday at 5:30 AM
30 5 * * 4 . /apps/myapp/current/scripts/cron/billing_monitoring.sh

⚠️ Important for managed deployments: If your crontab is version-controlled but not auto-deployed (e.g., Capistrano without cron management), changes to the file in your repo do not automatically update the server. Always verify with crontab -l after deploying.

Part 4: Building the Datadog Dashboard

Once metrics are flowing, set up a dashboard for at-a-glance visibility.

4.1 Create the Dashboard

Datadog → Dashboards → New Dashboard
Name it: “Billing Health Monitor”
Click + Add Widgets

4.2 Add Timeseries Widgets

For each metric, add a Timeseries widget:

Widget title	Metric	Visualization
Unbilled Orders	`billing.unbilled_orders`	Line chart
Missing Billing Records	`billing.missing_billing_records`	Line chart
Failed Charges	`billing.failed_charges`	Line chart

Widget configuration:

Graph: select metric → billing.unbilled_orders
Display as: Line
Timeframe: Set to “Past 1 Week” or “Past 1 Month” after data starts flowing (not “Past 1 Hour” which shows nothing between weekly runs)

4.3 Add Reference Lines (Optional but Useful)

For the unbilled orders widget, add a constant line at your alert threshold:

In the widget editor → Markers → Add marker at y = 10 (your BILLING_UNBILLED_THRESHOLD)
Color it red to make the threshold visually obvious

4.4 Where to Find Your Custom Metrics

Metric Explorer: app.datadoghq.com/metric/explorer — type billing. to autocomplete and graph any metric
Metric Volume: app.datadoghq.com/metric/volume — confirms Datadog has received the metric (appears within 2-5 minutes of first emission)

Part 5: Testing the Integration End-to-End

5.1 Test Datadog Metrics (no alerts, safe in any env)

			
# Rails console
require 'datadog/statsd'
host   = ENV.fetch('DD_AGENT_HOST', '127.0.0.1')
statsd = Datadog::Statsd.new(host, 8125)
statsd.gauge('billing.unbilled_orders',         0)
statsd.gauge('billing.missing_billing_records', 0)
statsd.gauge('billing.failed_charges',          0)
statsd.close
puts "Sent — check /metric/explorer in Datadog in ~2-3 minutes"

		

5.2 Test PagerDuty (staging)

			
# Rails console — staging
# First, verify the key exists:
Rails.application.credentials[:staging][:pagerduty_billing_integration_key].present?
# Then trigger a test incident:
svc = Monitoring::BillingHealthCheck.new(billing_week: BillingWeek.current)
svc.send(:trigger_pagerduty, "TEST: Billing health check — staging validation #{Time.current}")
# Remember to resolve the incident in PagerDuty UI immediately after!

		

5.3 Test PagerDuty (production) — Preferred Method

Use PagerDuty’s built-in test instead of triggering from code:

PagerDuty → Services → Billing Pipeline → Integrations
Find the integration → click “Send Test Event”

This fires through the same pipeline without touching your app or risking a real alert.

5.4 Test PagerDuty (production) — via Rails Console

If you must test via code in production, use a unique dedup key so it doesn’t collide with real billing alerts, and coordinate with your on-call engineer first:

			
svc = Monitoring::BillingHealthCheck.new(billing_week: BillingWeek.current)
Pagerduty::Wrapper.new(
  integration_key: svc.send(:pagerduty_integration_key)
).client.incident("billing-health-test-#{Time.current.to_i}").trigger(
  summary:  "TEST ONLY — please ignore — integration validation",
  source:   "rails-console",
  severity: "critical"
)

		

5.5 Test the Full Service Class (production, after billing has run)

Once billing has completed successfully for the week, all counts will be 0 and no PagerDuty alert will fire:

			
result = Monitoring::BillingHealthCheck.new(billing_week: BillingWeek.current).run
puts result
# => { unbilled_orders_count: 0, missing_billing_records_count: 0, failed_charges_count: 0, ... }

Common Gotchas

1. StatsD is Fire-and-Forget

UDP has no acknowledgment. If the agent isn’t running, your statsd.gauge() calls return normally with no error. Always verify the agent is reachable by checking for your metric in the Datadog UI after sending — don’t rely on exception-free code as proof of delivery.

2. Metric Volume vs Metric Explorer

Metric Volume (/metric/volume): Confirms Datadog received the metric. Good for first-time setup verification.
Metric Explorer (/metric/explorer): Lets you actually graph and analyze the metric over time. This is where you do your monitoring work.

3. Rescue Around Everything

Both emit_datadog_metrics and trigger_pagerduty should have rescue blocks. Your monitoring code must never crash your main business process. The job that failed to alert is better than the job that crashed silently because the alert raised an exception.

			
def emit_datadog_metrics(results)
  # ... emit metrics
rescue => e
  Rails.logger.error("Failed to emit Datadog metrics: #{e.message}")
  # Do NOT re-raise — monitoring failure is never a reason to abort the job
end

		

4. Environment Parity for the Datadog Agent

In production the agent runs as a sidecar or daemon. In local development and staging, it often doesn’t. This is fine — just make sure your code uses ENV.fetch('DD_AGENT_HOST', '127.0.0.1') so the host is configurable per environment, and don’t be alarmed when staging metrics don’t appear in Datadog.

5. PagerDuty Dedup Keys Prevent Double-Paging

If your cron job or health check can run more than once for the same underlying issue (retry logic, manual reruns), always use a stable dedup_key tied to the resource and time period — not a timestamp. A timestamp-based key creates a new PagerDuty incident on every run.

Summary

Concern	Tool	How
Custom business metrics	Datadog StatsD	`Datadog::Statsd#gauge` via local agent (UDP)
APM / request tracing	Datadog ddtrace	`Datadog.configure` initializer
Metric visualization	Datadog Dashboards	Timeseries widgets per metric
Critical alert on failure	PagerDuty Events API v2	`Pagerduty::Wrapper` + dedup key
Secondary notification	Google Chat / Slack webhook	HTTP POST to webhook URL
Scheduled execution	Cron + Rake	Shell script wrapping `bundle exec rake`
Cron liveness monitoring	Healthchecks.io	Ping before/after cron run

Both integrations together give you a complete observability loop: your scheduled jobs run on time, emit metrics to Datadog for trending and analysis, and page the right engineer via PagerDuty the moment something goes wrong — before any customer notices.

How to Integrate Datadog and PagerDuty into an Enterprise Rails Application – Part 1

Introduction

When you’re running an enterprise web application, two questions keep engineering teams up at night:

“Is our system healthy right now?”
“If something breaks at 3 AM, will we know before our customers do?”

Datadog and PagerDuty together answer both. Datadog gives you the metrics, dashboards, and visibility. PagerDuty turns critical metrics into actionable alerts that reach the right person at the right time. This post walks you through integrating both into a Rails 7+ application — from gem installation to a live production dashboard — using a real-world billing health monitor as the example.

What is Datadog?

Datadog is a cloud-based observability and monitoring platform. It collects metrics, traces, and logs from your infrastructure and applications and surfaces them in a unified UI.

Core capabilities relevant to Rails apps:

Feature	What it does
APM (Application Performance Monitoring)	Traces every Rails request, shows latency, errors, and bottlenecks
StatsD / DogStatsD	Accepts custom business metrics (gauges, counters, histograms) via UDP
Dashboards	Visualize any metric over time — single chart or full ops dashboard
Monitors & Alerts	Trigger notifications when a metric crosses a threshold
Log Management	Centralized log search and correlation with traces
Infrastructure Monitoring	CPU, memory, disk — the full host/container picture

For this guide, we focus on custom business metrics via DogStatsD — the most powerful and underused feature for application teams.

What is PagerDuty?

PagerDuty is an incident management platform. When something breaks in production, PagerDuty decides who gets notified, how (phone call, SMS, push notification, Slack), and when to escalate if the alert isn’t acknowledged.

Key concepts:

Concept	Description
Service	A logical grouping of alerts (e.g., “Billing Service”)
Integration Key	The secret key your app uses to send events to a PagerDuty service
Incident	A triggered alert that requires human acknowledgment
Dedup Key	A unique string that prevents duplicate incidents for the same root cause
Escalation Policy	Defines who gets paged and in what order if the incident isn’t acknowledged
Severity	`critical`, `error`, `warning`, or `info`

PagerDuty integrates with Datadog (you can alert from DD monitors), but for critical business logic alerts — like a billing pipeline failing — it’s often better to trigger PagerDuty directly from your application code, giving you full control over deduplication and context.

Why These Are Must-Have Integrations for Enterprise Apps

If you’re running any of the following, you need both:

Scheduled jobs / cron tasks that process money, orders, or user data
Background workers (Sidekiq, Delayed Job) that can silently fail
Third-party payment or fulfillment pipelines with no built-in alerting
SLAs that require uptime or processing guarantees
On-call rotations where the right person needs to be paged — not just an email inbox

The core problem both solve: Rails applications fail silently. A rescue clause that logs an error to Rails.logger does nothing at 2 AM. A Sidekiq deadlock on your billing job won’t send you an email. Without Datadog and PagerDuty:

You find out about failures from customers, not dashboards
You can’t tell when a metric degraded or how long it’s been broken
There’s no escalation path — the alert that fires at 3 AM goes nowhere

With both integrated, you get: visibility (Datadog) + accountability (PagerDuty).

Architecture Overview

			
Rails App / Cron Job
       │
       ├──► Datadog Agent (UDP :8125)
       │         └──► Datadog Cloud ──► Dashboard / Monitor
       │
       └──► PagerDuty Events API (HTTPS)
                  └──► On-call Engineer ──► Slack / Phone / SMS

		

The Datadog Agent runs as a daemon on your server or as a sidecar container. Your app sends lightweight UDP packets to it (fire-and-forget). The agent batches and forwards them to Datadog’s cloud.

PagerDuty receives events over HTTPS directly from your app — no local agent needed.

Part 1: Datadog Integration

1.1 Install the Gems

			
# Gemfile
gem 'ddtrace', '~> 2.0'       # APM tracing
gem 'dogstatsd-ruby', '~> 5.0' # Custom metrics via StatsD

bundle install

1.2 Configure the Datadog Initializer

Create config/initializers/datadog.rb:

			
require 'datadog/statsd'
require 'datadog'
enabled = Rails.application.credentials[Rails.env.to_sym][:datadog_integration_enabled]
service_name = "myapp-#{Rails.env}"
Datadog.configure do |c|
  c.tracing.enabled = enabled
  c.runtime_metrics.enabled = enabled
  c.tracing.instrument :rails, service_name: service_name
  c.tracing.instrument :rake, enabled: false  # avoid tracing long-running tasks
  # Consolidate HTTP client spans under one service name to reduce noise
  c.tracing.instrument :faraday,     service_name: service_name
  c.tracing.instrument :httpclient,  service_name: service_name
  c.tracing.instrument :http,        service_name: service_name
  c.tracing.instrument :rest_client, service_name: service_name
end

		

Store the flag in Rails credentials:

rails credentials:edit --environment production

			
# config/credentials/production.yml.enc
datadog_integration_enabled: true

Important: The datadog_integration_enabled flag controls APM tracing only. Custom StatsD metrics (gauges, counters) are sent by Datadog::Statsd regardless of this flag — as long as the Datadog Agent is running.

1.3 Install and Configure the Datadog Agent

The Datadog Agent must be running on the host where your app runs. It listens for UDP packets on port 8125 and forwards them to Datadog’s cloud.

Docker Compose (recommended for containerized apps):

			
# docker-compose.yml
services:
  app:
    environment:
      DD_AGENT_HOST: datadog-agent
      DD_DOGSTATSD_PORT: 8125
  datadog-agent:
    image: datadog/agent:latest
    environment:
      DD_API_KEY: ${DATADOG_API_KEY}
      DD_DOGSTATSD_NON_LOCAL_TRAFFIC: "true"
    ports:
      - "8125:8125/udp"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /proc/:/host/proc/:ro
      - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro

		

Bare metal / VM:

			
DD_API_KEY=your_api_key bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"

1.4 Emit Custom Business Metrics

Now the interesting part — emitting metrics from your business logic.

Create a service class for a billing health check at app/lib/monitoring/billing_health_check.rb:

			
# frozen_string_literal: true
class Monitoring::BillingHealthCheck
  UNBILLED_THRESHOLD = ENV.fetch('BILLING_UNBILLED_THRESHOLD', 10).to_i
  def initialize(date:)
    @date = date
  end
  def run
    results = collect_metrics
    fire_datadog_metrics(results)
    alert_if_unhealthy(results)
    results
  end
  private
    def collect_metrics
      billed_ids  = BillingRecord.where(date: @date).pluck(:order_id)
      missing_order_ids    = billed_ids - Order.where(date: @date).ids
      unbilled_count     = Order.active.where(week: @date, billed: false).count
      failed_charges     = Order.joins(:bills)
                                .where(date: @date, billed: false, bills: { success: false })
                                .distinct
                                .count
      {
        missing_order_ids:              missing_order_ids,
        missing_order_records_count: missing_order_ids.size,
        unbilled_orders_count:         unbilled_count,
        failed_charges_count:          failed_charges
      }
    end
    def fire_datadog_metrics(results)
      host   = ENV.fetch('DD_AGENT_HOST', '127.0.0.1')
      port   = ENV.fetch('DD_DOGSTATSD_PORT', 8125).to_i
      statsd = Datadog::Statsd.new(host, port)
      statsd.gauge('billing.unbilled_orders',          results[:unbilled_orders_count])
      statsd.gauge('billing.missing_billing_records',  results[:missing_billing_records_count])
      statsd.gauge('billing.failed_charges',           results[:failed_charges_count])
      statsd.close
    rescue => e
      Rails.logger.error("Failed to emit Datadog metrics: #{e.message}")
    end
    # ... alerting covered in Part 2
end

		

Why Datadog::Statsd.new(host, port) instead of Datadog::Statsd.new?

The no-argument form defaults to 127.0.0.1:8125. In containerized environments, the Datadog Agent runs as a separate container/service with a different hostname. Always read the host from an environment variable so the code works in every environment without changes.

1.5 Choosing the Right Metric Type

Type	Method	Use when
Gauge	`statsd.gauge('name', value)`	Current snapshot value (queue depth, count at a point in time)
Counter	`statsd.increment('name')`	Counting occurrences (requests, errors)
Histogram	`statsd.histogram('name', value)`	Distribution of values (response times, batch sizes)
Timing	`statsd.timing('name', ms)`	Duration in milliseconds

For billing health metrics — unbilled orders, failed charges — gauge is correct because you want the current count, not a running total.

1.6 Debugging: Why Aren’t My Metrics Appearing?

This is the most common issue. Because StatsD uses UDP, failures are completely silent.

Checklist:

			
# 1. Is the Datadog Agent reachable from your app container/host?
# Run in Rails console:
require 'socket'
UDPSocket.new.send("test:1|g", 0, ENV.fetch('DD_AGENT_HOST', '127.0.0.1'), 8125)
# 2. Send a test gauge and wait 2-3 minutes
statsd = Datadog::Statsd.new(ENV.fetch('DD_AGENT_HOST', '127.0.0.1'), 8125)
statsd.gauge('debug.connectivity_test', 1)
statsd.close
puts "Sent — check Datadog metric/explorer in 2-3 minutes"
# 3. Check if the integration flag is blocking APM (not metrics, but worth knowing)
Rails.application.credentials[Rails.env.to_sym][:datadog_integration_enabled]

		

Then in the Datadog UI:

Go to Metrics → Explorer
Type your metric name (e.g., billing.) in the graph field — it should autocomplete
If it doesn’t autocomplete after 5 minutes, the agent is not receiving the packets

Common root causes in staging/dev environments:

Symptom	Likely cause
No metrics in any env	Agent not running or wrong host
Metrics in production only	`DD_AGENT_HOST` not set, defaults to `127.0.0.1` but agent is on a different host in staging
Intermittent metrics	UDP packet loss (rare, but can happen under high load)

Part 2: PagerDuty Integration

2.1 Install the Gem

			
# Gemfile
gem 'pagerduty', '~> 3.0'

bundle install

2.2 Create a PagerDuty Service

Log in to PagerDuty → Services → Service Directory → + New Service
Name it (e.g., “Billing Pipeline”)
Under Integrations, select “Use our API directly” → choose Events API v2
Copy the Integration Key — you’ll need this in credentials

2.3 Store Credentials Securely

rails credentials:edit --environment production

			
# config/credentials/production.yml.enc
pagerduty_billing_integration_key: your_integration_key_here
google_chat_monitoring_webhook: https://chat.googleapis.com/v1/spaces/...

2.4 Create a PagerDuty Wrapper

Create a lightweight wrapper at app/lib/pagerduty/wrapper.rb:

			
# frozen_string_literal: true
class Pagerduty::Wrapper
  def initialize(integration_key:, api_version: 2)
    @integration_key = integration_key
    @api_version     = api_version
  end
  def client
    @client ||= Pagerduty.build(
      integration_key: @integration_key,
      api_version:     @api_version
    )
  end
end

		

2.5 Wire Up Alerting in Your Service Class

Continuing the billing health check class:

			
def alert_if_unhealthy(results)
  issues = []
  if results[:missing_billing_records_count] > 0
    missing_names = results[:missing_regions].map(&:name).join(', ')
    issues << "Missing billing records for regions: #{missing_names}"
  end
  if results[:unbilled_orders_count] > UNBILLED_THRESHOLD
    issues << "#{results[:unbilled_orders_count]} unbilled orders (threshold: #{UNBILLED_THRESHOLD})"
  end
  return if issues.empty?
  summary = build_alert_summary(results, issues)
  trigger_pagerduty(summary)
  send_google_chat_notification(summary)
end
private
  def build_alert_summary(results, issues)
    [
      "Billing Health Check FAILED at #{Time.zone.now.strftime('%Y-%m-%d %H:%M:%S %Z')}",
      "Week: #{@billing_week}",
      *issues,
      "Failed charges: #{results[:failed_charges_count]}"
    ].join(" | ")
  end
  def trigger_pagerduty(summary)
    dedup_key = "billing-health-#{@billing_week}"
    Pagerduty::Wrapper.new(
      integration_key: pagerduty_integration_key
    ).client.incident(dedup_key).trigger(
      summary:  summary,
      source:   Rails.application.routes.default_url_options[:host],
      severity: "critical"
    )
  rescue => e
    Rails.logger.error("Failed to trigger PagerDuty: #{e.message}")
  end
  def send_google_chat_notification(message)
    # Post to your team's Google Chat / Slack webhook
    HTTParty.post(
      google_chat_webhook,
      body:    { text: message }.to_json,
      headers: { 'Content-Type' => 'application/json' }
    )
  rescue => e
    Rails.logger.error("Failed to send Google Chat notification: #{e.message}")
  end
  def pagerduty_integration_key
    Rails.application.credentials[Rails.env.to_sym][:pagerduty_billing_integration_key]
  end
  def google_chat_webhook
    Rails.application.credentials[Rails.env.to_sym][:google_chat_monitoring_webhook]
  end

		

2.6 The Dedup Key — Why It Matters

dedup_key = "billing-health-#{@billing_week}"

PagerDuty uses the dedup_key to group events about the same incident. If your billing check runs at 8:30 AM and again at 9:00 AM (e.g., after a retry), PagerDuty will update the existing incident instead of creating a second one and paging your on-call engineer twice.

Best practices for dedup keys:

Make them specific to the root cause, not the timestamp
Include the resource identifier (week date, job ID, etc.)
Use a format like {service}-{resource}-{date} for easy filtering in PagerDuty

Happy Integration!

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30