CDN 的性能监控指标有哪些？如何监控 CDN 的性能？ - 面试题

CDN 性能监控的重要性

CDN 性能监控是确保 CDN 服务质量和用户体验的关键环节。通过实时监控 CDN 的各项性能指标，可以及时发现和解决问题，优化 CDN 配置，提升整体性能。

核心监控指标

1. 延迟指标

响应时间

定义：从用户发起请求到收到完整响应的时间

关键指标：

TTFB（Time to First Byte）：首字节时间
TTLB（Time to Last Byte）：末字节时间
总响应时间：完整请求响应时间

目标值：

静态内容：<100ms
动态内容：<500ms
API 请求：<200ms

网络延迟

定义：数据在网络中传输的时间

测量方法：

bash
# 使用 ping 测量延迟
ping cdn.example.com

# 使用 traceroute 测量路径延迟
traceroute cdn.example.com

2. 吞吐量指标

带宽使用率

定义：实际使用的带宽占总带宽的比例

计算公式：

shell
带宽使用率 = (当前带宽 / 总带宽) × 100%

监控维度：

边缘节点带宽
回源带宽
总带宽使用率

请求量

关键指标：

QPS（Queries Per Second）：每秒请求数
RPS（Requests Per Second）：每秒请求数（同 QPS）
峰值 QPS：最高每秒请求数

监控示例：

javascript
// 计算每秒请求数
let requestCount = 0
setInterval(() => {
  console.log(`QPS: ${requestCount}`)
  requestCount = 0
}, 1000)

// 每个请求增加计数
function handleRequest(request) {
  requestCount++
  // 处理请求...
}

3. 可用性指标

节点可用性

定义：节点正常提供服务的时间比例

计算公式：

shell
节点可用性 = (正常运行时间 / 总时间) × 100%

目标值：

单个节点：>99.9%
整体 CDN：>99.99%

故障转移时间

定义：从节点故障到流量切换到其他节点的时间

目标值：

故障检测：<5 秒
流量切换：<10 秒
总故障转移：<15 秒

4. 缓存指标

缓存命中率

定义：从 CDN 缓存返回的请求占总请求的比例

计算公式：

shell
缓存命中率 = (缓存命中请求数 / 总请求数) × 100%

目标值：

静态内容：>95%
动态内容：>70%
整体：>90%

优化策略：

nginx
# 设置合理的缓存时间
location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
    expires 1y;
    add_header Cache-Control "public, immutable";
}

回源率

定义：需要回源的请求占总请求的比例

计算公式：

shell
回源率 = (回源请求数 / 总请求数) × 100%

目标值：<10%

5. 错误指标

HTTP 错误率

定义：返回 4xx/5xx 状态码的请求比例

关键错误码：

4xx：客户端错误（如 404 Not Found）
5xx：服务器错误（如 502 Bad Gateway）

目标值：<1%

超时率

定义：请求超时的比例

目标值：<0.1%

监控工具和平台

1. CDN 自带监控

主流 CDN 服务商提供的监控：

Cloudflare Analytics

功能：

实时流量监控
请求分析
威胁检测
性能报告

使用示例：

javascript
// 通过 API 获取监控数据
const response = await fetch('https://api.cloudflare.com/client/v4/zones/{zone_id}/analytics/dashboard', {
  headers: {
    'Authorization': 'Bearer {api_token}'
  }
})
const data = await response.json()
console.log(data)

AWS CloudFront Metrics

功能：

请求量统计
字节传输统计
错误率监控
延迟监控

CloudWatch 集成：

bash
# 使用 AWS CLI 获取 CloudFront 指标
aws cloudwatch get-metric-statistics \
  --namespace AWS/CloudFront \
  --metric-name Requests \
  --dimensions Name=DistributionId,Value={distribution_id} \
  --start-time 2026-02-19T00:00:00Z \
  --end-time 2026-02-19T23:59:59Z \
  --period 3600 \
  --statistics Sum

2. 第三方监控工具

Pingdom

功能：

网站性能监控
可用性监控
页面速度测试
报警通知

特点：

全球监控节点
详细性能报告
易于使用

New Relic

功能：

应用性能监控（APM）
基础设施监控
用户体验监控
错误追踪

特点：

全栈监控
实时数据
强大的分析能力

Datadog

功能：

基础设施监控
应用性能监控
日志管理
安全监控

特点：

统一平台
强大的集成能力
灵活的告警

3. 自建监控系统

Prometheus + Grafana

架构：

shell
CDN → Exporter → Prometheus → Grafana

配置示例：

Prometheus 配置（prometheus.yml）：

yaml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'cdn'
    static_configs:
      - targets: ['cdn-exporter:9090']

Grafana 仪表板：

json
{
  "dashboard": {
    "title": "CDN Performance Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(cdn_requests_total[5m])"
          }
        ]
      },
      {
        "title": "Cache Hit Rate",
        "targets": [
          {
            "expr": "cdn_cache_hits / cdn_requests_total * 100"
          }
        ]
      }
    ]
  }
}

ELK Stack（Elasticsearch, Logstash, Kibana）

用途：

日志收集和分析
性能监控
错误追踪

配置示例：

Logstash 配置（logstash.conf）：

conf
input {
  file {
    path => "/var/log/cdn/access.log"
    start_position => "beginning"
  }
}

filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "cdn-logs-%{+YYYY.MM.dd}"
  }
}

监控数据采集

1. 日志采集

访问日志格式：

nginx
log_format cdn '$remote_addr - $remote_user [$time_local] '
                '"$request" $status $body_bytes_sent '
                '"$http_referer" "$http_user_agent" '
                'rt=$request_time uct="$upstream_connect_time" '
                'uht="$upstream_header_time" urt="$upstream_response_time" '
                'cache=$upstream_cache_status';

关键字段：

request_time：总请求时间
upstream_connect_time：连接上游时间
upstream_header_time：接收上游响应头时间
upstream_response_time：接收上游响应时间
upstream_cache_status：缓存状态（HIT/MISS/BYPASS）

2. 指标采集

自定义指标采集：

javascript
// 使用 Prometheus client 库
const client = require('prom-client');

// 创建指标
const httpRequestDuration = new client.Histogram({
  name: 'cdn_http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'code']
});

// 记录指标
const end = httpRequestDuration.startTimer();
// 处理请求...
end({ method: 'GET', route: '/api/data', code: 200 });

3. 实时监控

WebSocket 实时推送：

javascript
// 使用 WebSocket 实时推送监控数据
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', (ws) => {
  // 定期发送监控数据
  const interval = setInterval(() => {
    const metrics = {
      qps: getCurrentQPS(),
      latency: getAverageLatency(),
      cacheHitRate: getCacheHitRate()
    };
    ws.send(JSON.stringify(metrics));
  }, 1000);

  ws.on('close', () => {
    clearInterval(interval);
  });
});

告警机制

1. 告警规则

常见告警规则：

高延迟告警

yaml
# Prometheus 告警规则
groups:
  - name: cdn_alerts
    rules:
      - alert: HighLatency
        expr: cdn_request_duration_seconds{quantile="0.95"} > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "95th percentile latency is {{ $value }}s"

低缓存命中率告警

yaml
- alert: LowCacheHitRate
  expr: cdn_cache_hits / cdn_requests_total * 100 < 80
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Low cache hit rate"
    description: "Cache hit rate is {{ $value }}%"

高错误率告警

yaml
- alert: HighErrorRate
  expr: cdn_errors_total / cdn_requests_total * 100 > 1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate detected"
    description: "Error rate is {{ $value }}%"

2. 告警通知

通知渠道：

邮件通知

yaml
# Alertmanager 配置
receivers:
  - name: 'email'
    email_configs:
      - to: 'team@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager'
        auth_password: 'password'

短信通知

yaml
receivers:
  - name: 'sms'
    webhook_configs:
      - url: 'https://sms.example.com/send'
        send_resolved: true

即时通讯工具

yaml
receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#cdn-alerts'
        username: 'CDN Alert Bot'

性能优化建议

1. 基于监控数据的优化

延迟优化

分析高延迟的请求路径
优化缓存策略
调整 CDN 节点配置

缓存优化

识别低缓存命中率的内容
调整 TTL 设置
优化缓存键配置

带宽优化

分析高带宽消耗的内容
启用压缩
优化图片和视频

2. A/B 测试

测试不同配置：

javascript
// A/B 测试不同缓存策略
function getCacheStrategy(userId) {
  const hash = hashUserId(userId);
  if (hash % 2 === 0) {
    return 'strategy-a'; // 长缓存
  } else {
    return 'strategy-b'; // 短缓存
  }
}

3. 容量规划

基于历史数据预测：

python
# 使用时间序列预测
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

# 加载历史数据
data = pd.read_csv('cdn_metrics.csv')

# 训练模型
model = ARIMA(data['requests'], order=(5,1,0))
model_fit = model.fit()

# 预测未来 7 天
forecast = model_fit.forecast(steps=7)
print(forecast)

面试要点

回答这个问题时应该强调：

了解 CDN 的核心监控指标及其目标值
掌握主流的监控工具和平台
能够设计监控数据采集方案
理解告警机制的重要性
有基于监控数据进行性能优化的经验