CDN 故障排查的流程是什么？有哪些常用工具？ - 面试题

CDN 故障排查的重要性

CDN 作为网站和应用的流量入口，其故障会直接影响用户体验和业务可用性。掌握 CDN 故障排查的方法和技巧，能够快速定位和解决问题，最小化故障影响。

常见 CDN 故障类型

1. 访问失败

症状：

用户无法访问网站
返回 5xx 错误
连接超时

可能原因：

CDN 节点故障
DNS 解析问题
源站故障
网络连接问题

2. 性能下降

症状：

响应时间变慢
频繁缓冲（视频）
加载时间延长

可能原因：

缓存命中率低
网络拥塞
源站负载高
CDN 节点过载

3. 内容不一致

症状：

用户看到旧内容
不同地区看到不同内容
更新后未生效

可能原因：

缓存未刷新
TTL 设置过长
缓存键配置错误
多 CDN 配置不一致

4. 安全问题

症状：

遭受 DDoS 攻击
恶意爬虫访问
数据泄露

可能原因：

安全配置不当
防护策略不足
漏洞未修复

故障排查流程

1. 确认故障范围

检查步骤：

1. 确认用户范围

bash
# 检查是否有大量用户报告
# 查看监控数据
# 分析错误日志

2. 确认地理范围

bash
# 检查是否特定地区受影响
# 使用地理位置工具
# 分析访问日志

3. 确认时间范围

bash
# 检查故障开始时间
# 查看时间序列数据
# 对比历史数据

2. 检查 CDN 状态

检查项：

1. CDN 节点状态

bash
# 检查节点健康状态
curl -I https://cdn.example.com/health

# 检查多个节点
for node in node1 node2 node3; do
  curl -I https://$node.example.com/health
done

2. CDN 控制台

查看节点状态
检查告警信息
分析流量图表

3. CDN API

javascript
// 使用 CDN API 检查状态
const response = await fetch('https://api.cdn.com/status', {
  headers: {
    'Authorization': 'Bearer {api_token}'
  }
})
const status = await response.json()
console.log(status)

3. 检查 DNS 解析

检查步骤：

1. 检查 DNS 解析

bash
# 检查域名解析
dig example.com

# 检查特定 DNS 服务器
dig @8.8.8.8 example.com

# 检查 CNAME 记录
dig CNAME cdn.example.com

2. 检查 DNS 传播

bash
# 检查多个 DNS 服务器
for dns in 8.8.8.8 1.1.1.1 114.114.114.114; do
  echo "DNS: $dns"
  dig @$dns example.com
done

3. 检查 DNS 缓存

bash
# 清除本地 DNS 缓存
# macOS
sudo dscacheutil -flushcache
sudo killall -HUP mDNSResponder

# Linux
sudo systemctl restart nscd

4. 检查网络连接

检查步骤：

1. 检查网络延迟

bash
# Ping 测试
ping cdn.example.com

# Traceroute 测试
traceroute cdn.example.com

# MTR 测试（结合 ping 和 traceroute）
mtr cdn.example.com

2. 检查端口连接

bash
# 检查 HTTP 端口
telnet cdn.example.com 80

# 检查 HTTPS 端口
telnet cdn.example.com 443

# 使用 nc 测试
nc -zv cdn.example.com 443

3. 检查 SSL/TLS

bash
# 检查 SSL 证书
openssl s_client -connect cdn.example.com:443 -servername cdn.example.com

# 检查 SSL 证书有效期
echo | openssl s_client -connect cdn.example.com:443 2>/dev/null | openssl x509 -noout -dates

5. 检查缓存状态

检查步骤：

1. 检查缓存命中率

bash
# 分析访问日志
grep "HIT" access.log | wc -l
grep "MISS" access.log | wc -l

# 计算缓存命中率
hit_count=$(grep "HIT" access.log | wc -l)
total_count=$(wc -l < access.log)
hit_rate=$((hit_count * 100 / total_count))
echo "Cache hit rate: $hit_rate%"

2. 检查缓存键

bash
# 检查缓存键配置
nginx -T | grep proxy_cache_key

# 分析缓存键差异
grep "cache_key" access.log | sort | uniq -c

3. 检查缓存过期

bash
# 检查缓存 TTL
curl -I https://cdn.example.com/file.jpg | grep -i cache-control

# 检查缓存过期时间
curl -I https://cdn.example.com/file.jpg | grep -i expires

6. 检查源站状态

检查步骤：

1. 直接访问源站

bash
# 直接访问源站测试
curl -I https://origin.example.com/file.jpg

# 检查源站响应时间
time curl https://origin.example.com/file.jpg

2. 检查源站负载

bash
# 检查 CPU 使用率
top

# 检查内存使用率
free -h

# 检查磁盘使用率
df -h

# 检查网络连接
netstat -an | grep ESTABLISHED | wc -l

3. 检查源站日志

bash
# 检查错误日志
tail -f /var/log/nginx/error.log

# 检查访问日志
tail -f /var/log/nginx/access.log

# 检查慢查询
tail -f /var/log/mysql/slow.log

常用排查工具

1. 网络诊断工具

Ping

bash
# 基本使用
ping cdn.example.com

# 指定次数
ping -c 10 cdn.example.com

# 指定包大小
ping -s 1024 cdn.example.com

Traceroute

bash
# 基本使用
traceroute cdn.example.com

# 使用 ICMP
traceroute -I cdn.example.com

# 指定端口
traceroute -p 443 cdn.example.com

MTR

bash
# 基本使用
mtr cdn.example.com

# 指定报告模式
mtr -r -c 10 cdn.example.com

# 保存到文件
mtr -r -c 10 cdn.example.com > mtr_report.txt

2. HTTP 调试工具

Curl

bash
# 基本请求
curl https://cdn.example.com/file.jpg

# 查看响应头
curl -I https://cdn.example.com/file.jpg

# 查看详细信息
curl -v https://cdn.example.com/file.jpg

# 查看请求和响应头
curl -i https://cdn.example.com/file.jpg

# 指定请求头
curl -H "User-Agent: Mozilla/5.0" https://cdn.example.com/file.jpg

# 查看响应时间
curl -w "@curl-format.txt" -o /dev/null -s https://cdn.example.com/file.jpg

curl-format.txt：

shell
     time_namelookup:  %{time_namelookup}\n
        time_connect:  %{time_connect}\n
     time_appconnect:  %{time_appconnect}\n
    time_pretransfer:  %{time_pretransfer}\n
       time_redirect:  %{time_redirect}\n
  time_starttransfer:  %{time_starttransfer}\n
                     ----------\n
          time_total:  %{time_total}\n

Wget

bash
# 基本下载
wget https://cdn.example.com/file.jpg

# 查看详细信息
wget -d https://cdn.example.com/file.jpg

# 保存响应头
wget -S https://cdn.example.com/file.jpg

# 指定超时
wget -T 10 https://cdn.example.com/file.jpg

3. 浏览器开发者工具

Network 面板

查看请求详情：

请求 URL
请求方法
请求头
响应头
响应时间
状态码

查看瀑布图：

请求时间线
等待时间
下载时间
总时间

Console 面板

查看错误信息：

JavaScript 错误
网络错误
资源加载错误

4. 日志分析工具

ELK Stack

Elasticsearch 查询：

json
// 查询特定错误
{
  "query": {
    "match": {
      "status": 502
    }
  }
}

// 查询特定时间段
{
  "query": {
    "range": {
      "@timestamp": {
        "gte": "2026-02-19T00:00:00",
        "lte": "2026-02-19T23:59:59"
      }
    }
  }
}

Kibana 可视化：

请求量趋势图
错误率分布图
响应时间分布图

AWStats

分析访问日志：

bash
# 生成报告
awstats.pl -config=cdn -update

# 查看报告
awstats.pl -config=cdn -output

故障排查案例

案例 1：网站访问缓慢

问题描述：用户反馈网站加载缓慢

排查步骤：

1. 检查 CDN 节点

bash
# Ping 测试
ping cdn.example.com

# 检查响应时间
curl -w "@curl-format.txt" -o /dev/null -s https://cdn.example.com/

2. 检查缓存命中率

bash
# 分析访问日志
grep "MISS" access.log | wc -l

3. 检查源站

bash
# 直接访问源站
curl -w "@curl-format.txt" -o /dev/null -s https://origin.example.com/

4. 解决方案：

提高缓存命中率
优化源站性能
增加 CDN 节点

案例 2：内容更新未生效

问题描述：更新内容后，用户仍看到旧内容

排查步骤：

1. 检查缓存 TTL

bash
# 检查缓存控制头
curl -I https://cdn.example.com/file.jpg | grep -i cache-control

2. 检查缓存键

bash
# 检查缓存键配置
nginx -T | grep proxy_cache_key

3. 检查缓存状态

bash
# 检查缓存是否命中
curl -I https://cdn.example.com/file.jpg | grep -i x-cache

4. 解决方案：

刷新 CDN 缓存
使用版本控制
调整 TTL 设置

案例 3：HTTPS 证书错误

问题描述：浏览器提示证书错误

排查步骤：

1. 检查 SSL 证书

bash
# 检查证书信息
openssl s_client -connect cdn.example.com:443 -servername cdn.example.com

# 检查证书有效期
echo | openssl s_client -connect cdn.example.com:443 2>/dev/null | openssl x509 -noout -dates

2. 检查证书链

bash
# 检查证书链完整性
openssl s_client -connect cdn.example.com:443 -showcerts

3. 解决方案：

更新 SSL 证书
配置完整的证书链
检查证书配置

故障预防措施

1. 监控告警

关键指标：

节点可用性
响应时间
错误率
缓存命中率

告警配置：

yaml
# Prometheus 告警规则
groups:
  - name: cdn_alerts
    rules:
      - alert: HighErrorRate
        expr: cdn_errors_total / cdn_requests_total * 100 > 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"

2. 健康检查

定期检查：

bash
# 健康检查脚本
#!/bin/bash
while true; do
  status=$(curl -s -o /dev/null -w "%{http_code}" https://cdn.example.com/health)
  if [ $status -ne 200 ]; then
    echo "Health check failed: $status"
    # 发送告警
  fi
  sleep 60
done

3. 备份和容灾

备份策略：

定期备份配置
备份 SSL 证书
备份 DNS 记录

容灾方案：

多 CDN 策略
源站冗余
自动故障转移

面试要点

回答这个问题时应该强调：

掌握系统化的故障排查流程
熟练使用各种排查工具
有实际的故障排查经验
能够快速定位和解决问题
有故障预防和容灾的意识