5月27日 17:37

Prometheus Pull 和 Push 模式怎么选？

Prometheus 的数据采集方式是架构设计中最关键的选择之一。理解 Pull 和 Push 模式的差异，直接影响监控系统的可靠性、可维护性和扩展能力。

Pull 模式：Prometheus 的原生方式

Pull 模式下，Prometheus Server 主动向目标服务发起 HTTP 请求，从 /metrics 端点拉取指标数据。这是 Prometheus 从设计之初就确立的核心模式。

工作流程：Prometheus 根据 scrape_configs 中配置的抓取目标，按照设定的间隔（默认 15s）定期请求目标的 metrics 端点，将返回的数据存入时序数据库。

核心优势在于控制权在采集端：Prometheus 完全掌握采集节奏和目标列表，即使某个目标宕机，Prometheus 也能感知到抓取失败并记录告警，不会收到过期或虚假数据。

服务发现是 Pull 模式的重要支撑。在 Kubernetes 环境中，Prometheus 通过 API 自动发现新的 Pod 和 Service，无需手动更新配置：

yaml
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

除了 Kubernetes，Prometheus 还支持 Consul、DNS SRV、EC2、Azure 等多种服务发现机制，适应不同的基础设施环境。

Pull 模式最适合长期运行的服务——Web 服务、数据库、消息队列等。这些服务稳定暴露端口，Prometheus 可以持续采集。

Push 模式：短期任务的补充方案

Push 模式并非应用直接推送数据给 Prometheus，而是通过 Pushgateway 中转：应用将指标推送到 Pushgateway，Prometheus 再从 Pushgateway 拉取。

为什么要这样设计？因为有些任务的生存时间短于 Prometheus 的采集间隔。一个只运行 5 秒的批处理任务，Prometheus 的 15s 抓取周期可能根本来不及采集，任务就已经结束了。Pushgateway 解决了这个问题——任务在退出前把指标推送到 Pushgateway，Prometheus 随后统一拉取。

bash
# 推送指标到 Pushgateway
cat <<EOF | curl --data-binary @- http://pushgateway:9091/metrics/job/batch_task/instance/task01
# TYPE batch_duration_seconds gauge
batch_duration_seconds 3.14
# TYPE batch_status counter
batch_status{result="success"} 1
EOF

Push 模式的适用场景很明确：

批处理任务：定时跑批、数据清洗、报表生成等短生命周期作业
临时性脚本：一次性执行的工具脚本，无法持续暴露端口
防火墙隔离环境：目标在内网，Prometheus 无法直接访问

但 Pushgateway 带来了额外的问题。它是一个单点——如果 Pushgateway 挂了，所有推送数据丢失；更重要的是，Pushgateway 上的数据不会自动过期，一个已经停止的任务，其指标仍然留在 Pushgateway 上，Prometheus 会持续采集到过时数据。必须在配置中设置 honor_labels: true 并通过 --metrics.resolution 等参数管理数据生命周期。

两种模式的核心差异

从架构层面看，根本区别在于谁掌握主动权：

Pull：采集端控制节奏。目标服务只需暴露端口，不需要知道 Prometheus 的存在。目标挂了，采集端立刻发现。
Push：上报端控制节奏。目标主动推送，采集端被动接收。目标挂了，采集端无法感知。

从数据质量看：

Pull 模式下，抓取失败就是失败，不会产生假数据
Push 模式下，Pushgateway 上的旧数据可能被反复采集，造成指标失真

从运维复杂度看：

Pull 模式依赖服务发现和目标可达性，网络规划需要保证 Prometheus 到目标的连通
Push 模式依赖 Pushgateway 的可用性和数据清理策略，运维 Pushgateway 本身就是额外负担

如何选择：决策思路

选择标准可以归纳为一条原则——能用 Pull 就用 Pull，只有 Pull 不行时才用 Push。

具体判断：

服务长期运行且网络可达 → Pull
Kubernetes / Docker 环境 → Pull + 服务发现
任务生命周期短于采集间隔 → Push（Pushgateway）
目标在隔离网络，Prometheus 无法主动访问 → Push（或部署 Prometheus Agent 做远程写入）

实际生产中，混合使用是常态。核心业务服务走 Pull 模式，批处理任务走 Pushgateway，两者共存于同一套 Prometheus 集群。关键是 Pushgateway 不要滥用——它只服务于短期任务，不要把长期服务的指标也推上去。

采集间隔的设置也需要平衡：太短增加网络和存储压力，太长可能错过瞬时波动。一般建议 15s 到 1min，关键服务可以设为 10s，低频指标可以放宽到 2-5min。

标签：Prometheus