2月17日 23:35

What are the common tools, configuration methods, and best practices for Linux system monitoring and alerting?

Linux system monitoring and alerting are important means to ensure stable system operation, requiring mastery of various monitoring tools and alerting mechanisms.

System monitoring tools:

CPU monitoring:
- top: view CPU usage and process information in real-time
- htop: interactive process viewer with more powerful features
- mpstat: display usage of each CPU core
- sar: system activity report, can record historical data
- vmstat: report virtual memory statistics
Memory monitoring:
- free: display memory usage
- vmstat: view memory swap, cache and other information
- ps aux: view process memory usage
- pmap: view process memory mappings
Disk monitoring:
- df: view disk space usage
- du: view directory or file size
- iostat: view disk I/O statistics
- iotop: view disk I/O usage in real-time
Network monitoring:
- ifconfig/ip: view network interface configuration
- netstat/ss: view network connections and port listening
- nethogs: view network bandwidth usage by process
- tcpdump: capture and analyze network traffic
- iftop: display network bandwidth usage in real-time
Process monitoring:
- ps: view process status
- top/htop: monitor processes in real-time
- pgrep: find process IDs
- pidstat: monitor process resource usage

Performance analysis tools:

strace: trace system calls and signals
ltrace: trace library function calls
perf: performance analysis tool
sysdig: system-level monitoring and troubleshooting
eBPF: Extended Berkeley Packet Filter

Log monitoring:

/var/log/messages: main system log
/var/log/syslog: system log (Debian/Ubuntu)
/var/log/auth.log: authentication log
/var/log/secure: security log (CentOS/RHEL)
journalctl: systemd log viewing tool
logrotate: log rotation tool

Monitoring and alerting systems:

Nagios: enterprise-level monitoring system
Zabbix: distributed monitoring system
Prometheus: time series database and monitoring system
Grafana: data visualization platform
ELK Stack (Elasticsearch, Logstash, Kibana): log analysis and visualization
Datadog: cloud monitoring platform
New Relic: application performance monitoring

Prometheus monitoring:

Data collection: use Exporter to collect metrics
Common Exporters:
- node_exporter: system metrics
- mysqld_exporter: MySQL metrics
- nginx_exporter: Nginx metrics
- redis_exporter: Redis metrics
Configuration file: /etc/prometheus/prometheus.yml
Alert rules: define alert conditions using PromQL
Alert management: Alertmanager handles alert notifications

Grafana visualization:

Data source configuration: supports Prometheus, Elasticsearch, etc.
Dashboards: create custom monitoring panels
Alerts: set alerts based on visualization charts
Templates: use variables to create dynamic dashboards

Alert notification methods:

Email: SMTP email notification
SMS: SMS gateway
Instant messaging: Slack, DingTalk, Enterprise WeChat
Phone: voice notification
Webhook: custom web callback

Alert strategies:

Alert levels: Critical, Warning, Info
Alert thresholds: set reasonable thresholds based on business requirements
Alert suppression: avoid alert storms
Alert aggregation: merge related alerts for notification
Alert escalation: automatically escalate if not handled for a long time

Custom monitoring scripts:

Write Shell/Python scripts to collect metrics
Execute regularly using cron
Output format: supports Nagios, Prometheus, etc.

Example:

bash
#!/bin/bash
# Check disk usage
DISK_USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 80 ]; then
    echo "CRITICAL: Disk usage is ${DISK_USAGE}%"
    exit 2
fi
echo "OK: Disk usage is ${DISK_USAGE}%"
exit 0

Monitoring best practices:

Comprehensive monitoring: cover key metrics such as CPU, memory, disk, network
Reasonable sampling: avoid excessive monitoring data volume
Alert classification: distinguish between urgent and general alerts
Alert convergence: avoid duplicate alerts
Regular maintenance: clean expired data, update monitoring rules
Documentation: maintain monitoring configuration documentation
Testing and verification: regularly test alerting mechanisms

Common monitoring metrics:

System metrics: CPU usage, memory usage, disk usage, network traffic
Application metrics: request count, response time, error rate, concurrency
Business metrics: order count, user count, transaction amount

Troubleshooting process:

Confirm alert information
View system monitoring data
Check related service status
Analyze log files
Identify root cause
Implement remediation measures
Verify fix effectiveness
Summarize lessons learned

标签：Linux