2月17日 23:35
What are the common tools, configuration methods, and best practices for Linux system monitoring and alerting?
Linux system monitoring and alerting are important means to ensure stable system operation, requiring mastery of various monitoring tools and alerting mechanisms.
System monitoring tools:
- CPU monitoring:
- top: view CPU usage and process information in real-time
- htop: interactive process viewer with more powerful features
- mpstat: display usage of each CPU core
- sar: system activity report, can record historical data
- vmstat: report virtual memory statistics
- Memory monitoring:
- free: display memory usage
- vmstat: view memory swap, cache and other information
- ps aux: view process memory usage
- pmap: view process memory mappings
- Disk monitoring:
- df: view disk space usage
- du: view directory or file size
- iostat: view disk I/O statistics
- iotop: view disk I/O usage in real-time
- Network monitoring:
- ifconfig/ip: view network interface configuration
- netstat/ss: view network connections and port listening
- nethogs: view network bandwidth usage by process
- tcpdump: capture and analyze network traffic
- iftop: display network bandwidth usage in real-time
- Process monitoring:
- ps: view process status
- top/htop: monitor processes in real-time
- pgrep: find process IDs
- pidstat: monitor process resource usage
Performance analysis tools:
- strace: trace system calls and signals
- ltrace: trace library function calls
- perf: performance analysis tool
- sysdig: system-level monitoring and troubleshooting
- eBPF: Extended Berkeley Packet Filter
Log monitoring:
- /var/log/messages: main system log
- /var/log/syslog: system log (Debian/Ubuntu)
- /var/log/auth.log: authentication log
- /var/log/secure: security log (CentOS/RHEL)
- journalctl: systemd log viewing tool
- logrotate: log rotation tool
Monitoring and alerting systems:
- Nagios: enterprise-level monitoring system
- Zabbix: distributed monitoring system
- Prometheus: time series database and monitoring system
- Grafana: data visualization platform
- ELK Stack (Elasticsearch, Logstash, Kibana): log analysis and visualization
- Datadog: cloud monitoring platform
- New Relic: application performance monitoring
Prometheus monitoring:
- Data collection: use Exporter to collect metrics
- Common Exporters:
- node_exporter: system metrics
- mysqld_exporter: MySQL metrics
- nginx_exporter: Nginx metrics
- redis_exporter: Redis metrics
- Configuration file: /etc/prometheus/prometheus.yml
- Alert rules: define alert conditions using PromQL
- Alert management: Alertmanager handles alert notifications
Grafana visualization:
- Data source configuration: supports Prometheus, Elasticsearch, etc.
- Dashboards: create custom monitoring panels
- Alerts: set alerts based on visualization charts
- Templates: use variables to create dynamic dashboards
Alert notification methods:
- Email: SMTP email notification
- SMS: SMS gateway
- Instant messaging: Slack, DingTalk, Enterprise WeChat
- Phone: voice notification
- Webhook: custom web callback
Alert strategies:
- Alert levels: Critical, Warning, Info
- Alert thresholds: set reasonable thresholds based on business requirements
- Alert suppression: avoid alert storms
- Alert aggregation: merge related alerts for notification
- Alert escalation: automatically escalate if not handled for a long time
Custom monitoring scripts:
- Write Shell/Python scripts to collect metrics
- Execute regularly using cron
- Output format: supports Nagios, Prometheus, etc.
- Example:
bash
#!/bin/bash # Check disk usage DISK_USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//') if [ $DISK_USAGE -gt 80 ]; then echo "CRITICAL: Disk usage is ${DISK_USAGE}%" exit 2 fi echo "OK: Disk usage is ${DISK_USAGE}%" exit 0
Monitoring best practices:
- Comprehensive monitoring: cover key metrics such as CPU, memory, disk, network
- Reasonable sampling: avoid excessive monitoring data volume
- Alert classification: distinguish between urgent and general alerts
- Alert convergence: avoid duplicate alerts
- Regular maintenance: clean expired data, update monitoring rules
- Documentation: maintain monitoring configuration documentation
- Testing and verification: regularly test alerting mechanisms
Common monitoring metrics:
- System metrics: CPU usage, memory usage, disk usage, network traffic
- Application metrics: request count, response time, error rate, concurrency
- Business metrics: order count, user count, transaction amount
Troubleshooting process:
- Confirm alert information
- View system monitoring data
- Check related service status
- Analyze log files
- Identify root cause
- Implement remediation measures
- Verify fix effectiveness
- Summarize lessons learned