5月30日 01:39
What is distributed tracing? What are the mainstream tracing tools? How do they work?
Distributed tracing is an important tool for quickly locating problems and analyzing performance in distributed systems, capable of tracking the call chain of requests across multiple services:
Core Concepts:
1. Trace
- A complete request call chain
- The entire process from client initiating request to final response
- Contains multiple Spans
2. Span
- A specific call operation
- Includes start time, end time, operation name, etc.
- Spans form a call tree through parent-child relationships
3. Span ID
- Uniquely identifies a Span
- Used to build the call chain
4. Trace ID
- Uniquely identifies a complete trace
- All related Spans share the same Trace ID
5. Parent Span ID
- Identifies the parent Span of the current Span
- Used to build call hierarchy
6. Annotation
- Records timestamps of key events
- Such as CS (Client Send), SR (Server Receive), SS (Server Send), CR (Client Receive)
7. Baggage
- Key-value data passed along the call chain
- Used to pass context information between services
Main Tracing Tools:
1. Zipkin
- Features: Open-sourced by Twitter, based on Google Dapper paper
- Advantages:
- Mature and stable, active community
- Supports multiple languages
- Friendly visualization interface
- Disadvantages:
- Average storage performance
- Relatively simple functionality
- Applicable Scenarios: Small and medium distributed systems
2. Jaeger
- Features: Open-sourced by Uber, compatible with Zipkin API
- Advantages:
- Excellent performance, supports high concurrency
- Supports multiple storage backends
- More complete functionality
- Disadvantages:
- Relatively new
- Applicable Scenarios: Distributed systems with high performance requirements
3. SkyWalking
- Features: Domestic open source, focused on APM
- Advantages:
- Comprehensive features (tracing, performance monitoring, log analysis)
- Good Java support
- Complete Chinese documentation
- Disadvantages:
- Relatively weak support for other languages
- Applicable Scenarios: Microservice architecture mainly using Java
4. Pinpoint
- Features: Open-sourced by Naver, focused on Java
- Advantages:
- No code intrusion
- Detailed performance analysis
- Disadvantages:
- Only supports Java
- High resource usage
- Applicable Scenarios: Java single-language environment
5. OpenTelemetry
- Features: Hosted by CNCF, unified observability standard
- Advantages:
- Unified API and SDK
- Multi-language support
- Compatible with multiple backends
- Disadvantages:
- Relatively new, ecosystem still developing
- Applicable Scenarios: Projects requiring unified observability standards
Implementation Principles:
1. Context Propagation
- Pass Trace ID and Span ID during service calls
- Pass through HTTP headers, RPC metadata, etc.
- Example:
java
// gRPC context propagation Context ctx = Context.current().withValue(TRACE_ID_KEY, traceId); stub.withDeadlineAfter(timeout, TimeUnit.MILLISECONDS) .sayHello(request, ctx);
2. Interceptor/Filter
- Intercept at request entry and exit
- Record call start and end times
- Example:
java
@Component public class TraceInterceptor implements HandlerInterceptor { @Override public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) { String traceId = generateTraceId(); MDC.put("traceId", traceId); return true; } @Override public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) { MDC.remove("traceId"); } }
3. Sampling Strategy
- Fixed Sampling Rate: Sample at a fixed proportion
- Dynamic Sampling: Dynamically adjust based on request characteristics
- Error Priority: Prioritize sampling error requests
4. Data Reporting
- Asynchronous reporting to avoid affecting business performance
- Support batch reporting to reduce network overhead
- Support multiple transport protocols (HTTP, gRPC, Kafka)
Spring Cloud Sleath Integration Example:
java@SpringBootApplication @EnableZipkinServer public class ZipkinServerApplication { public static void main(String[] args) { SpringApplication.run(ZipkinServerApplication.class, args); } } // Client configuration spring: zipkin: base-url: http://localhost:9411 sleuth: sampler: probability: 0.1 # 10% sampling rate
Use Cases:
1. Performance Analysis
- Identify slow queries and slow services
- Analyze performance bottlenecks in call chains
- Optimize system performance
2. Troubleshooting
- Quickly locate problematic services
- Track error propagation paths
- Analyze root causes of failures
3. Dependency Analysis
- Understand service dependencies
- Identify unreasonable calls
- Optimize service architecture
4. Capacity Planning
- Analyze system load distribution
- Predict resource requirements
- Optimize resource allocation
Best Practices:
- Reasonably set sampling rate to balance performance and observability
- Combine with logs and monitoring to form a complete observability system
- Regularly analyze trace data to optimize system performance
- Use unified Trace ID for convenient cross-system tracing
- Pay attention to sensitive information protection, avoid passing sensitive data in traces