5月30日 01:39

What is distributed tracing? What are the mainstream tracing tools? How do they work?

Distributed tracing is an important tool for quickly locating problems and analyzing performance in distributed systems, capable of tracking the call chain of requests across multiple services:

Core Concepts:

1. Trace

A complete request call chain
The entire process from client initiating request to final response
Contains multiple Spans

2. Span

A specific call operation
Includes start time, end time, operation name, etc.
Spans form a call tree through parent-child relationships

3. Span ID

Uniquely identifies a Span
Used to build the call chain

4. Trace ID

Uniquely identifies a complete trace
All related Spans share the same Trace ID

5. Parent Span ID

Identifies the parent Span of the current Span
Used to build call hierarchy

6. Annotation

Records timestamps of key events
Such as CS (Client Send), SR (Server Receive), SS (Server Send), CR (Client Receive)

7. Baggage

Key-value data passed along the call chain
Used to pass context information between services

Main Tracing Tools:

1. Zipkin

Features: Open-sourced by Twitter, based on Google Dapper paper
Advantages:
- Mature and stable, active community
- Supports multiple languages
- Friendly visualization interface
Disadvantages:
- Average storage performance
- Relatively simple functionality
Applicable Scenarios: Small and medium distributed systems

2. Jaeger

Features: Open-sourced by Uber, compatible with Zipkin API
Advantages:
- Excellent performance, supports high concurrency
- Supports multiple storage backends
- More complete functionality
Disadvantages:
- Relatively new
Applicable Scenarios: Distributed systems with high performance requirements

3. SkyWalking

Features: Domestic open source, focused on APM
Advantages:
- Comprehensive features (tracing, performance monitoring, log analysis)
- Good Java support
- Complete Chinese documentation
Disadvantages:
- Relatively weak support for other languages
Applicable Scenarios: Microservice architecture mainly using Java

4. Pinpoint

Features: Open-sourced by Naver, focused on Java
Advantages:
- No code intrusion
- Detailed performance analysis
Disadvantages:
- Only supports Java
- High resource usage
Applicable Scenarios: Java single-language environment

5. OpenTelemetry

Features: Hosted by CNCF, unified observability standard
Advantages:
- Unified API and SDK
- Multi-language support
- Compatible with multiple backends
Disadvantages:
- Relatively new, ecosystem still developing
Applicable Scenarios: Projects requiring unified observability standards

Implementation Principles:

1. Context Propagation

Pass Trace ID and Span ID during service calls
Pass through HTTP headers, RPC metadata, etc.

Example:

java
// gRPC context propagation
Context ctx = Context.current().withValue(TRACE_ID_KEY, traceId);
stub.withDeadlineAfter(timeout, TimeUnit.MILLISECONDS)
    .sayHello(request, ctx);

2. Interceptor/Filter

Intercept at request entry and exit
Record call start and end times

Example:

java
@Component
public class TraceInterceptor implements HandlerInterceptor {
    @Override
    public boolean preHandle(HttpServletRequest request, 
                             HttpServletResponse response, 
                             Object handler) {
        String traceId = generateTraceId();
        MDC.put("traceId", traceId);
        return true;
    }
    
    @Override
    public void afterCompletion(HttpServletRequest request, 
                                HttpServletResponse response, 
                                Object handler, Exception ex) {
        MDC.remove("traceId");
    }
}

3. Sampling Strategy

Fixed Sampling Rate: Sample at a fixed proportion
Dynamic Sampling: Dynamically adjust based on request characteristics
Error Priority: Prioritize sampling error requests

4. Data Reporting

Asynchronous reporting to avoid affecting business performance
Support batch reporting to reduce network overhead
Support multiple transport protocols (HTTP, gRPC, Kafka)

Spring Cloud Sleath Integration Example:

java
@SpringBootApplication
@EnableZipkinServer
public class ZipkinServerApplication {
    public static void main(String[] args) {
        SpringApplication.run(ZipkinServerApplication.class, args);
    }
}

// Client configuration
spring:
  zipkin:
    base-url: http://localhost:9411
  sleuth:
    sampler:
      probability: 0.1  # 10% sampling rate

Use Cases:

1. Performance Analysis

Identify slow queries and slow services
Analyze performance bottlenecks in call chains
Optimize system performance

2. Troubleshooting

Quickly locate problematic services
Track error propagation paths
Analyze root causes of failures

3. Dependency Analysis

Understand service dependencies
Identify unreasonable calls
Optimize service architecture

4. Capacity Planning

Analyze system load distribution
Predict resource requirements
Optimize resource allocation

Best Practices:

Reasonably set sampling rate to balance performance and observability
Combine with logs and monitoring to form a complete observability system
Regularly analyze trace data to optimize system performance
Use unified Trace ID for convenient cross-system tracing
Pay attention to sensitive information protection, avoid passing sensitive data in traces

标签：RPC