Observability in Microservices

Now we will deep dive into details of observability.
Why is observability required in a service?
Once you deploy a service in the production environment, to deliver a quality of service to your customer you observe few critical parameters of the service:
- Monitor its efficiency in terms of latency or throughput.
- Monitor efficiency of service in utilizing the resources.
- Alert the developer or ops team in case of any problem in the system. E.g. Disk filling up, service crash, etc.
- Troubleshoot and identify the root cause in case of a problem.
Following are the patterns to the observability:
- Health Check APIs: Expose endpoint to inspect the health of the service
- Log aggregation: Trace out the service activity and writes down the log to a centralized place.
- Distributed tracing: External request is enriched with request ID and identifies the flow of the request in the system.
- Exception tracking: Forward exceptions to the exception tracing service, which prevents the duplicate the exception, send a alert to the developers and monitor the resolution of each exception
- Application metrics: The service maintains the metrics and exposes them to the metrics server.
- Audit logging: Log user actions.
Health Check API pattern
When a service is deployed in the production environment, you don’t want to route requests to it until it is ready to accept the request. (Healthy)
Eg. Services might take few seconds to get initialized in establishing database connections, populating cache, etc. It’s pointless to route request sent a request as it will eventually fail until service is ready.

There are two components of health check API pattern
- Implementation of health check endpoint
- Deployment infra to periodically invoke health check endpoint
Health Check API
Service exposes a health check endpoint, such as GET /health, which returns the health of the service.
Spring boot actuator library implements a GET /actuator/health endpoint which returns 200 only if service is healthy and 503 otherwise.
Invocation of health check API
The deployment infrastructure periodically invokes the endpoint to identify if the service is healthy and perform the appropriate operation if the service is unhealthy. K8s is a good example of it.
Also, you can implement service registries using Netflix Eureka to invoke health check API periodically.
Log aggregation pattern
To troubleshoot a problem, logs are a valuable tool. To know what’s wrong with the application, a good starting point is logs.
But in microservice, there are multiple services and for each service, there will be multiple instances. Hence, using logs in a microservice architecture is challenging.
The solution is log aggregation.
Log aggregation
Aggregate the logs of all the services in a centralized database that supports searching and alerting.
ELK is a popular system based on the concept of log aggregation.

ELK consist of three open sources:
- Elasticsearch: A text-search oriented NoSQL database that’s used as a logging server.
- Logstash: A log pipeline that aggregates the service logs and writes them to the elastic search.
- Kibana: A visualization tool for Elastic search.
Another tool used across the industry for log aggregation Coralogix.
Distributed Tracing pattern
In a microservice architecture, when a request lands on the first service (or API gateway) it nested invokes many other services and then returns the result to the client.
When an incident occurs, it is tedious to trace out the overall flow of the request distributed across the services.
Distributed tracing is the solution to the problem.
Distributed Tracing
Assign each external request a unique Id and record how it flows through the system from one service to the next service in a centralized server that provides visualization and analysis.
Zipkin is a popular distributed tracing server.
Distributed tracing has two components
- Trace: Each external request and combination of one or more spans.
- Span: Represents a request to internal services with properties: operation, its attributes, start timestamp, and end timestamp.

Metrics Pattern
The metrics system collects metrics that represent the critical information about every component of the system.
Primarily there are two categories of the metrics
- Infrastructural level metrics: CPU, memory, and disk utilization.
- Application-level metrics: Number of requests, latency, etc.
Service reports metrics to a central server that provides aggregation, visualization, and alerting.
Collecting service level metrics
A service exposes an endpoint for metrics, for eg. in the spring boot framework, you can include a micrometer metrics library in your application that exposes metrics endpoint.
Deliver metrics to the metrics service
A service delivers metrics to the metrics service in two ways: Push or Pull.
Push model
Service instance sends the metrics to the metrics service. Eg. AWS cloud watch is an example of a push model.
Pull model
Metrics service (or an agent running locally) invokes a service API to retrieve the metrics from the service instance. Eg. Prometheus, a popular framework for monitoring and alerting uses a pull model.
Metric Sample
A metric sample has the following properties
- Name
- Value
- Timestamp
- Dimension
Exception Tracking pattern
Whenever a service reports an exception, you should identify the root cause.
The traditional way of looking for the exception is to look into logs.
There are few problems with the traditional approach:
- Logs are intended for single line entries but the exception consists of multiple lines.
- No mechanism to track the resolution of the exception which occurred in past.
- No mechanism to avoid duplicate errors which are confusing most of the time.
A better approach is to use an exception tracking service
Exception tracking
Services report exceptions to the central service that de-duplicates exceptions, generate alerts, and manage the resolution of the exceptions.
Audit logging pattern
Audit logging is important to track the user’s activity.
Each audit log identify
- A user who performed the operation
- An operation which is being performed.
- A business entity on which operation is performed.
Audits are important to ensure better customer support, compliance in the system and detect suspicious behavior in the sytem.
Following are different ways to implement audit logging
- Add audit logging code the business logic
- Use aspect-oriented programming
- Event sourcing