Service Meshes: The Hidden Costs
Over the last five years, service meshes have gone from niche to a common architectural pattern. Using a service mesh, often as a sidecar container with a reverse proxy, moves logic for handling common API call patterns, whether to REST endpoints or other RPC standards, outside of application or library logic making it easier to standardize and update. Whether using patterns like retries, circuit breakers, availability zone affinity, or canaries, implementing policies once instead of in each framework in each language makes a microservices migration much easier.
For example, without a service mesh, a Python application might implement a retry policy for some 500 responses but not others. In a different Python application, a client library may implement its own fallback logic to other services that might be able provide the same information. If a new microservice is written in Go that needs to access the services already accessed in Python, it would need to carefully implement at least very similar policies for retries or fallback.
When, in the future, the fallback service referenced by the Python and Go libraries is to be decommissioned, library updates will need to be rolled out to every service that has an older version. When the service hasn’t updated dependencies in months or even years, this introduces risk for every update. Commonly, service providers will have to maintain old versions of the service for years until either all the libraries are updated or the dependent services have themselves been decommissioned.
Compared to this, using a service mesh looks like a smart choice. And it is, but it doesn’t come without costs. The sidecar pattern means a fixed cost incremental cost for each service’s compute and memory footprint. While easier to update, the service mesh proxy also needs to be updated regularly. Also, the control plane which distributes routing information to the reverse proxies needs to be highly available even when other parts of the system are failing. The service mesh itself could add as much as 45% to P90 latency to each request as it has to do full layer 7 parsing and routing.
These are the simple costs, but there are other more subtle costs. Unlike the library example, the application itself cannot see where the connection was ultimately routed nor even what endpoints were eligible for routing. If there is an error returned, it isn’t clear where that error came from or what the application can do with it. “Well, I guess the service mesh tried and failed” isn’t a particularly satisfying answer when you’re trying to build more complex resilience strategies. If the request takes longer than it should, answering the question of whether it’s the network, the downstream service, or the mesh itself is far from easy.
Similarly, if there is a resource exhaustion problem, it can affect both the application and the service mesh simultaneously, making it even harder to understand the cause of the problem or get to a solution. Monitoring the sidecars and matching that data with the application data at even smaller scales becomes its own set of headaches especially when the sidecar telemetry is heavily sampled or delayed.
Service meshes introduce a gap between what your application sees and what is actually happening with a request. While the benefit of simplifying your resilience strategy is real, make sure that your developers and operators don’t end up flying blind. Using eBPF based monitoring makes network and local resource content visible by default. Ask the hard questions about unified application service mesh visibility from your observability vendor to make sure you’re getting the whole picture.