OpenX's Kubernetes Migration

OpenX is a global leader in programmatic advertising, creating an online marketplace that connects premium publishers and advertisers. Over the past two years, they made the shift from operating five physical datacenters to running their entire delivery stack in Kubernetes on Google Cloud Platform in multiple regions across the world.

placeholder

OpenX at a Glance

  • OpenX is a global advertising marketplace
  • Transitioned from on premise datacenters to Kubernetes and public cloud
  • Running 6 regions across thousands of instances

Challenges

  • Visibility into network reliability and performance issues on a per service basis
  • Network traffic and cost measurement to advertising partners
  • DNS / service discovery misconfigurations

Results

  • Setup cluster-wide in minutes
  • Automatic visibility into network connection failures, packet loss, network latency between services and to external advertising partners
  • Detailed measurement of per service network bandwidth

The Challenge

Mark Chodos, SRE at OpenX, noted that "once we completed the move we started working on making sure we had good operational visibility into the cloud environment and there was a lot of focus on optimization of workloads from a performance and cost perspective. One of the things that we certainly had in the on-premise infrastructure was significant instrumentation of the network traffic. However, in the public cloud with the network abstracted from us, that visibility went away and we need to find ways to regain it."

Network reliability and performance and network bandwidth costs emerged as business critical issues in the new environment. Mark noted that "if there is a latency increase or connection failures or retransmissions to an advertising partner, it can really impact revenue. It's critical that we can troubleshoot issues like this." In addition to debugging problems, OpenX also discovered that one of their largest costs is network egress to different advertising partners. They needed a means of measuring, analyzing, and optimizing that cost to improve profitability.

The Solution

Flowmill integrated seamlessly into OpenX's Kubernetes environment. Mark noted that the setup process has "been incredibly easy. Download a helm chart and deploy it. In a matter of minutes, we had Flowmill up and running." The results were equally impressive. Shortly after deployment, Mark and his team identified "misconfigurations causing a high rate of DNS errors between services, a misconfiguration we traced back to our transition to GCP and did not have visibility into."

OpenX used Flowmill to "drill down to connections between our service and an advertising partner to look at what is going on, including connection failures, packet loss, changes in round trip time." Flowmill also allowed OpenX to "to analyze network traffic on a per demand-side partner [DSP] basis to compare what we are spending with what we are making on the relationship."

Today, Flowmill monitors well over 1000 cloud instances in 6 public cloud regions. It is used by the SRE team and support organization to troubleshoot issues as well as the finance team to analyze cloud computing costs and advertising partner profitability.

Flowmill was set up in a matter of minutes and it allowed us to drill down to connections between our service and an advertising partner to look at what was going on, including connection failures, packet loss, and changes in round trip time.

Mark ChodosSite Reliability Engineer, OpenX