Nine Entertainment Manages Network Costs with Flowmill

Nine Entertainment is the largest media company in Australia with holdings in radio, television, and digital media. Starting in 2016 with it’s Publishing business, it strongly embraced Kubernetes and began shifting all its applications towards microservice architectures. Find out how Flowmill helped them save tens of thousands from reduced cross-zone traffic.

Nine Entertainment Logo

Nine Entertainment at a Glance

  • Nine is the largest media company in Australia
  • Fully embraced Kubernetes and public cloud (AWS)
  • Running modern microservices architecture across over 100 instances in 3 availability zones

Challenges:

  • Measuring and attributing network transfer costs to Kubernetes services
  • Identifying cross-zone traffic patterns affecting service latency and cost
  • Troubleshooting network behavior affecting Kubernetes and specific microservices

Results:

  • Tens of thousands of dollars in network cost savings from reduced cross zone traffic
  • Alerts in minutes to proactively identify changes in traffic patterns between services
  • Detailed network traffic breakdowns so costs can be attributed across services and teams
  • Setup cluster-wide in minutes

Nine Entertainment is the largest media company in Australia with holdings in radio, television, and digital media. Starting in 2016 with its Publishing business, it strongly embraced Kubernetes and began shifting all its applications towards microservice architectures. The team built a small number of Kubernetes clusters and distributed them across three availability zones in AWS. The environment was fully automated with an extensive CI pipeline and Slack-driven workflows for cluster and application management.

The Challenge

The team had done “a lot of work around observability” and embraced open source, investing heavily in Prometheus / Grafana for metrics, Jaeger for tracing, and an ELK stack for logs. However, Michael Lorant, a Senior Systems Engineer, noted “we still couldn’t get visibility into the network. It was still a very immature area in the Kubernetes ecosystem and we couldn’t make this work well with the traditional approach. Even once we embraced a service mesh, we found there is even more complexity in getting visibility into what’s going on in the network.” Networking cost attribution quickly became an acute problem. As Michael noted, things like EC2 have continuously decreased in price over the years while network transfer costs have not. However, while the team had detailed metrics on CPU, memory, and I/O consumption, they could not connect network transfer costs with specific services or teams within the company. Michael noted that “we needed to know where traffic was being generated from both inbound and outbound so we could attribute and optimize the cost of running things in our clusters”.Michael also pointed out that network problems continuously emerged in their environment. He said that “in one instance, we had a bug in the componentry that sent logs to Kinesis that would retry over and over again, effectively maxing the links in our instances. We would get a bill a couple of days later with a massive increase in our network costs. This happened multiple times. Each occurrence was thousands of dollars and we still didn’t know where it occurred.”

The Solution

Flowmill integrated seamlessly into Nine’s Kubernetes clusters. After deploying the Helm charts for the Flowmill agents, Michael noted “it was one of the fastest POCs we’ve ever done. He added, “Flowmill gave us visibility into an area we generally don’t look at. We never dealt too deeply into the networking itself. We never really realized how important it was. No one, even Amazon, provides good tooling and detail in this area or a means of turning this detail into knowledge.”

Network cost was one of the largest areas of impact within the organization. Michael noted that “we were interested in cross az traffic and finding our largest offenders there. We identified our elastic search and analytics clusters were bouncing things around and it highlighted how much that was really costing us in network traffic. We also identified areas that have latency issues because cross az traffic does have a latency penalty. We finally understood who was being affected by it.”

Flowmill helped the team identify cross zone network traffic issues that existed outside of their production environment. Some examples include:

  • A bug in their logging infrastructure that generated 10 TB / day in unnecessary cross zone traffic.
  • A video solution still in development moving huge amounts of traffic
  • A database in the CI environment that had a master in a single zone but multiple replicas in other zones, adding latency and cost

Outside of cost and latency, Michael noted that Flowmill has also given his team an early warning into potential external issues. “When you a getting aDDoS attack, network traffic goes up as well. Flowmill is a great indicator of something wrong in your environment.”

Flowmill gave us visibility into an area we generally don’t look at. We never dealt too deeply into the networking itself. We never really realized how important it was. No one, even Amazon, provides good tooling and detail in this area or a means of turning this detail into knowledge.

Michael LorantSenior Systems Engineer, Nine Entertainment