I find Google to be one of the most interesting companies in the IT business, especially when it comes to infrastructure, managing it, and using it for fun and profit (guess who owns the most servers? More about Google architecture).
Google published a very interesting paper about their distributed monitoring infrastructure called Dapper. An excellent review was posted on highscalability.com. I found it very interesting, and decided to add some notes to it.
Dapper has been used by Google for the last two years, and “… is part of our basic machine image, making it present on virtually every server at Google [Totaling thousands of different applications]”
Now this is a critical piece of infrastructure that operates at incredibly high data rates. Storing the sheer amount of data is a difficult task, even for Google.
Just to give a basic feeling of complexity:
“A web-search example will illustrate some of the challenges such a system needs to address …. In total, thousands of machines and many different services might be needed to process one universal search query. Tools that aid in understanding system behavior and reasoning about performance issues are invaluable in such an environment.”
So, now on to some details and noteworthy citations:
- Data is written to local log files, pulled by the collection infrastructure and stored in one of several regional Bigtable repositories.
- A trace is laid out as a single Bigtable row, with each column corresponding to a Span (Dapper’s term for a basic units of work in a trace)
- Span Ids are probabilistically (!) unique64-bit integers
- Google production servers generate more than 1 terabyte of sampled trace data per day.
- Throughput is so high (tens of thousands per second per process) that Google decided to sample the data by keeping only a fraction (1/1024) of it
- The benefits of increased trace data density must then be weighed against the cost of machines and disk storage for the Dapper repositories. Sampling a high fraction of requests also brings the Dapper collectors uncomfortably close to the write throughput limit for the Dapper Bigtable repository
- Our experience at Google leads us to believe that, for high-throughput services, aggressive sampling does not hinder most important analyses. [...] If a notable execution pattern surfaces once in such systems, it will surface thousands of times
- However, lower traffic workloads may miss important events at such low sampling rates, while tolerating higher sampling rates with acceptable performance overheads.
- We are in the process of deploying an adaptive sampling scheme that is parameterized not by a uniform sampling probability, but by a desired rate of sampled traces per unit time.
More about Dapper:
- Nearly all of Google’s inter-process communication is built around a single RPC framework with bindings in both C++ and Java. We have instrumented that framework to define spans around all RPCs.
- The core instrumentation is less than 1000 lines of code in C++ and under 800 lines in Java
- Dapper also allows application developers to enrich Dapper traces with additional information that may be useful to monitor higher level system behavior or to help in debugging problems.
More about performance:
- The daemon never uses more than 0.3% of one core of a production machine during collection, and has a very small memory Trace data collection is responsible for less than 0:01% of the network traffic in Google’s production environment.
- Each span corresponds to 426 bytes on average
- Root span creation and destruction takes 204 nanoseconds on average
- Unrealistically heavy load testing benchmarks with data rates reaching to 2M/sec (!), with only 0.267% CPU Core usage. (I wonder if this stands for 2M spans or 2M/426=4694 traces per second?)
- Writes to local disk are the most expensive operation in Dapper’s runtime library, but their visible overhead is much reduced since each disk write coalesces multiple log file write operations and executes asynchronously with respect to the traced application.
Here’s a screenshot of Dapper user interface as shown in the article: