Here are my notes on the influential paper “Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center”. My notes pertain only to the original the paper itself, and not improvements or changes in the theory or implementation of Apache Mesos since 2010.
Mesos is “a platform for sharing commodity clusters between multiple diverse cluster computing frameworks”. A framework is a “software system that manages and executes one or more jobs on a cluster”, for example Hadoop, Spark, or MPI. Frameworks are responsible for running tasks, for example running a machine learning algorithm. When multiple frameworks run on a cluster without a platform like Mesos, there are often unintended consequences, for example one framework may grab resources for a job that would gave been better suited for another framework’s job.
Multiple frameworks often run on a single cluster because different frameworks are best suited for different kinds of computational workloads. For example, Spark for iterative workloads on shared data, or Flink for streaming workloads. Mesos shares cluster resources across frameworks with the goals of high utilization and efficient data sharing. Cluster resources can be shared without a framework, for example by simply partitioning the nodes in cluster to frameworks, or by allocating virtual machines to each framework, but utilization and efficiency suffer.
Determining how to share cluster resources between frameworks is especially challenging because individual frameworks already manage resources themselves. Cluster managers must either work with, augment, or replace these framework capabilities. For example, Hadoop’s Fair Scheduler assigns cluster nodes to jobs so that all jobs “get, on average, an equal share of resources”. Mesos does not seek to replace framework schedulers; it seeks to harmonize them so that total cluster utilization and efficiency is maximized, even though framework schedulers are unaware of each other’s existence. Mesos does this in a non-intrusive way by adopting a two phase approach:
- Mesos decides how many resources to offer each framework,
- Frameworks decide which resources to accept and which tasks should run on them (using their own scheduler).
There are several advantages to this approach:
- Frameworks can continue to use their own schedulers.
- Mesos can accommodate newly developed frameworks.
- The Mesos implementation itself can be kept simple (since concerns are separated).
- Mesos is scalable, because Mesos does not attempt to compute a global schedule for all tasks across all frameworks.
The primary disadvantage is that Mesos is denied the ability to globally optimize task allocation across frameworks.
Figures 2 and 3 in the paper are useful visual depictions of the Mesos offer process. Here is a simplified architectural diagram:
There are two components to the Mesos architecture: masters and workers. Masters are responsible for issuing offers to workers and interacting with workers and framework schedulers. Workers are responsible for running tasks on cluster resources.
The two phase approach for task scheduling and execution is summarized in Figure 3 in the original paper. The process begins with workers reporting available resources. You can think of these “reports” as tuples (w_i, r_1, r_2, …r_n) where w_i identifies the worker, and r_1, …, r_n represent resource attributes. For example, r_1 may represent the number of CPUs, r_2 may represent memory, r_3 the presence or absence of a GPU, and so on. Armed with the knowledge of the capabilities of the cluster, the master can begin issuing offers to framework schedulers. An offer is also a tuple (w_i, r_1, …, r_n) – it’s a record that represents resources that a scheduler can choose to use. At this point, the framework scheduler can choose to either accept or reject the offer. Frameworks decide to accept or reject based on the pending list of tasks that need to be executed by the framework. There are legitimate reasons for rejecting offers even if tasks are pending; for example pending tasks may require a GPU but the offer does not include one. When an offer is accepted, the framework scheduler sends back a list of tuples (t_i, u_1, …, u_n), with t_i identifying a task to be executed, and u_i representing the resources that will be utilized by the task when it is executed. The master can then send the tasks to workers for execution. It also “adjusts the books” so that future resource offers will account for the running tasks. When tasks are completed, the master is notified so that it can then account for these newly available resources.
It might be helpful to compare this process to home mortgages. In this world, Mesos plays the role of a mortgage broker. A Mesos offer represents the terms of a mortgage, offered to lenders (schedulers). An approval constitutes an agreement by a lender to fund the loan.
As the paper notes, the ability for frameworks to reject offers is an important extensibility point that allows for frameworks to account for its own considerations, without burdening Mesos with the details.
The process of brokering offers and launching tasks is the heart of Mesos. There are a number of important additional considerations, of course: how to handle long running or “zombie” tasks, task isolation, robustness, and fault tolerance. Mesos relies on existing framework or cluster node mechanisms to handle these considerations when possible, and adds simple policies to Mesos itself when this is not possible. This all falls under the general design principle of keeping Mesos simple. These mechanisms are described in Section 3 of the paper. The details are interesting but are not fundamental to understanding the design.
As noted earlier, Mesos takes a decentralized approach: offers are made to frameworks, and the frameworks schedule accepted offers. Frameworks are (implicitly) incented by Mesos to adopt certain policies in order to improve throughput. These incentives are given in Section 4.4:
- Uses short tasks,
- Uses resources as soon as they are allocated,
- Ability to scale down,
- Does not accept unknown resources.
Frameworks that follow these guidelines yield high utilization when managed by Mesos.
Mesos does not claim to be the only viable solution for cluster resource management. For example, in a traditional HPC-style cluster environment with specialized, largely homogeneous hardware and fixed-size jobs, centralized scheduling may be more appropriate. In a grid computing environment where geographically separate and separately administered resources are marshaled together for a computation (like me and my colleagues did for the famed “nug30” problem back in 2000), additional layers may need to sit on top of a framework such as Mesos.
Nonetheless for many modern cluster workloads, especially those for large scale machine learning, Mesos is an excellent choice.