Notes on Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Here are my notes on the influential paper “Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center”. My notes pertain only to the original the paper itself, and not improvements or changes in the theory or implementation of Apache Mesos since 2010.

Mesos is “a platform for sharing commodity clusters between multiple diverse cluster computing frameworks”. A framework is a “software system that manages and executes one or more jobs on a cluster”, for example Hadoop, Spark, or MPI. Frameworks are responsible for running tasks, for example running a machine learning algorithm. When multiple frameworks run on a cluster without a platform like Mesos, there are often unintended consequences, for example one framework may grab resources for a job that would gave been better suited for another framework’s job.

Multiple frameworks often run on a single cluster because different frameworks are best suited for different kinds of computational workloads. For example, Spark for iterative workloads on shared data, or Flink for streaming workloads. Mesos shares cluster resources across frameworks with the goals of high utilization and efficient data sharing. Cluster resources can be shared without a framework, for example by simply partitioning the nodes in cluster to frameworks, or by allocating virtual machines to each framework, but utilization and efficiency suffer.

Determining how to share cluster resources between frameworks is especially challenging because individual frameworks already manage resources themselves. Cluster managers must either work with, augment, or replace these framework capabilities. For example, Hadoop’s Fair Scheduler assigns cluster nodes to jobs so that all jobs “get, on average, an equal share of resources”. Mesos does not seek to replace framework schedulers; it seeks to harmonize them so that total cluster utilization and efficiency is maximized, even though framework schedulers are unaware of each other’s existence. Mesos does this in a non-intrusive way by adopting a two phase approach:

  1. Mesos decides how many resources to offer each framework,
  2. Frameworks decide which resources to accept and which tasks should run on them (using their own scheduler).

There are several advantages to this approach:

  • Frameworks can continue to use their own schedulers.
  • Mesos can accommodate newly developed frameworks.
  • The Mesos implementation itself can be kept simple (since concerns are separated).
  • Mesos is scalable, because Mesos does not attempt to compute a global schedule for all tasks across all frameworks.

The primary disadvantage is that Mesos is denied the ability to globally optimize task allocation across frameworks.

Figures 2 and 3 in the paper are useful visual depictions of the Mesos offer process. Here is a simplified architectural diagram:

Mesos

There are two components to the Mesos architecture: masters and workers. Masters are responsible for issuing offers to workers and interacting with workers and framework schedulers. Workers are responsible for running tasks on cluster resources.

The two phase approach for task scheduling and execution is summarized in Figure 3 in the original paper. The process begins with workers reporting available resources. You can think of these “reports” as tuples (w_i, r_1, r_2, …r_n) where w_i identifies the worker, and r_1, …, r_n represent resource attributes. For example, r_1 may represent the number of CPUs, r_2 may represent memory, r_3 the presence or absence of a GPU, and so on. Armed with the knowledge of the capabilities of the cluster, the master can begin issuing offers to framework schedulers. An offer is also a tuple (w_i, r_1, …, r_n) – it’s a record that represents resources that a scheduler can choose to use. At this point, the framework scheduler can choose to either accept or reject the offer. Frameworks decide to accept or reject based on the pending list of tasks that need to be executed by the framework. There are legitimate reasons for rejecting offers even if tasks are pending; for example pending tasks may require a GPU but the offer does not include one. When an offer is accepted, the framework scheduler sends back a list of tuples (t_i, u_1, …, u_n), with t_i identifying a task to be executed, and u_i representing the resources that will be utilized by the task when it is executed. The master can then send the tasks to workers for execution. It also “adjusts the books” so that future resource offers will account for the running tasks. When tasks are completed, the master is notified so that it can then account for these newly available resources.

It might be helpful to compare this process to home mortgages. In this world, Mesos plays the role of a mortgage broker. A Mesos offer represents the terms of a mortgage, offered to lenders (schedulers). An approval constitutes an agreement by a lender to fund the loan.

As the paper notes, the ability for frameworks to reject offers is an important extensibility point that allows for frameworks to account for its own considerations, without burdening Mesos with the details.

The process of brokering offers and launching tasks is the heart of Mesos. There are a number of important additional considerations, of course: how to handle long running or “zombie” tasks, task isolation, robustness, and fault tolerance. Mesos relies on existing framework or cluster node mechanisms to handle these considerations when possible, and adds simple policies to Mesos itself when this is not possible. This all falls under the general design principle of keeping Mesos simple. These mechanisms are described in Section 3 of the paper. The details are interesting but are not fundamental to understanding the design.

As noted earlier, Mesos takes a decentralized approach: offers are made to frameworks, and the frameworks schedule accepted offers. Frameworks are (implicitly) incented by Mesos to adopt certain policies in order to improve throughput. These incentives are given in Section 4.4:

  • Uses short tasks,
  • Uses resources as soon as they are allocated,
  • Ability to scale down,
  • Does not accept unknown resources.

Frameworks that follow these guidelines yield high utilization when managed by Mesos.

Mesos does not claim to be the only viable solution for cluster resource management. For example, in a traditional HPC-style cluster environment with specialized, largely homogeneous hardware and fixed-size jobs, centralized scheduling may be more appropriate. In a grid computing environment where geographically separate and separately administered resources are marshaled together for a computation (like me and my colleagues did for the famed “nug30” problem back in 2000), additional layers may need to sit on top of a framework such as Mesos.

Nonetheless for many modern cluster workloads, especially those for large scale machine learning, Mesos is an excellent choice.

Advertisements

2014 In Review: Five Data Science Trends

2014 was another transformative, exciting year for data science. Summarizing even one measly year of progress is difficult! Here, in no particular order, are five trends that caught my attention. They are are likely to continue in 2015.

Adoption of higher productivity analytics programming environments. Traditional languages and environments such as C, C++, and SAS, are diminishing in importance as R, Python, and Scala ascend. It is not that data scientists are dumping the old stuff; it is that a flood of new data scientists have entered the fray, overwhelmingly choosing more modern environments. These newer systems provide language conveniences as well as a rich library of built-in (or easy to install) libraries that handle higher abstraction analytics-related tasks. Modern data scientists don’t want to write CSV read routines, JSON parsers, SQL INSERTs or logging systems. R is notable in the sense that its productivity gains come from its packages and community support, not from the language itself. R’s clunkiness will be its downfall as Python, Scala, and other languages gain greater traction within the analytics community and narrow the libraries gap.

Machine learning is king. Data science, however you define it, is a broad discipline, but the thunder belongs to machine learning. Rather, machine learning is becoming synonymous with data science. Optimization, for example, is increasingly being described as a backend algorithm supporting prediction or classification. Objective functions are described as “learning functions”, and so on. We are collectively inventing definitions and classifications as we go along, so in some sense this is merely semantics. There is a real problem however: we risk ignoring the collective wisdom of past masters, throwing rather broad but shallow techniques at models with rather particular structure. Somewhere in a coffee shop right now, somebody is using a genetic algorithm to solve a network model.

Visualization is everywhere. To the point that it’s starting to get annoying. Whether in solutions such as Tableau or TIBCO Spotfire, add-ons such as Excel Power View or Frontline’s XLMiner Visualization app (plug!), or programming libraries such as matplotlib or d3.js, the machinery to build good visualizations is maturing. The explosion of infographics in popular media have raised expectations: users expect visualizations to guide and inform them at all stages of the analytics lifecycle. As you have no doubt seen, the problem is that it is so easy to build shitty viz: bar charts with no content; furious heat maps signifying nothing. We’ll start to see broader, if unarticulated, consensus on appropriate visualizations and visual metaphors for the results of quantitative analysis. I hope.

Spark is supplanting Hadoop. Apache Spark wins on performance, has well-designed streaming, text analytics, machine learning, and data access libraries, and has huge community momentum. This was all true at the beginning of 2014, but now at the end of 2014 we are starting to see a breakthrough in industry investment. Hadoop isn’t going anywhere, but in 2015 many new, “big time” big data projects will be built on Spark. The more flexible graph-based pipeline at the heart of Spark is begging for great data scientists to exploit – what will 2015 bring?

Service oriented architectures are coming to analytics. The ubiquity of REST-based endpoints in web development, combined with a new culture of code sharing, have engendered a new “mixtape” development paradigm. A kid (or even an older guy…) can whip out their MacBook, create a webservice on django, deploy it on AWS using Elastic Beanstalk, connect it to interactive visualizations in an afternoon, and submit an app that night. Amazon has built a $1B+ side business on the strength of cloud-only services, and same these forces will drive analytics forward. The prime mover in data science is not big data. It is cloud. RESTful, service-based analytics services will explode.

Spark Summit Keynote Notes

Here is a summary of my key impressions from the Day 1 keynotes of the 2014 Spark Summit

This year’s Spark Summit was one of the deepest, most interesting technical conferences I have attended, and I don’t say that lightly. It is easy to get caught up in the excitement of a conference filled with enthusiasts, but trust me when I say that conventional MapReduce–based Hadoop is over and technologies like Spark will be part of the tipping point that will turn Big Data hype into real applications and much more widespread deployment. Hadoop is legacy.

Spark, like Hadoop, is more than one “thing”. The base component of Spark is a cluster computation engine that is like MapReduce on steroids. Instead of the simple two stage “map then reduce” computational model, Spark supports more general DAG-structured computational flows (Microsoft watchers will remember Dryad). This in itself, is a big innovation, especially for analytics scenarios. Indeed, Spark has been shown to be 10, 100, even 1000 times faster than Hadoop on a number of real workloads. More important than this, in my view, is that Spark includes higher level libraries for data access and analytics, surfaced in clean, consistent APIs that are available from three languages: Java, Scala, and Python. Three important Spark components are MLLib, a machine learning library; GraphX, a graph processing library; and Spark SQL, introduced at this conference. An analogue for those familiar with the Microsoft ecosystem is the .Net Framework – .Net provides languages, a runtime, and a set of libraries together. The integration of the pieces makes each much more useful.

The Summit is organized and principally sponsored by Databricks (tagline: “making big data easy”). This is the company founded by the Berkeley-based creators of Spark. Ion Stoico, CEO of Databricks, kicked off Monday’s festivities, introducing Databricks Cloud, a web based Spark workbench for doing big data analytics. You can find screenshots on the Databricks Cloud site, or on twitter. Key points:

  • Databricks Cloud is currently in a private beta.
  • It’s a standalone web interface.
  • It has a command-line “REPL” interface.
  • The examples I saw were in Scala (which is kind of like a mix of F# and Java).
  • You can bring in data from Amazon S3 or other sources using Spark SQL (more on that in future posts).
  • It includes pre-canned datasets such as a twitter snapshot/firehose (can’t tell which).
  • You can do SQL queries right from the REPL.
  • It has incredibly simple, clean looking visualizations tied to results.
  • You can drag and drop dashboard components that correspond to REPL / program outputs. You can configure how often these components are “refreshed”.
  • We were presented a live 10 demo to create a dashboard to filter live tweets, based on a similarity model authored with the help of MLLib, and trained on wikipedia.
  • Databricks Cloud would be quite useful even as a standalone, single node analytics workbench, but recall that all of this is running on top of Spark, on the cluster without any “parallel programming” going on by the user. 
  • Everything you create in the workbench is Spark 1.0 compliant, meaning you can move it over to any other Spark 1.0 distribution without changes.

The development plan is sound, and there is a ton of corporate support for Spark from Cloudera, Hortonworks, DataBricks, SAP, IBM, and others. If time permits I will summarize some of the other keynotes and sessions.