Notes on Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Here are my notes on the influential paper “Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center”. My notes pertain only to the original the paper itself, and not improvements or changes in the theory or implementation of Apache Mesos since 2010.

Mesos is “a platform for sharing commodity clusters between multiple diverse cluster computing frameworks”. A framework is a “software system that manages and executes one or more jobs on a cluster”, for example Hadoop, Spark, or MPI. Frameworks are responsible for running tasks, for example running a machine learning algorithm. When multiple frameworks run on a cluster without a platform like Mesos, there are often unintended consequences, for example one framework may grab resources for a job that would gave been better suited for another framework’s job.

Multiple frameworks often run on a single cluster because different frameworks are best suited for different kinds of computational workloads. For example, Spark for iterative workloads on shared data, or Flink for streaming workloads. Mesos shares cluster resources across frameworks with the goals of high utilization and efficient data sharing. Cluster resources can be shared without a framework, for example by simply partitioning the nodes in cluster to frameworks, or by allocating virtual machines to each framework, but utilization and efficiency suffer.

Determining how to share cluster resources between frameworks is especially challenging because individual frameworks already manage resources themselves. Cluster managers must either work with, augment, or replace these framework capabilities. For example, Hadoop’s Fair Scheduler assigns cluster nodes to jobs so that all jobs “get, on average, an equal share of resources”. Mesos does not seek to replace framework schedulers; it seeks to harmonize them so that total cluster utilization and efficiency is maximized, even though framework schedulers are unaware of each other’s existence. Mesos does this in a non-intrusive way by adopting a two phase approach:

  1. Mesos decides how many resources to offer each framework,
  2. Frameworks decide which resources to accept and which tasks should run on them (using their own scheduler).

There are several advantages to this approach:

  • Frameworks can continue to use their own schedulers.
  • Mesos can accommodate newly developed frameworks.
  • The Mesos implementation itself can be kept simple (since concerns are separated).
  • Mesos is scalable, because Mesos does not attempt to compute a global schedule for all tasks across all frameworks.

The primary disadvantage is that Mesos is denied the ability to globally optimize task allocation across frameworks.

Figures 2 and 3 in the paper are useful visual depictions of the Mesos offer process. Here is a simplified architectural diagram:

Mesos

There are two components to the Mesos architecture: masters and workers. Masters are responsible for issuing offers to workers and interacting with workers and framework schedulers. Workers are responsible for running tasks on cluster resources.

The two phase approach for task scheduling and execution is summarized in Figure 3 in the original paper. The process begins with workers reporting available resources. You can think of these “reports” as tuples (w_i, r_1, r_2, …r_n) where w_i identifies the worker, and r_1, …, r_n represent resource attributes. For example, r_1 may represent the number of CPUs, r_2 may represent memory, r_3 the presence or absence of a GPU, and so on. Armed with the knowledge of the capabilities of the cluster, the master can begin issuing offers to framework schedulers. An offer is also a tuple (w_i, r_1, …, r_n) – it’s a record that represents resources that a scheduler can choose to use. At this point, the framework scheduler can choose to either accept or reject the offer. Frameworks decide to accept or reject based on the pending list of tasks that need to be executed by the framework. There are legitimate reasons for rejecting offers even if tasks are pending; for example pending tasks may require a GPU but the offer does not include one. When an offer is accepted, the framework scheduler sends back a list of tuples (t_i, u_1, …, u_n), with t_i identifying a task to be executed, and u_i representing the resources that will be utilized by the task when it is executed. The master can then send the tasks to workers for execution. It also “adjusts the books” so that future resource offers will account for the running tasks. When tasks are completed, the master is notified so that it can then account for these newly available resources.

It might be helpful to compare this process to home mortgages. In this world, Mesos plays the role of a mortgage broker. A Mesos offer represents the terms of a mortgage, offered to lenders (schedulers). An approval constitutes an agreement by a lender to fund the loan.

As the paper notes, the ability for frameworks to reject offers is an important extensibility point that allows for frameworks to account for its own considerations, without burdening Mesos with the details.

The process of brokering offers and launching tasks is the heart of Mesos. There are a number of important additional considerations, of course: how to handle long running or “zombie” tasks, task isolation, robustness, and fault tolerance. Mesos relies on existing framework or cluster node mechanisms to handle these considerations when possible, and adds simple policies to Mesos itself when this is not possible. This all falls under the general design principle of keeping Mesos simple. These mechanisms are described in Section 3 of the paper. The details are interesting but are not fundamental to understanding the design.

As noted earlier, Mesos takes a decentralized approach: offers are made to frameworks, and the frameworks schedule accepted offers. Frameworks are (implicitly) incented by Mesos to adopt certain policies in order to improve throughput. These incentives are given in Section 4.4:

  • Uses short tasks,
  • Uses resources as soon as they are allocated,
  • Ability to scale down,
  • Does not accept unknown resources.

Frameworks that follow these guidelines yield high utilization when managed by Mesos.

Mesos does not claim to be the only viable solution for cluster resource management. For example, in a traditional HPC-style cluster environment with specialized, largely homogeneous hardware and fixed-size jobs, centralized scheduling may be more appropriate. In a grid computing environment where geographically separate and separately administered resources are marshaled together for a computation (like me and my colleagues did for the famed “nug30” problem back in 2000), additional layers may need to sit on top of a framework such as Mesos.

Nonetheless for many modern cluster workloads, especially those for large scale machine learning, Mesos is an excellent choice.

Finding Optimal State Capitol Tours on the Cloud with NEOS

My last article showed you how to find an optimal tour of all 48 continental US state capitols using operations research. I used the Python API of the popular Gurobi solver to create and solve a traveling salesman problem (TSP) model in a few seconds.

In this post I want to show you how to use Concorde, the world’s best TSP solver for free on the cloud using the NEOS optimization service. In less than 100 lines of Python code, you can find the best tour. Here it is:

TSP_Tour48_Bokeh

Using NEOS is pretty easy. You need to do three things to solve an optimization problem:

  1. Create a NEOS account.
  2. Create an input file for the problem you want to solve.
  3. Give the input file to NEOS, either through their web interface, or by calling an API.

Let’s walk through those steps for the state capitol problem. If you just want to skip to the punchline, here is my code.

Concorde requires a problem specification in the TSPLIB format. This is a text based format where we specify the distances between all pairs of cities. Recall that Randy Olson found the distances between all state capitols using the Google Maps API in this post. Here is a file with this information. Using the distances, I created a TSPLIB input file with the distance matrix – here it is.

The next step is to submit the file to NEOS. Using the xmlrpc Python module, I wrote a simple wrapper to submit TSPLIB files to NEOS. The NEOS submission is an XML file that wraps the contents of the TSPLIB data, and also tells NEOS that we want to use the Concorde solver. The XML file is given to NEOS via an XML-RPC call. NEOS returns the results as a string – the end of the string contains the optimal tour. Here is the body of the primary Python function that carries out these steps:

def solve_tsp_neos_concorde(dist):
xml = make_neos_concorde(dist)
neos = NeosClient()
result = neos.run(xml)
return tour_from_neos_concorde_result(result)

When I run this code, I obtain the same tour as in my initial post. Hooray! You can also extend my code (which is based on NEOS documentation) to solve many other kinds of optimization models.

Computing Optimal Road Trips Using Operations Research

Randy Olson recently wrote about an approach for finding a road trip that visits all 48 continental US state capitols. Randy’s approach involves genetic algorithms and is accompanied by some very effective visualizations. Further, he examines how the length of these road trips varies as the number of states visited increases. While the trips shown in Randy’s post are very good, they aren’t quite optimal. In this post I’d like to show how you can find the shortest possible road trips in Python using the magic science of operations research. I suggest you read Randy’s post first to get up to speed!

An “optimal road trip” is an ordering of the 48 state capitols that results in the smallest possible driving distance as determined by Google Maps. This is an example of what is known as the Traveling Salesman Problem (TSP). In his work, Randy has made a couple of simplifying assumptions that I will also follow:

  • The driving distance from capitol A to capitol B is assumed to be the same as from B to A. We know this isn’t 100% true because of the way roads work. But close enough.
  • We’re optimizing driving distance, not driving time. We could easily optimize “average” driving time using data provided by Google. Optimizing expected driving time given a specified road trip start date and time is actually pretty complicated given that we don’t know what the future will bring: traffic jams, road closures, storms, and so on..

These aren’t “bugs”, just simplifying assumptions. Randy used the Google Maps API to get driving distances between state capitols – here’s the data file. Note that Google Maps returns distances in kilometers so you’ll need to convert to miles if that’s your preference.

Randy’s approach to solve this problem was to use a genetic algorithm. Roughly speaking, a genetic algorithm starts with a whole bunch of randomly generated tours, computes their total distances, and repeatedly combines and modifies them to find better solutions. Following the analogy to genetics, tours with smaller total distances are more likely to be matched up with other fit tours to make brand new baby tours. As Randy showed in his post, within 20 minutes his genetic algorithm is able to produce a 48 state tour with a total length of 13,310 miles.

It turns out that we can do better. An inaccuracy in Randy’s otherwise great article is the claim that it’s impossible to find optimal tours for problems like these. You don’t have to look at all possible 48 city road trips to find the best one – read this post by Michael Trick. What we can do instead is rely on the insanely effective field of operations research and its body of 50+ years of work. In an operations research approach, we build a model for our problem based on the decisions we want to make, the overall objective we have in mind, and restrictions and constraints on what constitutes a solution. This model is then fed to operations research software (optimization solvers) that use highly tuned algorithms to find provably optimal solutions. The algorithms implemented solvers rule out vast swaths of possible tours in a brutally efficient manner, making the seemingly impossible routine.

The best-in-class TSP solver is Concorde, and is based on an operations research approach. You don’t need Concorde to solve this TSP – a 48 city road trip is puny by operations research standards. I have chosen to use the Gurobi solver because it is a very powerful solver that includes an easy-to-use Python API, and it has a cloud version. Gurobi even includes an example that covers this very problem! The key to their model is to define a yes-no decision variable for each pair of state capitols. A “yes” value for a decision variable indicates that pair of cities is on the optimal tour. The model also needs to specify the rules for what it means to be an optimal tour:

  • The shorter the total distance of the tour (which is determined by the distances between all of the “yes” pairs of cities), the better. This is the objective (or goal) that we seek to optimize.
  • The traveller will arrive at each capitol from another capitol, and will leave for another capitol. In other words, exactly two decision variables involving a capitol will be “yes”.
  • Tours don’t have cycles: visiting Boise more than once is not allowed.

(Sidebar: If you are not used to building optimization models then the code probably won’t make much sense and you may have no idea what the hell the Gurobi tutorial is talking about. No offense, Gurobi, the tutorial is very well written! The challenge of writing optimization models, which involves writing down precise mathematical relationships involving the decision variables you want solved, is what prevents many computer scientists and data scientists from using the fruits of operations research more often. This is especially the case when the models for classical problems such as the TSP require things like “lazy constraints” that even someone experienced with operations research may not be familiar with. I wrote about this in more detail here. On the other hand, there are a lot of great resources and tutorials out there and it’s simply good practice to rely on proven approaches that provide efficient, accurate results. This is what good engineers do. Anyway, the point is that you can easily steal Gurobi’s sample for this problem and replace their “points” variable with the distances from the data file above. If I wanted to do this with an open source package, or with Concorde itself I could have done it that way too.)

My code, based on Gurobi’s example, is able to find a tour with a total length of 12930 miles, about 380 miles shorter than the original tour. What’s more, it takes seconds to find the answer. Here is my Python code. Here is the tour – click here to explore it interactively.

TSP_Tour48

A text file with the tour is here and a GPX file of the tour is here courtesy of gpsvisualizer.com. This optimal tour is very close to the one the genetic algorithm came up with. Here is a screenshot for reference:

TSP_Tour48Olson

An interesting twist is that Randy extends the problem to consider both the driving distance and the number of states visited. If we are willing to do a tour of, say, 10 states, then clearly the total distance for the tour will be much shorter than a 48 state tour. Randy has a nice animation showing tours of differing numbers of states, as well as a chart that plots the number of states visited against the estimated driving time. This curve is called the efficient frontier – you sometimes see similar curves in financial models.

The modified problem of finding the shortest tour involving K of 48 total state capitols can also be solved by Gurobi. I extended the optimization model slightly:

  • Introduce new yes-no decision variables for each capitol: “yes” if the capitol is one of the lucky K to be visited.
  • Exactly K of the new decision variables should be “yes”.
  • Fix up the original model to make sure we don’t worry about the other N-K cities not on our mini tour.

(I also had to modify the model because I am running on the cloud and the “lazy constraints” mentioned in Gurobi’s TSP writeup don’t work in the cloud version of Gurobi.)

With this new code in place I can call it for K=3…47 and get this optimal efficient frontier curve:

TSP_Pareto

The distances and tours for all of these mini tours are given here.

What have we learned here? In about 200 lines of Python code we were able to efficiently find provably optimal solutions for the original road trip problem, as well as the “pareto optimization” extension. If you’re a data scientist, get familiar with operations research principles because it will certainly pay off!

The Logit and Sigmoid Functions

If you mess around with machine learning for long enough, you’ll eventually run into the logit and sigmoid functions. These are useful functions when you are working with probabilities or trying to classify data.

Given a probability p, the corresponding odds are calculated as p / (1 – p). For example if p=0.25, the odds are 3 to 1: 0.25/0.75 = 3.

The logit function is simply the logarithm of the odds: logit(x) = log(x / (1 – x)). Here is a plot of the logit function:

logit

The value of the logit function heads towards infinity as p approaches 1 and towards negative infinity as it approaches 0.

The logit function is useful in analytics because it maps probabilities (which are values in the range [0, 1]) to the full range of real numbers. In particular, if you are working with “yes-no” (binary) inputs it can be useful to transform them into real-valued quantities prior to modeling. This is essentially what happens in logistic regression.

The inverse of the logit function is the sigmoid function. That is, if you have a probability p, sigmoid(logit(p)) = p. The sigmoid function maps arbitrary real values back to the range [0, 1]. The larger the value, the closer to 1 you’ll get.

The formula for the sigmoid function is σ(x) = 1/(1 + exp(-x)). Here is a plot of the function:

sigmoid

The sigmoid might be useful if you want to transform a real valued variable into something that represents a probability. This sometimes happens at the end of a classification process. (As Wikipedia and other sources note, the term “sigmoid function” is used to refer to a class of functions with S-shaped curves. In most machine learning contexts, “sigmoid” usually refers specifically to the function described above.)

There are other functions that map probabilities to reals (and vice-versa), so what’s so special about the logit and sigmoid? One reason is that the logit function has the nice connection to odds described at the beginning of the article. A second is that the gradients of the logit and sigmoid are simple to calculate (try it and see). The reason why this is important is that many optimization and machine learning techniques make use of gradients, for example when estimating parameters for a neural network.

The biggest drawback of the sigmoid function for many analytics practitioners is the so-called “vanishing gradient” problem. You can read more about this problem here (and here), but the point is that this problem pertains not only to the sigmoid function, but any function that squeezes real values to the [0, 1] range. In neural networks, where the vanishing gradient problem is particularly annoying, it is often a good idea to seek alternatives as suggested here.

Checkpointing and Reusing TensorFlow Models

In my last two posts I introduced TensorFlow and wrote a very simple predictive model. In doing so I introduced many of the key concepts of TensorFlow:

  • The Session, the core of the TensorFlow object model,
  • Computational graphs and some of their elements: placeholders, variables, and Tensors,
  • Training models by iteratively calling Session.run on Optimization objects.

In this post I want to show you can save and re-use the results of your TensorFlow models. As we discussed last time, training a model means finding variable values that suit a particular purpose, for example finding a slope and intercept that defines a line that best fits a series of points. Training a model can be computationally expensive because we have to search for the best variable values through optimization. Suppose we want to use the results of this trained model over and over again, but without re-training the model each time. You can do this in TensorFlow using the Saver object.

A Saver object can save and restore the values of TensorFlow Variables. A typical scenario has three steps:

  1. Creating a Saver and telling the Saver which variables you want to save,
  2. Save the variables to a file,
  3. Restore the variables from a file when they are needed.

A Saver deals only with Variables. It does not work with placeholders, sessions, expressions, or any other kind of TensorFlow object. Here is a simple example that saves and restores two variables:

def save(checkpoint_file=’hello.chk’):
    with tf.Session() as session:
        x = tf.Variable([42.0, 42.1, 42.3], name=’x’)
        y = tf.Variable([[1.0, 2.0], [3.0, 4.0]], name=’y’)
        not_saved = tf.Variable([-1, -2], name=’not_saved’)
        session.run(tf.initialize_all_variables())

        print(session.run(tf.all_variables()))
        saver = tf.train.Saver([x, y])
        saver.save(session, checkpoint_file)

def restore(checkpoint_file=’hello.chk’):
    x = tf.Variable(-1.0, validate_shape=False, name=’x’)
    y = tf.Variable(-1.0, validate_shape=False, name=’y’)
    with tf.Session() as session:
        saver = tf.train.Saver()
        saver.restore(session, checkpoint_file)
        print(session.run(tf.all_variables()))

def reset():
    tf.reset_default_graph()

Try calling save(), reset() and then restore(), and compare the outputs to verify everything worked out. When you create a Saver, you should specify a list (or dictionary) of Variable objects you wish to save. (If you don’t, TensorFlow will assume you are interested in all the variables in your current session.) The shapes and values of these values will be stored in binary format when you call the save() method, and retrieved on restore(). Notice in my last function, when I create x and y, I give dummy values and say validate_shape=False. This is because I want the saver to determine the values and shapes when the variables are restored. If you’re wondering why the reset() function is there, remember that computational graphs are associated with Sessions. I want to “clear out” the state of the Session so I don’t have multiple x and y objects floating around as we call save and restore().

When you use Saver in real models, you should keep a couple of facts in mind:

  1. If you want to do anything useful with the Variables you restore, you may need to recreate the rest of the computational graph.
  2. The computational graph that you use with restored Variables need not be the same as the one that you used when saving. That can be useful!
  3. Saver has additional methods that can be helpful if your computation spans machines, or if you want to avoid overwriting old checkpoints on successive calls to save().

At the end of this post I have include a modification of my line fitting example to optionally save and restore model results. I’ve highlighted the interesting parts. You can call it like this:

fit_line(5, checkpoint_file=’vars.chk’)
reset()
fit_line(5, checkpoint_file=’vars.chk’, restore=True)

With this version, I could easily “score” new data points x using my trained model.

def fit_line(n=1, log_progress=False, iter_scale=200,
             restore=False, checkpoint_file=None):
    with tf.Session() as session:
        x = tf.placeholder(tf.float32, [n], name=’x’)
        y = tf.placeholder(tf.float32, [n], name=’y’)
        m = tf.Variable([1.0], name=’m’)
        b = tf.Variable([1.0], name=’b’)
        y = tf.add(tf.mul(m, x), b) # fit y_i = m * x_i + b
        y_act = tf.placeholder(tf.float32, [n], name=’y_’)

        # minimize sum of squared error between trained and actual.
        error = tf.sqrt((y – y_act) * (y – y_act))
        train_step = tf.train.AdamOptimizer(0.05).minimize(error)

        x_in, y_star = make_data(n)

        saver = tf.train.Saver()
        feed_dict = {x: x_in, y_act: y_star}
        if restore:
            print(“Loading variables from ‘%s’.” % checkpoint_file)
            saver.restore(session, checkpoint_file)
            y_i, m_i, b_i = session.run([y, m, b], feed_dict)
        else:
            init = tf.initialize_all_variables()
            session.run(init)
            for i in range(iter_scale * n):
                y_i, m_i, b_i, _ = session.run([y, m, b, train_step],
                                               feed_dict)
                err = np.linalg.norm(y_i – y_star, 2)
                if log_progress:
                    print(“%3d | %.4f %.4f %.4e” % (i, m_i, b_i, err))

            print(“Done training! m = %f, b = %f, err = %e, iter = %d”
                  % (m_i, b_i, err, i))
            if checkpoint_file is not None:
                print(“Saving variables to ‘%s’.” % checkpoint_file)
                saver.save(session, checkpoint_file)

        print(”      x: %s” % x_in)
        print(“Trained: %s” % y_i)
        print(” Actual: %s” % y_star)

A Simple Predictive Model in TensorFlow

In my previous post I provided a simple introduction to TensorFlow. In this post I’d like to take the next step and build a predictive model so I can highlight some key TensorFlow concepts.

This model will fit a line y = m * x + b to a series of points (x_i, y_i). This code is not the best way fit a line – it’s just an example. In our code, we’ll generate points with small random deviations from a line with known slope and intercept. Our test will be to see if we can recover these known values using TensorFlow. Here is a picture of our training data:

TensorFlowData

My last post explained that there are often four phases to TensorFlow programs: creating a model, getting the input data, running the model, and processing the output. In our model we want to find a slope m and intercept b that best fits our input data. What do we mean by “best fit”? We mean values m, b that give the smallest sum of squared error between the predicted and actual y_i. The way we do this in TensorFlow is create this expression, and then repeatedly run a Session that adjusts the values of m and b to make the error smaller using an optimizer.

There are two functions below: one to generate test data, and another to create and run the TensorFlow model:

def make_data(n):
    np.random.seed(42) # To ensure same data for multiple runs
    x = 2.0 * np.array(range(n))
    y = 1.0 + 3.0 * (np.array(range(n)) + 0.1 * (np.random.rand(n) – 0.5))
    return x, y

def fit_line(n=1, log_progress=False):
    with tf.Session() as session:
        x = tf.placeholder(tf.float32, [n], name=’x’)
        y = tf.placeholder(tf.float32, [n], name=’y’)
        m = tf.Variable([1.0], trainable=True) # training variable: slope
        b = tf.Variable([1.0], trainable=True) # training variable: intercept
        y = tf.add(tf.mul(m, x), b) # fit y_i = m * x_i + b

        # actual values (for training)
        y_act = tf.placeholder(tf.float32, [n], name=’y_’)

        # minimize sum of squared error between trained and actual.
        error = tf.sqrt((y – y_act) * (y – y_act))
        # train_step = tf.train.GradientDescentOptimizer(0.01).minimize(error)
        train_step = tf.train.AdamOptimizer(0.05).minimize(error)

        # generate input and output data with a little random noise.
        x_in, y_star = make_data(n)

        init = tf.initialize_all_variables()
        session.run(init)
        feed_dict = {x: x_in, y_act: y_star}
        for i in range(30 * n):
            y_i, m_i, b_i, _ = session.run([y, m, b, train_step], feed_dict)
            err = np.linalg.norm(y_i – y_star, 2)
            if log_progress:
                print(“%3d | %.4f %.4f %.4e” % (i, m_i, b_i, err))

        print(“Done! m = %f, b = %f, err = %e, iterations = %d”
              % (m_i, b_i, err, i))
        print(”      x: %s” % x_in)
        print(“Trained: %s” % y_i)
        print(” Actual: %s” % y_star)

Hopefully make_data is fairly clear. The function fit_line takes two input arguments:

  • n: the number of points to generate
  • log_progress: whether to display TensorFlow’s progress in finding the right slope m and intercept b.

After we create a TensorFlow session, our next two steps are to create placeholders for our input x and output y, similar to our first example. These are both Tensors of size n since that’s how many data points we have. The next line creates a TensorFlow variable to represent the slope m. A variable is a value that is retained between calls to Session.run(). If the value is an input or an output from the model, we don’t want a variable – we want a placeholder. If the value remains constant during our computation, we don’t want a variable – we want a tf.constant. We want variables when we want TensorFlow to train the value based on some criteria in our model. Notice when we create the Variable objects we supply initial values for the variable, and a “trainable” flag. Providing TensorFlow with initial values for a variable informs TensorFlow of the dimensionality and type – in our case m and b are single dimensional Tensors of size 1, but they could just as easily be multidimensional and/or integer.

The next expression assigns y the value m * x. We want to do this on an elementwise basis: we have a series of points (x_i, y_i) that we want to train against scalar values m and b. The TensorFlow functions add and mul operate on their arguments on an elementwise basis with broadcasting: using + and * would not have the intended effect.

Now that we have a model for our predicted values y, we want to compute the sum of squared error. This is accomplished using Tensor arithmetic and tf.sqrt. Here is a picture of our computational graph to this point:

TensorFlowGraph

Here comes the next new concept: optimization. We have specified our model, and the error in the model, but now we want TensorFlow to find the best possible values for m and b given the error expression. Optimization is carried out in TensorFlow by repeatedly calling Session.run() with an Optimization object “fed” as input. An Optimization carries out logic that adjusts the variables in a way that will hopefully improve the value of the error expression. In our case we will use an AdamOptimizer object. The parameter to AdamOptimizer controls how much the optimizer adjusts the variables on each call – larger is more aggressive. All Optimizer objects have a minimize() method that lets you pass in the expression you want to optimize. You can see that the train_step, the value returned by the AdamOptimizer, is passed into the Session.run() call.

Let’s explain briefly how the optimization works. A single call to the Optimizer does not adjust variables all the way to their optimal values; a call represents a single step towards an optimum. If you want to learn more about the specific logic that AdamOptimizer uses during a step, look at the TensorFlow documentation, or if you are ambitious, read the paper. The key ingredient is the gradient of the variables that you are trying to optimize. TensorFlow computes gradients by creating computational graph elements for the gradient expressions and evaluating them – have a look at this stackoverflow response for details. Again, TensorFlow can do this because it has a symbolic representation of the expressions you’re trying to compute (it’s in the picture above). Since a call to an optimizer is a single step, Session.run() must be called repeatedly in a loop to get suitable values. In the picture below I have plotted the values of the error (MSE) and m (Slope) expressions for the first 50 steps.

TensorFlowConvergence

If you have past experience with optimization you may wonder why I am running the optimizer for a fixed number of steps rather than having a more sensible stopping criterion. The answer is I am keeping it simple – feel free to extend the example if you like. You may also observe that this code is not very efficient or accurate in fitting points in a line. That’s not TensorFlow’s fault – it’s my fault for writing such a contrived example. In many real world examples the actual computational graph represents a complicated neural network.

Much of the remaining code to create the input and output arrays and call session.run should be familiar to you if you worked through my first post. When we complete our loop of Session.run() calls we print out our final slope and intercept, as well as the trained and actual y values.

With luck, I will be able to continue this series to use TensorFlow to build and run a neural network to solve a problem that is closer to a real-world scenario.

An Introduction To TensorFlow

This post walks through a simple Google TensorFlow example.

Getting Started

TensorFlow is an open source library for analytics. It’s particularly useful for building deep learning systems for predictive models involving natural language processing, audio, and images.

The TensorFlow site provides instructions for downloading and installing the package. Loosely speaking, here’s what you need to do to get started on a Windows machine:

  • Get comfortable with Python.
  • Install docker.
  • Run the “development” image for TensorFlow. The development images contains all of the samples on the TensorFlow site. The command I used was
    docker run -i -t gcr.io/tensorflow/tensorflow:latest-devel /bin/bash

Running the development image “latest-devel” will provide you with code for all of the examples on the TensorFlow site. You don’t strictly speaking have to use docker to get started with TensorFlow, but that’s what worked for me.

A Simple TensorFlow Program

I think the TensorFlow tutorials are too complicated for a beginner, so I’m going to present a simple TensorFlow example that takes input x, adds one to it, and stores it in an output array y. Many TensorFlow programs, including this one, have four distinct phases:

  1. Create TensorFlow objects that model the calculation you want to carry out,
  2. Get the input data for the model,
  3. Run the model using the input data,
  4. Do something with the output.

I have marked these phases in the code below.

import numpy as np
import tensorflow as tf
import math

def add_one():
with tf.Session() as session:
# (1)
x = tf.placeholder(tf.float32, [1], name=’x’) # fed as input below
y = tf.placeholder(tf.float32, [1], name=’y’) # fetched as output below
b = tf.constant(1.0)
y = x + b # here is our ‘model’: add one to the input.

        x_in = [2] # (2)
y_final = session.run([y], {x: x_in}) # (3)
print(y_final) # (4)

The first line in add_one creates a TensorFlow Session object. Sessions contain “computational graphs” that represent calculations to be carried out. In our example, we want to create a computational graph that represents adding the constant 1.0 to an input array x. Here is a picture:

TensorFlow1

The next two lines create “placeholders” x and y. A placeholder is an interface between a computational graph element and your data. Placeholders can represent input or output, and in my case  x represents the value to send in, and y represents the result. The second argument of the placeholder function is the shape of the placeholder, which is a single dimensional Tensor with one entry. You can also provide a name, which is useful for debugging purposes.

The next line creates the constant b using tf.constant. As we will see in future examples, there are other TensorFlow functions for addition, multiplication, and so on. Using these helper functions you can assemble a very wide range of functions that involve inputs, outputs, and other intermediate values. In this example, we’re keeping it very simple.

The next line, y = x + b, is the computational model we want TensorFlow to calculate. This line does not actually compute anything, even though it looks like it should. It simply creates data structures (called “graph elements”) that represent the addition of x and b, and the assignment of the result to the placeholder y. Each of the items in my picture above is a graph element. These graph elements are processed by the TensorFlow engine when Session.run() is called. Part of the magic of TensorFlow is to efficiently carry out graph element evaluation, even for very large and complicated graphs.

Now that the model is created, we turn to assembling the input and running the model. Our model has one input x, so we create a list x_in that will be associated with the placeholder x. If you think of a TensorFlow model as a function in your favorite programming language, the placeholders are the arguments. Here we want to “pass” x_in as the value for the “parameter” x. This is what happens in the session.run() call. The first argument is a list of graph elements that you would like TensorFlow to evaluate. In this case, we’re interested in evaluating the output placeholder y, so that’s what we pass in. Session.run will return an output value for each graph element that you pass in as the first argument, and the value will correspond to the evaluated value for that element. In English this means that y_final is going to be an array that has the result: x + 1. The second argument to run is a dictionary that specifies the values for input placeholders. This is where we associate the input array x_in with the placeholder x.

When Session.run() is called, TensorFlow will determine which elements of the computational graph need to be evaluated based on what you’ve passed in. It will then carry out the computations and then bind result values accordingly. The final line prints out the resulting array.

This example is one of the simplest ones I could think of that includes all four key phases. It’s missing many of the core features of TensorFlow! In particular, machine learning models usually train certain values to predict or classify something, but we’re not doing that here. In my next post I will walk through another example shows how to train parameters in a simple predictive model.