Optimization Methods for Large-Scale Machine Learning

Hey, so I mostly read a 93 page paper. The topic is a worthy one: optimization methods for large-scale machine learning. Deep learning powers best in class speech, image, and text intelligence on the web today, and deep learning is in turn powered by optimization. I will summarize “Optimization Methods for Large-Scale Machine Learning” by Bottou, Curtis, and Nocedal over the next few posts because it provides a useful operations research-centered evaluation of an important area in machine learning. In general, machine learning practitioners don’t know shit about operations research, and vice versa. This paper, along with work of Stephen Wright at Wisconsin (check out this talk), will certainly help to remedy this situation. I also predict that this paper will spur new advances in deep learning.

Here goes, and remember that I’m trying to summarize 93 pages!

The title of the paper is quite broad, but the focus is primarily on the use of the stochastic gradient descent (SGD) method (and variants) in deep learning applications. If you don’t have any previous experience with these topics, this series may not be for you, but I will try to summarize anyway. The term “deep learning” describes a range of machine learning algorithms that are used to classify or predict. Deep learning is primarily distinguished by:

  1. The use of much more input data than is typical for machine learning,
  2. Models that have many internal layers of data manipulation and transformation,
  3. A reliance on parallel and GPU processing.

Training a deep learning algorithm involves finding model parameters that produce effective predictors or classifiers. Finding the values of variables that produce the best results for a particular objective (or “goal”) is the job of optimization. The stochastic gradient descent method is so-named because it repeatedly takes steps in the direction of steepest descent, which is defined by the gradient of the objective we want to optimize. If we think of the objective function as a hilly field, then the gradient always points in the steepest direction down, when we examine the immediate area around where we stand. The “stochastic” part of SGD applies because rather than looking at all of the samples over which the objective function is defined, we only look one (or a few) randomly determined sample. As compared to using the full gradient, this approach takes less time to take a single step, but the step is possibly less effective in improving the value of our objective function. In theory and practice we can establish that often the tradeoff is worth it. Characterizing these tradeoffs more concretely is one of the objectives of the paper. As as supplement to the paper and this post, check out this great post by Sebastian Ruder for an overview of gradient descent algorithms for machine learning.

In my next post in this series, I will cover Section 2 which describes the selection of a prediction function that is useful for modeling but practical for model training at scale.

Updated 8/2/2016 to correctly summarize SGD. Thanks J-F!

Notes on Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Here are my notes on the influential paper “Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center”. My notes pertain only to the original the paper itself, and not improvements or changes in the theory or implementation of Apache Mesos since 2010.

Mesos is “a platform for sharing commodity clusters between multiple diverse cluster computing frameworks”. A framework is a “software system that manages and executes one or more jobs on a cluster”, for example Hadoop, Spark, or MPI. Frameworks are responsible for running tasks, for example running a machine learning algorithm. When multiple frameworks run on a cluster without a platform like Mesos, there are often unintended consequences, for example one framework may grab resources for a job that would gave been better suited for another framework’s job.

Multiple frameworks often run on a single cluster because different frameworks are best suited for different kinds of computational workloads. For example, Spark for iterative workloads on shared data, or Flink for streaming workloads. Mesos shares cluster resources across frameworks with the goals of high utilization and efficient data sharing. Cluster resources can be shared without a framework, for example by simply partitioning the nodes in cluster to frameworks, or by allocating virtual machines to each framework, but utilization and efficiency suffer.

Determining how to share cluster resources between frameworks is especially challenging because individual frameworks already manage resources themselves. Cluster managers must either work with, augment, or replace these framework capabilities. For example, Hadoop’s Fair Scheduler assigns cluster nodes to jobs so that all jobs “get, on average, an equal share of resources”. Mesos does not seek to replace framework schedulers; it seeks to harmonize them so that total cluster utilization and efficiency is maximized, even though framework schedulers are unaware of each other’s existence. Mesos does this in a non-intrusive way by adopting a two phase approach:

  1. Mesos decides how many resources to offer each framework,
  2. Frameworks decide which resources to accept and which tasks should run on them (using their own scheduler).

There are several advantages to this approach:

  • Frameworks can continue to use their own schedulers.
  • Mesos can accommodate newly developed frameworks.
  • The Mesos implementation itself can be kept simple (since concerns are separated).
  • Mesos is scalable, because Mesos does not attempt to compute a global schedule for all tasks across all frameworks.

The primary disadvantage is that Mesos is denied the ability to globally optimize task allocation across frameworks.

Figures 2 and 3 in the paper are useful visual depictions of the Mesos offer process. Here is a simplified architectural diagram:


There are two components to the Mesos architecture: masters and workers. Masters are responsible for issuing offers to workers and interacting with workers and framework schedulers. Workers are responsible for running tasks on cluster resources.

The two phase approach for task scheduling and execution is summarized in Figure 3 in the original paper. The process begins with workers reporting available resources. You can think of these “reports” as tuples (w_i, r_1, r_2, …r_n) where w_i identifies the worker, and r_1, …, r_n represent resource attributes. For example, r_1 may represent the number of CPUs, r_2 may represent memory, r_3 the presence or absence of a GPU, and so on. Armed with the knowledge of the capabilities of the cluster, the master can begin issuing offers to framework schedulers. An offer is also a tuple (w_i, r_1, …, r_n) – it’s a record that represents resources that a scheduler can choose to use. At this point, the framework scheduler can choose to either accept or reject the offer. Frameworks decide to accept or reject based on the pending list of tasks that need to be executed by the framework. There are legitimate reasons for rejecting offers even if tasks are pending; for example pending tasks may require a GPU but the offer does not include one. When an offer is accepted, the framework scheduler sends back a list of tuples (t_i, u_1, …, u_n), with t_i identifying a task to be executed, and u_i representing the resources that will be utilized by the task when it is executed. The master can then send the tasks to workers for execution. It also “adjusts the books” so that future resource offers will account for the running tasks. When tasks are completed, the master is notified so that it can then account for these newly available resources.

It might be helpful to compare this process to home mortgages. In this world, Mesos plays the role of a mortgage broker. A Mesos offer represents the terms of a mortgage, offered to lenders (schedulers). An approval constitutes an agreement by a lender to fund the loan.

As the paper notes, the ability for frameworks to reject offers is an important extensibility point that allows for frameworks to account for its own considerations, without burdening Mesos with the details.

The process of brokering offers and launching tasks is the heart of Mesos. There are a number of important additional considerations, of course: how to handle long running or “zombie” tasks, task isolation, robustness, and fault tolerance. Mesos relies on existing framework or cluster node mechanisms to handle these considerations when possible, and adds simple policies to Mesos itself when this is not possible. This all falls under the general design principle of keeping Mesos simple. These mechanisms are described in Section 3 of the paper. The details are interesting but are not fundamental to understanding the design.

As noted earlier, Mesos takes a decentralized approach: offers are made to frameworks, and the frameworks schedule accepted offers. Frameworks are (implicitly) incented by Mesos to adopt certain policies in order to improve throughput. These incentives are given in Section 4.4:

  • Uses short tasks,
  • Uses resources as soon as they are allocated,
  • Ability to scale down,
  • Does not accept unknown resources.

Frameworks that follow these guidelines yield high utilization when managed by Mesos.

Mesos does not claim to be the only viable solution for cluster resource management. For example, in a traditional HPC-style cluster environment with specialized, largely homogeneous hardware and fixed-size jobs, centralized scheduling may be more appropriate. In a grid computing environment where geographically separate and separately administered resources are marshaled together for a computation (like me and my colleagues did for the famed “nug30” problem back in 2000), additional layers may need to sit on top of a framework such as Mesos.

Nonetheless for many modern cluster workloads, especially those for large scale machine learning, Mesos is an excellent choice.

Finding Optimal State Capitol Tours on the Cloud with NEOS

My last article showed you how to find an optimal tour of all 48 continental US state capitols using operations research. I used the Python API of the popular Gurobi solver to create and solve a traveling salesman problem (TSP) model in a few seconds.

In this post I want to show you how to use Concorde, the world’s best TSP solver for free on the cloud using the NEOS optimization service. In less than 100 lines of Python code, you can find the best tour. Here it is:


Using NEOS is pretty easy. You need to do three things to solve an optimization problem:

  1. Create a NEOS account.
  2. Create an input file for the problem you want to solve.
  3. Give the input file to NEOS, either through their web interface, or by calling an API.

Let’s walk through those steps for the state capitol problem. If you just want to skip to the punchline, here is my code.

Concorde requires a problem specification in the TSPLIB format. This is a text based format where we specify the distances between all pairs of cities. Recall that Randy Olson found the distances between all state capitols using the Google Maps API in this post. Here is a file with this information. Using the distances, I created a TSPLIB input file with the distance matrix – here it is.

The next step is to submit the file to NEOS. Using the xmlrpc Python module, I wrote a simple wrapper to submit TSPLIB files to NEOS. The NEOS submission is an XML file that wraps the contents of the TSPLIB data, and also tells NEOS that we want to use the Concorde solver. The XML file is given to NEOS via an XML-RPC call. NEOS returns the results as a string – the end of the string contains the optimal tour. Here is the body of the primary Python function that carries out these steps:

def solve_tsp_neos_concorde(dist):
xml = make_neos_concorde(dist)
neos = NeosClient()
result = neos.run(xml)
return tour_from_neos_concorde_result(result)

When I run this code, I obtain the same tour as in my initial post. Hooray! You can also extend my code (which is based on NEOS documentation) to solve many other kinds of optimization models.

Computing Optimal Road Trips Using Operations Research

Randy Olson recently wrote about an approach for finding a road trip that visits all 48 continental US state capitols. Randy’s approach involves genetic algorithms and is accompanied by some very effective visualizations. Further, he examines how the length of these road trips varies as the number of states visited increases. While the trips shown in Randy’s post are very good, they aren’t quite optimal. In this post I’d like to show how you can find the shortest possible road trips in Python using the magic science of operations research. I suggest you read Randy’s post first to get up to speed!

An “optimal road trip” is an ordering of the 48 state capitols that results in the smallest possible driving distance as determined by Google Maps. This is an example of what is known as the Traveling Salesman Problem (TSP). In his work, Randy has made a couple of simplifying assumptions that I will also follow:

  • The driving distance from capitol A to capitol B is assumed to be the same as from B to A. We know this isn’t 100% true because of the way roads work. But close enough.
  • We’re optimizing driving distance, not driving time. We could easily optimize “average” driving time using data provided by Google. Optimizing expected driving time given a specified road trip start date and time is actually pretty complicated given that we don’t know what the future will bring: traffic jams, road closures, storms, and so on..

These aren’t “bugs”, just simplifying assumptions. Randy used the Google Maps API to get driving distances between state capitols – here’s the data file. Note that Google Maps returns distances in kilometers so you’ll need to convert to miles if that’s your preference.

Randy’s approach to solve this problem was to use a genetic algorithm. Roughly speaking, a genetic algorithm starts with a whole bunch of randomly generated tours, computes their total distances, and repeatedly combines and modifies them to find better solutions. Following the analogy to genetics, tours with smaller total distances are more likely to be matched up with other fit tours to make brand new baby tours. As Randy showed in his post, within 20 minutes his genetic algorithm is able to produce a 48 state tour with a total length of 13,310 miles.

It turns out that we can do better. An inaccuracy in Randy’s otherwise great article is the claim that it’s impossible to find optimal tours for problems like these. You don’t have to look at all possible 48 city road trips to find the best one – read this post by Michael Trick. What we can do instead is rely on the insanely effective field of operations research and its body of 50+ years of work. In an operations research approach, we build a model for our problem based on the decisions we want to make, the overall objective we have in mind, and restrictions and constraints on what constitutes a solution. This model is then fed to operations research software (optimization solvers) that use highly tuned algorithms to find provably optimal solutions. The algorithms implemented solvers rule out vast swaths of possible tours in a brutally efficient manner, making the seemingly impossible routine.

The best-in-class TSP solver is Concorde, and is based on an operations research approach. You don’t need Concorde to solve this TSP – a 48 city road trip is puny by operations research standards. I have chosen to use the Gurobi solver because it is a very powerful solver that includes an easy-to-use Python API, and it has a cloud version. Gurobi even includes an example that covers this very problem! The key to their model is to define a yes-no decision variable for each pair of state capitols. A “yes” value for a decision variable indicates that pair of cities is on the optimal tour. The model also needs to specify the rules for what it means to be an optimal tour:

  • The shorter the total distance of the tour (which is determined by the distances between all of the “yes” pairs of cities), the better. This is the objective (or goal) that we seek to optimize.
  • The traveller will arrive at each capitol from another capitol, and will leave for another capitol. In other words, exactly two decision variables involving a capitol will be “yes”.
  • Tours don’t have cycles: visiting Boise more than once is not allowed.

(Sidebar: If you are not used to building optimization models then the code probably won’t make much sense and you may have no idea what the hell the Gurobi tutorial is talking about. No offense, Gurobi, the tutorial is very well written! The challenge of writing optimization models, which involves writing down precise mathematical relationships involving the decision variables you want solved, is what prevents many computer scientists and data scientists from using the fruits of operations research more often. This is especially the case when the models for classical problems such as the TSP require things like “lazy constraints” that even someone experienced with operations research may not be familiar with. I wrote about this in more detail here. On the other hand, there are a lot of great resources and tutorials out there and it’s simply good practice to rely on proven approaches that provide efficient, accurate results. This is what good engineers do. Anyway, the point is that you can easily steal Gurobi’s sample for this problem and replace their “points” variable with the distances from the data file above. If I wanted to do this with an open source package, or with Concorde itself I could have done it that way too.)

My code, based on Gurobi’s example, is able to find a tour with a total length of 12930 miles, about 380 miles shorter than the original tour. What’s more, it takes seconds to find the answer. Here is my Python code. Here is the tour – click here to explore it interactively.


A text file with the tour is here and a GPX file of the tour is here courtesy of gpsvisualizer.com. This optimal tour is very close to the one the genetic algorithm came up with. Here is a screenshot for reference:


An interesting twist is that Randy extends the problem to consider both the driving distance and the number of states visited. If we are willing to do a tour of, say, 10 states, then clearly the total distance for the tour will be much shorter than a 48 state tour. Randy has a nice animation showing tours of differing numbers of states, as well as a chart that plots the number of states visited against the estimated driving time. This curve is called the efficient frontier – you sometimes see similar curves in financial models.

The modified problem of finding the shortest tour involving K of 48 total state capitols can also be solved by Gurobi. I extended the optimization model slightly:

  • Introduce new yes-no decision variables for each capitol: “yes” if the capitol is one of the lucky K to be visited.
  • Exactly K of the new decision variables should be “yes”.
  • Fix up the original model to make sure we don’t worry about the other N-K cities not on our mini tour.

(I also had to modify the model because I am running on the cloud and the “lazy constraints” mentioned in Gurobi’s TSP writeup don’t work in the cloud version of Gurobi.)

With this new code in place I can call it for K=3…47 and get this optimal efficient frontier curve:


The distances and tours for all of these mini tours are given here.

What have we learned here? In about 200 lines of Python code we were able to efficiently find provably optimal solutions for the original road trip problem, as well as the “pareto optimization” extension. If you’re a data scientist, get familiar with operations research principles because it will certainly pay off!

The Logit and Sigmoid Functions

If you mess around with machine learning for long enough, you’ll eventually run into the logit and sigmoid functions. These are useful functions when you are working with probabilities or trying to classify data.

Given a probability p, the corresponding odds are calculated as p / (1 – p). For example if p=0.25, the odds are 3 to 1: 0.25/0.75 = 3.

The logit function is simply the logarithm of the odds: logit(x) = log(x / (1 – x)). Here is a plot of the logit function:


The value of the logit function heads towards infinity as p approaches 1 and towards negative infinity as it approaches 0.

The logit function is useful in analytics because it maps probabilities (which are values in the range [0, 1]) to the full range of real numbers. In particular, if you are working with “yes-no” (binary) inputs it can be useful to transform them into real-valued quantities prior to modeling. This is essentially what happens in logistic regression.

The inverse of the logit function is the sigmoid function. That is, if you have a probability p, sigmoid(logit(p)) = p. The sigmoid function maps arbitrary real values back to the range [0, 1]. The larger the value, the closer to 1 you’ll get.

The formula for the sigmoid function is σ(x) = 1/(1 + exp(-x)). Here is a plot of the function:


The sigmoid might be useful if you want to transform a real valued variable into something that represents a probability. This sometimes happens at the end of a classification process. (As Wikipedia and other sources note, the term “sigmoid function” is used to refer to a class of functions with S-shaped curves. In most machine learning contexts, “sigmoid” usually refers specifically to the function described above.)

There are other functions that map probabilities to reals (and vice-versa), so what’s so special about the logit and sigmoid? One reason is that the logit function has the nice connection to odds described at the beginning of the article. A second is that the gradients of the logit and sigmoid are simple to calculate (try it and see). The reason why this is important is that many optimization and machine learning techniques make use of gradients, for example when estimating parameters for a neural network.

The biggest drawback of the sigmoid function for many analytics practitioners is the so-called “vanishing gradient” problem. You can read more about this problem here (and here), but the point is that this problem pertains not only to the sigmoid function, but any function that squeezes real values to the [0, 1] range. In neural networks, where the vanishing gradient problem is particularly annoying, it is often a good idea to seek alternatives as suggested here.

Checkpointing and Reusing TensorFlow Models

In my last two posts I introduced TensorFlow and wrote a very simple predictive model. In doing so I introduced many of the key concepts of TensorFlow:

  • The Session, the core of the TensorFlow object model,
  • Computational graphs and some of their elements: placeholders, variables, and Tensors,
  • Training models by iteratively calling Session.run on Optimization objects.

In this post I want to show you can save and re-use the results of your TensorFlow models. As we discussed last time, training a model means finding variable values that suit a particular purpose, for example finding a slope and intercept that defines a line that best fits a series of points. Training a model can be computationally expensive because we have to search for the best variable values through optimization. Suppose we want to use the results of this trained model over and over again, but without re-training the model each time. You can do this in TensorFlow using the Saver object.

A Saver object can save and restore the values of TensorFlow Variables. A typical scenario has three steps:

  1. Creating a Saver and telling the Saver which variables you want to save,
  2. Save the variables to a file,
  3. Restore the variables from a file when they are needed.

A Saver deals only with Variables. It does not work with placeholders, sessions, expressions, or any other kind of TensorFlow object. Here is a simple example that saves and restores two variables:

def save(checkpoint_file=’hello.chk’):
    with tf.Session() as session:
        x = tf.Variable([42.0, 42.1, 42.3], name=’x’)
        y = tf.Variable([[1.0, 2.0], [3.0, 4.0]], name=’y’)
        not_saved = tf.Variable([-1, -2], name=’not_saved’)

        saver = tf.train.Saver([x, y])
        saver.save(session, checkpoint_file)

def restore(checkpoint_file=’hello.chk’):
    x = tf.Variable(-1.0, validate_shape=False, name=’x’)
    y = tf.Variable(-1.0, validate_shape=False, name=’y’)
    with tf.Session() as session:
        saver = tf.train.Saver()
        saver.restore(session, checkpoint_file)

def reset():

Try calling save(), reset() and then restore(), and compare the outputs to verify everything worked out. When you create a Saver, you should specify a list (or dictionary) of Variable objects you wish to save. (If you don’t, TensorFlow will assume you are interested in all the variables in your current session.) The shapes and values of these values will be stored in binary format when you call the save() method, and retrieved on restore(). Notice in my last function, when I create x and y, I give dummy values and say validate_shape=False. This is because I want the saver to determine the values and shapes when the variables are restored. If you’re wondering why the reset() function is there, remember that computational graphs are associated with Sessions. I want to “clear out” the state of the Session so I don’t have multiple x and y objects floating around as we call save and restore().

When you use Saver in real models, you should keep a couple of facts in mind:

  1. If you want to do anything useful with the Variables you restore, you may need to recreate the rest of the computational graph.
  2. The computational graph that you use with restored Variables need not be the same as the one that you used when saving. That can be useful!
  3. Saver has additional methods that can be helpful if your computation spans machines, or if you want to avoid overwriting old checkpoints on successive calls to save().

At the end of this post I have include a modification of my line fitting example to optionally save and restore model results. I’ve highlighted the interesting parts. You can call it like this:

fit_line(5, checkpoint_file=’vars.chk’)
fit_line(5, checkpoint_file=’vars.chk’, restore=True)

With this version, I could easily “score” new data points x using my trained model.

def fit_line(n=1, log_progress=False, iter_scale=200,
             restore=False, checkpoint_file=None):
    with tf.Session() as session:
        x = tf.placeholder(tf.float32, [n], name=’x’)
        y = tf.placeholder(tf.float32, [n], name=’y’)
        m = tf.Variable([1.0], name=’m’)
        b = tf.Variable([1.0], name=’b’)
        y = tf.add(tf.mul(m, x), b) # fit y_i = m * x_i + b
        y_act = tf.placeholder(tf.float32, [n], name=’y_’)

        # minimize sum of squared error between trained and actual.
        error = tf.sqrt((y – y_act) * (y – y_act))
        train_step = tf.train.AdamOptimizer(0.05).minimize(error)

        x_in, y_star = make_data(n)

        saver = tf.train.Saver()
        feed_dict = {x: x_in, y_act: y_star}
        if restore:
            print(“Loading variables from ‘%s’.” % checkpoint_file)
            saver.restore(session, checkpoint_file)
            y_i, m_i, b_i = session.run([y, m, b], feed_dict)
            init = tf.initialize_all_variables()
            for i in range(iter_scale * n):
                y_i, m_i, b_i, _ = session.run([y, m, b, train_step],
                err = np.linalg.norm(y_i – y_star, 2)
                if log_progress:
                    print(“%3d | %.4f %.4f %.4e” % (i, m_i, b_i, err))

            print(“Done training! m = %f, b = %f, err = %e, iter = %d”
                  % (m_i, b_i, err, i))
            if checkpoint_file is not None:
                print(“Saving variables to ‘%s’.” % checkpoint_file)
                saver.save(session, checkpoint_file)

        print(”      x: %s” % x_in)
        print(“Trained: %s” % y_i)
        print(” Actual: %s” % y_star)

A Simple Predictive Model in TensorFlow

In my previous post I provided a simple introduction to TensorFlow. In this post I’d like to take the next step and build a predictive model so I can highlight some key TensorFlow concepts.

This model will fit a line y = m * x + b to a series of points (x_i, y_i). This code is not the best way fit a line – it’s just an example. In our code, we’ll generate points with small random deviations from a line with known slope and intercept. Our test will be to see if we can recover these known values using TensorFlow. Here is a picture of our training data:


My last post explained that there are often four phases to TensorFlow programs: creating a model, getting the input data, running the model, and processing the output. In our model we want to find a slope m and intercept b that best fits our input data. What do we mean by “best fit”? We mean values m, b that give the smallest sum of squared error between the predicted and actual y_i. The way we do this in TensorFlow is create this expression, and then repeatedly run a Session that adjusts the values of m and b to make the error smaller using an optimizer.

There are two functions below: one to generate test data, and another to create and run the TensorFlow model:

def make_data(n):
    np.random.seed(42) # To ensure same data for multiple runs
    x = 2.0 * np.array(range(n))
    y = 1.0 + 3.0 * (np.array(range(n)) + 0.1 * (np.random.rand(n) – 0.5))
    return x, y

def fit_line(n=1, log_progress=False):
    with tf.Session() as session:
        x = tf.placeholder(tf.float32, [n], name=’x’)
        y = tf.placeholder(tf.float32, [n], name=’y’)
        m = tf.Variable([1.0], trainable=True) # training variable: slope
        b = tf.Variable([1.0], trainable=True) # training variable: intercept
        y = tf.add(tf.mul(m, x), b) # fit y_i = m * x_i + b

        # actual values (for training)
        y_act = tf.placeholder(tf.float32, [n], name=’y_’)

        # minimize sum of squared error between trained and actual.
        error = tf.sqrt((y – y_act) * (y – y_act))
        # train_step = tf.train.GradientDescentOptimizer(0.01).minimize(error)
        train_step = tf.train.AdamOptimizer(0.05).minimize(error)

        # generate input and output data with a little random noise.
        x_in, y_star = make_data(n)

        init = tf.initialize_all_variables()
        feed_dict = {x: x_in, y_act: y_star}
        for i in range(30 * n):
            y_i, m_i, b_i, _ = session.run([y, m, b, train_step], feed_dict)
            err = np.linalg.norm(y_i – y_star, 2)
            if log_progress:
                print(“%3d | %.4f %.4f %.4e” % (i, m_i, b_i, err))

        print(“Done! m = %f, b = %f, err = %e, iterations = %d”
              % (m_i, b_i, err, i))
        print(”      x: %s” % x_in)
        print(“Trained: %s” % y_i)
        print(” Actual: %s” % y_star)

Hopefully make_data is fairly clear. The function fit_line takes two input arguments:

  • n: the number of points to generate
  • log_progress: whether to display TensorFlow’s progress in finding the right slope m and intercept b.

After we create a TensorFlow session, our next two steps are to create placeholders for our input x and output y, similar to our first example. These are both Tensors of size n since that’s how many data points we have. The next line creates a TensorFlow variable to represent the slope m. A variable is a value that is retained between calls to Session.run(). If the value is an input or an output from the model, we don’t want a variable – we want a placeholder. If the value remains constant during our computation, we don’t want a variable – we want a tf.constant. We want variables when we want TensorFlow to train the value based on some criteria in our model. Notice when we create the Variable objects we supply initial values for the variable, and a “trainable” flag. Providing TensorFlow with initial values for a variable informs TensorFlow of the dimensionality and type – in our case m and b are single dimensional Tensors of size 1, but they could just as easily be multidimensional and/or integer.

The next expression assigns y the value m * x. We want to do this on an elementwise basis: we have a series of points (x_i, y_i) that we want to train against scalar values m and b. The TensorFlow functions add and mul operate on their arguments on an elementwise basis with broadcasting: using + and * would not have the intended effect.

Now that we have a model for our predicted values y, we want to compute the sum of squared error. This is accomplished using Tensor arithmetic and tf.sqrt. Here is a picture of our computational graph to this point:


Here comes the next new concept: optimization. We have specified our model, and the error in the model, but now we want TensorFlow to find the best possible values for m and b given the error expression. Optimization is carried out in TensorFlow by repeatedly calling Session.run() with an Optimization object “fed” as input. An Optimization carries out logic that adjusts the variables in a way that will hopefully improve the value of the error expression. In our case we will use an AdamOptimizer object. The parameter to AdamOptimizer controls how much the optimizer adjusts the variables on each call – larger is more aggressive. All Optimizer objects have a minimize() method that lets you pass in the expression you want to optimize. You can see that the train_step, the value returned by the AdamOptimizer, is passed into the Session.run() call.

Let’s explain briefly how the optimization works. A single call to the Optimizer does not adjust variables all the way to their optimal values; a call represents a single step towards an optimum. If you want to learn more about the specific logic that AdamOptimizer uses during a step, look at the TensorFlow documentation, or if you are ambitious, read the paper. The key ingredient is the gradient of the variables that you are trying to optimize. TensorFlow computes gradients by creating computational graph elements for the gradient expressions and evaluating them – have a look at this stackoverflow response for details. Again, TensorFlow can do this because it has a symbolic representation of the expressions you’re trying to compute (it’s in the picture above). Since a call to an optimizer is a single step, Session.run() must be called repeatedly in a loop to get suitable values. In the picture below I have plotted the values of the error (MSE) and m (Slope) expressions for the first 50 steps.


If you have past experience with optimization you may wonder why I am running the optimizer for a fixed number of steps rather than having a more sensible stopping criterion. The answer is I am keeping it simple – feel free to extend the example if you like. You may also observe that this code is not very efficient or accurate in fitting points in a line. That’s not TensorFlow’s fault – it’s my fault for writing such a contrived example. In many real world examples the actual computational graph represents a complicated neural network.

Much of the remaining code to create the input and output arrays and call session.run should be familiar to you if you worked through my first post. When we complete our loop of Session.run() calls we print out our final slope and intercept, as well as the trained and actual y values.

With luck, I will be able to continue this series to use TensorFlow to build and run a neural network to solve a problem that is closer to a real-world scenario.