Chapter 6: Optimization
This supplementary video to Chapter 6 of The Rise of Artificial Intelligence presents the main concepts related to optimization, which are illustrated algorithmically through the magic square example and the complex business problem of distributing cars to auction sites. The presentation concludes with an overview of additional topics such as dynamic environments, multi-objective optimization, risk and variability, and global optimization.
Some of material in this video is based on a complex business problem that's used as a running example. The following article provides a full explanation of this problem as well as its complexities:
Michalewicz, Z., Schmidt, M., Michalewicz, M., and Chiriac, C., A Decision-Support System based on Computational Intelligence: A Case Study, IEEE Intelligent Systems, Vol.20 (4), July-August 2005, pp.44 – 49
Click here to download Chapters 1 & 2 of The Rise of Artificial Intelligence: Real-world applications for revenue and margin growth, and please contact us to request a soft copy of any other chapter of the book.
Transcript (reading time: 24:55 min)
Hi, this is Zbigniew Michalewicz, I am one of the co-authors of The Rise of Artificial Intelligence, and this is a supplementary video to Chapter 6 of the book. And the topic of today's presentation is optimization. Some of the presented material for this chapter, the previous two chapters and the following chapter is based on the complex business problem of distributing cars that I would keep using as the running example. The recommended reading is this brief article, which is also available from the same website as this presentation.
The outline of the topic is the following: We'll start with general remarks on optimization and then talk about the main concept behind optimization, and in particular, the three key components of any optimization problem. Then we'll look at one example that involves the use of Magic Squares. Then we'll move directly to the car distribution problem. We will talk in detail about optimization for distributing cars, how it can be done, and towards the end, the last few minutes, we will talk about a few additional topics related to optimization.
Many problems fall into the category of optimization problems, and every time we talk about the best solution among many possible solutions, we really talk about optimization. Every time you try to select something "best" out of many, it is basically optimization. And organizations want to optimize everything: minimize cost, maximize profit, minimize risk, maximize efficiency – making the best possible decision, whatever this decision might be. The fundamental concept behind any optimization discussion is the concept of the search space. If we enumerate all possible decisions, all possible plans, all possible scenarios, this is the "search space," and we have to select the best scenario to hit some objectives. For example, for the car distribution problem, each of these dots would represent a possible distribution of cars among all auction sites, and we'd like to select the "best" distribution. So the next important concept is quality measure. To do optimization, we have to have some quality measure by which we can tell the solutions apart. We should be able to say "this solution is better than that solution because the quality measure told me so."
Without quality measure, there is no way we can optimize anything because we can't distinguish between different solutions. So having this huge set of possible solutions, the quality measure is the key component that allows for optimization. An additional concept is that of feasibility. Any problem has many constraints – some of these constraints are satisfied and some of them are not satisfied. Thus, it's meaningful to talk about feasible solutions, which are those that satisfy problem-specific constraints, and unfeasible solutions are those that don't. Of course, there are a variety of grades, there are hard constraints and soft constraints, and not so hard constraints and a variety of them, but for this distinction is be more than sufficient for this presentation. So the optimization task can be formulated in very simple language: find the highest quality, feasible solution. So out of many possible solutions (and the number of solutions can easily go into the trillions, as we'll see) we're after the one solution that is not only feasible, but of the highest quality.
Now there is some complication and we'll talk about that towards the end of the presentation. Sometimes we have more than one quality measure. For example, we would like to maximize margin on a product, but at the same time we'd like to maximize the volume and we know that margin and volume work against each other. So suddenly, we might be comparing two different solutions. One solution is better on one dimension, the other solution is better on the other dimension.
And the question is, which of these two is "better"? And suddenly the answer is not that clear as it was when we had a single objective. But they are methods to deal with this situation in a very efficient way.
The key step in any optimization process is the transformation of one solution into another. Let's say we have a particular distribution of cars, a particular schedule, or a particular promotional plan – whatever it might be –and we evaluate the solution and know the quality measure. Then, during the optimization process, we transform this solution into a new solution, and we evaluate it. And either we do it on an individual basis by processing a single solution, and the optimizer is always paying attention to the single current solution, trying to improve this solution, or we may have several solutions, we say a population of solutions, and we are processing this population together. And sometimes we have two or three current solutions we use to create a new solution which is displayed here: Parts of these two solutions are glued together to generate a new solution. So how to generate this new solution or a set of new solutions from the current one is really the key concept in classifying many Artificial Intelligence-based optimization methods.
We talk about genetic algorithm and simulated annealing, tabu search, particle swarm and ant systems, which we discussed earlier. There are more similarities than differences between these methods, with the main similarity being that we are processing either a single or many solutions. And the main difference is the method of generating a new set of solutions that take advantage of all knowledge we have, because we don't have unlimited time to arrive at the optimum solution (which we'll talk about very, very soon).
So the three key components to remember (aside from the search space with a solution) for any optimization method are 1) representation of the solution: Whether the single dot in the search space is a schedule, a promotional plan, or a distribution of all cars, in this case 7,000 cars across all auctions. How will we represent this solution? With a matrix? Or collection of values of some sort? How to represent the solution is a key question for any optimizer. The second component is the objective. Once we have a particular solution, how can we evaluate the quality of this solution? How can we measure the quality? And the third component are constraints, which make some solutions feasible and other solutions not feasible. And we have to keep all three components in mind. Now, let's start with one particular example: Let's consider the problem of constructing a magic square, which is a square with numbers from one to, in this case, 100, because we're talking about a ten by ten square.
If the square is smaller, like 5 by 5, we would be dealing with 25 numbers, and our task is to rearrange these numbers, to swap them around in such a way that when we're done, the total of each row, each column and each diagonal is exactly the same. And if we think about this, the problem is far from trivial. There are many, many possible swaps we can make and whatever new arrangement we arrive at, we have to evaluate this solution.
We can find the total of each row, each column and each diagonal. And let me show some simple magic square, which is 3 by 3. In this square the numbers are 1, 2, 3, 4, 5, 6, 7, 8, 9, and we have totals for each column, each row, and each diagonal.
When we start optimizing, in a fraction of a second at generation nine, we get an error of zero, which mean that all totals for each row, column, and diagonal is exactly the same. And indeed this was an easy problem. But we can make this problem a little bit harder. We can look at 100 x 100 problem. This will take a significant amount of time to do by hand. Again, we can use the optimizer and at generation 235, in a couple of seconds, the error is again zero and the solution is found.
We can add these numbers anyway we like: diagonal wise, vertical, horizontal, and the total is always 505. Or we make a magic square of 20 by 20. Here we're talking about 400 variables at this moment. We are optimizing and at generation 663, again the error is zero, and the totals are the same at 2010. The problem is solved.
Or we can move to much higher dimensions, let's say 50 by 50; and if we go to 200, then we're talking about 40,000 variables! And we can also play with constraints. For example, in this 50 by 50 problem with 2,500 variables, we would like to have these numbers from 1 to 9 grouped together into one small square, just for fun, just to introduce some constraints, just to make this optimization task a little bit harder. So we run the optimizer and again, this time it will take a little bit longer.
Probably it was like seven seconds or so, at generation 1554, the error is zero. We can add these numbers left to right, and let me search for a constraint: yes, here it is 1, 2, 3, 4, 5, 6, 7, 8, 9. This is a feasible solution, the constraint is satisfied, and the problem is solved. So let's quit the Magic Square, and let's return to the presentation and talk about the complexity of this problem.
First of all, let's talk about some numbers. The number of seconds since the Big Bang, which is 13.8 billion years ago or so, is not that large. This is really surprising. This is the number of seconds from Big Bang; the estimated number of planets in the known universe is a little bit larger. But the number of ways that we can arrange a 52-card deck is really enormous. Think about that. We have an ordinary card deck with 52 cards, and the number of different arrangements – the order of these 52 cards – is astonishing.
And then if we have this silly 10 by 10 square with numbers from one to 100, so the number of possible squares with different arrangements of these numbers is nine, followed by 157 zeros. So this is our search space: this is the number of possible solutions, and we have to zoom in and find the one with the highest quality, the one with an error of zero, which is feasible. So how is it done? One method, for example, would be to use evolutionary algorithms, which would work like this: This is our solution, our structure – it's a square of 100 numbers. And this is the dot we talked about earlier, so we know that the average is 505 of all these numbers. So if we're aiming at getting the same number for each row, each column and each diagonal, we're aiming at 505. And because of that, we can easily evaluate this initial solution.
Of course, it's not a solution. It's just an additional square with numbers; it's our starting position in the quest of finding the optimal solution. So the evaluation function is very simple: we take into account the total of each row. This was 55, the total of the 2nd row, the total of the 3rd row, 255, and so on, and we subtract 505, and look at the deviation, whether positive or negative, we don't care. And we do this for every row, every column, and both diagonals – this is why we have 22 components of this evaluation function – and we are getting 2,750. This is the margin of error. This is how far we are from this ideal 505. If the total for each row, each column, and each diagonal is 505, then this evaluation would result in zero. So we know our evaluation score.
Pay attention we're trying to minimize this number because the evaluation score is the error, and we're trying to find the optimum solution with an error of zero - we can't go below that. So what we can do with a miserable solution that has an error of 2,750? Let's talk about this important step, how to generate a new solution based on the current solution. What we can do is run a so-called mutation.
We can identify two random places in the square – let's say we identify 14 and 87 – and then we swap them. 87 would go in the place of 14, and 14 would replace 87. And suddenly we have a new solution. We made this transition from one solution to another. So let's evaluate this new solution. Now the numbers would be slightly different in the 2nd row, in the 9th row, the 4th column, and the 7th column. The diagonals probably wouldn't change, and the error is 2,690, which is better than the previous one, which means we made a step in the right direction and the new solution is better.
As I said, the goal is to reduce the evaluation score to zero, because the goal is to reduce error and we can run many generations of this process. This is the first generation, second generation, third generation, and so on. And of course, in this example, as we saw with the 10 by 10 magic square, there are too many possibilities. This process wouldn't be that efficient. The number of possible solutions for this 10 by 10 magic square is 9, followed by 157 zeroes.
So if we just follow the simple process that I outlined minutes ago, we'd probably wait for a very, very long time. So we can think about an intelligent initialization: instead of starting with square 1, 2, 3, 4 or 5 up to 100, we initialize this square randomly in that way, and most likely we'd reduce the error. Then instead of processing a single solution and generate a new solution, we can process a population of solutions and exchange information.
Then we can design intelligent operators. I indicated that we could take 14 and 87 and swap their places, but we could also pay more attention to rows and columns with higher errors. And if we have a higher error, then we'd have a better probability of improvement. And then the evaluation may include a penalty component which reflects a degree of constraint satisfaction. If we keep these numbers 1, 2, 3, 4, 5, 6, 7, 8, and 9 far apart, then we add an additional penalty, making this error even bigger. So there are many ways how we can handle this situation. With this explanation, let's return to our running example, the complex business problem of distributing cars. Again, a very brief summary: GMAC leasing company, part of General Motors, they are getting back cars at the end of leases and rentals. The total annual number of returned cars is 1.2 million, which translates into a few thousand cars – between between four and seven thousand cars – every single business day.
And the team in Detroit of 23 analysts are making decisions on how to distribute these cars in an intelligent way. So the last time we talked about the prediction issues of this problem, and now let's talk about optimization issues. Even if we get a perfect prediction from the predictive model – how much we'll get for this car at this auction – there are very significant optimization issues: the number of possible distributions, the transportation schedules are very complex.
And there is a component of risk, we have complex constraints and business rules, we have to take everything into account, including the volume effect and other issues related to the predictive model to find the best distribution. So this was the situation: We have distribution centers that collect cars, we have auction sites, and every day the team in Detroit is making up to 7,000 individual decisions: this particular car sitting here should be shipped together with many other cars to this particular auction site.
So last time I talked about the interaction between the optimizer and predictive model, and as I indicated, the optimizer would ask the predictive model to evaluate a solution. This is predictive model is playing the role of an evaluation function, returning a quality measure score. So if the optimizer says "hey, what about this particular distribution?" The predictive model would say "for this distribution, you'll be getting $213 average lift per car."
And as I indicated last time, this process of generating a new distribution and consulting the predictive model whether the distribution is any good, is the key step of the optimizer. So the optimizer is also the key, not just the predictive model, because the number of possible distributions of 7000 cars among 57 auction sites is– listen carefully– 1 followed by 12,291 zeros. So this is our search space, this is the set of all possible distributions.
And some of these distributions are absolutely crazy: We send all cars from all distribution centers to one auction site. This is a possible distribution and so on. But the number of possible distribution is beyond astronomical. So the number of times the optimizer asks the predictive model to evaluate a particular distribution must be limited. Even if the valuation takes a fraction of a second, it would take many, many lives of the entire universe to explore all possible distributions. So even if we could evaluate trillions of different distributions within the allocated time, it's still a tiny, tiny fraction of the search space.
This is why we need some intelligence in the search. On top of that, the optimizer should return a feasible solution – so on top of everything, the optimizer should be able to handle all business rules and constraints which emerge from knowledge, from the earlier level of our pyramid. After data preparation, data analysis, collecting all information, we get some knowledge, we get some glimpses of wisdom, and we'd like to guide the optimizer in a particular direction. For example, we discovered that it doesn't pay to transport cars for more than 200 miles because risk factors are too high: probability of accidents, delays, and many, many other things. So we'd like to influence the optimizer by telling it "whatever you do, don't transport any car for a more than 200 miles." Then through some additional developments, we might find that BMWs, for whatever reason, aren't popular in the Boston area. So we'd like to direct the optimizer again, by saying "no BMW should go to the auction in Boston." Or red cars aren't popular in North Florida, so we say "no red car should go to Jacksonville," or "no car with low odometer reading should go to New Jersey," or "no old cars, no car produced earlier than 2005 should go to Knoxville," and so on.
So we have a variety of no little variety of pieces of wisdom and you would like to share this wisdom with the optimizer to make the right direction. So let me show you how this is done. Let's return to the car distribution system. Yes, this is our car distribution system. As explained earlier, these are the cars, they are sitting at the distribution center waiting for distribution. And we can run our optimizer and this optimized route considered these millions, if not more, different distributions; for every distribution the system we would evaluate the merits of the proposed distribution.
And as we saw last time, for each particular decision, the predictive model would tell us the predicted sale price for each particular car for each particular auction. However, before we run the optimizer, we may consider moving to the configuration screen. We can look at auctions, we can select a particular auction, let's say, ADESA Dallas, and then we can implement our knowledge here. We can say "if mileage odometer reading is below 10,000 miles, we shouldn't send cars like that to Dallas" because people over there don't appreciate cars with very low mileage.
And vice versa: if a car has 120,000 miles or more, it's also a bad decision, because people over there wouldn't buy a car with such high mileage. We can do the same with years by saying "no car produced in 2005 or earlier should go to this auction site." As I indicated earlier, we can select BMWs – let's say the 3 Series – and we add it to the exclusion list. No BMW 3 Series would go there. Or we can add all BMW models, so that no BMW would go there. And we can do the same with color: any green car, for example, would be excluded. We can influence or change the auction schedule. Also, we can look at the global constraints and say "the maximum transportation distance is 500 miles." Let's see what would happen if this is the maximum transportation distance? And we can share this knowledge for any auction site. We can select any auction site, and whether it is Buffalo or Knoxville or Los Angeles, we can share our knowledge with system and we can use such screen for what-if analysis.
So we set our business rules and constraints, we run the optimizer, and effectively we are asking the question "what would happen if I don't allow any red cars in California? What would happen if I limited transportation distances to 1000 miles? What would happen if I do this or that?" Also, this tiny button – use this auction or not, is of a strategic significance, because there are 200 auction sites in the United States and they operated on 57, and who said that 57 is the best number? Using this feature, they can explore possibilities using agent-based simulation. We can explore the possibility of using 62 auction sites or 51 auction sites. And at the end of the run, the system will produce the final result, the final distribution, final sales prices with estimation of volume effect and total distance travelled and transportation costs and net price and the lift.
So let's return to our presentation. How does it work? Let's discuss the evolutionary algorithm to see what was happening behind the screen. This is our potential solution: Let's say that this box contains up to 7,000 numbers because we are distributing up to 7,000 cars. All these cars are numbered: car number 1, car number 2, car number 3. And the numbers in the box correspond to an auction site index. The first interpretation of this screen is the following: This is a possible solution, where the 1st car should be sent to auction 12, the 2nd car should be sent to auction 43, the 3rd car should be sent to auction 28, and so on.
So this is a representation of our solution. So these are numbers of auction sites from 1 to 57. And again, the number of possible distributions is enormous, 1 followed by 12,000 zeroes. So let's say that we'll try using an evolutionary algorithm, then we have a population of 100 possible solutions. We can initialize it randomly or use some more intelligent method, and then we can run the mutation. We identify some randomly selected numbers where each number is a decision: this 45 means that the car 1, 2, 3, 4, 5, 6, 7, this car number 7 is directed to auction 45, and we impose a mutation and we replace this number 45 by another randomly selected number. So we say "this car number 7, instead of going to auction 45, should go to auction 27." And the same here: instead of directing this car to auction 7, we are directing it to auction 23. And here we are mutating in two places, nothing wrong with that. 12 is replaced by 32 and 11 is replaced by 17. Then we can try also so-called crossover. We take two solutions and then cut them in random points and take the first part of the first solution and the second part of the second solution. You are grouping them together and we arrive at the offspring of the new solution. This is how we generate new solutions. As I mentioned it earlier, this is one of the key points of optimization: how to generate new solutions from the current solutions.
So here in an evolutionary algorithm, we do it through mutation and crossover. So let's say we have this population of 100 candidate solutions, we evaluate all the solutions – like the average lift for this distribution is 193, and 211 for another distribution – and we keep going. Then we select the better individuals. These evaluations would most likely result in different numbers. Still, we might be very far from the optimum solution, but at least we can identify better solutions and weaker solutions, with better solutions serving as parents, and on these parents we are run mutation and crossover and generate offsprings. So we take two solutions, cut them together and take a yellow part from the first solution and a yellow part from the second solution to create a new offspring, a new solution. Or we can glue together the green parts and so on, while randomly mutating different parts. And then we make the replacement: the offspring replace the weak parents, and then we run the process of evolution again.
We have to know the merit of each solution and we repeat the process: we select parents, we mutate them, we you apply crossover. These solutions generate offspring and the offspring replaces the weaker parents. We evaluate new solutions and we keep going, generation by generation, and this is how it was working. Every distribution was presented during this process. The best individual found so far was evaluated together with other individuals for a variety of purposes, and the system asked the predictive model for the quality measure score. Finally, at the end of the run, the system couldn't improve anything further and the optimal average lift that was found was $245 per car.
And the results were really impressive in that the average net lift after four months of testing was around $250 per car, 23 analysts went down to 2 that now operate the system in minutes, not man-days. This was, as we see, a data-driven decision that was consistent where the predicted versus actual comparisons were performed, creating a closed loop for improvement. New data were coming in, new feedback was coming in, which we'll talk more about in Chapter 7.
A couple of additional comments: if we look at our search space and quality measure, the number of solutions is enormous, and we can visualize this as a surface. Each little point on the horizontal surface is a possible solution. And here we have a surface of quality scores. But in the real-world situation, that's even more complex because the surface moves, because we operate in a dynamic environment. So one of the challenging questions is to create optimizers that can operate in a dynamic environment.
Another point is multi-objective optimization, as very often we have more than one objective: we would like, let's say, maximize margin, while at the same time maximizing the sales volume, and these two objectives work against each other. And all these five solutions, all these five dots are optimal in the sense, but it's very difficult to tell which is better and why. We can talk only about trade-offs. But this is a separate topic of multi-year objective optimization.
An additional comment is related to variability and risk. We may have two different solutions and have some evaluation function that indicated solution A is little bit better, the quality measure score is a little bit higher. And very often this is a very interesting question: which of these two solutions should be selected for implementation? And most people would agree that solution A is because if the evaluation function is genuine, providing some real quality measure, then A solution is clearly better.
However, this solution may be sitting on a very narrow peak, meaning that any deviations from the underlying assumptions – a small delay or small inaccuracy – may result in a very rapid downturn of the evaluation function. On the other hand, the other solution is much more stable if there is some delay or some measurement wasn't accurate, such as damage level for the car wasn't measured that precisely and so on – this solution is much more stable. So on top of everything, we should take into account the shape of the surface we saw earlier, and also include these considerations in the final recommendation.
And the final comment is connected with global optimization. Very often we are solving a limited problem like optimizing operation in the mine, in the coal industry, or we are optimizing logistics, or we're optimizing port operations. And very often when we optimize one silo of a business operation, we talk about local optimization. And sometimes the best solution for one silo may have a negative impact on other silos. So the name of the game is to develop a global optimizer in the sense that we take the whole operation of an organization, looking at each aspect, at each silo, optimizing each silo separately and together at the same time, so these silos talk to each other, which is essential to providing the greatest benefit. This concludes the presentation for Chapter 6, and in Chapter 7, we'll talk about new data, learning, and updating models. Thank you.