Chapter 4: Data
This supplementary video to Chapter 4 of The Rise of Artificial Intelligence presents basic terminology related to data (e.g. datasets, records, variables, and values), as well as an overview of data transformation and composition, variable selection, data reduction, data normalization, and missing data. This video also distinguishes between data, information, and knowledge, which represent the three lower levels of the problem-to-decision pyramid.
Some of material in this video is based on a complex business problem that's used as a running example. The following article provides a full explanation of this problem as well as its complexities:
Michalewicz, Z., Schmidt, M., Michalewicz, M., and Chiriac, C., A Decision-Support System based on Computational Intelligence: A Case Study, IEEE Intelligent Systems, Vol.20 (4), July-August 2005, pp.44 – 49
Click here to download Chapters 1 & 2 of The Rise of Artificial Intelligence: Real-world applications for revenue and margin growth, and please contact us to request a soft copy of any other chapter of the book.
Transcript (reading time: 19:10 min)
Hi, this is Zbigniew Michalewicz, I am one of the co-authors of The Rise of Artificial Intelligence, and this is a supplementary video to Chapter 4 of the book. During this presentation, we'll talk about data, which is the topic of Chapter 4. However, I would also touch on the next two levels of the pyramid, information and knowledge. Let me start with a general remark. Some of the presented material for this chapter and the next three chapters is based on one particular, very complex business problem, and I would use this problem as a running example.
There is one particular article listed here, available from our website, which is recommended reading because it describes this complex business problem in detail. The outline of this presentation is as follows: we'll start with the car distribution example, then talk about terminology, data sets, records, variables and values, followed by data preparation, data transformation, composition, variable selection, and the handling of missing data. Towards the end we'll make a distinction between data, information and knowledge, and then conclude by returning to the car distribution example.
So, GMAC is a leasing company, they lease cars across the nation and these cars are returned to GMAC at the end of the lease. When returned, these cars are sold on auction sites across the United States. The annual turnover is over a million cars, which translates into a daily number of between 4,000 and 7,000 cars. Every single day, the re-marketing team in Detroit of 23 data analysts make decisions for every single car that's ready to be shipped to auction.
They have to make a manual decision: which is the best auction site to send this particular car to? And the problem is very complex. We'll discuss the complexity of this problem when we talk about prediction and optimization in chapters five and six, and you can also read about the complexity of this problem in the article I mentioned earlier. So the cars are collected at distribution centers where they're sent from dealerships, over there the paperwork is prepared and little repairs are performed if necessary.
Whenever the car is ready for shipment, the re-marketing team in Detroit makes the final decision on where to ship the car. For example, this particular Pontiac Grand Prix, which is sitting and ready for shipment at this particular location, would be shipped to this particular auction site or that particular auction site. So a few thousand cars are available for shipment every day, meaning that a few thousand individual decisions need to be make every day. The team has to come up with the best possible destination, and the "best" means the destination that would provide the highest selling price.
So today we'll look at available data and we'll talk about how it's transformed into information and knowledge. And the main purpose of such transformations is to prepare for building models, in particular, predictive models and optimization models. At the end of the day, we'd like to have a better understanding of different possibilities, putting some intelligent system together. So all these exercises connected with data and transformations only have one goal: to build a clever model that would help us make better decisions.
So, let's talk about data. As far as cars are concerned, we have a variety of variables: we have the VIN number for each car, the make and model, mileage, year of production, and color. These cars may also have a damage level from zero to ten, and some indicators for options, such as power windows, sunroof, alloy wheels, and so on. We also have a variety of data connected with the auction sites.
In one particular auction in Houston, Texas, we might have a maximum inventory of 8,500 cars that we can keep there. We may have some information about the average number of people who attend auctions. We also need some information on the distance between all distribution centers and all auction sites. So the distance between first distribution center and first auction site is 850 miles. The distance between the second distribution center and first auction site is 220 miles, and so on. This will be important for estimating transportation costs.
Also, it's very important to have a calendar of auctions because auctions are usually run every second week. These dates are very important because if you miss the auction by a single day, the car will sit for an additional two weeks, suffering depreciation costs. So the calendar of auctions will be a very important variable for the optimal distribution of cars. Also, we have historical data on sales. We know that a particular car with a particular VIN number was sold at auction site 47 in Houston, Texas on 28th February 2002, and the price was $11,900.
And this information is also extended by the identifiers of other cars that were processed that day. The reason for this is called the volume effect. If we have too many cars of the same type that are sold together, they would depress the price. So the volume effect is an important aspect for consideration, which is helpful later on in modeling. And the remarketing team also has access to external historical data: for example, they know the temperature that day or rainfall and the gas prices and additional information like color preferences for this area of the country.
And some of these variables are really important. If we have significant rainfall, clearly the prices would be lower because fewer people would attend the auction. So the basic term is dataset. The dataset is simply a collection of records for cars. We have all these records that follow the structure, which I just described a little bit earlier. So each record would consists of VIN number and make and model and mileage, color and power windows, and so on.
And when we talk about records, it's customary to talk about variables and their values – so VIN number, make, model, mileage – these are variables, and then each variable would have a value from a particular range. It might be a string of characters like the VIN number, it may be the year of production, it might be any integer number for mileage and so on. And the values for options are indicated by binary variables: "yes" or "no."
Talking about the values of variables or variables themselves, we can look at variables from a slightly different perspective: we can say some variables are numerical, which are easy variables to deal with because we can compare the value of one variable against the value of the same variable in the other record – for example, which of the two cars have higher mileage. Some variables are nominal, there are some labels attached to the variable like color, white, yellow, green, silver, and so on.
Some variables are binary: "0/1," "yes/no," "true/false," and very often there are some other variables like free text, we have some free text description of the damage, or a description that relates to special features of the car, which are not included in a standard set of attributes. Then very often when we are dealing with nominal variables, we are converting them into numbers simply because computers are much better with numbers than a string of characters. So one possibility might be to assign "1" for white and "2" for yellow, "3" for orange, and so on.
Having fourteen colors, let's say we assign "14" for black. Another possibility would be to convert these labels into binary strings, so let's say white would be "1" followed by 13 zeros, yellow is "01" followed by 12 zeros, and so on, with black being all zeros and on the 14th position the number "1," so many possibilities of how to do this. Also, variable transformation and variable composition are very important concepts because some variables may require transformation. For example, age might be much more important than year of birth for a human and for taking into account datasets for cars we can see the age of cars might be more important than year of production, for example: this is a three year old car, a five year old car.
So having only a year of production, it might be smart to transform this variable into age. The same with the variable composition. It may be meaningful to create a new variable, which by itself may not present in the dataset, but we can construct this variable by means of other variables. So average miles per year: we take the whole mileage on the odometer, we divide it by the age of the car, and we have an average number of miles per year, which may be a very important and useful variable for predictive modeling, when we try to predict the price for the car at a particular auction site. Also, variable selection is a very important concept because oftentimes there's a significant number of variables and the key issue is to select the meaningful variables, and only the meaningful variables. If we miss a meaningful variable, that's very bad, for example, if we don't include the make of the car.
So we have no clue whether it is Toyota or whether it is Honda, Mazda, Holden, or so on. Clearly, the make and model of the car would be one of the key variables and this is very obvious. But with a large number of variables, it's not that obvious which variables are more important and which of them are less important. And if we keep including irrelevant variables, then it's very bad for the predictive model because we're introducing noise and it's much harder to train the model, and the accuracy of the model will usually would be much, much poorer.
And the issue is the following: If we have just 20 key variables, we have over a million possible subsets of variables to choose from. The question is, how should we organize these variables into subsets? Let's say one possible subset would be to select the variable 2, 3, 5, 7 and 11. Another possible subset is to select variables 3, 5, 6, 11, 12 17,19. And if we look at all these possible subsets of variables we'd like to include in the predictive model, then the number of these possible subsets is quite significant: with 100 variables, the number of possible subsets is huge.
So we can't evaluate each subset and we can't evaluate the precision of the model built on all different subsets. Instead, we have to use principal component analysis or other methods to identify these variables. Also, this variable selection process is part of the data reduction process. Very often we have huge datasets and very often it's useful to train the model on subsets of variables. So basically, again, the concept is very much the same: we remove non-essential data and this is a very healthy approach from a predictive modeling point of view.
One possibility is to remove some variables. Let's say we discover that this binary variable "power windows" is of no significance whatsoever, so we can remove it. Another possibility would be to cluster some values which are present in the dataset in two ranges. For example, instead of dealing with every single possible integer for milage, we can group them together, for example, mileage between 10,000 and 20,000.
Another possibility, would be to remove some records, because it's not always the case that "the more data, the better." One particular technique we can use is incremental sampling when we train a model on a subset of data, and gradually increase the number of records and measure improvements on the predictive model, and once we see the improvement has stopped – we keep adding records and there's no improvement – then it's time to stop. An additional aspect of data preparation is data normalization. For some variables, it may be meaningful to scale them into specific range, let's say from 0 to 1.
Because the reason is very simple. If we look at the age of a car, which is usually a single digit integer two-year old car, five-year old car, and so on, versus mileage, which is in tens of thousands of miles, then the ranges of these two variables are very, very different. And to introduce some uniformity and predictive modeling, you may wish to have some scale. And so both the age of the cars and their mileage are represented on the scale of 0 to 1, with "1" meaning the largest possible mileage we have in the dataset, and the same for age, with "1" corresponding to the oldest possible car, which might be a nine-year old car.
Also, we can convert some numbers into different numbers; for example, instead of keeping $400 in damage for this particular car, we can convert that either into a damage level or a damage category. Categories 0, 1, 2, 3, 4 and 5, with "5" being very, very significant damage. Also, it's important to develop an approach for missing data because in almost any dataset, some values aren't recorded.
Colour might be missing, mileage might be missing, and so on. So either we introduce an additional value for the variable – apart from white, silver of red, black, we can have "unknown" as a value, because it's missing – or we can just remove the records with missing values. However, we have to be very careful with that. Everything depends on the proportion of missing values in the whole dataset. In some cases, we might lose over 90% of records, which isn't healthy.
So another possibility would be to do some replacement. For example, we can take some mean value for this variable. Let's say the mean value for mileage would be "27,000" miles, and whenever we find a missing value, we insert "27,000" miles. It will be even better if we can estimate the value of this variable on the basis of additional data which we may have. For example, we can assume that the average car travels 12,000 miles per year, and if we know that the car with missing mileage was four years old, we can then estimate its mileage at 48,000 miles, which is a more meaningful replacement. Additional possibility is to use agent-based systems to fill in the gap. And in the book, we discuss one particular application of agent-based system where we have a collection of birds, they are distributed randomly in the environment, the location is random and the initial direction of their flight is random, and we just impose two very simple rules:
If two birds of the same species – meaning that both of them are either blue or red – come close to each other, then since they're the same species, they get even closer together, and if two birds of different species come close to each other, then they get further apart. And we can wonder with random start what would be the emergent behavior, what would happen?
So we build a simple simulation model and we observe the emergent behavior, and we can see what would happen, how they cluster together, and we can reach some conclusions. We can make a variety of additional observations. This is a very interesting approach that we can use to simulate a variety of things. We can simulate, for example, the number of individuals attending the auction site where the variables are season, weather, proximity from some cities, and a variety of other factors. And the rules we get are based on historical data and our observations of what happened in the past.
But we can model a variety of situations and we can arrive at some results that would fill in the gap for is. Final aspect I would like to mention is time dependency: very often when people do time series modelling, they assume that they're present at regular intervals. For example, every week you have some sales data, every quarter – in stock market, every day, or every hour where we can take readings, and so on. With cars, it is a very different story.
For example, we can look at one particular car, let's say Toyota Corolla, let's say blue with some mileage around 30,000, and then we look at one particular auction site and realize that one car of this type was sold in mid-April, the other two in early May, another one in late August. And now we're making a prediction for mid-October. So it's not really this classic time series modelling, when everything is very regular, we need, again, some clever techniques to come up with an accurate prediction.
Let's talk now a little bit more about information, and later about knowledge. We use all this data, process it, and get some information. For example, we can take one particular make and model –the Pontiac Grand Prix – and concentrate on cars between 20,000 and 40,000 miles, produced in 2001, with no accidents or with damage level of "0." And we can look at the average price of this particular type of car across a variety of auction sits.
And we can see, on the one hand, we have auction site 14 where these cars aren't that popular, with the average price is around $11,000, and on the other extreme, we have auction site 18, where we can sell the same car for more than $12,500. So we get some very significant information that would be very useful in our distribution decision of where we should send each car. However, we have to remember the volume effect; for example, it seems that the best destination for this particular car is auction site 18. However, if we send 300 Pontiac Grand Prix to auction site 18, then the price might fall below $10,000 because of the volume effect because of too much supply. We may get some additional external data for consumer preferences in the surrounding neighbourhoods. So in the area around auction site 18, we can look at color of new cars sold, which provides information on consumer color preferences.
This might also be an important indicator for some auction sites that we'll return to during the next video. We can then look at that one particular auction site and again, look at one particular make and model, with particular mileage and year of production, with no accident record, and look at different months of the year and the number of cars that were sold and the average price. And for some particular reason, September was very, very successful: despite the fact that the number of cars was the highest: 18, the price was also very high.
It might be because of some seasonal factors, or there were some other external events that contributed to that, which would require deeper investigation. We can look at the total number of cars sold in the nation, in different states, we can look at the number of people attending auctions, during the summertime or wintertime, and this particular piece of information might be quite significant. Again, the more people attending auctions, the higher the price, the correlation is very, very clear.
Also, if we look at depreciation from the day the car is delivered to auction site to the day the car is sold, it seems that a few auction sites have very long waiting periods in Nevada and Idaho, and the north east over there. And this should be taken also into account when we distribute the cars, trying to maximize the auction price. So let's talk about knowledge: we can gain a variety of very important insights.
For example, we can discover that green cars are very unpopular in Texas: for some particular reason, people there simply don't like green cars, whatever the reason, and there is significant evidence based on historical data. So we can make a note saying: "try to stay away from sending any green cars to Houston," or we might discover that people over there don't appreciate cars with high mileage, so we can make another note. This is the knowledge we discover by analyzing historical data, finding that if a car has over 80,000 miles, it's probably a mistake to send this car to this auction site; or old cars – cars produced in 2002 or earlier – or it may be some current events, some bad press, people boycotting BMWs, Audis, and so on. We can make a note, no BMWs, no Audis would go to this auction site, keep these cars on an exclusion list. The same with volume effect. If we look at the number of similar cars being sold at a particular auction on a particular day, and I indicated already a few times there is a strong correlation between the price of the car and the number of the same cars that are sold, the number of Pontiac Grand Prix for example, we send 200 almost identical cars to this auction site, we would like to understand what would happen to the price, a very important consideration when we build a predictive model.
The remarketing team in Detroit are manually making up to 7,000 decisions. At this particular distribution center, they have 510 cars. And for each single car, they have to make a decision: which car should be shipped to which particular auction site. And they have to take all information and all knowledge into account. They take into account seasonality and weather forecasts and distances and transportation costs, depreciation values and volume effect – everything humanly possible – and they are doing the best they can.
So basically, so far we discussed how decisions are made using data, information and knowledge. Of course, everything is supported by additional material. They have access to Black Book, Blue Book, with the valuation of different colors which are modified by mileage and year of production and so on. They are getting regular reports from individual auction sites.
So they have data, information, knowledge, and their task is to make the best possible decision. And the question is, can we make better decisions by climbing up the pyramid? If we add a predictive model, if we add optimization model, can we beat a human team by making smarter decision, providing significant value for the company? This would be the topic for the next couple of videos. Thank you.