Chief Economist at Google Inc.
Breaking the Wall of Economic Uncertainty. How Online Data Can Help us Understand the Economy
So, last night, we had a very nice reception, and I noticed that a lot of wine was being consumed. So, I thought this audience might be experts on the following question: What day of the week are the most searches for “hangover”? I guess that is Kater in German. How many people think Sunday? How many people think Monday? How many people think Tuesday? It is a very sober crowd, I think. How would you found out the answer to this question? Well, of course, you would go to Google. In this case, it is an application called “Google Trends”. It is open to the public; anybody can use it. You can do this at home. At “Google Trends” I typed in the term “hangover”, and I put in the date, and you can see a very regular pattern. It peaks every Sunday, and that big spike that you see is, in fact, January 1st. (Laughter) Now, if you scroll down the screen a bit, you can see the geographic distribution (Laughter and applause), and in this case you would notice that New York is the hangover capital of the United States. Finally, you can do comparisons.
In the next chart I show the searches for “Vodka” and the searches for “hangover”, and you can see that they are separated by exactly one day. So, let this be a lesson to us all. One other example that is kind of fun is to look at searches for a term, a historical term: “Civil War” in the U.S, and you can see that this has a very, very regular pattern; it repeats itself year-after-year. Why is that? Some people say “holiday”, some people say, “education”. Well, anyone who has taught at a university knows that peak occurs three days before the term paper is due, and it repeats itself in a very regular pattern. Now both of those are fun examples, but let me give you a slightly more serious example about unemployment. When economists measure unemployment, they look at two numbers. One number is the initial claims for unemployment: what happens when people first become unemployed. They go to the unemployment office, and they file for benefits. The other number that people look at is the unemployment rate: how many people are unemployed at a given point in time. Of course, these numbers vary significantly according to whether or not we are in a recession. So, in this chart, the grey bar is the recession. The red lines are the unemployment rate, and the black line is the initial claims and people filing for unemployment benefits. Now you will notice, if you look at that chart, that the initial claims peaks right at the end of every recession. It is the best single indicator for the end of a recession, and, in fact, it tends to peak about six months, or four months, before the unemployment peaks. So, people watch that number very closely. They are very interested in the behaviour of that initial claims to unemployment.
Now, you might ask yourself: if you became unemployed what would you do? Perhaps one of the first things you do is you would go to your computer and say, “Where is the unemployment office? How do I file for unemployment? How much are unemployment benefits? How does the system work?” We have a tool with Google, which you can use, where you can enter any time series, any series that you want, and find the queries that are the most correlated with that series. So, in this particular case, I entered the initial claims for unemployment, the official numbers downloaded from the government, and I came back with the answer. It said that the query that is the most correlated with that series is, in fact, the queries on unemployment office— very natural under the circumstances. So, the question you might ask is: could you use those queries to try to predict the initial claims for unemployment in this particular example? As we all know, prediction is hard, especially about the future. So, in fact, we will lower the bar a little bit. We won’t try to predict the distant future. What we will do is what economists call “nowcasting”: we will focus on trying to understand the current state of the economy. That is still quite valuable, because the official statistics are released with a lag. So, if you have an idea what the statistics will look like before they are released, it gives you a little leg up in terms of understanding the current state of the economy. So, you can apply some statistical methods. You look at the initial claims for unemployment this week, and you might specify that they depend on the initial claims for unemployment last week, plus this new variable, the queries on the term “unemployment office”. The way to do this from a statistical point of view is estimate the model up until time “t”, forecast the next week, add a sample, and repeat through all the observations you have; compare a baseline model to the model with the queries.
When you do that you find that you get about a 8.7% improvement in that added sample mean absolute error. So, typically, if you do this for other sorts of statistics, you will find something similar. You can find improvement in the forecast accuracy, the very short-term forecast, the nowcast, of somewhere between 5 and 10%. So, here is a little picture of what it looks like. That is the initial claims to unemployment: there is a black line. The red line is just a baseline, simple model, and the green line incorporates this additional variable from the Google searches. Another example is consumer sentiment. So, this is a measure that is conducted by the University of Michigan. They call up several thousands of people each month, and they say, “Are you and your family better off than you were a year ago? Do you and your family expect to be better off next year than you are now?” So, it is these questions that they are using to gage the confidence or the sentiment of the individuals involved. You might think that the kinds of queries that people are entering into search engines would, in fact, also reflect to some degree the consumer sentiment. Now before I use specific queries, but in this case I am going to use categories of queries, I plug this data, the consumer sentiment, into a statistical system; and the system finds those categories of queries that are the most predictive of consumer sentiment. So, we start with a trend, which is just a rough time series of fitting a line: the red line is the trend; the blue dots are the consumer sentiment, and the bars down below are the errors in terms of that fit. Now we start adding these query categories. These are queries on financial planning. You can see now the red line is approaching the blue dots a little bit better. The errors have gone down a bit. We add in queries on investing, and now the line is really zoomed in those dots; the errors are getting smaller. We add business news queries. When the recession hit, people were very anxious during the financial crisis—a lot of queries on business news. And add in search engines, the same idea, we get even closer, and now the errors have become almost uniform, or fitting just about as well during the recession as during the other periods. Finally, we add in energy and utility queries, and now we see that we have a really a pretty good fit over the entire range.
Of course, this is just prediction. I am not asserting causality or anything like that. We are just trying to see: can we use this real time data, the data that is available in more or less a continuous basis, to help improve our estimates of what the official data will look like when it is released in a month or two? Now, this is stimulated, a lot of economic research. The central banks are particularly interested in nowcasting the economy, because it is their responsibility to try to respond to changes in economic conditions. So, there have been research reports from the Bank of England, Bank of Chile, Israel, Spain, European Central Bank. All these places are trying to utilise the data to understand employment, unemployment, inflation, retail sales, auto sales, travel destinations, and so on.
So, there are many possible ways you can mind this data to understand social phenomena. Let me also emphasize that this is not just Google that has this data, but, in fact, there are many private sector companies that have real time data. For example, if you look at credit card companies, they can tell you how much was charged on their credit card yesterday. You look at shipping companies, like UPS and FedEx, they can tell you how many packages were sent on a given day from a given region. If you look at large retailers like Wal-Mart, Target, and so on, you can find out exactly how much consumers spent in those chains on a day-by-day, and in some cases, even hour-by-hour basis. Now, if you compare that private sector data, where companies have spent the last decade building real time data systems to measure the performance of their organisation on a daily or hourly basis, it is a much, much higher frequency data than you see from the government agencies. The government agencies have historical data; it is very carefully corrected. It is very labour intensive, and, in fact, it tends to be rather low frequency: monthly data or quarterly data. So, being able to combine that real time data that comes from the private sector with the official statistics from the public sector, should give us a way to understand the functioning of the economy better than we are now. Thank you very much.