Professor, School of Computer and Communication Sciences, École Polytechnique Fédérale de Lausanne, CH
Breaking the Wall of Data Deluge. How Efficient Data Exploration Enables New Scientific Discoveries.
Hello, welcome, and thank you very much for this opportunity to be “normal”. I am a computer scientist, and I work on data management within computer science, but I am a firm believer in interdisciplinary science. Today I would like to introduce to you the parts of technology that I think can make a huge impact on the world of tomorrow – and, of course, today. So, let’s talk about technology, shall we?
We toast bread. We like that. In 1800 we would have this device to toast bread. Through the years, it took us about 200 years, in fact, to come up with other models, more sophisticated models. That is about as sophisticated as it is going to get, right? So, this scientific progress is modest, relative to other fields, for example, the car industry. It took only 120 years to go from a model that runs 13 km/hr, all the way up to a model that runs 250 km/hr – a factor of 20 in 120 years. That is much more impressive than with the toaster industry, but there is absolutely no industry, no technology that progresses faster than computer technology.
In 1946, the first computer – its name was ENIAC – which took about the size of a six-bedroom apartment, could compute at the rate of 100 kHz and weighed about 27 tons. That is how computations started to be done electronically. The evolution of transistors and our integration trends, which allow us to pack more and more transistors in the same chip area – every 18 months we double how many transistors we pack into the same chip area, in fact – led us to different computers. The last one I am showing is up there; it is a telephone, a smart phone in fact, which runs at about 2 GHz, takes a few square centimetres worth of space, and can do in the same time as ENIAC, about a million more computations. That is the smallest computer you can buy today. That is really amazing. Computer technology is the fastest growing technology.
But there is one trend that comes from this fastest growing technology and grows even faster than that technology. That is the data – the data that we collect in the world. We collect data when we go to the supermarket; we collect data when we go to the doctor; we collect data when we click on the internet. There is data collected right now. Someone is filming me; that is going to be data on a computer. That trend of data collection, the red line, grows much faster than the blue line, which is our ability to process that data despite the technological trend that I showed before – the impressive growth of computer technology.
So, we need to bridge that gap. The sciences’ scientific data are an amazing reason why we need to bridge that gap. I am going to relate to the previous speaker2 now and make the same case but for a different reason. The sequencing of the human DNA was a big breakthrough, but it was an expensive exercise. In 2000 it cost about $10,000. Now, that trend went down to $1,00. So, in only ten years, we had four orders of magnitude of cost drop. This means that all of the sudden we can sequence everything. We can take a little water, do gene sequencing there, and we can take a little water from ten metres away, follow the evolution of bacteria, follow trends, and follow anything through the data because it is so cheap. Technology makes that possible. There is a lot of data being gathered. These are just two of the databases that store gene pairs – TRACE has raw data of about two trillion pairs. Then there is also the Sequence Read Archive that has next-generation data – raw data of about 25 trillion pairs. These are the actual lines, they are not even plottable in comparison to the cost drop.
There is one trend, however, that doesn’t follow those. This is how we process that data, what do we do with that data. As an example, take the whole human genome shotgun sequence – the trend of which is much more modest than the data collection trend. So, what we want to do is bridge that gap. As a computer scientist, I am proud that we are able to give the world the technology to gather all of this data. But, I am not very happy that we cannot yet have the technology to process all this data for the benefit of the human. So, as a database person, as a data management person, I feel responsible. My work is in trying to bridge that gap through efficient data management to be able to harness all of this data and turn it into useful information.
So, I am going to give you just a couple of examples of my work so far in this area. We are going to switch gears for a little bit here and go to astronomy. From the beginning of time, people wondered what is out there. Today, because of technology, we have elaborate telescopes that give us a lot of information about celestial orbits. This is an example of a future telescope that is called the large synoptic array telescope that is going to record data from the sky at about 20 petabytes every night. A petabyte is a very large amount of data. I am not going to bore you with details here, but that is what we do. As computer scientists, we gather data and then we scare people with it.
So, the data looks like this. This is really what the data looks like. There are types of data recorded, and when they were being photographed, and what is the information about it – that is interesting. You have the astronomers that are wondering about what is happening in the sky. The astronomer wants to go look at that data and say: “Tell me which galaxies are fast moving?” He can do that if the data is of a modest size. He can just go look at every single line of that data, find which are the galaxies, and then figure out when they were observed, match them together and do the necessary processing But as the data grows, this is not scalable. They cannot do it as efficiently. You can still do it. You can go through a lot of data, but that is going to take a lot of time. So, what we want is to find methods. My work is to find methods; for example, come up with structures that can be applied on that data, such as the indexing structure, which is like an index finger, and can show the astronomer where the galaxies are without him or her having to go through all the data to discover them.
There are two nice things about it. The first one is that you go directly to the data that you care about. The second one is that the methods that we discovered to do that are not directly dependent and do not take as long commensurately to how much data you have. So, their performance is not dependent on the amount of data. So we disassociate ourselves from the very impressive trend of data growth, and we are able to come up with really performant methods for going through large amounts of data. That is a very simple example that allows us to do things fast.
Going back to medicine, we started a proposal for a project recently, headed by EPFL. This project compasses about 165 PIs. It is called the Human Brain Project. It is about simulating the human brain, being able to harness the data, starting from the molecules, all the way up to the brain and cognition – going through neurons, synapses and micro and meso and macro circuits, all the way up to the whole picture of the brain – then being able to harness all of that data and understand how the brain works.
This is, by and large, a data integration problem. We need to be able to integrate data knowledge through patterns and routes, and through models that we have, to understand how the brain works. As a very simple example, imagine a neuroscientist looking at this picture, seeing something interesting, and wanting to drug his or her mouse around the area that they are interested in and go directly in there and get more information – go deeply in there and get more information. That is not even possible if you have to go through the entire set of data to figure out where this area is and get the data that is needed to show the neuroscientist. If I take too long to do that, the neuroscientist is just going to lose context. He or she won’t even remember why they asked for that piece of data. I have to do this instantaneously. That is very important. That is why efficient data management is key here.
So, bridging is the key. Communication is the key, communication between computer science and all the other domain sciences. The brain scientist has to tell me what they need in order for me to be able to provide the appropriate methods for them to find it – the astronomer the same thing. Interdisciplinary research is a very important thing, but also in order to bridge the gap between data connection and data processing.
I am going to end with a duality of my science, which is essential – its service nature to other sciences, and at the same time its introspective nature to cope with the growth of the underlying technology. Building algorithms to do work is a difficult thing. You have to build a sequence of steps, describe what you need to do and then get to the target.
Consider cooking for example. Cooking is a tedious process for many. It is an enjoyable process for a lot of people. A chef who wants to cook a seven-course meal has a timeframe. Now, this seven-course meal requires ten hours to do. The chef says: “I need ten hours.” The restaurant owner says: “That is not good enough; I need it in fifteen minutes.” I am going to give you 40 sous chefs who are going to do it for you in due time, because 10 hours by 40 gives us 15 minutes. So, just divide the work to them. Isn’t that simple? Well, that is not really very simple, right? The reason why it is not simple is that you cannot really divide the work just like that. The flavours don’t mix if you don’t mix them correctly, if you just divide the ingredients around. But there is another even subtler problem. Say that you are smart as a chef and that you divided the work correctly. What if there is only one saltshaker? Putting salt on the dish is a process that takes a split second, but putting salt when 40 people are waiting to put salt basically makes 39 people wait. So, all this parallelism doesn’t work. Use of critical resources is very important to be done correctly. That is a very difficult problem, because a human brain is used to work through a sequence of steps.
This is what a computer looks like today. It is a highly parallel machine and that is why we need to get past that hurdle. Finally, data-driven science is what we have today. We are bridging computer science with all the main sciences to harness data into useful information. Thank you.