domingo, 17 de marzo de 2013

Heavy duty ETL: Extraction Transformation and Load

These days, many people are talking about Big Data. However, very few talk about how big is Big Data and about all the different components that need to be considered prior, during, and after running a system based on Big Data. Many don't know the time it takes to extract the data. Others forget that they need to validate it. And only a few mention the loading tools they use to speed up the process of loading millions of rows into a system (not to mention normalizing the data while you are loading it). Therefore, I thought our "little" project could be of interest. This post will explain all the different issues you need to consider before implementing a system with Big Data.

Data Volumes
The amount of data that we use for the Climate Viz project is astronomical. Just to give an idea, we are collecting temperature data from satellites at a 0.25 resolution, which means that there are 864,000 data points being collected every 3 hours. At the end of just one day, we get close to 7 million data points. Just for one day! And we haven't mentioned transformation and other data aggregation statistics that we need for the project.

Extraction
Every project always comes with dirty work. And extracting and loading the files is the dirty part of this project. Visualizing the data in maps and writing the UI controls is the easy part. But getting all the data in and validating it, that's where the pain begins.

We use a two-step process for extracting the files from NASA Giovanni. Setting up a wget script that downloads the GRIB files was easy. Then we process them with pygrib and slice them in order to have them ready for GAE.

For data is never going to change, we generate master tables. In our case, latitude and longitude are fixed for every resolution and can be generated programatically. Two for loops ( x and  y ) and you think you are done but, when you are dealing with Big Data, that translates into several troubles. First consider that the loops could take so long that you could reach timeouts as part of the request and post methods of the web layer. Then, the same thing could happen even if you make a dummy get method to generate all the different tasks using the queue.

Validation
Once you start dealing with data collected from different sources, the first need is to validate them. Most likely you would like to have a visualization tool that compares and contrasts your data. Unfortunately, you can't use excel because it has a limit of 65,535 rows (far too limited for big data sources).

Thus, we are left with building our own tools to validate our data. Think about it because it makes a big difference. Google Maps was a useful interface for us. Also, Google drive is another option for loading big files.

Transfer 
Of course, transferring data from your source to the destination is another issue. If you have a normal ETL tool, then your problem is solved. However, the rest of us have to deal with several issues.  For each POST, there is a limit in both the size of the file we are sending as well as the time it takes to process it. I had to play around with how fast GAE was processing the files. I started with 15000 lines per file and had to go down to 200 lines. Otherwise, I received a timeout error in the operation for 5K, 2K, and 800. Luckily, I was generating files with a test script but at times is not easy to generate data.

I also generated a script to scan the files and post them automatically (using multi-part). Even with a queue of 40 tasks a second, the time to process the files is agonisingly slow (5 seconds per 100 lines) and it could probably take several months to load. :(

Another trick that I used was to slice the files to upload.  Each file contained a line order that will help me aggregate the data later.

Conclusion
I hope the ideas exposed in this post will help you in your Big Data projects.