viernes, 26 de abril de 2013

Techniques for Displaying Temperature for large data sets

In our project we are dealing with the problem of finding ways to displays temperature for the whole earth. Although the problem seems simple at first, we are struggling with the problem of keeping the fidelity of the data in sets with high volume of data points. Currently we were using the heatmap layer API from Google Maps to display the points. However, this type of graphical representation is better suited for displaying density of points rather than displaying a range of temperature sets. An example is provided.


We are going to continue to look for ways to generate a rasterized image of the whole temperature for the world in different resolutions. In the meantime, we also want to test different representations of the datasets in charts with an example like http://d3js.org/

domingo, 17 de marzo de 2013

Heavy duty ETL: Extraction Transformation and Load

These days, many people are talking about Big Data. However, very few talk about how big is Big Data and about all the different components that need to be considered prior, during, and after running a system based on Big Data. Many don't know the time it takes to extract the data. Others forget that they need to validate it. And only a few mention the loading tools they use to speed up the process of loading millions of rows into a system (not to mention normalizing the data while you are loading it). Therefore, I thought our "little" project could be of interest. This post will explain all the different issues you need to consider before implementing a system with Big Data.

Data Volumes
The amount of data that we use for the Climate Viz project is astronomical. Just to give an idea, we are collecting temperature data from satellites at a 0.25 resolution, which means that there are 864,000 data points being collected every 3 hours. At the end of just one day, we get close to 7 million data points. Just for one day! And we haven't mentioned transformation and other data aggregation statistics that we need for the project.

Extraction
Every project always comes with dirty work. And extracting and loading the files is the dirty part of this project. Visualizing the data in maps and writing the UI controls is the easy part. But getting all the data in and validating it, that's where the pain begins.

We use a two-step process for extracting the files from NASA Giovanni. Setting up a wget script that downloads the GRIB files was easy. Then we process them with pygrib and slice them in order to have them ready for GAE.

For data is never going to change, we generate master tables. In our case, latitude and longitude are fixed for every resolution and can be generated programatically. Two for loops ( x and  y ) and you think you are done but, when you are dealing with Big Data, that translates into several troubles. First consider that the loops could take so long that you could reach timeouts as part of the request and post methods of the web layer. Then, the same thing could happen even if you make a dummy get method to generate all the different tasks using the queue.

Validation
Once you start dealing with data collected from different sources, the first need is to validate them. Most likely you would like to have a visualization tool that compares and contrasts your data. Unfortunately, you can't use excel because it has a limit of 65,535 rows (far too limited for big data sources).

Thus, we are left with building our own tools to validate our data. Think about it because it makes a big difference. Google Maps was a useful interface for us. Also, Google drive is another option for loading big files.

Transfer 
Of course, transferring data from your source to the destination is another issue. If you have a normal ETL tool, then your problem is solved. However, the rest of us have to deal with several issues.  For each POST, there is a limit in both the size of the file we are sending as well as the time it takes to process it. I had to play around with how fast GAE was processing the files. I started with 15000 lines per file and had to go down to 200 lines. Otherwise, I received a timeout error in the operation for 5K, 2K, and 800. Luckily, I was generating files with a test script but at times is not easy to generate data.

I also generated a script to scan the files and post them automatically (using multi-part). Even with a queue of 40 tasks a second, the time to process the files is agonisingly slow (5 seconds per 100 lines) and it could probably take several months to load. :(

Another trick that I used was to slice the files to upload.  Each file contained a line order that will help me aggregate the data later.

Conclusion
I hope the ideas exposed in this post will help you in your Big Data projects.


lunes, 21 de enero de 2013

Geospatial queries

After working on the decoding of the GLDAS files, our next step is in optimizing the geospatial queries  for the project. On one hand we need to have the right UI controls as part of the map interfaces. And on the backend, we need to have the ability to do the bounding box queries needed for the different resolutions.

The first part we have it under control. We are able to detect the zooming behavior in Google Maps API as part of the javascript code.

For the latter part we are going to be using Geomodel a nice open source project to do the geospatial work.

jueves, 10 de enero de 2013

Validating our data points

Happy new year 2013!

During the weeks of the holiday we were validating the data points extracted from the GRIB files.

Giuseppe and Mario took the average temperature for the month of June and compared against the city where they live. In order to do that, they found out the latitude and longitude of the city and looked it up in the excel. Thus far the temperatures make sense.

In the meantime Jyothi has been looking at mastering the Google Maps API event handlers fro zooming in and zooming out of the maps. This will help us reload the Heatmap layer with different resolutions depending on the zooming behavior. The idea is to have a set of data points with different resolutions.

viernes, 16 de noviembre de 2012

Our First Map

We are happy to announce that we have our first Climate Visualization Map

http://innova-t.appspot.com/marovi/listPoints/

The Process
We have created a semi-automatic process.  We download the GRIB file to a local linux box. We then process that file so that we extract:
  • initial timestamp
  • latitude and longitude
  • temperature in Kelvin
The excel that is generated is the input to the datastore in GAE.

Styles and Heatmap Layer
Maps are good if they display what we want clearly. That´s why we were inspired by several talks of the Google Maps Developer team to include some of their recommendations on the maps. Of course, we turned on the Heatmap layer -- which is a part of the Google Maps API. We configured the terrain, inverted the lightness to display water masses and countries more clearly. We also specified the opacity in a custom format. The data points were also important and we turned off the dissipation for a better effect of the temperature.

And of course, we used the Style Map Wizard for testing our styles.

Content
The map displays only 1000 data points from a set of 150K temperature measurements taken during the first four months of 2012. We plan to integrate more than 30 years worth of data. So this is a good start. We are able to validate that the coordinates are correct (displays temperature on top of land masses) and that they make sense. The points displayed show that in Australia is is warmer than America due probably to the hour of the day.

lunes, 5 de noviembre de 2012

Understanding GRID files

Abstract
Notes on the interpretation of GRIB and HDF5 file formats for EOS.

Grid Data
Our files are mostly grids which means they represents data points taken at some point in time and contains a set of projection equations. Although we do not care that much about the different type of projection used (Mercartor, for example) due to libraries that do the hard work for us. However, we are interested in how to extract the actual value taken along with the relevant latitude and longitude. I try to explain what I did with the files in order not to forget in the future.

Basics
There are three important features of a grid file: data fields, dimensions, and the projection. The data fields are rectilinear arrays of two or more dimensions and are related to each other by common geolocation.

Dimensions are used to relate data fields to each other and to the geolocation information. To be interpreted properly, each data field must make use of two predefined dimensions "XDim" and "YDim".

Projections are used to encode and interpret. It provides with a convenient way to encode geolocation information as a set of math equations capable of transforming earth coordinates to x-y coordinates on a sheet of paper.

Most sources of data files are websites like Giovanni and provide with GRIB or HDF-EOS file formats. We use those file formats to extract the values we are interested: average surface temperature.

Extracting Values
Thanks to libraries like pygrib we are able to read the GRIB files and extract:
- lat,lng
- measurement

The files that we are currently processing are taken at 3 hours intervals so we just have to take the first measurement and iterate forward. The date is given by the file name.

You can read the "headers" of the file that will tell you which index to use to extract the measurement of interest.

sábado, 20 de octubre de 2012

Data Extraction Plans

This week we are planning to work on extracting the data from GRIB files. After reviewing the spec from the file we are going to use the average temperature from the last 3 hour periods.

We also reviewed the Google Maps API in particular the heatmap layer API.

The library in python that we are going to use is pygrib It requires several libraries like numpy in order to work.

We are starting to develop the business model that will allow us to extract the data in the following dimensions:
  • lat/lon
  • city (for the United States): we are going to get an approximate
  • timestamp when the data point was collected
  • month, week, day, hour (averaged value)
Regarding the data source, some important points: 
  • each coordinate have the state in the united states (the resolution is every 10 Km)
  • the time period is an instant (every 3hrs)
  • it is the average surface temperature code 138 from the GLDAS model

We are also reviewing the use of Fusion Tables for displaying the geodata.

Keep you posted.