Crowdsourcing Weather Data With Amazon Redshift
Have you ever tried setting up a personal weather station? Collecting digital weather data in your backyard or on your rooftop has recently become an easy thing to do. Sharing it is easy also: the Citizen Weather Observer Program (CWOP) will transmit your backyard data to NOAA to help with forecasts. All you have to do is say yes during station setup!
A Niagara of interesting weather data has poured out of CWOP. You can see what other people are sending in real time at http://wxqa.com/shortform.html. NOAA provides an overview: https://madis-data.ncep.noaa.gov/MadisSurface/. (CWOP stations are labeled APRSWXNET.)
The problem is archiving all the variety of what gets sent through CWOP. The basic, typical variables are archived at NOAA, but rarer observations such as solar radiation measurements were not included in archiving programs at the outset. Yet there is an opportunity that should not be missed: an archive of solar radiation data could help support solar energy development. Volunteers started saving CWOP’s solar radiation observations in February 2009, zipping up the data every few weeks and posting the files at Google Drive. This raw data archive is selectable at http://wxqa.com/lum_search.htm.
Download the Top 14 Performance Tuning Techniques for Amazon Redshift
To make access easier, the volunteer team then developed an Amazon Redshift archive, presenting the data in scientific units instead of APRS codes. Storing lots of raw data is one of the typical use cases for Amazon Redshift. The Redshift archive also includes supplementary information. Most important is a model “clear-sky” value of solar radiation: one model value to accompany each observation. Comparison readily highlights data problems.
The Redshift database is now 500 million observations, each with 39 fields (weather, location, time, model value and so on). Redshift compression fits all that into about 50 GB, one dc2.large node.
The data flow to the archive begins with the raw data stored in Google Drive. New files from Google Drive are downloaded into an EC2 instance running R. There the raw data is parsed. The raw files are uploaded to S3 and from there they are copied to Redshift.
Below is an example of the COPY command for the upload S3 to Redshift. This is a somewhat tolerant upload useful for working with crowdsourced data. The issue is that the raw data is a little bit glitchy – a few unexpected EOFs in every million observations – but we want to include it as one column in the archive. So the COPY command tells Redshift to eliminate strange characters, replacing each with a marker so we can identify these edits later.
COPY cwoparchive FROM 's3://cwop/lpic/LPIC220171231.txt'
NULL 'NA' DELIMITER '\t' ACCEPTINVCHARS AS '^';
The database structure is just four tables.
(i) The main table contains a half billion observations.
(ii and iii) Two tables provide smaller, random data selections: one million rows and ten million rows. This lets users work with smaller datasets.
(iv) Last is a table of station-day summaries. Each line includes station name, a date, maximum and minimum L, the number of observations that day, and how they compared to model “clear-sky” values on average.
The SQL sample below shows how the summary table is used to select data of a certain quality – in this case, the user is selecting from a certain latitude band on a certain day:
select * from public.cwoparchive c, public.stationdays s
where c.stationname = s.stationname and
c.latitude = s.latitude and
c.longitude = s.longitude and nobs > 100 and c.latitude > 40.
and c.latitude < 45 and z='2010-01-01' and quality < 6;
In the following snippet, an R user connects to the database within R, and draws a graph:
cwop_db<-src_postgres(host='my-cluster-cwop-solar.cix687k0xvpq.us-west-2.redshift.amazonaws.com', port='5439', dbname='dev',
cwoparchive <- tbl(cwop_db, "cwoparchive")
millionpoints <- tbl(cwop_db, "millionpoints")
tenmillionpoints <- tbl(cwop_db, "tenmillionpoints")
group_by (archivedate) %>%
p2+xlab("")+ylab("")+ggtitle("number of stations")
Updates of the archive are quarterly. They would be more often if it were not so awkward to move data from Google Drive to an EC2 via command line. It appears this is about to become a lot easier: there is a new package for R, googledrive, that speeds command line access to Google Drive from R.
So there is our Redshift archive. At the moment it is open a few hours a day and always available to be cloned. If you clone the data and spin up your own cluster, make sure you first read up on the 3 things to avoid when setting up an Amazon Redshift cluster. We continue to develop ways to help users select the best data for their applications. Most importantly, while we potter along adding data selection tools, the archive of donated observations grows larger. It is becoming a uniquely valuable storehouse of surface observations of solar radiation. We hope it will support solar energy development. Everyone is welcome to use it.