Start Now Login
Crowdsourcing Weather Data With Amazon Redshift

Crowdsourcing Weather Data With Amazon Redshift

Have you ever tried setting up a personal weather station?  Collecting digital weather data in your backyard or on your rooftop has recently become an easy thing to do. Sharing it is easy also: the Citizen Weather Observer Program (CWOP) will transmit your backyard data to NOAA to help with forecasts.  All you have to do is say yes during station setup!

A Niagara of interesting weather data has poured out of CWOP. You can see what other people are sending in real time at http://wxqa.com/shortform.html. NOAA provides an overview: https://madis-data.ncep.noaa.gov/MadisSurface/. (CWOP stations are labeled APRSWXNET.)

The problem is archiving all the variety of what gets sent through CWOP. The basic, typical variables are archived at NOAA, but rarer observations such as solar radiation measurements were not included in archiving programs at the outset.  Yet there is an opportunity that should not be missed: an archive of solar radiation data could help support solar energy development. Volunteers started saving CWOP’s solar radiation observations in February 2009, zipping up the data every few weeks and posting the files at Google Drive. This raw data archive is selectable at http://wxqa.com/lum_search.htm.

To make access easier, the volunteer team then developed an Amazon Redshift archive, presenting the data in scientific units instead of APRS codes. Storing lots of raw data is one of the typical use cases for Amazon Redshift. The Redshift archive also includes supplementary information. Most important is a model “clear-sky” value of solar radiation: one model value to accompany each observation. Comparison readily highlights data problems.

The Redshift database is now 500 million observations, each with 39 fields (weather, location, time, model value and so on). Redshift compression fits all that into about 50 GB, one dc2.large node.

The data flow to the archive begins with the raw data stored in Google Drive. New files from Google Drive are downloaded into an EC2 instance running R. There the raw data is parsed. The raw files are uploaded to S3 and from there they are copied to Redshift.

Below is an example of the COPY command for the upload S3 to Redshift. This is a somewhat tolerant upload useful for working with crowdsourced data. The issue is that the raw data is a little bit glitchy – a few unexpected EOFs in every million observations – but we want to include it as one column in the archive. So the COPY command tells Redshift to eliminate strange characters, replacing each with a marker so we can identify these edits later.

The database structure is just four tables.

(i) The main table contains a half billion observations.

(ii and iii) Two tables provide smaller, random data selections: one million rows and ten million rows. This lets users work with smaller datasets.

(iv) Last is a table of station-day summaries. Each line includes station name, a date, maximum and minimum L, the number of observations that day, and how they compared to model “clear-sky” values on average.

The SQL sample below shows how the summary table is used to select data of a certain quality – in this case, the user is selecting from a certain latitude band on a certain day:

In the following snippet, an R user connects to the database within R, and draws a graph:

Updates of the archive are quarterly. They would be more often if it were not so awkward to move data from Google Drive to an EC2 via command line. It appears this is about to become a lot easier: there is a new package for R, googledrive, that speeds command line access to Google Drive from R.

So there is our Redshift archive. At the moment it is open a few hours a day and always available to be cloned. If you clone the data and spin up your own cluster, make sure you first read up on the 3 things to avoid when setting up an Amazon Redshift cluster. We continue to develop ways to help users select the best data for their applications. Most importantly, while we potter along adding data selection tools, the archive of donated observations grows larger. It is becoming a uniquely valuable storehouse of surface observations of solar radiation. We hope it will support solar energy development. Everyone is welcome to use it.

Related content
3 Things to Avoid When Setting Up an Amazon Redshift Cluster Apache Spark vs. Amazon Redshift: Which is better for big data? Amazon Redshift Spectrum: Diving into the Data Lake! What Causes "Serializable Isolation Violation Errors" in Amazon Redshift? A Quick Guide to Using Short Query Acceleration and WLM for Amazon Redshift for Faster Queries What is TensorFlow? An Intro to The Most Popular Machine Learning Framework Titans of Data with Mirko Novakovic - How Containers are Giving Rise to New Data Services Why We Built intermix.io - “APM for Data” 4 Simple Steps To Set-up Your WLM in Amazon Redshift For Better Workload Scalability World-class Data Engineering with Amazon Redshift - Training Announcing App Tracing - Monitoring Your Data Apps With intermix.io Have Your Postgres Cake with Amazon Redshift and eat it, too. 4 Real World Use Cases for Amazon Redshift 3 Steps for Fixing Slow Looker Dashboards with Amazon Redshift Zero Downtime Elasticsearch Migrations Titans of Data with Florian Leibert – CEO Mesosphere Improve Amazon Redshift COPY performance:  Don’t ANALYZE on every COPY Building a Better Data Pipeline - The Importance of Being Idempotent The Future of Machine Learning in the Browser with TensorFlow.js Gradient Boosting Libraries — A Comparison The Future of Apache Airflow Announcing Query Groups – Intelligent Query Classification Top 14 Performance Tuning Techniques for Amazon Redshift Product Update: An Easy Way To Find The Cause of Disk Usage Spikes in Amazon Redshift How We Reduced Our Amazon Redshift Cost by 28%
Ready to start seeing into your data infrastructure?
Get started with a 14-day free trial, with access to the full platform

No Credit Card Required