GeoWave EMR Quickstart Guide
The GeoWave EMR Quickstart guide is similar to the standard Quickstart Guide, except that it is run in an Amazon EMR environment. Amazon EMR is a platform that simplifies the creation and management of multi-node clusters. There are also Jupyter and Zeppelin Notebook examples for users looking try out GeoWave in that manner.
Environment Setup
See the AWS Environment Setup Guide for setting up an EMR cluster to use with this guide.
Preparation
Install GeoWave
This guide assumes that GeoWave has already been installed and is available on the command-line. See the Installation Guide for help with the installation process.
Create Working Directory
In order to keep things organized, create a directory on your system that can be used throughout the guide. The guide will refer to this directory as the working directory.
$ mkdir quickstart
$ cd quickstart
Download Sample Data
We will be using data from the GDELT Project in this guide. For more information about the GDELT Project please visit their website here.
Download one or more ZIP files from the GDELT Event Repository into a new gdelt_data
folder in the working directory. The examples in this guide will use all of the data from February 2016 (201602 Prefix).
Download Styles
Later in the guide, we will be visualizing some data using GeoServer. For this, we will be using some styles that have been created for the demo.
Download the following styles to your working directory:
When finished, you should have a directory structure similar to the one below.
quickstart
|- KDEColorMap.sld
|- SubsamplePoints.sld
|- gdelt_data
| |- 20160201.export.CSV.zip
| |- 20160202.export.CSV.zip
| |- 20160203.export.CSV.zip
| |- 20160204.export.CSV.zip
.
.
.
After all the data and styles have been downloaded, we can continue.
Vector Demo
Before starting the vector demo, make sure that your working directory is the current active directory in your command-line tool. |
Configure GeoWave Data Store
Depending on which key/value store that was configured in the EMR setup, execute the appropriate command to add the store to the GeoWave configuration, replacing $HOSTNAME
with the Master public DNS
of the EMR cluster:
-
Accumulo
geowave store add gdelt --gwNamespace geowave.gdelt -t accumulo --zookeeper $HOSTNAME:2181 --instance accumulo --user geowave --password geowave
-
HBase
geowave store add gdelt --gwNamespace geowave.gdelt -t hbase --zookeeper $HOSTNAME:2181
-
Cassandra
geowave store add gdelt --gwNamespace geowave.gdelt -t cassandra --contactPoints $HOSTNAME:2181
This command adds a connection to the key/value store on EMR under the name gdelt
for use in future commands. It configures the connection to put all data for this named store under the geowave.gdelt
namespace.
Add an Index
Before ingesting any data, we need to create an index that describes how the data will be stored in the key/value store. For this example we will create a simple spatial index.
$ geowave index add gdelt gdelt-spatial -t spatial --partitionStrategy round_robin --numPartitions 32
This command adds a spatial index to the gdelt
data store with an index name of gdelt-spatial
, which will be used to reference this index in future commands. It configured the index to use a round robin partitioning strategy with 32 partitions.
Ingest Data
GeoWave has many commands that facilitate ingesting data into a GeoWave data store. For this example, we want to ingest GDELT data from the local file system, so we will use the ingest localToGW
command. We will use a bounding box that roughly surrounds Germany to limit the amount of data ingested for the example.
$ geowave ingest localToGW -f gdelt --gdelt.cql "BBOX(geometry,5.87,47.2,15.04,54.95)" ./gdelt_data gdelt gdelt-spatial
This command specifies the input format as GDELT using the -f
option, filters the input data using a CQL bounding box filter, and specifies the input directory for all of the files. Finally, we tell GeoWave to ingest the data to the gdelt-spatial
index in the gdelt
data store. GeoWave creates an adapter for the new data with the type name gdeltevent
, which we can use to refer to this data in other commands. The ingest should take about 3-5 minutes.
Query the Data
Now that the data has been ingested, we can make queries against it. The GeoWave programmatic API provides a large variety of options for issuing queries, but for the purposes of this guide, we will use the query language support that is available for vector data. This query language provides a simple way to perform some of the most common types of queries using a well-known syntax. To demonstrate this, perform the following query:
$ geowave query gdelt "SELECT * FROM gdeltevent LIMIT 10"
This command tells GeoWave to select all attributes from the gdeltevent
type in the gdelt
data store, but limits the output to 10 features. After running this command, you should get a result that is similar to the following:
+-------------------------+-----------+------------------------------+----------+-----------+----------------+----------------+-------------+-------------------------------------------------------------------------------------------------------+ | geometry | eventid | Timestamp | Latitude | Longitude | actor1Name | actor2Name | countryCode | sourceUrl | +-------------------------+-----------+------------------------------+----------+-----------+----------------+----------------+-------------+-------------------------------------------------------------------------------------------------------+ | POINT (15.0395 50.1904) | 510693819 | Thu Feb 11 00:00:00 EST 2016 | 50.1904 | 15.0395 | CZECH | THAILAND | EZ | http://praguemonitor.com/2016/02/11/czech-zoo-acquires-rare-douc-langur-monkeys | | POINT (15.0395 50.1904) | 510694920 | Thu Feb 11 00:00:00 EST 2016 | 50.1904 | 15.0395 | THAILAND | CZECH | EZ | http://praguemonitor.com/2016/02/11/czech-zoo-acquires-rare-douc-langur-monkeys | | POINT (14.7186 50.4983) | 508121628 | Wed Feb 03 00:00:00 EST 2016 | 50.4983 | 14.7186 | | LEBANON | EZ | http://praguemonitor.com/2016/02/03/plane-pick-five-czechs-leave-lebanon-wednesday | | POINT (14.7186 50.4983) | 508121971 | Wed Feb 03 00:00:00 EST 2016 | 50.4983 | 14.7186 | POLICE | | EZ | http://praguemonitor.com/2016/02/03/plane-pick-five-czechs-leave-lebanon-wednesday | | POINT (14.7186 50.4983) | 508122060 | Wed Feb 03 00:00:00 EST 2016 | 50.4983 | 14.7186 | CZECH | | EZ | http://praguemonitor.com/2016/02/03/plane-pick-five-czechs-leave-lebanon-wednesday | | POINT (14.7186 50.4983) | 508122348 | Wed Feb 03 00:00:00 EST 2016 | 50.4983 | 14.7186 | FOREIGN MINIST | LEBANON | EZ | http://praguemonitor.com/2016/02/03/plane-pick-five-czechs-leave-lebanon-wednesday | | POINT (14.7186 50.4983) | 508122668 | Wed Feb 03 00:00:00 EST 2016 | 50.4983 | 14.7186 | LEBANON | | EZ | http://praguemonitor.com/2016/02/03/plane-pick-five-czechs-leave-lebanon-wednesday | | POINT (14.7186 50.4983) | 508122669 | Wed Feb 03 00:00:00 EST 2016 | 50.4983 | 14.7186 | LEBANON | | EZ | http://praguemonitor.com/2016/02/03/plane-pick-five-czechs-leave-lebanon-wednesday | | POINT (14.7186 50.4983) | 508122679 | Wed Feb 03 00:00:00 EST 2016 | 50.4983 | 14.7186 | LEBANON | FOREIGN MINIST | EZ | http://praguemonitor.com/2016/02/03/plane-pick-five-czechs-leave-lebanon-wednesday | | POINT (14.7186 50.4983) | 508579066 | Thu Feb 04 00:00:00 EST 2016 | 50.4983 | 14.7186 | CZECH | MEDIA | EZ | http://www.ceskenoviny.cz/zpravy/plane-with-five-czechs-flying-from-beirut-to-prague-ministry/1311188 | +-------------------------+-----------+------------------------------+----------+-----------+----------------+----------------+-------------+-------------------------------------------------------------------------------------------------------+
We can see right away that these results are tagged with the country code EZ
which falls under Czech Republic. Since our area of interest is around Germany, perhaps we want to only see events that are tagged with the GM
country code. We can do this by adding a WHERE clause to the query.
$ geowave query gdelt "SELECT * FROM gdeltevent WHERE countryCode='GM' LIMIT 10"
Now the results show only events that have the GM
country code.
+-------------------------+-----------+------------------------------+----------+-----------+------------+------------+-------------+---------------------------------------------------------------------------------------------------------------------------+ | geometry | eventid | Timestamp | Latitude | Longitude | actor1Name | actor2Name | countryCode | sourceUrl | +-------------------------+-----------+------------------------------+----------+-----------+------------+------------+-------------+---------------------------------------------------------------------------------------------------------------------------+ | POINT (13.0333 47.6333) | 508836788 | Fri Feb 05 00:00:00 EST 2016 | 47.6333 | 13.0333 | GERMANY | | GM | http://www.thespreadit.com/gold-bar-lake-keep-69589/ | | POINT (13.0333 47.6333) | 508836797 | Fri Feb 05 00:00:00 EST 2016 | 47.6333 | 13.0333 | GERMANY | ALBERT | GM | http://www.thespreadit.com/gold-bar-lake-keep-69589/ | | POINT (13.0333 47.6333) | 508837466 | Fri Feb 05 00:00:00 EST 2016 | 47.6333 | 13.0333 | ALBERT | GERMANY | GM | http://www.thespreadit.com/gold-bar-lake-keep-69589/ | | POINT (12.9 47.7667) | 508569746 | Thu Feb 04 00:00:00 EST 2016 | 47.7667 | 12.9 | | GERMAN | GM | http://www.ynetnews.com/articles/0,7340,L-4762071,00.html | | POINT (12.9 47.7667) | 508574449 | Thu Feb 04 00:00:00 EST 2016 | 47.7667 | 12.9 | COMPANY | GOVERNMENT | GM | http://www.i24news.tv/en/news/international/101671-160204-holocaust-survivors-sue-hungary-for-deportation-of-500-000-jews | | POINT (12.9 47.7667) | 508665355 | Thu Feb 04 00:00:00 EST 2016 | 47.7667 | 12.9 | HUNGARY | GERMANY | GM | http://www.jns.org/news-briefs/2016/2/4/14-holocaust-survivors-sue-hungary-in-us-court | | POINT (12.9 47.7667) | 508773863 | Fri Feb 05 00:00:00 EST 2016 | 47.7667 | 12.9 | | GERMAN | GM | http://jpupdates.com/2016/02/04/14-holocaust-survivors-sue-hungary-in-u-s-court/ | | POINT (12.9 47.7667) | 508775266 | Fri Feb 05 00:00:00 EST 2016 | 47.7667 | 12.9 | HUNGARY | GERMANY | GM | http://jpupdates.com/2016/02/04/14-holocaust-survivors-sue-hungary-in-u-s-court/ | | POINT (12.9 47.7667) | 509245139 | Sat Feb 06 00:00:00 EST 2016 | 47.7667 | 12.9 | | GERMAN | GM | https://theuglytruth.wordpress.com/2016/02/06/hungary-holocaust-survivors-sue-hungarian-government/ | | POINT (12.9 47.7667) | 509327879 | Sun Feb 07 00:00:00 EST 2016 | 47.7667 | 12.9 | | LARI | GM | http://blackgirllonghair.com/2016/02/the-black-victims-of-the-holocaust-in-nazi-germany/ | +-------------------------+-----------+------------------------------+----------+-----------+------------+------------+-------------+---------------------------------------------------------------------------------------------------------------------------+
If we wanted to see how many events belong to to the GM
country code, we can perform an aggregation query.
$ geowave query gdelt "SELECT COUNT(*) FROM gdeltevent WHERE countryCode='GM'"
+----------+ | COUNT(*) | +----------+ | 81897 | +----------+
We can also perform multiple aggregations on the same data in a single query. The following query counts the number of entries that have set actor1Name
and how many have set actor2Name
.
$ geowave query gdelt "SELECT COUNT(actor1Name), COUNT(actor2Name) FROM gdeltevent"
+-------------------+-------------------+ | COUNT(actor1Name) | COUNT(actor2Name) | +-------------------+-------------------+ | 93750 | 80608 | +-------------------+-------------------+
We can also do bounding box aggregations. For example, if we wanted to see the bounding box of all the data that has HUNGARY
set as the actor1Name
, we could do the following:
$ geowave query gdelt "SELECT BBOX(*), COUNT(*) AS total_events FROM gdeltevent WHERE actor1Name='HUNGARY'"
+------------------------------------------+--------------+ | BBOX(*) | total_events | +------------------------------------------+--------------+ | Env[6.1667 : 14.7174, 47.3333 : 53.5667] | 408 | +------------------------------------------+--------------+
In these examples each query was output to console, but there are options on the command that allow the query results to be output to several formats, including geojson, shapefile, and CSV. |
For more information about queries, see the queries section of the User Guide.
Kernel Density Estimation (KDE)
We can also perform analytics on data that has been ingested into GeoWave. In this example, we will perform the Kernel Density Estimation (KDE) analytic.
$ geowave analytic kde --featureType gdeltevent --minLevel 5 --maxLevel 26 --minSplits 32 --maxSplits 32 --coverageName gdeltevent_kde --hdfsHostPort ${HOSTNAME}:8020 --jobSubmissionHostPort ${HOSTNAME}:8032 --tileSize 1 gdelt gdelt
This command tells GeoWave to perform a Kernel Density Estimation on the gdeltevent
type. It specifies that the KDE should be run at zoom levels 5-26 and that the new raster generated should be under the type name gdeltevent_kde
. It also specifies that the minimum and maximum splits should be 32, which is the number of partitions that were created for the index. It then points the analytic to the HDFS and resource manager ports on the EMR cluster. Finally, it specifies the input and output data store as our gdelt
store. It is possible to output the results of the KDE to a different data store, but for this demo, we will use the same one. The KDE can take 5-10 minutes to complete due to the size of the dataset.
Visualizing the Data
Now that we have prepared our vector and KDE data, we can visualize it by using the GeoServer plugin. GeoWave provides an embedded GeoServer with the command-line tools.
Configure GeoServer
Because GeoServer is running on the EMR cluster, we need to configure GeoWave to communicate with it. Execute the following command, replacing $HOSTNAME
with the Master public DNS
of the EMR cluster:
$ geowave config geoserver "$HOSTNAME:8000"
Add Layers
GeoWave provides commands that make adding layers to a GeoServer instance a simple process. In this example, we can add both the gdeltevent
and gdeltevent_kde
types to GeoServer with a single command.
$ geowave gs layer add gdelt --add all
This command tells GeoWave to add all raster and vector types from the gdelt
data store to GeoServer.
Add Styles
We already downloaded the styles that we want to use to visualize our data as part of the preparation step. The KDEColorMap style will be used for the heatmap produced by the KDE analytic. The SubsamplePoints style will be used to efficiently render the points from the gdeltevent
type. All we need to do is add them to GeoServer.
$ geowave gs style add kdecolormap -sld KDEColorMap.sld
$ geowave gs style add SubsamplePoints -sld SubsamplePoints.sld
Now we can update our layers to use these styles.
$ geowave gs style set gdeltevent_kde --styleName kdecolormap
$ geowave gs style set gdeltevent --styleName SubsamplePoints
View the Layers
The GeoServer web interface can be accessed in your browser:
-
${Master_public_DNS}:8000/geoserver/web
Login to see the layers.
-
Username: admin
-
Password: geoserver
Select "Layer Preview" from the menu on the left side. You should now see our two layers in the layer list.
Click on the OpenLayers link by any of these layers to see them in an interactive map.
gdeltevent - Shows all of the GDELT events in a bounding box around Germany as individual points. Clicking on the map preview will show you the feature data associated with the clicked point.
gdeltevent
Layergdeltevent_kde - Shows the heat map produced by the KDE analytic in a bounding box around Germany.
For this screenshot, the background color of the preview was set to black by appending |
gdeltevent_kde
LayerRaster Demo
In this demo, we will be looking at Band 8 of Landsat raster data around Berlin, Germany. See USGS.gov for more information about Landsat 8.
Install GDAL
The Landsat 8 extension for GeoWave utilizes GDAL (Geospatial Data Abstraction Library), an image processing library, to process raster data. In order to use GDAL, native libraries need to be installed on the system. More info on GDAL can be found here.
GeoWave provides a way to install GDAL libraries with the following command:
$ geowave raster installgdal
Configure GeoWave Data Stores
Before continuing the demo, make sure that your working directory is the current active directory in your command-line tool. |
For this demo, we will be using two data stores. One will be used for vector data, and the other will be used for raster data. Again, replace $HOSTNAME
with the Master public DNS
of the EMR cluster:
-
Accumulo
$ geowave store add -t accumulo -z $HOSTNAME:2181 landsatraster --gwNamespace geowave.landsat_raster -i accumulo -u geowave -p geowave $ geowave store copycfg landsatraster landsatvector --gwNamespace geowave.landsat_vector
-
HBase
$ geowave store add -t hbase -z $HOSTNAME:2181 landsatraster --gwNamespace geowave.landsat_raster $ geowave store copycfg landsatraster landsatvector --gwNamespace geowave.landsat_vector
-
Cassandra
$ geowave store add -t cassandra --contactPoints $HOSTNAME:2181 landsatraster --gwNamespace geowave.landsat_raster --batchWriteSize 15 $ geowave store copycfg landsatraster landsatvector --gwNamespace geowave.landsat_vector
These commands creates a store for the raster data, and then copies that store configuration, changing only the namespace for the vector data store. The result is that the data for both stores will be on the same key/value store, but under different namespaces, so GeoWave will treat them as separate data stores.
Add an Index
Before ingesting our raster data, we will add a spatial index to both of the data stores.
$ geowave index add -t spatial -c EPSG:3857 landsatraster spatial-idx
$ geowave index add -t spatial -c EPSG:3857 landsatvector spatial-idx
This is similar to the command we used to add an index in the vector demo, but we have added an additional option to specify the Coordinate Reference System (CRS) of the data. Geospatial data often uses a CRS that is tailored to the area of interest. This can be a useful option if you want to use a CRS other than the default. After these commands have been executed, we will have spatial indices named spatial-idx
on both data stores.
Analyze Available Data
We can now see what Landsat 8 data is available for our area of interest.
$ geowave util landsat analyze --nbestperspatial true --nbestscenes 1 --usecachedscenes true --cql "BBOX(shape,13.0535,52.3303,13.7262,52.6675) AND band='B8' AND cloudCover>0" -ws ./landsat
This command tells GeoWave to analyze the B8 band of Landsat raster data over a bounding box that roughly surrounds Berlin, Germany. It prints out aggregate statistics for the area of interest, including the average cloud cover, date range, number of scenes, and the size of the data. Data for this operation is written to the landsat
directory (specified by the -ws
option), which can be used by the ingest step.
Ingest the Data
Now that we have analyzed the available data, we are ready to ingest it into our data stores.
$ geowave util landsat ingest --nbestperspatial true --nbestscenes 1 --usecachedscenes true --cql "BBOX(shape,13.0535,52.3303,13.7262,52.6675) AND band='B8' AND cloudCover>0" --crop true --retainimages true -ws ./landsat --vectorstore landsatvector --pyramid true --coverage berlin_mosaic landsatraster spatial-idx
There is a lot to this command, but you’ll see that it’s quite similar to the analyze command, but with some additional options. The --crop
option causes the raster data to be cropped to our CQL bounding box. The --vectorstore landsatvector
option specifies the data store to put the vector data (scene and band information). The --pyramid
option tells GeoWave to create an image pyramid for the raster, this is used for more efficient rendering at different zoom levels. The --coverage berlin_mosaic
option tells GeoWave to use berlin_mosaic
as the type name for the raster data. Finally, we specify the output data store for the raster, and the index to store it on.
Visualizing the Data
We will once again use GeoServer to visualize our ingested data.
Configure GeoServer
GeoServer should already be configured from the previous demo, but if not, go ahead and configure it now:
$ geowave config geoserver "$HOSTNAME:8000"
Add Layers
Just like with the vector demo, we can use the GeoWave CLI to add our raster data to GeoServer. We will also add the vector metadata from the vector data store.
$ geowave gs layer add landsatraster --add all
$ geowave gs layer add landsatvector --add all
View the Layers
When we go back to the Layer Preview page in GeoServer, we will see three new layers, band
, berlin_mosaic
, and scene
.
Click on the OpenLayers link by any of these layers to see them in an interactive map.
berlin_mosaic - Shows the mosaic created from the raster data that fit into our specifications. This mosaic is made of 5 images.
berlin_mosaic
Layerband/scene - Shows representations of the vector data associated with the images. The band and scene layers are identical in this demo.
band
and scene
Layers