The Greenplum Database (GPDB) is an advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes. Uniquely geared toward big data analytics, Greenplum Database is powered by the world’s most advanced cost-based query optimizer delivering high analytical query performance on large data volumes. https://www.greenplum.org
This repository demonstrates how to use Greenplum with PXF to read text data from hadoop.
- Pre-requisites
- Start Docker-compose
- Configure Greenplum
- Configure Cloudera
- Use PSQL to read data via PXF
- 8GB memory or more allocated to docker
- docker-compose
- Cloudera docker image
- GPDB 5.x docker image
Once you have cloned this repository, you can run the command ./runDocker.sh -t usecase1 -c up
, in order to start both Greenplum and Cloudera docker instances.
The assumption: docker and docker-compose are already installed on your machine.
$ ./runDocker.sh -t usecase1 -c up
WARNING: Found orphan containers (postgresql) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
Recreating gpdbsne ... done
Recreating quickstart.cloudera ... done
Attaching to gpdbsne, quickstart.cloudera
gpdbsne | /etc/sysconfig/run-parts
gpdbsne | /usr/bin/run-parts
gpdbsne | Running /docker-entrypoint.d
gpdbsne | /docker-entrypoint.d/startInit.sh:
gpdbsne |
gpdbsne | init is running
gpdbsne | /docker-entrypoint.d/startSSH.sh:
You can use this command docker exec -it gpdbsne bin/bash
to access Greenplum docker instance. Or you can run this script accessDockerGPDBSNE.sh
.
For example:
$ docker exec -it gpdbsne bin/bash
[root@gpdbsne /]#
Once you have access to Greenplum docker instance, you can create database, table with some sample data.
- Use gpadmin user by using
su - gpadmin
- Disable user impersonation by editing /home/gpadmin/pxf/conf/pxf-env.sh
[gpadmin@gpdbsne conf]$ vi pxf-env.sh
Make sure you change PXF_USER_IMPERSONATION=false
export PXF_USER_IMPERSONATION=false
Or you can run this script /code/usecase1/pxf/disablePXFUserImpersonation.sh
that performs the same operation.
3. Create a configuration for Hadoop cluster by copying template files
$ cp /home/gpadmin/pxf/templates/*.xml /home/gpadmin/pxf/servers/default
- Change
/home/gpadmin/pxf/servers/default/core-site.xml
to use "quickstart.cloudera".
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://quickstart.cloudera:8020</value>
</property>
</configuration>
Or you can run this script /code/usecase1/pxf/addClouderaToPXFCoreSiteXML.sh
that performs the same operation.
- Create sample database with the name "pxf_db"
The scripts to create database and sample data is found at
/code/usercase1/pxf/
.
Next, run the command /code/usecase1/pxf/setupDB.sh
$ /code/usecase1/pxf/setupDB.sh
GRANT
GRANT
CREATE EXTERNAL TABLE
- Verify database and table is created Use `psql -U gpadmin -d pxf_db -c "\dE"``
[gpadmin@gpdbsne ~]$ psql -U gpadmin -d pxf_db -c "\dE"
List of relations
Schema | Name | Type | Owner | Storage
--------+---------------------+-------+---------+----------
public | pxf_hdfs_textsimple | table | gpadmin | external
(1 row)
This section describes how to setup Cloudera.
- You can use this command
docker exec -it quickstart.cloudera bin/bash
to access Cloudera docker instance. For example:
$ docker exec -it quickstart.cloudera bin/bash
[root@quickstart /]#
- This example shows how to use a script to creates sample text file.
[root@quickstart cloudera]# /code/usecase1/cloudera/setupHdfsTextSimpleExample.sh
Using root to run this script.
Found hadoop /data directory
Recursively delete files under /data
Deleted /data
## pxf_hdfs_simple.txt is created under this directory /data/pxf_examples/
Found 1 items
-rw-r--r-- 1 root supergroup 100 2018-05-07 22:29 /data/pxf_examples/pxf_hdfs_simple.txt
[root@quickstart cloudera]#
Make sure you are accessing GPDB docker instance.
- Use psql on GPDB docker instance
$
$ psql -U gpadmin -d pxf_db -c "select * from pxf_hdfs_textsimple;"
location | month | num_orders | total_sales
-------------+-------+------------+-------------
Prague | Jan | 101 | 4875.33
Rome | Mar | 87 | 1557.39
Bangalore | May | 317 | 8936.99
Beijing | Jul | 411 | 11600.67
(4 rows)
- Verify the number of records in this table pxf_hdfs_textsimple
$ psql -U gpadmin -d pxf_db -c "select count(*) from pxf_hdfs_textsimple;"
count
-------
4
(1 row)