Pig Design Patterns is a comprehensive guide that will enable readers to readily use design patterns that simplify the creation of complex data pipelines in various stages of data management. This book focuses on using Pig in an enterprise context, bridging the gap between theoretical understanding and practical implementation. Each chapter contains a set of design patterns that pose and then solve technical challenges that are relevant to the enterprise use cases.
The book covers the journey of Big Data from the time it enters the enterprise to its eventual use in analytics, in the form of a report or a predictive model. By the end of the book, readers will appreciate Pig’s real power in addressing each and every problem encountered when creating an analytics-based data product. Each design pattern comes with a suggested solution, analyzing the trade-offs of implementing the solution in a different way, explaining how the code works, and the results.
The following code performs the task of connecting to MongoDB, setting up the connection, loading the MongoDB native file, parsing it, and retrieving only the specified schema in the MongoLoader constructor by mapping the fields of the MongoDB document with the fields specified in the schema. This abstraction is performed by just one call to the MongoLoader function.
/*
Register the mongo jar files to be able to use MongoLoader UDF
Load the contents of files starting with NASDAQ_daily_prices_ into a Pig relation stock_data
*/
stock_data= LOAD '/user/cloudera/pdp/datasets/mongo/NASDAQ_daily_prices/NASDAQ_daily_prices_*' USING org.apache.pig.piggybank.storage.CSVLoader() as (exchange:chararray, stock_symbol:chararray, date:chararray, stock_price_open:chararray, stock_price_high:chararray, stock_price_low:chararray, stock_price_close:chararray, stock_volume:chararray, stock_price_adj_close:chararray);
/*
* Some processing logic goes here which is deliberately left out to improve readability
*/
/*
Store data to MongoDB by specifying the MongoStorage serializer. The MongoDB URI nasdaqDB.store_stock is the document collection created to hold this data.
*/
STORE stock_data INTO 'mongodb://slave1/nasdaqDB.store_stock' using MongoStorage();
The following code snippet illustrates the ingestion of the HBase data into a Pig relation:
/*
Load data from HBase table retail_transactions, it contains the column families transaction_details, customer_details and product_details.
The : operator is used to access columns in a column family.
First parameter to HBaseStorage is the list of columns and the second parameter is the list of options
The option -loadkey true specifies the rowkey should be loaded as the first item in the tuple, -limit 500 specifies the number of rows to be read from the HBase table
*/
transactions = LOAD 'hbase://retail_transactions'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(