【数据科学】Pig Design Patterns - 第2页

11楼

ReneeBK(未真实交易用户) 发表于 2017-1-16 06:49:03

The HBase ingress and egress pattern

The egress code
The following code illustrates content storage of a Pig relation into HBase table:
/*
Load the transactions dataset using PigStorage into the relation transactions
*/
transactions = LOAD '/user/cloudera/pdp/datasets/hbase/transactions.csv' USING PigStorage( ',' ) AS (
listing_id: chararray,
transaction_date: chararray,
customer_id: int,
age: chararray,
residence_area: chararray,
product_subclass: int,
product_id: long,
amount: int,
asset: int,
sales_price: int);
/*
* Some processing logic goes here which is deliberately left out to improve readability
*/
/*
Use HBaseStorage to store data from the Pig relation transactions into a HBase table hbase://retail_transactions.
The individual contents of transactions are mapped to three column families transaction_details, product_details and customer_details.
*/
STORE transactions INTO 'hbase://retail_transactions' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('transaction_details:transaction_date customer_details:customer_id customer_details:age customer_details:residence_area product_details:product_subclass product_details:product_id product_details:amount product_details:asset product_details:sales_price');

复制代码

12楼

ReneeBK(未真实交易用户) 发表于 2017-1-16 06:50:54

The Hive ingress and egress patterns

Importing data using RCFile
The following code illustrates the usage of HiveColumnarLoader that loads data from a Hive table stored in a RCFile:
/*
Register the Piggybank jar file to be able to use the UDFs in it
*/
REGISTER '/usr/share/pig/contrib/piggybank/java/piggybank.jar';
-- Register Hive common and exec jars
REGISTER '/usr/lib/hive/lib/hive-common-0.11.0.1.3.0.0-107.jar';
REGISTER '/usr/lib/hive/lib/hive-exec-0.11.0.1.3.0.0-107.jar';
/*
Load retail_transactions_rc RCfile and specify the names of the columns of the table and their types in the constructor of HiveColumnarLoader.
*/
transactions = LOAD '/apps/hive/warehouse/transactions_db.db/retail_transactions_rc' USING org.apache.pig.piggybank.storage.HiveColumnarLoader('transaction_no int,transaction_date string,cust_no int,amount double,category string,product string,city string,state string,spendby string');
/*
* Some processing logic goes here which is deliberately left out to improve readability
*/
/*
Display the contents of the relation transactions on the console
*/
DUMP transactions;

复制代码

13楼

ReneeBK(未真实交易用户) 发表于 2017-1-16 06:51:38

The Hive ingress and egress patterns

Importing data using HCatalog
The following code illustrates the loading of data from Hive using HCatalog:
/*
Specify the table name as the input to the HCatLoader function provided by HCatalog.
This function abstracts the storage location, files type, schema from the user and takes only the table name as input
*/
transactions = LOAD 'transactions_db.retail_transactions' USING org.apache.hcatalog.pig.HCatLoader();
/*
* Some processing logic goes here which is deliberately left out to improve readability
*/
/*
Display the contents of the relation transactions on the console
*/
DUMP transactions;

复制代码

14楼

ReneeBK(未真实交易用户) 发表于 2017-1-16 06:52:09

The Hive ingress and egress patterns

The following code illustrates the egression of data to Hive using HCatStorer:
-- Register piggybank and hcatalog-pig-adapter jars
REGISTER '/usr/share/pig/contrib/piggybank/java/piggybank.jar';
REGISTER '/usr/lib/hcatalog/share/hcatalog/hcatalog-pig-adapter.jar';
/*
Load the transactions dataset into the relation transactions
*/
transactions = LOAD '/user/cloudera/pdp/datasets/hive/retail_transactions.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (transaction_no:int, transaction_date:chararray, cust_no:int, amount:double, category:chararray, product:chararray, city:chararray, state:chararray, spendby:chararray);
/*
* Some processing logic goes here which is deliberately left out to improve readability
*/
/*
Specify the Hive table name transactions_db.retail_transactions as the input to the HCatStorer function.
The contents of the relation transactions are stored into the Hive table.
*/
STORE transactions INTO 'transactions_db.retail_transactions' using org.apache.hcatalog.pig.HCatStorer();

复制代码

15楼

ReneeBK(未真实交易用户) 发表于 2017-1-16 06:53:57

The mainframe ingestion pattern

The following is a Java code snippet of VSAMLoader, which is a custom loader implementation:
@Override
public ResourceSchema getSchema(String arg0, Job arg1) throws IOException {
.
.
while (it.hasNext()) {
Map.Entry pairs = (Map.Entry) it.next();
//Get the next key/value pairs
String key = (String) pairs.getKey();
String value = (String) pairs.getValue();
/*For Group and Alphanumeric types in copybook, return
pig compliant type chararray*/
if (value.toString()
.equals("class net.sf.cb2java.copybook.Group")
|| value.toString().equals("class net.sf.cb2java.copybook.AlphaNumeric")){
fieldSchemaList.add(new FieldSchema(key,
org.apache.pig.data.DataType.CHARARRAY));
}
/*For Decimal type in copybook, return
pig compliant type integer*/
else if (value.toString()
.equals("class net.sf.cb2java.copybook.Decimal")){
fieldSchemaList.add(new FieldSchema(key,
org.apache.pig.data.DataType.INTEGER));
}
// Else return default bytearray
else
{
fieldSchemaList.add(new FieldSchema(key,
org.apache.pig.data.DataType.BYTEARRAY));
}
}
return new ResourceSchema(new Schema(fieldSchemaList));
}

复制代码

16楼

ReneeBK(未真实交易用户) 发表于 2017-1-16 06:54:57

The code for simple JSON
The code to load JSON files using JsonLoader is shown as follows:
/*
Use JSONLoader UDF, it takes in the parameter of the JSON schema and loads the contents of the JSON file emails.json into a map enron_emails
*/
enron_emails = LOAD '/user/cloudera/pdp/datasets/json/emails.json' USING JsonLoader('body:chararray, from:chararray, tos:chararray, ccs:chararray, bccs:chararray, date:chararray, message_id:chararray, subject:chararray');
/*
* Some processing logic goes here which is deliberately left out to improve readability
*/
/*
Display the contents of the relation enron_emails on the console
*/
DUMP enron_emails;
It is important to note that the JsonLoader does not use the AS clause to supply the schema.

复制代码

17楼

ReneeBK(未真实交易用户) 发表于 2017-1-16 06:55:20

The code for nested JSON
The Pig script to load nested JSON is shown as follows, and we use the elephant-bird libraries to accomplish this:
/*
Register elephant-bird and JSON jar files
*/
REGISTER '/home/cloudera/pdp/jars/elephant-bird-core-3.0.5.jar';
REGISTER '/home/cloudera/pdp/jars/elephant-bird-pig-3.0.5.jar';
REGISTER '/home/cloudera/pdp/jars/json-simple-1.1.1.jar';
/*
Use ElephantBird's JsonLoader for loading a nested JSON file
The parameter –nestedload denotes nested loading operation
*/
emails = LOAD '/user/cloudera/pdp/datasets/json/emails.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
/*
* Some processing logic goes here which is deliberately left out to improve readability
*/
/*
Display the contents of the relation emails on the console
*/
DUMP emails;

复制代码

18楼

ReneeBK(未真实交易用户) 发表于 2017-1-16 06:56:00

The egress code
The following section shows the code and its explanation to egress data stored in a Pig relation to JSON format:
/*
Load the JSON file using JsonLoader to the relation enron_emails
*/
enron_emails = LOAD '/user/cloudera/pdp/datasets/json/emails.json' USING JsonLoader('body:chararray, from:chararray, tos:chararray, ccs:chararray, bccs:chararray, date:chararray, message_id:chararray, subject:chararray');
/*
* Some processing logic goes here which is deliberately left out to improve readability
*/
/*
Use JsonStorage to store the contents of the relation to a json file
*/
STORE enron_emails into '/user/cloudera/pdp/output/json/output.json' USING JsonStorage()

复制代码

19楼

ReneeBK(未真实交易用户) 发表于 2017-1-16 07:08:54

Pig script
The following is the Pig script illustrating the implementation of this pattern:
/*
Register the datatypeinferer and custom storage jar files
*/
REGISTER '/home/cloudera/pdp/jars/datatypeinfererudf.jar';
REGISTER'/home/cloudera/pdp/jars/customdatatypeinfererstorage.jar';
/*
Load the transactions dataset into the relation transactions
*/
transactions = LOAD'/user/cloudera/pdp/datasets/data_profiling/transactions.csv'USING PigStorage(',') AS (transaction_id:long,transaction_date:chararray, cust_id:chararray, age:chararray,area:chararray, prod_subclass:int, prod_id:long, amt:int,asset:int, sales_price:int, phone_no:chararray,country_code:chararray);
/*
Infer the data type of the field cust_id by invoking the DataTypeInfererUDF.
It returns a tuple with the inferred data type.
*/
data_types = FOREACH transactions GENERATEcom.profiler.DataTypeInfererUDF(cust_id) AS inferred_data_type;
/*
Compute the count of each data type, total count, percentage.
The data type with the highest count is considered as dominant data type
*/
grpd = GROUP data_types BY inferred_data_type;
inferred_type_count = FOREACH grpd GENERATE group ASinferred_type, COUNT(data_types) AS count;
grpd_inf_type_count_all = GROUP inferred_type_count ALL;
total_count = FOREACH grpd_inf_type_count_all GENERATESUM(inferred_type_count.count) AS tot_sum,MAX(inferred_type_count.count) AS max_val;
percentage = FOREACH inferred_type_count GENERATE inferred_type AStype, count AS total_cnt,CONCAT((Chararray)ROUND(count*100.0/total_count.tot_sum),'%') ASpercent,(count==total_count.max_val?'Dominant':'Other') ASinferred_dominant_other_datatype;
percentage_ord = ORDER percentage BYinferred_dominant_other_datatype ASC;
/*
CustomDatatypeInfererStorage UDF extends the StoreFunc. All the abstract methods have been overridden to implement logic that writes the contents of the relation into a file in a custom report like format.
The results are stored on the HDFS in the directory datatype_inferer
*/
STORE percentage_ord INTO'/user/cloudera/pdp/output/data_profiling/datatype_inferer'using com.profiler.CustomDatatypeInfererStorage('cust_id','chararray');

复制代码

【数据科学】Pig Design Patterns [推广有奖]

浏览过的帖子

浏览过的版块

本版微信群