Level 1
Scenario:
There is a file called “Big_Data” and a file called “Needed_Data”. The “Needed_Data file contains a list of fields that needs to be pulled from the “Big_Data” file. When the code runs it should create the file called “Output_Data” and it should contain all the needed fields plus the primary key of the “Big_Data” file.
Input files
File one:
• File Name: Big_Data
• File Type: SAS dataset
• Records: 10 million records.
• Variables: 5 thousand variables per record.
• Primary key: Account_number
File two:
• File Name: Needed_Data.
• File Type SAS dataset.
• Records: 1 to X number.
• Variables: 1
• Varname:
o Keep_list: Contains the name of a single variable that would be on the Big_Data file.
Example data:
Keep_list
Apples
Oranges
Grapes
Processing requirement:
Output file “Output_Data” should contain all the fields that was requested in the “Needed_Data” file plus the primary key.
Output and Usage requirement:
None.
Error handling requirement:
None.
Suggestion:
For now assume the “Needed_Data” file will always contain variables that are on the “Big_Data” file.
Level 2
All requirements identical to Level 1 except for the following changes.
Input files
File two:
• File Name: Needed_Data.
• File Type SAS dataset.
• Records: 1 to X number.
• Variables: 3
• Varname:
o Keep_list: Name of a single variable that is on the “Big_Data” file.
o Where_list: The expected value of the variable in the keep list.
o Rename_list: The name of the variable to be named in the “Output_Data file”.
Example data:
Keep_list Where_list Rename List
Apples Red Ambrosia
Oranges Orange
Grapes Green Seedless
Processing requirement:
Output file “Output_Data” should contain all the fields that was requested in the “Needed_Data” file plus the primary key. The output fields should be renamed where asked it was asked for.
Example Output_data:
Account_number
Ambrosia
Oranges
Seedless
Output and Usage requirement:
None.
Error handling requirement:
Do not expect all fields being request in the “Needed_Data” file is on the “Big_Data” file. If a field is missing it should not show up on the “Output_Data” file and a note should be add to the log indicating the data field was not available. The code should then continue with the remainder of the fields.
Suggestion:
Do not assume the “Needed_Data” file will always contain variables that are on the “Big_Data” file