In most real-life problems, need to break down the problem into various components and apply a combination of techniques. These techniques are applied in sequence that form a pipeline where output of one algorithm stage becomes input to the next stage in the pipeline.
Consider a system that you are designing to identify the number plate of a speeding vehicle. For this, you will have to design four components:
- Speed detection: This component will monitor the speed of the oncoming/passing vehicles and will trigger whenever the speed goes above the prescribed speed limit. Depending on the policy, instead of just a speeding/not speeding decision, it might record the actual speed. And, for speeding vehicles, it will capture the image of the car.
- Number plate segmentation: This component will extract out the portion of the camera image that has the number plate. The rest of the car image is not of much importance.
- Identifying the vehicle number: This component will recognize the characters on the number plate – in order to identify the registration number of the vehicle.
- Generating the ticket: This component will access the transport department’s records to generate the ticket for the vehicle and post it to the right contact address.
As you implement and evaluate your system, you find that the whole system provides an accuracy of only 68%. Obviously, you want to increase the overall accuracy. The question is, which component of the system should you improve first.
#1 Ceiling Analysis
Just like any other analysis involving performance tuning, you need to identify a single parameter (or figure of merit) that you will use to evaluate your system. In the current example, we are considering being able to accurately fetch the registration
details for speeding vehicles as the parameter of interest.
- assume that the first component (speed detection) works with 100% accuracy and manually feeds the correct data to the second component - this correct data may have been manually obtained.
- Now see how the accuracy of the whole system improves. Say, the accuracy now improves to 72%. This says that by improving the first component, you will gain a maximum improvement of 4% - this 4% represents the ceiling of the improvement that you can get by improving the first component.
- Now see how the accuracy of the whole system improves. Say, the accuracy now improves to 72%. This says that by improving the first component, you will gain a maximum improvement of 4% - this 4% represents the ceiling of the improvement that you can get by improving the first component.
- assume that the second component also generates output with 100% accuracy (i.e., both first and second component are working with 100% accuracy). Let us say that the overall accuracy of the whole system now rises to 80%. This means that by improving the second component (or stage), you can get a maximum improvement of 8% from the whole system. This 8% once again represents the ceiling of improvement achieved through improving second stage.
overall accuracy.
2 Data Quality
In the world of machine learning, the richness of training data is much more important than minor improvements to algorithms – for better prediction.
- Data models used in business intelligence are dependent on integrity models of relational databases with rules based on functional dependencies of the data.
- Significant work is done in sanitizing data in business applications with satisfactory techniques in replacing null values, identifying missing data, and identifying duplicate data. These are techniques based on rules created by domain experts based on their understanding of data dependencies, consistency requirements, and functional requirements. Rule sets are created based on the requirements of the domain. As the number of data sources increases, the rule sets get complex and are hard to create as these rule sets can be extensive.
- Significant work is done in sanitizing data in business applications with satisfactory techniques in replacing null values, identifying missing data, and identifying duplicate data. These are techniques based on rules created by domain experts based on their understanding of data dependencies, consistency requirements, and functional requirements. Rule sets are created based on the requirements of the domain. As the number of data sources increases, the rule sets get complex and are hard to create as these rule sets can be extensive.
- The rules for data cleaning are also dependent on each other and need to work in tandem (串联).
- For example, finding duplicate data will work better if the missing data is replenished (补充). On the other hand, missing data is better replenished if the duplicates are identified first. It often helps if the rules are executed together to improve the quality of the data. Often, the execution order in the rule execution makes the difference in resultant data quality.
- Machine learning techniques like Bayesian models can be used to apply multiple rules together in improving data quality.
- Missing data or nulls in structured data can be filled in with a known median for that variable or other statistical parameter. It is possible to decide to remove the data with nulls depending on the domain or application.
- For example, in a time series or a sequence, missing values can be duplicate of previous value or average of values adjacent to missing data. In a non-sequential data like user preference in shopping, the data point with missing or uncertain user ID may be deleted from dataset.
- For example, in a time series or a sequence, missing values can be duplicate of previous value or average of values adjacent to missing data. In a non-sequential data like user preference in shopping, the data point with missing or uncertain user ID may be deleted from dataset.
- Duplicate data is detected based on the domain.
- In a simple case, if the user ID matches in the database and there is expected to be only one entry based on user ID, then more than one entry will be duplicate entry. Depending on the domain model, the data can be merged or one of the entries will be deleted.
- In a simple case, if the user ID matches in the database and there is expected to be only one entry based on user ID, then more than one entry will be duplicate entry. Depending on the domain model, the data can be merged or one of the entries will be deleted.
- Data profiling and cleaning are performed based on the domain knowledge.
- For example, user name John Doe can be written as Doe John or J.D. Using other information available in the dataset, the user name can be identified as the same user or a different user. Complex rules based on the domain information will be used in resolving the conflicts or profiling the data.
- For example, user name John Doe can be written as Doe John or J.D. Using other information available in the dataset, the user name can be identified as the same user or a different user. Complex rules based on the domain information will be used in resolving the conflicts or profiling the data.
- Anomalous or incorrect data either because of errors in entry or incorrect labelling of data also results in similar errors in modeling.
- Anomaly detection techniques discussed in Chap. 13 can be used to remove anomalous data. Domain rule-based techniques can also be used to remove the anomalous data points.
- Anomaly detection techniques discussed in Chap. 13 can be used to remove anomalous data. Domain rule-based techniques can also be used to remove the anomalous data points.
# 1. Unstructured Data
This unstructured data, also termed as Big Data, requires additional techniques to clean the data to make it useful for machine learning algorithms. The data quality and precision requirements change based on the application.
- For example, application identifying fashion trends to predict most efficient store display needs only generic trends based on past customer profiles and pace of change in tastes. Whereas identifying credit terms to extend a customer will need a more precise financial profile of the customer in question along with the trends of defaults within the class of profile.
# 2. Getting Data
It is very important to spend some time in thinking of ways to collect a lot of labelled data.
Often, you can create many more instances of data from a set of already available data. Example, for characters, you can create a large set of labelled training data through:
- Use many different fonts for the character.
- Put random background for these characters.
- Put slight rotation or blur some portions of the character.
- Apply some distortion, e.g., extend the character only vertically etc.
- You can also use a crowdsourcing technique such as Amazon Mechanical Turk or Figure Eight (formerly called CrowdFlower) to get a huge quantity of labelled data.
- Or, you could collect and label your own data – which could be very time consuming. Best is to be able to tap into already existing dataset (if available) that you can re-purpose.
- Also, make sure that the data that you are collecting for training is a fair representation of the sample space on which you finally intend to apply your algorithm.
- For example, you want to create an application for facial recognition that is supposed to work across the globe. You want to ensure that it is trained on a large dataset representing a wide variety of demographic variation across ethnicity, age, gender, dressing styles, etc. There have been embarrassing cases of face recognition machines not working well on people of specific ethnic background!
- For example, you want to create an application for facial recognition that is supposed to work across the globe. You want to ensure that it is trained on a large dataset representing a wide variety of demographic variation across ethnicity, age, gender, dressing styles, etc. There have been embarrassing cases of face recognition machines not working well on people of specific ethnic background!
3 Improvisations over Gradient Descent
some improvisation for reaching the cost curve faster.
- The improvisations mentioned here will help jump over such local minima – in addition to reducing the meandering of the solution.
#1. Momentum
#2. RMSProp
#3 ADAM (Adaptive Moment Estimation)
4 Software Stacks
5 Choice of Hardware



雷达卡






京公网安备 11010802022788号







