【学习笔记】Designing a Machine Learning System

1关注
3粉丝

已卖：70份资源

学科带头人

54%

还不是VIP/贵宾

-

0%

威望: 0 级
论坛币: 13005 个
通用积分: 409.9229
学术水平: 109 点
热心指数: 112 点
信用等级: 103 点
经验: 71218 点
帖子: 1079
精华: 0
在线时间: 1538 小时
注册时间: 2016-7-19
最后登录: 2024-6-8

楼主

liuxf666 发表于 2019-5-14 10:37:50 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

1 Pipeline Systems
In most real-life problems, need to break down the problem into various components and apply a combination of techniques. These techniques are applied in sequence that form a pipeline where output of one algorithm stage becomes input to the next stage in the pipeline.
Consider a system that you are designing to identify the number plate of a speeding vehicle. For this, you will have to design four components:

Speed detection: This component will monitor the speed of the oncoming/passing vehicles and will trigger whenever the speed goes above the prescribed speed limit. Depending on the policy, instead of just a speeding/not speeding decision, it might record the actual speed. And, for speeding vehicles, it will capture the image of the car.
Number plate segmentation: This component will extract out the portion of the camera image that has the number plate. The rest of the car image is not of much importance.
Identifying the vehicle number: This component will recognize the characters on the number plate – in order to identify the registration number of the vehicle.
Generating the ticket: This component will access the transport department’s records to generate the ticket for the vehicle and post it to the right contact address.

These components form a pipeline, in the sense that the output of the first component feeds into the second component and so on.
As you implement and evaluate your system, you find that the whole system provides an accuracy of only 68%. Obviously, you want to increase the overall accuracy. The question is, which component of the system should you improve first.

#1 Ceiling Analysis
Just like any other analysis involving performance tuning, you need to identify a single parameter (or figure of merit) that you will use to evaluate your system. In the current example, we are considering being able to accurately fetch the registration
details for speeding vehicles as the parameter of interest.

assume that the first component (speed detection) works with 100% accuracy and manually feeds the correct data to the second component - this correct data may have been manually obtained.
- Now see how the accuracy of the whole system improves. Say, the accuracy now improves to 72%. This says that by improving the first component, you will gain a maximum improvement of 4% - this 4% represents the ceiling of the improvement that you can get by improving the first component.
assume that the second component also generates output with 100% accuracy (i.e., both first and second component are working with 100% accuracy). Let us say that the overall accuracy of the whole system now rises to 80%. This means that by improving the second component (or stage), you can get a maximum improvement of 8% from the whole system. This 8% once again represents the ceiling of improvement achieved through improving second stage.

use the same technique to identify the ceiling of improvements possible for each stage. And, now, you should work on improving the stage which has the highest ceiling value. Improving this component will provide you the highest gain in the
overall accuracy.

2 Data Quality
In the world of machine learning, the richness of training data is much more important than minor improvements to algorithms – for better prediction.

Data models used in business intelligence are dependent on integrity models of relational databases with rules based on functional dependencies of the data.
- Significant work is done in sanitizing data in business applications with satisfactory techniques in replacing null values, identifying missing data, and identifying duplicate data. These are techniques based on rules created by domain experts based on their understanding of data dependencies, consistency requirements, and functional requirements. Rule sets are created based on the requirements of the domain. As the number of data sources increases, the rule sets get complex and are hard to create as these rule sets can be extensive.
The rules for data cleaning are also dependent on each other and need to work in tandem (串联).
- For example, finding duplicate data will work better if the missing data is replenished (补充). On the other hand, missing data is better replenished if the duplicates are identified first. It often helps if the rules are executed together to improve the quality of the data. Often, the execution order in the rule execution makes the difference in resultant data quality.
- Machine learning techniques like Bayesian models can be used to apply multiple rules together in improving data quality.
Missing data or nulls in structured data can be filled in with a known median for that variable or other statistical parameter. It is possible to decide to remove the data with nulls depending on the domain or application.
- For example, in a time series or a sequence, missing values can be duplicate of previous value or average of values adjacent to missing data. In a non-sequential data like user preference in shopping, the data point with missing or uncertain user ID may be deleted from dataset.
Duplicate data is detected based on the domain.
- In a simple case, if the user ID matches in the database and there is expected to be only one entry based on user ID, then more than one entry will be duplicate entry. Depending on the domain model, the data can be merged or one of the entries will be deleted.
Data profiling and cleaning are performed based on the domain knowledge.
- For example, user name John Doe can be written as Doe John or J.D. Using other information available in the dataset, the user name can be identified as the same user or a different user. Complex rules based on the domain information will be used in resolving the conflicts or profiling the data.
Anomalous or incorrect data either because of errors in entry or incorrect labelling of data also results in similar errors in modeling.
- Anomaly detection techniques discussed in Chap. 13 can be used to remove anomalous data. Domain rule-based techniques can also be used to remove the anomalous data points.

# 1. Unstructured Data
This unstructured data, also termed as Big Data, requires additional techniques to clean the data to make it useful for machine learning algorithms. The data quality and precision requirements change based on the application.

For example, application identifying fashion trends to predict most efficient store display needs only generic trends based on past customer profiles and pace of change in tastes. Whereas identifying credit terms to extend a customer will need a more precise financial profile of the customer in question along with the trends of defaults within the class of profile.

# 2. Getting Data
It is very important to spend some time in thinking of ways to collect a lot of labelled data.
Often, you can create many more instances of data from a set of already available data. Example, for characters, you can create a large set of labelled training data through:

Use many different fonts for the character.
Put random background for these characters.
Put slight rotation or blur some portions of the character.
Apply some distortion, e.g., extend the character only vertically etc.

So, for a character A – by taking 10 different fonts, and putting 3 different types of background, you got 30 samples of A. As you can imagine, with little creativity, you can easily get ten times the original dataset, obviously assuming that the original dataset was itself unique, rather than derived from each other.

You can also use a crowdsourcing technique such as Amazon Mechanical Turk or Figure Eight (formerly called CrowdFlower) to get a huge quantity of labelled data.
Or, you could collect and label your own data – which could be very time consuming. Best is to be able to tap into already existing dataset (if available) that you can re-purpose.
Also, make sure that the data that you are collecting for training is a fair representation of the sample space on which you finally intend to apply your algorithm.
- For example, you want to create an application for facial recognition that is supposed to work across the globe. You want to ensure that it is trained on a large dataset representing a wide variety of demographic variation across ethnicity, age, gender, dressing styles, etc. There have been embarrassing cases of face recognition machines not working well on people of specific ethnic background!