By DataScience.comSponsored Post. In this series, we will explore core concepts related to the study and application of natural language processing. Part one below provides an introduction to the field and explains how to identify lexical units as a means of data preprocessing. Natural language processing is a set of techniques that allows computers and people to interact. Consider the process of extracting information from some data generating process: A company wants to predict user traffic on its website so it can provide enough compute resources (server hardware) to service demand. Engineers can define the relevant information to be the amount of data requested. Because they control the data generating process, they can add logic to the website that stores every request for data as a variable. Then, they can define the unit of measurement as the amount of data requested as a byte, in turn allowing us to represent the information as integers. With an excellent representation of the information in hand, the engineers can store it in a tabular database so analysts can make predictions based on this historical data. Natural language processing is the application of the steps above — defining representations of information, parsing that information from the data generating process, and constructing, storing, and using data structures that store information — to information embedded in natural languages. What makes a language natural is precisely what makes natural language processing difficult; the rules governing the representation of information in natural languages evolved without predetermination. These rules can be high level and abstract, such as how sarcasm is used to convey meaning; or quite low level, such as using the character "s" to denote plurality of nouns. Natural language processing involves identifying and exploiting these rules with code to translate unstructured language data into information with a schema. Language data may be formal and textual, such as newspaper articles, or informal and auditory, such as a recording of a telephone conversation. Language expressions from different contexts and data sources will have varying rules of grammar, syntax, and semantics. Strategies for extracting and representing information from natural languages that work in one setting often fail in others. Companies often have access to records of natural language that contain valuable information. Product reviews or even tweets on Twitter can contain specific complaints or feature requests related to a product that can help prioritize and evaluate proposals. Online marketplaces may have item descriptions available that can help define a taxonomy of products. A digital newspaper may have an archive of online articles that can be used to build a search engine to allow users to find relevant content. Information that is representational of natural language can also be useful for building powerful applications, such as bots that respond to questions or software that translates from one language to another. Natural language processing can be used to identify specific complaints from text. Projects requiring natural language processing are generally organized by these sorts of challenges. Solving them usually requires us to serially piece multiple subtasks together, where there may be many approaches for each subtask. The universe of natural language processing methods can be daunting, as it's highly specialized, vast, and somewhat lacking in an overarching conceptual framework. While a complete summary of natural language processing is well beyond the scope of this article, we will cover some concepts that are commonly used in general purpose natural language processing work. We'll assume that we have access to textual data with which to work (not auditory, which requires the additional step of speech recognition). As natural languages are generally composed of words, an initial step of many natural language processing projects is identifying words within some raw text. The concept of a word, however, may be too restrictive or ambiguous. The strings "cats" and "cat" are different forms of the same entry in the dictionary; should they be treated equivalently? "Star Wars" has no entry in the dictionary, and though it contains a space, we think of it a singular entity. These are the sorts of challenges involved in defining lexical units, which represent basic elements of a vocabulary. For a given task, the researcher must define what constitutes an appropriate lexical unit. Should singular and plural forms of a word be considered to belong to the same lexical unit? Assume we are building a question answering system, and receive the following queries: A: "Find the closest theaters to me." B: "Find the closest theater to me." In A, the user is implying that she wants to view multiple theaters, whereas in B, she just wants the single closest theater. Throwing away the distinction between singular and plural will degrade the quality of our application. Alternatively, assume that we want to summarize the following product reviews: A: "The product had some connectivity problems." B: "The product had a connectivity problem."本帖隐藏的内容