Chapter 1
Introduction
1.1 Background
Web 2.0 sites allow users to do more than just retrieving information. Instead of merely reading, users are encouraged to comment on published articles, update their status in social networking sites (SNS), and collaborate by editing questions in question-and-answer sites. User-generated content (UGC) is now part of the online experience to billions of Internet users worldwide.
As online shopping grows, E-commerce is expanding internationally. In E-commerce sites, one of the major UGC is review information, which consists of consumersā purchase history, ratings and reviews, product description provided by merchants, and friendships between users. Moreover, the popularity of mobile devices produces a large amount of location-based data. For example, users can check in restaurants, shops and other places of interest by mobile applications, and share their locations and likes with friends. According to the about page,1 Taobao.com, the most popular consumer-to-consumer (C2C) platform in China, has about 500 million registered users, with more than 60 million visitors everyday. It sells 48,000 products per minute out of the total 800 million provided per day. On the other hand, Yelp.com, a crowd-sourced local business review site, hosts 67 million reviews,2 while Dianping.com, a Chinese version of Yelp, gets 60 million reviews,3 as shown in their websites.
The growth in users, products, as well as trading volume brings huge UGC in E-commerce sites, and yields heterogeneous or unstructured data from various sources. Among different types of UGC, online reviews, especially the rating or score4 and a piece of descriptive text, have a great impact on both consumers and merchants. Consumers comment on products they bought, in the aspects of quality, service, delivery, etc. Potential buyers can read reviews, in addition to sellerās description, to get a clear picture of a certain product. This helps to judge products and make purchase decision based on other peopleās shopping experience. Furthermore, these reviews may reveal user preference, which can serve as feedback to merchants to improve their service and recommend products to target users.
However, the properties of noise, heterogeneity and sheer size prevent the efficient usage of online reviews. In this book, we focus on solving the problems introduced by these properties, including reviewer quality assessment, product normalization, review organization and recommendation application.
1.2 Challenges
People experience product searching and viewing in online shopping. They choose products with high scores, and compare similar products from different merchants. They can get further information on products by reading reviews from other consumers. As reviews are generated without a strict/predefined control, they are noisy; as reviews are produced by different applications, especially the moving terminal, it does not have the uniform schema and is expanding faster and faster. In order to make the usage of reviews easy and efficient, several techniques are employed. We learn user credibility to rank products fairly, and apply entity resolution to classify similar products, which enhances the product lists presented to users. We select or summarize product reviews for easy information browsing. In addition, personalized recommendations are made supplementary to the product list.
Credibility Learning. When shopping online, people first search for products they want by keywords. A list of products ranked by overall scores is returned to users for comparison. Generally, a productās overall score is calculated by averaging all the review scores it has received. People usually click on the top few products to get more information. However, driven by the economic benefits, there exist noisy data, e.g., undeserving high scores for product promotion or false low scores to damage competitorsā reputation, which results in inaccurate evaluation. In an extreme case of our work, review spammers intentionally give fake reviews and ratings. Moreover, as the reviews grow in E-commerce sites, manually reading to judge review reliability is not feasible. Hence, it is crucial to learn the credibility of reviewers to re-score and rank products/ratings fairly.
As we survey, the following two observations may cause the noisy data in product scoring: (1) some customers do not give out fair reviews; (2) there is inconsistency between the score and the comment of a review. For the first observation, it is due to the fake rating mentioned above or the user behavior characteristics such as giving gentle evaluation. For the second observation, the inconsistency means that a review consists of a high score and a negative comment. This happens because of the two reasons: One is that though giving high scores, consumers may write some disadvantages which do not affect their shopping experience; the other one is because of the pressure from merchants who want to have more high scores by the so called after-sell services, if you assign low scores. Considering these two observation, it is necessary to analyze user review behaviors for user credibility.
We design an approach to tackle the above problems and calculate the new overall scores for products by two steps. The first step is to employ a supervised learning method to correct the inconsistency between review scores and comments. Newly autopredicted scores are then used as customersā review scores for calculating the overall scores. The second step is to evaluate user credibility, so that the originally assigned ratings of products and shops are then adjusted according to it. We construct a twin-bipartite graph to model the review relationship among users, products and shops, which is not fully exploited in previous work. We design a novel feedback strategy to increase and decrease user credibility iteratively over the graph by comparing individual ratings with collective ratings. Our basic idea behind is: good products deserve high review scores while bad products should be given low scores; good customers should assign high scores to good products and low scores to bad products.
Entity Resolution. The rapid growth in both variety and quantity of products sold online brings difficulty in product organization, which may also affect user experience. Products without uniform schema and strict description specification may lead to search results either too few or too much. When searching for a product by keywords, customers would be disappointed with the duplicate products in a diversified result list. And sometimes they want to compare a group of similar products from various merchants, so search results containing products quite different from each other may not be satisfying. Hence, it is critical to apply entity resolution, which identifies instances that represent the same real-world entity, for products in E-commerce sites to further enhance the product list. Moreover, user-generated product descriptions may introduce intended errors for economic purpose; or there are vacant values on the descriptions. Mining the unstructured review text to help normalize products will be important for product organization.
We design two entity resolution frameworks, which are centralized and distributed respectively, to find products or records that refer to the same entity. We first introduce the centralized one, which achieves product normalization by schema integration and data cleaning. A graph-based method is proposed for schema integration to produce uniform and meaningful representation of products. Then we conduct data cleaning to create precise and comprehensive description for each product. The evidence extracted from textual information is utilized for data cleaning, including missing value filling, incorrect value correction and value confirmation. Finally, we distinguish products by clustering on product similarity matrix learned from logistic regression.
However, entity resolution is a compute-intensive job as it needs to compare each pair of records, thus the complexity of traditional matching algorithms is not feasible for large datasets. To handle the massive property of products in E-commerce sites, we design a distributed entity resolution framework and implement a fast matching algorithm based on MapReduce. MapReduce, a distributed computing framework, is well suited for entity matching as the pairwise similarity computation can be executed in parallel. Based on the unstructured product description data, we can generate high dimensional vectors for products. However, high dimension may cause dimensional curse when doing similarity calculating. We propose to transform the high dimensional vectors to lower dimensional signatures by a specified locality sensitive hash (LSH) function. We introduce a bunch of random algorithms to do signature permutation ensuring that similar products will be matched in a high probability. In order to reduce redundant computation which is a pervasive problem for entity resolution on MapReduce, we design our own algorithm to remove redundancy. Our designed entity matching framework exhibits good capabilities in promising load balancing and lowering network transmission.
Review Selection. It is common for people to click on products in the ranking list and read reviews to get product details for decision-making. However, as online reviews proliferate in recent years, E-commerce sites are facing the problem of information overload. On the one hand, it is too time-consuming for users to go through all the reviews for a certain product, especially for those popular and trending items. On the other hand, an increasing number of users choose to shop via mobile devices. Due to the limitation of screen size, these users prefer to read a small fraction of reviews to make their purchase decision in a short time. To support such kind of applications, it is necessary and urgent to select a subset of reviews that covers useful information for each product and presented to users for reference.
E-commerce sites have adopted several representative review selection methods based on review ranking. Review ranking ranks reviews according to their helpfulness votes, so as to provide top-k reviews to users. Helpfulness votes are evaluated by users to those reviews that are helpful to them. And there are also a number of researches on automatically estimating the quality of reviews [Kim et al. (2006); Liu et al. (2007); Tsur and Rappoport (2009); Hong et al. (2012)], but they have two drawbacks as follows. First, the resulting top-k reviews of a product may contain redundant information while some important attributes may not be covered. Second, since previous experiments [Danescu-Niculescu-Mizil et al. (2009)] show that users tend to consider helpful for the reviews that follow the mainstream, the resulting top-k reviews may lack opinion diversity. By these two observations, review selection based on attribute coverage [Tsaparas et al. (2011)] is proposed. It prefers to choose reviews covering as many attributes as possible, which does not reflect the original opinion distribution. Then Lappas et al. (2012) proposed selecting a set of reviews that keeps the proportion of positive and negative opinions on each attribute. But we have found that this work does not perform well especially for selecting a few reviews, which is caused by the overlooking of attribute importance.
Hence, to improve the overall value of the top-k reviews, we view the top-k reviews as a review set rather than simple aggregation of reviews. We propose an approach to select a small set of high quality reviews that cover important attributes and diversified opinions. A Support Vector Machine (SVM) regression function is learned from both textual and user preference features to estimate review quality. We also evaluate the importance of attributes by calculating their weights. In order to improve the diversity of the resulting top-k reviews, we cluster reviews into different topic groups, and select reviews proportionally from each group to preserve opinion distribution.
Review Summarization. Instead of representative review selection, review summarization is to sketch the overall opinion among different aspects hidden in reviews. Previous work normally acts on the assumption that each individual review has one discussion object. For example, a book review is about a particular book. However, there are reviews with different discussion objects, such as restaurants and trip reviews. User-generated reviews for a restaurant are different from those for a single product, as a restaurant review is generally a mixture of opinions on various dishes. People concern about the food, service and environment of restaurants. In particular, they are keen to know the taste, quality and component of each dish, as they can only try a few dishes once. Though previous works have made review summaries on different aspects, they do not consider the latent semantic relationships among them. Thus, the conciseness of summaries is reduced. This results in a brand new review summarization task that extracts information for each dish from the reviews of a certain restaurant. Moreover, there is only one overall rating assigned to a dining experience, leading to the difficulties in gaining viewpoints on each dish.
To complete the new task, we try to generate product-oriented or dish-oriented summaries,5 each of which consists of an evaluation score and key comments on a product which contain common opinions from past customers. Furthermore, as a piece of summary should be concise enough to fit into the screens of mobile devices, we partition the comments into short snippets to provide a small set of representative snippets. Therefore, our new task includes two subtasks for summarizing the reviews of a product: (1) estimate the product score which is different from the restaurantās overall score; (2) select descriptive snippets that can well represent the comments on the product.
We provide two solutions to product-oriented review summarization. The first solution employs a hierarchical term tree to classify terms semantically. It considers two aspects: opinion-aspects which refer to the positive or negative opinions on products, and attribute-aspects which refer to the description of product features. This solution has three steps: (1) extract product snippets; (2) predict snippet scores; (3) summarize product snippets. In the first step, we extract the surrounding words of products as snippets and classify whether the snippets contain opinions. In the second step, we predict the opinion scores of evaluative snippets using several different approaches. After the first two steps, we have the candidate snippets with predicted scores for each product. Then we could select snippets with respect to the opinion and attribute-aspects.
The second solution proposes a new bilateral topic model to support efficient and accurate analysis on review comments from the rating aspect and the text aspect. As an extension of latent Dirichlet allocation (LDA) [Blei et al. (2003)], our new model features a two-dimensional topic matrix, which incorporates aspects of dishes on one dimension and scores on the other. Every pair of aspect and score forms an individual topic, while at the same time the topic-dependent and score-dependent correlations are preserved in the topic matrix. We also derive a new inference algorithm for the model that can automatically extract labels of every word in the comments with probabilities on the aspect-score pairs. After model training, a joint algorithm is designed for snippet selection, which exploits the probabilistic information on the possible aspects and scores of snippets on dishes.
Recommendation. So far, there are two major appr...