SECTION II
Social Media
4
TOWARD REALIZING META SOCIAL MEDIA CONTENTS MANAGEMENT SYSTEM IN BIG DATA
Takafumi Nakanishi, Kiyotaka Uchimoto, & Yutaka Kidawara
NATIONAL INSTITUTE OF INFORMATION AND COMMUNICATIONS TECHNOLOGY, JAPAN
1. Background
Even though the amount of social media data is increasing explosively, we understand very little from them and only become saturated by reading each SNS timeline. In addition, each piece of data is too fragmented. For example, a tweet is a mere sequence of less than 140 words. We often obtain news updates or notices from one tweet without understanding the perspective of a focused issue. In the current method, most search or retrieval methods are done per data. For example, the current systems retrieve data that correspond to a user’s query to a bit of appropriate data and focus on each piece of data. We cannot understand an issue by just focusing on one bit of data. Current search or retrieval systems do not correspond to the current social media situation. Since there are many types of social media on the Web, we must realize a new analysis and visualization method that interconnects various massive heterogeneous social media contents.
Based on the above background, we have to focus on Big data analytics, which is completely different from such current data analytics as data mining technology. The key issues of Big data analytics are heterogeneity, continuity and visualization.
In this chapter, we introduce Big data analytics and describe its features: heterogeneity, continuity and visualization. We also show one application example, called “Topic-Based Browsing of Conversation Tendencies in Twitter,” and propose and represent an overview of our meta social media contents management system.
2. What Is Big Data Analytics? How Is It Related to Social Media?
Recently, not only businesspeople but also researchers are focusing on Big data, which is defined by three Vs (Berman, 2013):
Volume: Large amounts of data
Variety: Different forms of data, including traditional database, images, documents and complex records
Velocity: Data content constantly changing through the absorption of complementary data collections and from streaming data from multiple sources
Current research on Big data focuses on high performance computing and parallel distributed processing. However, we have to focus on another aspect, the schemaless data processing issue, which is different from very large database (VLDB). Schemaless means that schemas cannot be designed for systems because most current systems have goal-oriented designs. However, systems for Big data environments are not allowed to decide any goals without user queries.
It is important to discover answers or clues for users in real time. A system has to create appropriate schema from the data themselves given by user queries. Until now, data have been organized based on database schema. Currently, only various fragmentary data exist on the Web.
This is a huge paradigm shift. We must create schema and data structures that correspond to the processing required by users after they input queries. We have to shift the system from designing closed assumptions to open assumptions.
Heterogeneity, continuity and visualization are the most critical features of Big data analytics, which provides scale and connection merits based on them. No current data analysis methods are based on open assumptions. Big data analytics provides a new data analysis method based on open assumptions. Below, we discuss the inconsistencies caused by continuing to use the current methods. Figure 4.1 shows the relationship among the three elements (volume, variety and velocity) of Big data’s definition and its analysis features (heterogeneity, continuity and visualization).
Big data analytics is related to the analysis of social media contents because they are one example of Big data. In Big Data Planet (2013), Hewlett-Packard claims that every 60 seconds, social media users generate more than 98,000 tweets on Twitter, 695,000 status updates on Facebook, 11 million instant messages and 217 new mobile Web users.
To analyze all social media, we have to consider and integrate various heterogeneous social media contents, including Facebook, Twitter, LinkedIn, etc. Social media contents can be regarded as human sensors. Each user in a social media generates fragmented data about feelings and emotions and gives updates about attending events like concerts and baseball games. When discrete social media contents are approximately continuous, we can identify each human’s moving trajectory. We do not want to show every single piece of data like current search engines; rather, we want to identify the trends of human actions. Our system has to provide a visualization of the overview. Social media analytics includes Big data analytics.
FIGURE 4.1 Relationship among three Vs of Big data definition and Big data analytics definition: heterogeneity, continuity, and visualization.
3. Heterogeneity: Big Data Analytics Features
In Big data analytics, heterogeneity is different than it is in the Big data definition. The variety of Big data definitions includes such content as images, sounds, documents, etc. Its heterogeneity includes such data fields as news, entertainment, technology and science, all of which are semantic aspects.
In Big data analysis, reasonable correlations must be discovered between heterogeneous fields. Currently, semantic Web technologies (Berners-Lee, 2006; Bizer, Heath, & Berners-Lee, 2009; Greaves & Mika, 2008) or association rule extraction technologies (Gonzales, Nakanishi, & Zettsu, 2011) are generally used. However, in Big data analytics, there are three inconsistencies because the Big data environment is an opened assumption not a closed assumption (Nakanishi, Uchimoto, & Kiadawara, 2013).
4. Example of Three Opened Assumptions’ Inconsistencies
We focus on human relationships to represent our example (see Figures 4.2 and 4.3).
First, in Figure 4.2, we present an example of human relationships between AI and DB communities that share fields. ai and bj are researchers. The edges indicate relationships that represent the similarity of their research and the symmetric and transitive relationships. When someone adds symmetric and transitive relationships to a3 and b4, a1 is related to b5 because a1 is related to a3, a3 is related to b4 and b4 is related to b5. Realistically, a1 may also be related to b5.
Next, in Figure 4.3, we illustrate another example of personal relationships between workplace and music communities by assuming that no common fields exist. The edges represent the relationships of friendships and coworkers or co-session members, and the edges indicate symmetric and transitive relationships. For example, a3 met b4 at a party and they became friends. In this case, we add symmetric and transitive relationships between a3 and b4. Is it true that a1 is related to b5 when we add such relationships between a3 and b4 in the graph structure? Here, a1 is related to b5. However, realistically, a1 and b5 do not share any common ground without other definitions or analysis. In this case, inconsistencies are caused by the previous methods.
FIGURE 4.2 Relationships among persons in communities AI and DB. ai and bj are researchers. When someone adds symmetric and transitive relationships between a3 and b4, it is true that a1 is related to b5 because a1 is related to a3, a3 is related to b4, and b4 is related to b5.
FIGURE 4.3 Relationships among persons in workplace and music communities. ai are co-workers and bj are musicians. When someone adds symmetric and transitive relationships between a3 and b4, it is not true that a1 is related to b5. In the graph structure, it is true that a1 is related to b5. However, realistically, a1 and b5 do not share common ground without other definitions or analysis.
The difference between the first and second examples is community positioning. Here, we consider a community to be a set and the persons are its elements. For the first example, the AI community is set A and the DB community is set B. The following is the relation of sets A and B:
For the second example, the workplace community is set A and the music community is set B. The following is the relation of sets A and B:
We represented the inconsistencies of these examples and applied it to various fields. However, we used the results of these methods for cooperation within cases where the near fields are linked or in the same fields. They do not completely apply to linking heterogeneous fields. The second example may also be applicable to the previous method. In this case, any relation between sets is only implicit. When elements are added to heterogeneous sets, their elements have the same order relation. In this case, it is implicitly true that A ⊂ B, B ⊂ A, or A ∩ B ≠ Ø.
When it is true that A ∩ B ≠ Ø, we use the previous methods. However, we do not make real-world inferences for A ∩ B ≠ Ø because set theory is limited. In set theory, we have to define the transitive and order relationships in each attribute based on the relationships in each new scene. Such outdated computer schemes as database and rule-based systems were designed by closed assumptions. New scenes do not appear. However, current systems interconnect heterogeneous systems or the data for heterogeneous fields, which are not closed assumptions.
In the Big data era, we must discover the relations for A ∩ B ≠ Ø. We believe that discoveries lead to knowledge. Computers are discovering new relations based on opened assumptions overlooked by humans. We must create a system that discovers relationships when A ∩ B ≠ Ø.
5. Three Opened Assumptions’ Inconsistencies with Two Easy Mathematical Proofs
First, we provide proofs of the inconsistency of order relations between two certain sets.
The preconditions of the proofs are as follows. There are two sets, A = {a1, a2, …, an} and B = {b1, b2, …, bm}, where A ∩ B ≠ Ø. Each set defines the order relations differently.
We prove that we cannot determine the relationship between sets A and B or other relationships when we get relationship f between al ∈ A and b1 ∈ B.
Proof: We prove by induction that it is satisfied when bi = f(ai) is not true.
When i = 1, b1 = f(al) is true by the above condition.
We assume that bk = f(ak) is true when i = k.
When i = k + 1, bk + 1 = f(ak + 1) is not true because set A has an order relation. However, set B has another order relation. bk ≤ bk + 1 may not be true if ak ≤ ak + 1 is true and vice versa. Furthermore, both ak ≤ ak + 1 and bk ≤ bk + 1 may not be true and although b1 = f(al) is true, bi = f(ai) is not.
[Q.E.D]
We cannot uncover the relation between each heterogeneous set when we discover or link between heterogeneous elements. It is also difficult to identify other relations with clue b1 = f(a1).
Next, we prove the inconsistency of the order relation when someone links to elements of a heterogeneous set. Using the same sets A and B as in the former case, set B has order relation b1 ≤ b2 ≤ b3 ≤ b4. … Set B has a transitive relation; if b1 ≤ b2 and b2 ≤ b3 are true, then b1 ≤ b3 is true. Set A has its own order relation.
Proof: We prove that a1 ≤ b3 is true when we obtain relation a1 ≤ b1. To reveal the conclusion, a1 ≤ b3 may not be satisfied. We thus show a counterexample: Assume a1 = (1, 5), b1 = (2, 1), b2 = (3, 2), and b3 = (4, 3).
The relationship of a1 and b1 focuses on each first element.
Then a1 ≤ b1 is true.
The order relation of set B focuses more on the values of each second element. Then b1 ≤ b2 ≤ b3, and if b1 ≤ b2 and b2 ≤ b3 are true, then b1 ≤ b3 is true.
However, a1 ≤ b3 is not true in the order set of set B.
Like the relation of a1 and b1, an inconsistency occurs whose order and transitive relations of set B are not guaranteed.
[Q.E.D]
Although we strictly define the order and transitive relations in a certain set, an inconsistency occurs with a relation with elements outside of it.
6. The Three Opened Assumptions’ Inconsistencies
Until now, computer science researchers have based their ideas on closed assumptions by freely linking and interconnecting each object. For example, the interconnection between element ai in set A and element bj in set B for A ∩ B ≠ Ø remains a closed assumption. However, users, especially data-intensive scientists, do not require such knowledge. They have to consider new discovery methods in opened assumptions, where A ∩ B = Ø.
Note that such inconsistencies only occur when extending the current methods introduced in Section 2. We call these inconsistencies the Three Opened Assumptions’ Inconsistencies:
A relation does not guarantee the future.
For example, we can identify relationships among each set through data mining technology. Note that the results only represent the relationships of the present data. These relationships are not guaranteed if the system adds new records (data). Occasionally, researchers and users anticipate an uncertain value of a new record using extracted relationships. However, such usage is incorrect. Due to the insignificance of predicting uncertain values by data mining, we assume that sets A and B are attributes in the relational database and that ai and bi are the attribute values of each set. The data mining result is guaranteed if no updates occur. However, most tables undergo many updates. We assume k records in the database and that the numbers of each attribute value are k. The system performs data mining and extracts bi = f(ai). This relation f is only guaranteed when there are k records in the database. If the number of records is k + 1, relation f is not guaranteed. Indexing relations, which are extracted by various methods, is meaningless for predicting uncertain or missing values.
A transitive relation is not true when...