1
DIGITAL INSIGHTS AND DIGITAL FICTIONS
In just a quarter of a century, from the mid-1990s, the world has witnessed a transformation of its communication systems resulting in a global network of inter-connections through a combination of wired and wireless technologies that have radically changed how people, businesses, governments and virtually every other kind of organisation engage with each other. The emergence of the Internet during this time enables people to stay constantly connected, provides access to huge amounts of information about anything anyone might need to know, and has created a parallel reality in which people can live their lives. In this setting, people and organisations can put themselves on display, reveal who their family and friends are, disclose their thoughts and feelings, likes and dislikes, and display their interests, opinions, skills and daily activities. At the same time, they leave behavioural data trails through their online search patterns revealing what types of information they have searched for, which web sites they have visited and how long they spent at each location. All of this activity occurs on an open public platform within specific sites such as blogs, micro-blogs and social networks. Today, many of us live out large portions of our lives online. Although this extension of our everyday reality can enable us to connect more readily to others than we can in the offline world, it can also bring unwanted attention from sources that might wish to take advantage of us for their own benefit.
Developments in computer science and the social sciences have made it possible to track public behaviour and opinion through online records that are refreshed minute-by-minute. In addition to our âsearchâ behaviour, many of us provide other information about ourselves from the verbal comments we provide about all manner of things and from the online conversations we have with other online users. For those who are interested, therefore, large quantities of freely available data are being produced on a continuous basis that can provide insights into our current preoccupations and concerns, past histories, and our attitudes towards all kinds of entities including commercial products and services, social issues, public policies, government departments, political parties and political candidates, celebrities and other public figures, and many other things. While a lot of these âdataâ are unstructured, as compared with, for instance, public opinion data obtained from carefully constructed surveys, they are also ânaturalâ in that they comprise peopleâs use of everyday language.
The data produced by people online are both diverse and extensive. Moreover, most of these data are there for the taking. A data analyst simply needs the right toolkit to mine the information we post about ourselves online. Once suitably equipped, there are potentially untold riches to be discovered that enable any interested parties to profile citizensâ and consumersâ views on all kinds of topics. There is a great deal of market, political, cultural and social intelligence to be obtained in this online world.
Understandably, there has been increased recognition of the potential value of these data sources. Within this often chaotic-appearing chatter, it is possible, by analysing its linguistic qualities, to discover insights into the characters of the people behind it. Whereas questionnaire-based self-completion or interview surveys have been the traditional methods for gathering large amounts of data about peopleâs attitudes, beliefs and perceptions, the data produced by analysis of these âbig dataâ methods are dwarfed by the vast quantities of natural language and image content produced every day by Internet users, especially on ubiquitous micro-blogging and social networking sites such as Facebook, Instagram, MySpace, Pinterest, SnapChat, Twitter, Whatâs App and others, that all contain similar information, albeit in different formats. In the past decade, significant strides have been taken in the creation of research techniques that can not just analyse massive quantities of online data but also lend structure to the unstructured big data in the form of natural language and deliver usable computations that can inform business and public policy decisions, and be used to guide the content and shape of social, marketing and political campaigns (Hamilton et al., 2007; Reinhold & Bhutaia, 2007; Casteleyn et al., 2009).
Some of these online data analysis techniques can âreadâ the texts of online messages and posts and extract from them indicators of opinion or sentiment. These techniques have come to be referred to, collectively, as âsentiment analysisâ. In addition, there are other techniques that examine patterns of linguistic nuances such as vocabulary choices, grammatical style and expressions of meaning (semantics) that can often be highly distinctive to each individual language producer. Analysis of natural language discourses can reveal insights into peopleâs lives and lifestyles, values, beliefs, feelings, perceptions, knowledge and understanding, and even their personality traits (Schillewaert et al., 2009; Thelwall, Wilkinson & Uppal, 2010; Thelwall, Buckley et al., 2010).
This book will examine online profiling methods that have been created to classify people through their non-verbal and verbal digital behaviour patterns. This work spans a number of academic disciplines and has had a range of real-world applications. There is academic interest in creating the best possible methods for online person profiling in disciplines including computer science, economics, health science, information science, linguistics, mathematics and political science as well as in sociology and psychology. There has also been growing investment in research in this field by the commercial sector with the market research industry playing a leading role in applying new methods for large-scale, economical and quick-turnaround consumer classification and behaviour modelling services for their clients in relation to product and service development, marketing and promotion, and consumption and sales monitoring (Hamilton et al., 2007; Reinhold & Bhutaia, 2007; Casteleyn et al., 2009; Schillewaert et al., 2009).
A great deal of attention has been devoted to the development of data analysis tools that focus on âuser-generatedâ content that appears on peopleâs online micro-blogging and social networking sites. Users of sites such as Facebook and Twitter generate massive amounts of data every day. Much of this content contains disclosures about usersâ opinions expressed in their own words. Users devote considerable space in their posts to revealing their thoughts and feelings about all kinds of things. The new research techniques developed to cope with such data have created a buzz around them because they have been trumpeted as heralding a new future for public opinion polling.
Despite the excitement created by these techniques, ultimately their end-users need assurances that they provide data every bit as good as those provided by traditional polls. Whether this is the case depends upon the nature of the samples from which data are obtained and the ability of linguistic analysis methods to identify âfeelingsâ or âopinionsâ in an accurate and consistent fashion. Early evidence has shown that these methods have some promise but have not yet proven to be an entirely adequate substitute for more conventional forms of market research data (Hardey, 2011a, 2011b, 2011c).
Natural language comments made online can sometimes replicate fairly accurately offline open-ended comments about the same things. Comparisons of opinions expressed about the same objects in face-to-face in-depth interviews, focus group discussions and chatter in online forums reveal that these different methods can yield similar results. One potential problem with new online chatter is that the anonymity people experience in that setting might encourage them to express more extreme views than they would in face-to-face conversations (Hardey, 2011c). Then, in a focus group setting, there is always a possibility that individual participants succumb to group conformity in the views they give more than they would in an online chat setting where they might be exposed to a much wider range of views, causing them to consider a variety of positions before deciding on their own viewpoint (Gunter et al., 2014).
As already noted, online profiling spans a number of disciplines. Methodologically, the key disciplines are linguistics and computer science. Linguistics provides the methods for analysing textual or natural language data. Computer science provides the techniques for large-scale data analysis. Their hybrid, âcomputer linguisticsâ, represents a discipline that combines these two analytical approaches and allows sophisticated language analysis to be conducted swiftly on a very large scale. In this book, a third discipline, psychology, is central to online data analysis where researchers seek to discover the psychological characteristics of Internet users. In particular, can computer science and linguistics tools be used to measure the personality profiles of individuals in massive online samples using their online search behaviours and natural language discourses? The interest in personality stems from recognition of its importance in predicting how people behave across a wide range of social settings and how they usually respond to different kinds of social stimulus. A detailed understanding of an individualâs personality traits places those who might wish to communicate specific persuasive messages to this person in a stronger position to know how to influence that individualâs social perceptions and behaviour.
Public attention to use of computer linguistics methods in computing psychological profiles of individuals from their disclosures online has been dramatically focused by news stories about major political campaigns in the United States and Britain in which such data was allegedly used unethically and possibly illegally by individuals trying to influence the way people voted. It is worth pausing to examine this publicity and the concerns it raised about this whole new area of human profiling before discussing the science behind it.
Online profiling and political insights
A political earthquake occurred in America in the 2016 presidential election. Against the odds, the Republican candidate, property magnate and TV reality star, Donald Trump was elected as 45th President of the United States. The bombastic, self-promoting Trump beat the more politically experienced Democratic candidate Hillary Clinton in the Electoral College vote, although not in the popular public vote. It had been incredible enough to many people that the âGrand Old Partyâ had selected him as their nominated candidate. Yet, in the primaries in which contenders fought for their partyâs nomination, he fought off many more experienced political campaigners and professional, career politicians. According to one commentator, Trump and his campaign team were surprised themselves to have got as far as the head-to-head with Hillary Clinton and they never expected to win (Wolff, 2018).
Trumpâs victory came on the back of promises he made to the American people to bring a different approach to running the country. He was not afraid to make controversial statements. His campaign will be remembered especially for a promise to build a wall along the border between the United States and Mexico to keep out illegal migrants and to repatriate Mexicans already in the USA (Pengelly, 2015). He also attracted much ire from the liberal classes for proposing a complete ban on Muslims travelling from a target list of Islamic countries (Pilkington, 2015). His âAmerica Firstâ mantra indicated that he was prepared to withdraw the United States from international trade agreements he regarded as disadvantageous to his country and to pull back from Americaâs role as the worldâs peacekeeper, particularly in relation to its funding of international bodies such as NATO.
He levelled accusations of criminality at his Democratic Party rival, Clinton. He labelled her âcrooked Hillaryâ. Yet Trump himself was mired in controversy about his attitude towards women, his tax affairs, which he refused to disclose, contravening convention among incoming Presidents, and accusations that members of his campaign team were consorting with Russian government agents during the campaign and afterwards (Owen, 2017; Porter, 2017; Rothwell & Krol, 2017; Waldman, 2017). The latter accusations were particularly serious given claims that there had been Russian state-sponsored advertising and so-called âfake newsâ activities on Facebook designed to undermine the Clinton candidacy. After initial denials, Facebook eventually conceded that these activities had taken place on its site (Gambino, 2017). These advertisements began in the summer of 2015 and continued until several months after the new president assumed office. Many of these messages were aimed at polarising the American publicâs views on controversial issues such as gun law, immigration, gay rights and racial discrimination. More than 3,300 messages were placed in all (Fredericks, 2017).
However, it was a different role played by Facebook during the presidential campaign that attracted attention after Trump had won the election. A data science firm called Cambridge Analytica, partly owned by wealthy American hedge-fund manager Robert Mercer, claimed to have culled psychological profile data about Facebook users that it had used to guide the targeting of political messages from the Trump campaign machine to voters, especially in constituencies where the vote would be tight. These Facebook data were allegedly combined with âbig dataâ about consumers obtained from other data mining companies and entered into statistical models to produce detailed personal profiles of Facebook users that were then used to guide the targeting of political messages to them designed to influence the way they voted.
The idea here was that usersâ Facebook data could be used to identify personality markers that could, in turn, be used to construct a psychographic typology of voters. Such data might reveal insights into the kinds of argument or appeal voters with specific personality profiles would be most sensitive or responsive to. These data, so it was claimed, could not only be extracted from the default profile details users often provided about themselves upon joining Facebook (e.g., demographic details plus information about hobbies and interests, entertainment and leisure preferences, and so on), but also from linguistic analysis of their posts to the site.
Text messages on usersâ Facebook walls, private messages to contacts and status updates could be analysed not just in terms of the personal information they supplied, but also by the style of language being used. An analytical methodology had apparently been developed that could identify language style attributes â a sort of idiosyncratic linguistic fingerprint â that could in turn serve as a marker of specific personality traits. In particular, language attributes were identified that could predict what Cambridge Analytica referred to as âOCEANâ, an acronym derived from what psychologists today call the âBig Fiveâ personality factors: Openness to Experience, Conscientiousness, Extraversion, Agreeableness and Neuroticism (Funk, 2016). Campaign messages tailored to these idiosyncrasies could then be targeted at those types of voter via their social media sites.
An example might be targeting messages directed at people classed as highly Neurotic with fear-arousing messages about the Islamic State to encourage them to support tighter immigration controls and, of course, to vote for the candidate strongest on this policy. Another example might be to produce messages for someone classed as high in Openness to Experience (who welcomes fresh ideas and approaches to problem-solving) that emphasise opening borders to people from overseas who might bring new skills while enriching cultural diversity in the United States.
Despite the impact of the Trump victory on the profile of Cambridge Analytica, its story as a new type of political pollster pre-dates any part the company might have played in the Trump presidential campaign. Before then, the firm worked for another Republican Party contender in the same election, Ted Cruz. It was reported that Cambridge Analytica had âharvestedâ psychological data from millions of Facebook users on his behalf during the presidential primaries. It was further alleged that the company had obtained these data without the permission of the people that had supplied them. The data were used to find ways of giving Cruz an edge over his key Republican rivals, including Trump, by developing âpsychographic profilesâ of voters that could help in the design of more eff...