The case, like many others, began with a phone call from a detective.
āHello? Dr Heydon? Iām a detective at Mackay CIB, Queensland Police. I am not sure if youāre the right person, or if you can help me, but weāve got a murder victim and an anonymous threatening letter and we think it was written by the suspect because the spelling is really similarā¦ā
āOh, right,ā I think, āspelling.ā I sigh inwardly. āOK, so you would like me to conduct an analysis to see if your suspect might be the same person who wrote the anonymous letter? Well, the thing about that is, spelling is almost never a very good indicator of authorship. Itās just too similar across authors ā even really bad spelling.ā
But the detective was insistent: this was really unusual spelling, and it was consistent between the anonymous letter and all the written material he had gathered that was known to be written by the suspect. Could I just take a quick look?
The high profile case involved a young woman who died from serious injuries after being assaulted near her home. Her boyfriend was the prime suspect, despite having made a prominent appearance in a public march organised to honour the womanās memory and protest the violence and brutality of her death. Police were convinced the boyfriend had written the anonymous letter, which denounced the victim in graphic and explicit language as a poor lover and an unfaithful girlfriend. The prosecution case would rest in part on the attribution of the authorship of the letter to the suspect, on the basis that the letter was evidence of murderous intent. The suspect had denied writing the letter during his police interview.
I agreed to assess the evidence but I had already drafted the case rejection letter in my head before Iād put down the phone, and by the time the materials arrived on a disk sent by registered post, I had completely dismissed the request from my mind. And the reason I was so sure I couldnāt help? As I had told the detective in Queensland, spelling is rarely a good indicator of authorship. But to understand why this is so, letās first understand more about this type of analysis.
One of the most common types of forensic linguistic cases is the attribution of the authorship of a given text to an individual: authorship attribution. Typically, as in my Mackay case, the mystery text is an anonymous threatening letter, and it implicates the author in some kind of criminal activity or supports the case against them. In civil cases, a common need for authorship attribution arises in cases involving a contested will or other legal document. Forensic linguists in various jurisdictions have been involved in cases involving authorship attribution and it is a frequent topic of research published in the International Journal of Speech, Language and the Law and books on forensic linguistics (Coulthard, 2000; Gibbons, 2012; Larner, 2014; Nini & Grant, 2013; Perkins & Grant, 2018; Shaomin, 2016; Woolls & Coulthard, 1998) as well as computational linguistics and information technology. However, the origins of authorship attribution are neither legal nor linguistic. Historically, authorship attribution has been the realm of literary mysteries, perhaps most famously the authorship of the plays attributed to William Shakespeare (Trucco, 1988) and the Federalist papers (Mosteller & Wallace, 1963). In these literary cases, the methods used have evolved from largely stylistic analysis, where texts are described and compared with regard to various literary devices and historical context, to stylometric approaches such as word counts (Stamatatos, 2008). There is not the space here to provide a more complete overview of the many different approaches taken to authorship attribution; however, it is important to note that a considerable amount of work has been done in the computational linguistics and information science fields. Stamatatos (2008) provides a detailed description and critique of the various statistically or computationally supported methods of authorship attribution and the methods used to evaluate or test these methods.
Some approaches to forensic authorship attribution for legal matters use a non-statistical stylistic approach, and it is advocated in books by McMenamin, among others (McMenamin, 2002). However, several linguists have sharply criticized this method on the grounds that the features chosen for inclusion in the analysis are not selected for their known capacity to separate texts by different authors. For more on this, see David Crystalās critique of McMenamin (Crystal, 1995). Essentially, critics such as Crystal (1995) and Chaski (2001) find that stylistic analysis fails to address a central problem of authorship attribution: the problem of likelihood. For any feature that we might choose to analyse in the documents, we must be able to say how likely it is that the feature could not have been produced by some other author (Grant, 2007). Stylistic analysis does not attempt to calculate the statistical likelihood that the features identified in both known and questioned texts could have been produced by any other author.
This brings me back to the Mackay murder case, and the problem with spelling, or more properly, misspelling. In ordinary, casual written documents, patterns of misspelling across different authors are so common that misspelling alone is almost always irrelevant as an identifying feature. Experience told me that the fact that the detective in Mackay had noticed the pattern of misspellings between the two sets of documents was not a strong indicator that they were written by the same person. I was very confident that, given a short amount of time and an internet connection, I could demonstrate that what might have seemed to be unusual patterns of misspelling were in fact extremely common. I would do this using a corpus-based method, as described below.
Authorship attribution relies on being able to correctly group together any texts produced by the same author. Therefore, the linguistic problem to be solved is to find features of the text that vary according to authorship. Sadly, however, there is no known ātext-fingerprintā1 ā a pattern of language use that is unique to each author. Stylistics describes patterns of language that are similar or different between two texts, but does not attempt to calculate how likely it is that these patterns of language might appear in any other authorās texts. This means that a stylistic analysis has no statistical validity, which severely undermines its use in legal cases (Chaski, 2005). In the absence of a reliable and scientifically proven indicator of authorship, it is hard to see how such cases can be solved using linguistic analysis, although so many proposals are made for various computational methods that one is sure to emerge in the near future (Grieve, 2007; Stamatatos, 2008).
As should be clear to readers, I had no illusions about my capacity to provide a reliable finding in the Mackay case and I was especially sceptical of the detectiveās assurances that the spelling patterns in the documents would be the key. When non-linguists see spelling patterns as a strong indicator of authorship, they are generally unaware of what constitutes a common feature of language. For example, if both the known and questioned documents (QD) in a case included the word cant (canāt without an apostrophe), this might appear to be a useful marker of authorship. In fact, a search of the Birmingham Blog (BB) Corpus, a collection of 630 million words drawn from blogging websites, shows that around 3.6% of instances of this word are spelled without the apostrophe, which makes it far from unusual.
When the data from the Mackay CIB arrived, I realised that I had misjudged the detective: these really were some of the most unusual spelling patterns I had ever seen, and it was clear right away that this would be one case where misspelling might be used to attribute authorship.
In the following sections I will describe how to undertake one form of corpus analysis for authorship attribution, how to present the findings for a lay audience and how this type of analysis can be applied to legal practice. Throughout this chapter, I will be using the Mackay case file to illustrate various aspects of this type of analysis. Before we turn to the method used to undertake the analysis of the data, we must first address the framing of the research question, and the collection of the data.
Identifying research questions
Like any research project, a forensic case needs a clear question to which the analysis and findings will provide a response. A forensic case is a very specific and focused kind of research project: the objective of the analysis is determined by a legal framework. In authorship attribution cases, the objective of the analysis is to test whether or not the documents in question were written by the same author. The primary research question therefore might be:
Research Question 1. Does the linguistic analysis of the two data sets indicate that they were written by the same author?
In order to answer this question, we first need to answer at least two more questions:
Research Question 2. What features of the data can be used to discriminate between the texts by the known author and texts by any other author?
and
Research Question 3. What method can be used to analyse those features identified in Research Question 2 in order to produce valid results?
The terms of reference for the research questions need to be carefully defined, especially the scope of the term āvalid results.ā For the purposes of providing an expert opinion, the validity of the research results will be determined in part by the courtās rules relating to scientific evidence. There is a tendency for courts to prefer results that can be stated in a numerical form (Grant, 2007), which can be interpreted by the judge or jury as a percentage of probability that the proposition in question is true. In the Mackay case, the results would be described in terms of the likelihood that the two data sets were produced by the same person compared to the likelihood that the two data sets were produced by two different people. From this calculation, the court can determine the probability that the defendant authored the questioned letter, and not someone else. Note the difference between these two findings. The first can be determined using linguistic analysis and is the province of a linguistic expert. The second is determined using a common-sense analysis and is the province of the judge and/or jury (Coulthard, Johnson, & Wright, 2017; Gibbons, 2012; Grant, 2007).
Returning to the research questions that we have identified for this project, we might start with Question 2 (āWhat features of the data can be used to discriminate between the texts by the known author and texts by any other author?ā) and this brings us to the matter of data collection.
In authorship attribution cases, the data available are usually supplied to the analyst by the police or lawyers. There might be times when it is possible or indeed necessary to specify the data required for the analysis, but mostly, the availability of data is predetermined by the facts of the case. This is especially true for the QD (see below). In relation to the documents of known or undisputed authorship, which I will refer to as KD (Known Document/s), there might be a large quantity of documents available, such as all the SMS texts from a mobile phone, or all the emails from an email account, or the quantity of material might be limited to only a handful of documents that were recovered from available sources and verifiably written by the suspect. In the case of the former, it is usual for the analyst to restrict the data set in some way, unless the method allows for a data set of limitless size. For example a computation analysis can be run on extensive data sets because the process is automated (Stamatatos, 2008). Whichever data selection criteria are applied, these need to be carefully documented and justifiable within the parameters of a valid analysis. In the Mackay case, there was no initial requirement to apply selection criteria to the data collection, as the police supplied all the available KD, although some complicating factors did arise in relation to one of the KD (see further below).
Typically, a case will involve a smaller set of texts ā sometimes a single document ā which are of unknown or disputed authorship. From here on I will refer to these texts as the QD. The availability of data in the QD set is usually limited but in some cases can be very extensive, such as where the case involves all the SMS texts from a mobile phone and the provenance or ownership of the phone itself is disputed. This is becoming more common in cases involving organised crime and terrorism where offenders use disposable mobile phones or SIM cards to avoid identification. In those cases, the data set required for the analysis might have to be defined and restricted using selection criteria, as described above for the KD data set.
In the Mackay case, the KD consisted of four texts produced by the suspect in response to a psychological self-help questionnaire (labelled KD1, KD2, KD3 and KD4), and approximately 850 text messages from the suspectās mobile phone (KD5). The authorship of these documents was not disputed by the defence. The QD for this case was an anonymous typed document approximately one and a quarter pages in length which was titled āthings about [name] i dont likeā and consisted of a list of 28 statements about the victim and her relationship to the author. The language was highly explicit and the document referred mainly to sex acts and intimate details about the victim (see Ethical Considerations in Introduction).
Once the data sets are defined and obtained, the next task relevant to data collection is to identify the features for analysis whose similarities and differences across the two data sets will allow the findings to show how likely it is that the two data sets might have common authorship. This decision is closely related to the method being used for analysis, and it is therefore difficult to discuss one without pre-su...