On Wednesday, February 28th, 2024, the seminar series hosted a talk by Dr. Arthur White, Assistant Professor in Statistics at the School of Computer Sciences and Statistics in Trinity College Dublin, titled “Cluster Analysis of Linguistic Profiles of Old English Texts”. Details for the talk are below.
Title
Cluster Analysis of Linguistic Profiles of Old English Texts
Abstract
This is ongoing work in collaboration with Mark Faulkner in the School of English. I will present the results from a preliminary cluster analysis applied to a dataset of 135,734 vowel-initial spellings taken from 2,406 Old English texts. Our goal is to systematically compare the surviving Old English corpus, which is highly fragmented, often of uncertain date and place of production. Surviving manuscripts are also often produced at unknown temporal and geographical removes from the composition of the original text. Due to the innate challenges involved in analysing such a corpus, traditional comparisons of texts could only be performed by human analysts, often in necessarily ad hoc and opaque ways. Modern data science tools now permit a more objective and systematic comparison of texts for the first time. Our analysis focuses on the association of different spellings to vowel sounds: texts with similar spelling profiles across multiple vowel sounds are assumed to be inherently similar. After some initial data pre-processing, hierarchical cluster analysis was performed. A very large number (100 or 250) of clusters were selected, and the cluster profiles of specific texts were then investigated to perform an initial qualitative validation of the cluster results. As well as describing the essential data characteristics and presenting initial cluster results, I will also highlight unusual aspects of the data and discuss the potential challenge of formally validating results. As this work is still at quite an early stage, any comments and suggestions regarding potential future directions will be especially welcome.