On Friday, September 20th, 2024, the seminar series hosted a talk by Dr. Silvia D’Angelo, Assistant Professor in Statistics at the School of Computer Sciences and Statistics in Trinity College Dublin, titled “Data compression for clustering high-dimensional discrete data”. Details for the talk are below.
Title
Data compression for clustering high-dimensional discrete data
Abstract
Clustering high-dimensional data is a challenging task, typically addressed in the context of continuous numerical data. Recent literature is exploring strategies to cluster high-dimensional categorical and discrete data, which finds numerous applications, such as RNA sequencing and text data analysis. We propose a fast and easy-to-implement approach to cluster high-dimensional discrete data, scalable to datasets with thousands of dimensions where other strategies may be computationally unfeasible. Our approach relies on reducing the dimension of the data by performing a deterministic compression to a drastically lower dimension. The method employs a lossy compression that reduces the data to a collection of continuous features. We demonstrate that such compressed features can be treated as approximately normally distributed, allowing the application of standard finite Gaussian mixture models for model-based clustering. We discuss the approach and study its performance on a series of simulated scenarios with different dimensions and levels of complexity, involving both categorical and count data. Additionally, we illustrate the method on real-world data.