Quentin HAENN
  • PhD student
  • Data Engineering
  • ISAE-ENSMA

Research Activity

Clustering under similarity constraints

My thesis delves into an in-depth study of a specific approach to partitioning under similarity constraints. It focuses on developing and testing algorithms that perform partitioning operations under global constraints on the resulting groups, particularly addressing radius and diameter constraints.

These constraints add a specific business context to the partitioning (clustering) process, ensuring that the formed groups adhere to the predefined constraints. However, algorithms capable of performing such tasks are scarce. Existing literature primarily offers algorithms based on Mixed Integer Linear Programming (MILP) models, which delegate mathematical guarantees and performance responsibilities to underlying solvers. These solvers can be costly and do not immediately adapt to all contexts.

Most partitioning algorithms under similarity constraints rely on optimizations related to the mathematical properties of metric spaces. In contrast, my work focuses on the properties of weighted graphs, allowing the use of dissimilarities rather than metric distances. This extends applicability to various contexts, such as time series clustering.

As part of this thesis, I have developed two radius-constrained clustering algorithms using graph properties, based on minimum cardinality dominating sets. These algorithms are publicly available on GitHub and PyPI, accompanied by rich documentation and common usage examples. A third algorithm, CURGRAPH, is under development to provide an exact solution to the problem of finding the clustering with the smallest possible radius under a given constraint. This algorithm will soon be integrated into the public library.

Additionally, I have developed a Python package to test desirable mathematical properties for clustering algorithms, as identified in the literature. This package enables exhaustive testing of implementation limits and their robustness against potential changes in input data. Although not yet publicly available, its publication is planned.