Rigouste, Loïs (2006) Inference and evaluation of the multinomial mixture model for unsupervised text clustering. PhD thesis Signal et Images, Signal et Images, ENST.
Full text available as:
|
|
Abstract
In this thesis, we investigate the use of a probabilistic model for unsupervised clustering of text collections. We focus in particular on the multinomial mixture model, with one latent theme variable per document.
Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. Recent proposals have been made of probabilistic clustering models, which build "soft" theme-document associations. These models allow to compute, for each document, a probability vector whose values can be interpreted as the strength of the association between documents and clusters. As such, these vectors can also serve to project texts into a lower-dimensional "semantic" space. These models however pose non-trivial estimation problems, which are aggravated by the very high dimensionality of the parameter space.
The contribution of this study is twofold. First, we present and contrast various estimation procedures for the multinomial mixture model, some of which had not been tested before in this context. Second, we propose a systematic evaluation of the performances of these algorithms, thereby defining a framework to assess the quality of unsupervised text clustering methods. The comparison with the performances of other classical models demonstrates, in our opinion, the relevance of the simple multinomial mixture model for clustering corpus mainly composed of monothematic documents.
| Item Type: | PhD Thesis (PhD) |
|---|---|
| Thesis Supervisor: | Cappé, Olivier and Yvon, François |
| Date: | November 2006 |
| Board of examiners: | Lebart, Ludovic and Gaussier, Eric and Sebag, Michèle and Clérot, Fabrice and Robert, Christian |
| Ecole Doctorale: | ED 130 INFORMATIQUE, TELECOMMUNICATIONS ET ELECTRONIQUE (EDITE) |
| Discipline: | Signal et Images |
| Collection (Fonds): | ENST ENST |
| Institution: | ENST |
| Department: | Signal et Images |
| Subjects: | 2. Information and Communication Sciences and Technologies |
| Uncontrolled Keywords: | unsupervised document classification evaluation mixture models multinomial distributions EM algorithm MCMC methods, classification non supervisée documents évaluation modèles de mélange lois multinomiales algorithme EM méthodes MCMC |
| ID Code: | 2424 |
| Deposited By: | Loïs Rigouste |
| Deposited On: | 10 May 2007 |
Table of content
1. Introduction
1.1 Motivations
1.2 Classification non supervisée
1.3 Contributions
1.4 Organisation de la thèse
2. Etat de l'art
2.1 Représentation des documents
2.2 Modèles non probabilistes
2.3 Modèles probabilistes
2.4 Relations entre les modèles
2.5 Intérêt du modèle de mélange de multinomiales
3. Evaluation
3.1 Introduction
3.2 Mesures générales en classification non supervisée
3.3 Perplexité
3.4 Evaluation extrinsèque
3.5 Mesures de comparaison avec un étiquetage manuel
3.6 Discussion et méthodologie
4. Modèle de mélange de multinomiales
4.1 Retour sur le modèle
4.2 Performances de l'algorithme EM
4.3 Amélioration des performances de l'EM par réduction de la dimensionnalité
4.4 Echantillonnage de Gibbs
4.5 Généralisation des résultats
4.6 Cadres semi-supervisé et supervisé
5 Discussion des performances
5.1 Comparaison avec l'algorithme des K-moyennes
5.2 Allocation Dirichlet latente
5.3 Interprétation des thèmes
5.4 Conclusion
Conclusion
Glossaire
Notations
Annexe A. Compléments au chapitre 2
A.1 Modèle de mélange de lois binomiales négatives
A.2 Algorithme itératif du goulot d'information
Annexe B. Participation au DEfi Fouille de Textes DEFT'05
B.1 Introduction
B.2 Modèles de Markov pour la segmentation
B.3 Modèle de mélange de multinomiales
B.4 Utilisation du segmenteur en thèmes pour DEFT
B.5 Conclusion
Annexe C. Le programme C++ Textclust
C.1 Introduction
C.2 La bibliothèque BOW
C.3 Textclust
Bibliographie
Index
Repository Staff Only: edit this item

