Home DE ES FR


Advanced Search

Our On-Line PhDs

Submit a Thesis
My Account Register Help

About
Fields
Mathematics and Applications
Information and Communication Sciences and Technologies
Physics, Optics
Materials Science, Mechanics and Mechanical Engineering
Fluid Mechanics and Energy
Chemistry, Physical Chemistry and Chemical Engineering
Life Sciences and Engineering
Earth Sciences and Environmental Engineering
Sciences of Economy, Management and Society
Inference and evaluation of the multinomial mixture model for unsupervised text clustering

Rigouste, Loïs (2006) Inference and evaluation of the multinomial mixture model for unsupervised text clustering. PhD thesis Signal et Images, Signal et Images, ENST.

Full text available as:

- these_rigouste.ps ( 3085 Kb )
- these_rigouste.pdf ( 2363 Kb )
Licence: Copyright

Abstract

In this thesis, we investigate the use of a probabilistic model for unsupervised clustering of text collections. We focus in particular on the multinomial mixture model, with one latent theme variable per document.
Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. Recent proposals have been made of probabilistic clustering models, which build "soft" theme-document associations. These models allow to compute, for each document, a probability vector whose values can be interpreted as the strength of the association between documents and clusters. As such, these vectors can also serve to project texts into a lower-dimensional "semantic" space. These models however pose non-trivial estimation problems, which are aggravated by the very high dimensionality of the parameter space.
The contribution of this study is twofold. First, we present and contrast various estimation procedures for the multinomial mixture model, some of which had not been tested before in this context. Second, we propose a systematic evaluation of the performances of these algorithms, thereby defining a framework to assess the quality of unsupervised text clustering methods. The comparison with the performances of other classical models demonstrates, in our opinion, the relevance of the simple multinomial mixture model for clustering corpus mainly composed of monothematic documents.

Item Type:PhD Thesis (PhD)
Thesis Supervisor:Cappé, Olivier and Yvon, François
Date:November 2006
Board of examiners:Lebart, Ludovic and Gaussier, Eric and Sebag, Michèle and Clérot, Fabrice and Robert, Christian
Ecole Doctorale:ED 130 INFORMATIQUE, TELECOMMUNICATIONS ET ELECTRONIQUE (EDITE)
Discipline:Signal et Images
Collection (Fonds):ENST
ENST
Institution:ENST
Department:Signal et Images
Subjects:2. Information and Communication Sciences and Technologies
Uncontrolled Keywords:unsupervised document classification evaluation mixture models multinomial distributions EM algorithm MCMC methods, classification non supervisée documents évaluation modèles de mélange lois multinomiales algorithme EM méthodes MCMC
ID Code:2424
Deposited By:Loïs Rigouste
Deposited On:10 May 2007

Table of content

1. Introduction
1.1 Motivations
1.2 Classification non supervisée
1.3 Contributions
1.4 Organisation de la thèse
2. Etat de l'art
2.1 Représentation des documents
2.2 Modèles non probabilistes
2.3 Modèles probabilistes
2.4 Relations entre les modèles
2.5 Intérêt du modèle de mélange de multinomiales
3. Evaluation
3.1 Introduction
3.2 Mesures générales en classification non supervisée
3.3 Perplexité
3.4 Evaluation extrinsèque
3.5 Mesures de comparaison avec un étiquetage manuel
3.6 Discussion et méthodologie
4. Modèle de mélange de multinomiales
4.1 Retour sur le modèle
4.2 Performances de l'algorithme EM
4.3 Amélioration des performances de l'EM par réduction de la dimensionnalité
4.4 Echantillonnage de Gibbs
4.5 Généralisation des résultats
4.6 Cadres semi-supervisé et supervisé
5 Discussion des performances
5.1 Comparaison avec l'algorithme des K-moyennes
5.2 Allocation Dirichlet latente
5.3 Interprétation des thèmes
5.4 Conclusion
Conclusion
Glossaire
Notations
Annexe A. Compléments au chapitre 2
A.1 Modèle de mélange de lois binomiales négatives
A.2 Algorithme itératif du goulot d'information
Annexe B. Participation au DEfi Fouille de Textes DEFT'05
B.1 Introduction
B.2 Modèles de Markov pour la segmentation
B.3 Modèle de mélange de multinomiales
B.4 Utilisation du segmenteur en thèmes pour DEFT
B.5 Conclusion
Annexe C. Le programme C++ Textclust
C.1 Introduction
C.2 La bibliothèque BOW
C.3 Textclust
Bibliographie
Index

Statistiques de consultation

Repository Staff Only: edit this item

© ParisTech 2007 - Réalisé par RILK.com - Graphisme par Winch Communication