Accueil DE EN ES FR


Advanced Search

Our On-Line PhDs

Submit a Thesis
My Account Register Help

About
Fields
Mathematics and Applications
Information and Communication Sciences and Technologies
Physics, Optics
Materials Science, Mechanics and Mechanical Engineering
Fluid Mechanics and Energy
Chemistry, Physical Chemistry and Chemical Engineering
Life Sciences and Engineering
Earth Sciences and Environmental Engineering
Sciences of Economy, Management and Society
Recherche d'une représentation des données efficace pour la fouille des grandes bases de données

Boullé, Marc (2007) Recherche d'une représentation des données efficace pour la fouille des grandes bases de données. PhD thesis INFORMATIQUE, Département TSI, ENST p.328.

Full text available as:

- BoulleThesis07.pdf ( 4925 Kb )
Licence: Copyright

Alternative Locations: http://perso.rd.francetelecom.fr/boulle/publications/BoulleThesis07.pdf

Abstract

The data preparation step of the data mining process represents 80% of the problem and is both time consuming and critical for the quality of the modeling. In this thesis, our purpose is to design an evaluation criterion of data representations, in order to automate data preparation.



To overcome this problem, we introduce a non parametric family of density estimation models, named data grid models. Each variable is partitioned in intervals or in groups of values according to whether it is numerical of categorical, and the whole data space is partitioned into a grid of cells resulting from the cross-product of the univariate partitions. We then consider density estimation models where the density is assumed constant per data grid cell.



Because of their high expressiveness, data grid models are hard to regularize and to optimize. We exploit a model selection technique based on a Bayesian approach and obtain an exact analytic criterion for the posterior probability of data grid models.

We introduce combinatorial optimization algorithms which leverage the properties of our evaluation criterion and the sparseness of data in large dimension.

These algorithms have a guaranteed algorithmic complexity, which is super-linear with the sample size.



We evaluate data grid models in numerous tasks of data analysis, for supervised classification, regression, clustering or coclustering. The results demonstrate the validity of the approach, that allows to automatically and efficiently detect fine-grained and reliable information useful for the data preparation step.

Item Type:PhD Thesis (PhD)
PhD Supervisor:Moulines, Eric
Date:24 September 2007
Board of examiners:Guyon, Isabelle and Robert, Christian and Moulines, Eric and Clérot, Fabrice and Sebag, Michèle and Zighed, Djamel
Ecole Doctorale:ED 130 INFORMATIQUE, TELECOMMUNICATIONS ET ELECTRONIQUE (EDITE)
Discipline:INFORMATIQUE
Collection (Fonds):TELECOM ParisTech (ENST)
Institution:ENST
Department:Département TSI
Subjects:2. Information and Communication Sciences and Technologies
1. Mathematics and Applications
Uncontrolled Keywords:Apprentissage Exploration de données Statistique Bayesienne Préparation des données Sélection de modèles, Machine learning Data exploration Bayesianism Data preparation Model selection
ID Code:3023
Deposited By:Marc Boullé
Deposited On:23 May 2008

References

M. Boullé. Recherche d'une représentation des données efficace pour la fouille des grandes bases de données. PhD Thesis Ecole Nationale Supérieure des Télécommunications, 2007

Statistiques de consultation

Repository Staff Only: edit this item

© ParisTech 2007 - Réalisé par RILK.com - Graphisme par Winch Communication