Boullé, Marc (2007) Recherche d'une représentation des données efficace pour la fouille des grandes bases de données. PhD thesis INFORMATIQUE, Département TSI, ENST p.328.
Full text available as:
|
|
Alternative Locations: http://perso.rd.francetelecom.fr/boulle/publications/BoulleThesis07.pdf
Abstract
The data preparation step of the data mining process represents 80% of the problem and is both time consuming and critical for the quality of the modeling. In this thesis, our purpose is to design an evaluation criterion of data representations, in order to automate data preparation.
To overcome this problem, we introduce a non parametric family of density estimation models, named data grid models. Each variable is partitioned in intervals or in groups of values according to whether it is numerical of categorical, and the whole data space is partitioned into a grid of cells resulting from the cross-product of the univariate partitions. We then consider density estimation models where the density is assumed constant per data grid cell.
Because of their high expressiveness, data grid models are hard to regularize and to optimize. We exploit a model selection technique based on a Bayesian approach and obtain an exact analytic criterion for the posterior probability of data grid models.
We introduce combinatorial optimization algorithms which leverage the properties of our evaluation criterion and the sparseness of data in large dimension.
These algorithms have a guaranteed algorithmic complexity, which is super-linear with the sample size.
We evaluate data grid models in numerous tasks of data analysis, for supervised classification, regression, clustering or coclustering. The results demonstrate the validity of the approach, that allows to automatically and efficiently detect fine-grained and reliable information useful for the data preparation step.
| Item Type: | PhD Thesis (PhD) |
|---|---|
| PhD Supervisor: | Moulines, Eric |
| Date: | 24 September 2007 |
| Board of examiners: | Guyon, Isabelle and Robert, Christian and Moulines, Eric and Clérot, Fabrice and Sebag, Michèle and Zighed, Djamel |
| Ecole Doctorale: | ED 130 INFORMATIQUE, TELECOMMUNICATIONS ET ELECTRONIQUE (EDITE) |
| Discipline: | INFORMATIQUE |
| Collection (Fonds): | TELECOM ParisTech (ENST) |
| Institution: | ENST |
| Department: | Département TSI |
| Subjects: | 2. Information and Communication Sciences and Technologies 1. Mathematics and Applications |
| Uncontrolled Keywords: | Apprentissage Exploration de données Statistique Bayesienne Préparation des données Sélection de modèles, Machine learning Data exploration Bayesianism Data preparation Model selection |
| ID Code: | 3023 |
| Deposited By: | Marc Boullé |
| Deposited On: | 23 May 2008 |
References
M. Boullé. Recherche d'une représentation des données efficace pour la fouille des grandes bases de données. PhD Thesis Ecole Nationale Supérieure des Télécommunications, 2007
Repository Staff Only: edit this item