Voici les ƩlƩments 1 - 1 sur 1
  • Publication
    AccĆØs libre
    Automatic Author Profiling and Verification
    Cette thĆØse sā€™intĆ©resse principalement aux problĆØmes de classification de textes fondĆ©e sur le style dont le but est dā€™identifier les caractĆ©ristiques de lā€™auteur comme son Ć¢ge, sexe, son idiolecte, en se basant sur un ensemble de ses Ć©crits. De plus, on aborde la question de savoir si deux textes (comme des chats, courriels menaƧants, testaments douteux, essais, mĆ©mos, ou fictions) ont Ć©tĆ© Ć©crits par la mĆŖme personne en comparant leur style dā€™Ć©criture selon diffĆ©rentes reprĆ©sentations. Nous proposons un processus de sĆ©lection des attributs simple et stable. Cette rĆ©duction nous conduit Ć  proposer une dĆ©cision possĆ©dant un plus grand pouvoir explicatif. Nous dĆ©butons ce travail par analyser lā€™efficacitĆ© de plusieurs modĆØles basĆ©s sur lā€™apprentissage automatique et recourant Ć  lā€™ensemble du vocabulaire. Une procĆ©dure de rĆ©duction des attributs en deux Ć©tapes peut alors ĆŖtre appliquĆ©e. Nous pouvons alors comparer les performances de divers modĆØles avec des rĆ©duction du nombre dā€™attributs basĆ©s sur notre approche, le 2 ou le PMI. Dans tous les cas, le nombre dā€™attributs est rĆ©duit Ć  300. Sur la base de la collection de documents de diffĆ©rentes campagnes dā€™Ć©valuation CLEF-PAN, nous avons testĆ© notre approche avec plusieurs baselines. On constate que les modĆØles Extra Trees, Random Forest, ou Gradient Boost produisent souvent les meilleures performances. De plus, la rĆ©duction des attributs au nombre de 300 permet dā€™obtenir des performances similaires. Cette diminution permet Ć©galement de rĆ©duire la taille des reprĆ©sentations des documents et donc de rĆ©duire le temps de calcul. Parfois, nous observons mĆŖme un gain dā€™efficience. Dans le cadre de la vĆ©rification dā€™auteur, et selon diverses reprĆ©sentations des textes, nous pouvons Ć©galement amĆ©liorer la qualitĆ© des rĆ©sultats. Ainsi, les documents prĆ©sentant de grandes diffĆ©rences de reprĆ©sentation ne sont pas Ć©crits par la mĆŖme personne. Dans ce but nous avons appliquĆ© diffĆ©rentes mesures de performance (AUC, c@1, Final Score (FS)) dont les rĆ©sultats sont corrĆ©lĆ©s en particulier AUC et FS. En tenant compte uniquement du taux de rĆ©ussite, la pondĆ©ration TFIDF offre les meilleures performances. This thesis mostly discusses the style-based text categorization problem, where the objective is to identify the authorā€™s demographics, such as gender, age range, and language variety, based on a set of texts. Also to determine whether two writings (chat, threatening e-mail, doubtful testimony, essays, text messages, business memos, fanfiction texts) were authored by the same person by contrasting the writing styles of the two texts by applying the vector difference text representation. We also create a stable and straightforward paradigm for feature reduction iteratively. This reduction will result to a more explainable decision. We begin by assessing the effectiveness of several machine learning models using the complete vocabulary. The two-step feature selection technique is then used to design a feature reduction strategy. After testing the models with these reduced features, we were able to examine how the performance variation would appear in the two scenarios. We went on to test further feature reduction by applying 2 and PMI scoring functions to select the top 300 features. With the use of several CLEF-PAN datasets, we test our models, and we can see that Extra Trees, Random Forest, or Gradient Boost often produce the best results. Furthermore, empirical evidence reveals that the feature set can be effectively condensed using 2 and PMI scoring methods to about 300 features without compromising performance. Additionally, we can see that by discarding non informative features, decreasing the text feature representation not only cuts down on runtime but also improves performance in some cases. With the difference vector text representation approach we demonstrate how utilization of confidencebased approaches can benefit classification accuracy in the author verification. We can see that small differences in vectorial representation indicates higher similarity, but documents with a large differences are not authored by the same writer. Several performance measures are obtained including accuracy, area under the curve (AUC), c@1 and Final Score (FS). Our research shows a strong correlation between all performance with measurements FS and AUC having the strongest correlation. We take into account only the accuracy to draw conclusion about the different text representation methods. Our experiments therefore show that the best scoring model include TFIDF feature set since it considers both occurrence frequency and the distribution of terms across the collection.