Voici les éléments 1 - 6 sur 6
Vignette d'image
Publication
Accès libre

Ontology engineering using formal concept analysis from unstructured textual data

2019, Jabbari, Simin, Stoffel, Kilian

L’extraction de connaissances, en particulier à partir de données non structurées tels que les textes, a longtemps était considérée comme une des demandes les plus souhaitées, avec un grand nombre d’applications dans presque toutes les industries. La conception, ainsi que la mise en place de solutions capables d’extraire des connaissances, de façon presque automatique, est loin d’être une tâche facile. Depuis déjà plusieurs décennies, différents chercheurs ont proposé une variété de méthodologies et d’algorithmes afin de décrire comment donner une certaine structure à des données textuelles, avec pour but ultime l’extraction de connaissances. Un des éléments clés de ces solutions est d’utiliser une ontologie reposant sur une structure de graphe, et rendant possible une représentation de connaissances. Cependant, la construction d’ontologies, en particulier à partir de textes, n’est pas aisée. A notre connaissance, il n’y a pas encore de méthodologie complète décrivant la construction d’une ontologie à partir du traitement de données textuelles, dans un domaine d’intérêt donné, pour être par la suite utilisée pour l’extraction de connaissances explicites ainsi que implicites (sémantiques). L’objectif de cette thèse est de proposer un pipeline décrivant comment partir de l’analyse de textes pour finalement arriver à une ontologie comprenant les propositions les plus informatives de ce corpus de textes sur un contexte donné, et dans le but d’être utilisé pour l’extraction de connaissances. Ce pipeline repose sur l’utilisation de trois méthodes d’analyse de données, tout aussi différentes que complémentaires, incluant (i) le traitement du langage naturel, (ii) l’analyse formelle de concepts, et (iii) l’apprentissage d’ontologies. En résumé, le pipeline débutera par une exploration du corpus de textes en entrée (pour un domaine d’intérêt bien défini), faisant usage des techniques de traitement du langage naturel les plus avancées. L’analyse formelle de concepts sera par la suite utilisée pour former les concepts et construire leurs hiérarchies (i.e., un treillis de concepts), constituant le fondement de l’ontologie désirée. Enfin, les propositions les plus informatives du corpus de textes seront intégrées au sein de l’ontologie, dérivée au préalable d’un ensemble d’algorithmes proposés dans cette thèse et appliqués au treillis de concepts susmentionné. Afin de valider la précision de notre pipeline, nous l’avons testé avec quelques exemples synthétiques ainsi qu’avec de vrais cas d’utilisation dans le contexte de l’industrie pharmaceutique. Nous avons pu démontrer qu’une telle ontologie obtenue peut être utilisée pour interroger d’importantes connaissances ainsi que des informations extraites de données textuelles non structurées, et peut être employée comme élément central au sein d’un moteur de recherche intelligent, avec des applications en analyse sémantique. Un des avantages de notre solution est de minimiser l’intervention humaine, contrairement à beaucoup d’autres solutions déjà existantes et aux performances fortement dépendantes de la présence ou non d’un expert en la matière tout au long du processus de construction de l’ontologie. Lors du processus de validation, nous impliquons toutefois toujours l’expertise humaine, afin de garantir la constante amélioration de notre ontologie., Knowledge extraction especially from unstructured data such as texts has been always considered as one of the highly demanded requests with lots of applications in almost all industries. Design and building of solutions that are capable of extracting knowledge, in an almost automated way, is not an easy task at all. Many researchers have proposed variety of methodologies and algorithms to describe how one can give some structure to textual data with the ultimate goal of knowledge extraction since decades ago. One of the key elements of those solutions is to utilize ontology as a graph-like structure for representing knowledge. Building ontologies especially from textual data, however, is not quite straightforward. To the best of our knowledge, there is no yet a comprehensive methodology to describe how one can form an ontology from processing textual data in a given domain of interest to be later used for explicit as well as implicit (or semantic) knowledge extraction. In this thesis, we propose a pipeline to describe how we can start from analyzing texts to end up with an ontology, which is equipped with the most informative statements of that text corpus about a given context, in order to be used for knowledge extraction. The proposed pipeline is based on utilization of three different yet complementary data analysis methods including (i) natural language processing, (ii) formal concept analysis, and (iii) ontology learning. In a nutshell, the pipeline will start by mining the input text corpus (in a given domain of interest) using state-of-the-art natural language processing techniques. The formal concept analysis will then be used to form the concepts and build the hierarchies among them (i.e., a concept lattice) as the cornerstone of the desired ontology. Finally, the most informative statements extracted from text corpus will be embedded into the ontology, that has been derived from a set of proposed algorithms applied on the aforementioned concept lattice. To validate the accuracy of the proposed pipeline we tested it on a few toy examples as well as a real use case in the context of pharmaceuticals. We could demonstrate that such an engineered ontology can be used for querying valuable knowledge and insights from unstructured textual data, and to be employed as the core component of smart search engines with applications in semantic analysis. One of the advantages of our proposed solution is that it does not require so much of human intervention, as opposed to many existing solutions whose performance highly depends on the presence of a subject matter expert along the ontology engineering process. It does not, however, mean that our proposed pipeline cannot benefit from existence of such additional information resources to be further empowered by human expertise in shaping ontologies.

Vignette d'image
Publication
Accès libre

Ontology extraction from MongoDB using formal concept analysis

2017-10-21, Jabbari, Simin

Using formal concept analysis, we propose a method for engineering ontology from MongoDB to effectively represent unstructured data. Our method consists of three main phases: (1) generating formal context from a MongoDB, (2) applying formal concept analysis to derive a concept lattice from that formal context, and (3) converting the obtained concept lattice to the first prototype of an ontology. We apply our method on NorthWind database and demonstrate how the proposed mapping rules can be used for learning an ontology from such database. At the end, we discuss about suggestions by which we can improve and generalize the method for more complex database examples.

Vignette d'image
Publication
Accès libre

FCA-Based Ontology Learning From Unstructured Textual Data

2018-12-20, Jabbari, Simin

Ontologies have been frequently used for representing a domain knowledge. It has a lot of applications in semantic knowledge extraction. However, learning ontologies especially from unstructured data is a difficult yet an interesting challenge. In this paper, we introduce a pipeline for learning ontology from a text corpora in a semi-automated fashion using Natural Language Processing (NLP) and Formal Concept Analysis (FCA). We apply our proposed method on a small given corpus that consists of some news documents in IT and pharmaceutical domain. We then discuss the potential applications of the proposed model and ideas on how to improve it even further.

Vignette d'image
Publication
Accès libre

Parallel Execution of Binary-Based NextClosure Algorithm

2016-7-18, Jabbari, Simin

Formal concept analysis (FCA) has become a popular method for analyzing data across various domains in which data bases can be analyzed regardless of their contexts. With its properties FCA is of big interest in the context of Big Data. However, the complexity of the basic FCA analysis algorithms often prohibits its use in general production tool chains for data analysis. In this paper we show how to overcome some of these problems. In the first step we show how to implement the well known NextClosure in efficient way in Python (a preferred language in the context of ad-hoc data analysis) which is several times faster the other published algorithms. In the second step we show how our implementation can be parallelized on common hardware by strictly using the best sequential algorithm which di↵ers in an important way form so far published parallel algorithms for FCA.

Vignette d'image
Publication
Accès libre

A Methodology for Extracting Knowledge about Controlled Vocabularies from Textual Data using FCA-Based Ontology Engineering

2018-12-3, Jabbari, Simin

We introduce an end-to-end methodology (from text processing to querying a knowledge graph) for the sake of knowledge extraction from text corpora with a focus on a list of vocabularies of interest. We propose a pipeline that incorporates Natural Language Processing (NLP), Formal Concept Analysis (FCA), and Ontology Engineering techniques to build an ontology from textual data. We then extract the knowledge about controlled vocabularies by querying that knowledge graph, i.e., the engineered ontology. We demonstrate the significance of the proposed methodology by using it for knowledge extraction from a text corpus that consists of 800 news articles and reports about companies and products in the IT and pharmaceutical domain, where the focus is on a given list of 250 controlled vocabularies.

Vignette d'image
Publication
Accès libre

A Hybrid Algorithm for Generating Formal Concepts and Building Concept Lattice Using NextClosure and Nourine Algorithms

2016-7-18, Jabbari, Simin

Concept lattice produced from a set of formal concepts is used for representing concept hierarchy and has many applications in knowledge representation and data mining. Different algorithms have been proposed in the past for efficiently generating formal concepts and building concept lattices. In this paper we introduce the idea of combining existing algorithms in FCA with the aim of benefiting from their specific advantages. As an example, we propose a hybrid model that utilizes the NextClosure (NC) algorithm for generating formal concepts and parts of the Nourine algorithm for building concept lattices.We compare the proposed hybrid model with two of its counterparts: pure NC and pure Nourine. Our experiments show that the hybrid model always outperforms pure NC and for very large datasets can surpass the pure Nourine, as well.