The aim of this tutorial is to show the foundations and modern practical applications of knowledge-based and statistical methods for exploring large document corpora. It will first focus on many of the techniques required for this purpose, including natural language processing tasks, approximate nearest neighbours methods, clustering algorithms, probabilistic topic models, and will then describe how a combination of these techniques is being used in practical applications for browsing large multilingual document corpora without the need to translate texts. Participants will be involved in the entire process of creating the necessary resources to finally build a multilingual text search engine.
Searching for similar documents and exploring major themes covered across groups of documents are common activities when browsing collections of scientific papers. This manual knowledge-intensive task can become less tedious and even lead to unexpected relevant findings if unsupervised algorithms are applied to help researchers. Most text mining algorithms represent documents in a common feature space that abstract them away from the specific sequence of words used in them. Probabilistic Topic Models reduce that feature space by annotating documents with thematic information. Over this low-dimensional latent space some locality-sensitive hashing algorithms have been proposed to perform document similarity search. However, thematic information gets hidden behind hash codes, preventing thematic exploration and limiting the explanatory capability of topics to justify content-based similarities. This paper presents a novel hashing algorithm based on approximate nearest-neighbor techniques that uses hierarchical sets of topics as hash codes. It not only performs efficient similarity searches, but also allows extending those queries with thematic restrictions explaining the similarity score from the most relevant topics. Extensive evaluations on both scientific and industrial text datasets validate the proposed algorithm in terms of accuracy and efficiency.
paper @K-CAP 2019
With the ongoing growth in number of digital articles in a wider set of languages and the expanding use of different languages, we need for annotation methods that enable browsing multi-lingual corpora. Multilingual probabilistic topic models have recently emerged as a group of semi-supervised machine learning models that can be used to perform thematic explorations on collections of texts in multiple languages. However, these approaches require theme-aligned training data to create a language-independent space. This constraint limits the amount of scenarios that this technique can offer solutions to train and makes it difficult to scale up to situations where a huge collection of multi-lingual documents are required during the training phase. This paper presents an unsupervised document similarity algorithm that does not require parallel or comparable corpora, or any other type of translation resource. The algorithm annotates topics automatically created from documents in a single language with cross-lingual labels and describes documents by hierarchies of multi-lingual concepts from independently-trained models. Experiments performed on the English, Spanish and French editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.
paper @Oxford Academic
Drug-drug interactions (DDIs) involving antiretrovirals (ARVs) tend to cause harm if unrecognized, especially in the context of multiple co-morbidity and polypharmacy. A database linkage was established between the regional drug dispensing registry of Madrid and the Liverpool HIV DDI database (January-June 2017). Polypharmacy was defined as the use of ≥5 non-HIV medications, and DDIs were classified by a traffic-light ranking for severity. HIV-uninfected controls were also included. A total of 22,945 patients living with HIV (PLWH) and 6,613,506 uninfected individuals had received medications. Antiretroviral therapy regimens were predominantly based on integrase inhibitors (51.96%). Polypharmacy was significantly higher in PLWH (32.94%) than uninfected individuals (22.16%; P<0.001), and this difference was consistently observed across all age strata except for individuals aged ≥75 years. Polypharmacy was more common in women than men in both PLWH and uninfected individuals. The prevalence of contraindicated combinations involving ARVs was 3.18%. Comedications containing corticosteroids, quetiapine, or antithrombotic agents were associated with the highest risk for red-flag DDI, and the use of raltegravir or dolutegravir-based antiretroviral therapy was associated with an adjusted odds ratio of 0.72 (95% confidence interval: 0.60 – 0.88; P=0.001) for red-flag DDI.Polypharmacy was more frequent among PLWH across all age groups except those aged ≥75 years and was more common in women. The persistent detection of contraindicated medications in patients receiving ARVs suggests a likely disconnect between hospital and community prescriptions. Switching to alternative unboosted integrase regimens should be considered for patients with high risk of harm from DDIs.
Easily download articles and legal documents on public procurement.
If you have data described in RDF format (e.g. a knowledge base or an ontology) and you want to publish them on the web following the REST principles via API over HTTP, this is your tool!
Innovative open source projects developed by enthusiastic students, organizations and teachers.
We got a huge collection of (un-labelled) documents and we would like to explore the knowledge inside. Imagine that we could run an unsupervised, automated pipeline to generate connections between them state-of-the art techniques to programmatically generating annotations for each of the texts inside big collections of documents in a way that is computationally affordable
Explore large-scale multilingual corpora to discover similar documents trying to avoid all pair-wise comparisons and translations. Document similarity is calculated on-demand from existing corpus. Some results are provided.
proposal @Junta Castilla y Leon
Based on open data sources, such as that offered by the Junta de Castilla y León, our proposal consists of regulating public lighting according to criteria grouped into profiles: I) Oranges , associated with residential areas; II) Greens , associated with natural spaces; III) Blues , associated with astronomical observation spaces; IV) White , associated with areas of low population; V) Roses , associated with leisure areas; VI) Yellow , associated with monumental areas; and finally, VII) Red , associated with critical areas such as hospitals or transport stations. These profiles establish ranges in on/off time and light intensity. The lighting control would be carried out by means of an intensity regulator installed in the public lighting controls, for example a Raspberry PI.
Use of probabilistic topic models to create scalable representation of documents aim to: (1) organize, summarise and search them, (2) explore them in a way that you can index of ideas contained in them, and (3) browse them in a way that you can find documents dealing specific areas.
The Corpus Viewer platform relies on natural language processing (NLP), machine learning (ML), and machine translation (MT) techniques to analyze structured metadata and unstructured textual data in large corpus of textual documents. The platform allows decision-makers and policy implementers the ability to analyze the R+D+i information space (mainly patents, scientific publications and public grants) for the implementation of evidence- and knowledge-based policies. It is based, among other techniques, on the modeling of topics and the analysis of graphs. Added integration with the NLP (English and Spanish) and Topic Models modules from librAIry.
Design and implementation of an algorithm to manage versions of machine learning models.
This tool builds a probabilistic topic model based on BTM from a collection of tweets.
This document reports the results of our experimental study aimed to find out the impact of different orders of adding documents to datasets for measuring terminological saturation. The motivation for this research activity lies in the fact that real world document collections are retrospective. So, terminological drift in time is often present in such collections. We empirically investigated the proper ways to cope with this temporal drift and its influence on terminological saturation. Our premise was that there could be several different orders of adding documents to the processed datasets, dealing with the time of publication:(i) chronological;(ii) reversedchronological;(iii) bi-directional; and (iv) random. Experiments were performed using three different real world document collections coming from different domains, where the collections of high-quality documents were available as scientific ….
This paper reports on cross-evaluating the two freely available software tools for automated term extraction (ATE) from English texts: NaCTeM TerMine and UPM Term Extractor. The objective was to find the most fitting software for extracting the bags of terms to be the part of our instrumental pipeline for exploring terminological saturation in text document collections in a domain of interest. The choice of these particular tools from the bunch of the other available is explained in our review of the related work in ATE. The approach to measure terminological saturation is based on the use of the THD algorithm developed in frame of our OntoElect methodology for ontology refinement. The paper presents the suite of instrumental software modules, experimental workflow, 2 synthetic and 3 real document collections, generated datasets, and set-up of our experiments. Next, the results of the cross-evaluation experiments are presented, analyzed, and discussed. Finally the paper offers some conclusions and recommendations on the use of ATE software for measuring terminological saturation in retrospective text document collections.
There are many scenarios where we may want to find pairs of textually similar documents in a large corpus (e.g. a researcher doing literature review, or an R&D project manager analyzing project proposals). To programmatically discover those connections can help experts to achieve those goals, but brute-force pairwise comparisons are not computationally adequate when the size of the document corpus is too large. Some algorithms in the literature divide the search space into regions containing potentially similar documents, which are later processed separately from the rest in order to reduce the number of pairs compared. However, this kind of unsupervised methods still incur in high temporal costs. In this paper, we present an approach that relies on the results of a topic modeling algorithm over the documents in a collection, as a means to identify smaller subsets of documents where the similarity function can then be computed. This approach has proved to obtain promising results when identifying similar documents in the domain of scientific publications. We have compared our approach against state of the art clustering techniques and with different configurations for the topic modeling algorithm. Results suggest that our approach outperforms (>0.5) the other analyzed techniques in terms of efficiency.
This half-day tutorial covers the foundations and modern practical applications of knowledge-based and statistical methods, techniques and models and their combination for exploiting large document corpora. It is focused on the foundations of many of the techniques that can be used for this purpose, including knowledge graphs, word embeddings, neural network methods, probabilistic topic models, and describes how a combination of these techniques is being used in practical applications and commercial projects where the instructors are currently involved.
prototype @Desafio Aporta
Incorporation of natural language processing techniques to the indexing process of open data sets to increase their characterization and allow their linking with news published in digital media. Given the text of a news item, the system suggests to its author data sets published in datos.gob.es based on three aspects: - space-time proximity: datasets whose content covers the location and / or time where the news is framed. - similarity: data sets that deal with issues similar to those discussed in the news. - reference: datasets that contain people, places or organizations mentioned in the news. The latter case will also incorporate new metadata to the existing data set based on the entities (people, places, organizations) extracted from its content. Ideally, the importance of each line of recommendation will depend on the content of the news.
poster @EACS 2017
This was a cross-sectional population-based study carried out from January 1 to June 30, 2016 in the region of Madrid. Hospital pharmacies dispense antiretrovirals (ARVs), and Co-meds are dispensed mainly by community pharmacies. Refills of medications are done monthly. The following parameters were evaluated: age, sex, and prescription drugs (ARVs and Co-meds). Patients were classified as pediatric patients (< 18 years) and adults (≥18 years). ARVs were categorized according to class: nRTIs, nnRTIs, PIs, INSTI, CCR5 inhibitors and fusion inhibitors. Co-meds were classified according to the Anatomical Therapeutic Chemical (ATC) classification system. Polypharmacy was considered as the intake of ≥5 Co-meds.
paper @ISWC SemSci Workshop
Summaries and abstracts of research papers have been traditionally used for many purposes by scientists, research practitioners, editors, programme committee members or reviewers (e.g. to identify relevant papers to read or publish, cite them, explore new fields and disciplines). As a result, many paper repositories only store or expose abstracts, what may limit the capacity of finding the right paper for a specific research purpose. Given the size limitations and the concise nature of abstracts, they usually omit explicit references to some contributions and impacts of the paper. Therefore for certain information retrieval tasks they cannot be considered as the most appropriate excerpt of the paper to base these operations on. In this paper we have studied other kinds of summaries, built upon textual fragments falling under certain categories of the scientific discourse, such as outcome,background, approach, etc, in order to decide which one is more appropriate in order to substitute the original text. In particular, two novel measures are proposed: (1) internal-representativeness, which evaluates how well a summary describes what the full-text is about and (2) external-representativeness, which evaluates the potential of a summary to discover related texts. Results suggest that summaries explaining the method of a scientific article express a more accurate description of the full-content than others. In addition, more relevant related articles are also discovered from summaries describing the method, together with those containing the background knowledge or the outcomes of the research paper.
We present librAIry, a novel architecture to store, process and analyze large collections of textual resources, integrating existing algorithms and tools into a common, distributed, high-performance workflow. Available text mining techniques can be incorporated as independent plug&play modules working in a collaborative manner into the framework. In the absence of a pre-defined flow, librAIry leverages on the aggregation of operations executed by different components in response to an emergent chain of events. Extensive use of Linked Data (LD) and Representational State Transfer (REST) principles are made to provide individually addressable resources from textual documents. We have described the architecture design and its implementation and tested its effectiveness in real-world scenarios such as collections of research papers, patents or ICT aids, with the objective of providing solutions for decision makers and experts in those domains. Major advantages of the framework and lessons-learned from these experiments are reported.
This paper presents the motivation for, planning of, and very first results of the PhD project by the first author. The objective of the project is to experimentally assess the representativeness (completeness), for knowledge extraction, of a retrospective textual document collection. The collection is chosen to describe a single well circumscribed subject domain. The approach to assess completeness is based on measuring the saturation of the semantic (terminological) footprint of the collection. The goal of this experimental study is to check if the saturation-based approach is valid. The project is performed at the Dept. of Computer Science of Zaporizhzhya National University in cooperation with BWT Group, Universidad Politecnica de Madrid, and Springer-Verlag GmbH.
This deliverable describes the design and implementation of the unified information service whose objective is to simplify the search of relevant content in situations where multiple heterogeneous non-structured information sources are involved, such as web-based Research Objects.
Ontology Learning algorithms are used to automatically generate ontologies, usually from unstructured resources. They are specially useful in particular domains where there are no mature or de-facto ontologies available. As part of the DRInventor research efforts, we have developed an ontology learning algorithm that aims to generate the domain model underlying the research objects indexed in the platform. Our approach is able to detect a set of relevant terms and relations that annotate the information contained in the corpus’s resources. We also present a unified framework for evaluating ontology learning algorithms, which takes into consideration different lexical and taxonomical aspects that are compared against a semi-automatically generated Gold Standard.
This deliverable describes the different knowledge discovery and recommendation mechanisms implemented in the DRInventor Platform. The objective of these techniques is to provide users with the relevant knowledge available in big corpora of documents in an automatic and timely manner. This way consumers in general and scientists in particular can obtain pertinent suggestions about which resources or specific fragments inside those resources they should be reading next, hence maximising the relevance of the information consumed while minimising the efforts spent on identifying them.
FarolApp is a mobile web application that aims to increase the awareness of light pollution by generating illustrative maps for cities and by encouraging citizens and public administrations to provide street light information in an ubiquitous and interactive way using online street views. In addition to the maps, FarolApp builds on existing sources to generate and provide up-to-date data by crowdsourced user annotations. Generated data is available as dereferenceable Linked Data resources in several RDF formats and via a queryable SPARQL endpoint. We propose Live Linked Data, a new approach to publish data about city infras- tructures trying to keep them synchronized leveraging the collaboration of citizens. The demo presented in this paper illustrates how FarolApp maintains continuously evolving Linked Data that reflect the current status of city street light infrastructures and use that data to generate light pollution maps.