Developing a healthcare dataset information resource (DIR) based on Semantic Web

October 18, 2023

An impressive tool that allows users to ask a variety of questions about a potential dataset. Supports the basics like “how many patients”, “is it open source”–but also is able to get into more details. For example, which statistical methods have been used on this dataset (extracted from publications in PubMed), and the data elements used in this dataset. The focus of the paper isn’t to demonstrate the tool itself, but is about how their application of semantic methods allows this kind of functionality. In that sense, this is a great primer on some Semantic Web basics, like RDF, SPARQL, and how to utilize several disparate ontologies. Their ability to extract statistical methods from publications is almost like a sub-paper where they describe their rule-based NER and results. It also contains some basics on the 12 datasets that they included–most of which you should read up on and know about if you are a researcher in the health informatics space. Unfortunately, they are still limited by the great equalizer–manual extraction. The Discussion section had some good things to say about how to move away from manual curation. Link to paper.