Health Informatics is becoming an even bigger part of our modern healthcare sector, but existing communities perceive and use the concept differently, which might create contrasts in what constitutes the term and if it is being explicitly discussed on internet forums.
We, therefore, aim to explore the topic of Health Informatics on Wikipedia and obtain an overview of its network as well as knowledge on how it is understood by different actors.
We identify several thematic clusters, however, it must be noted that the obtained data is biased towards English articles solely from the Wikipedia platform.
Furthermore the algorithms used for interpretation and visualisation of the data is opinionated and manipulative with a certain bias.
ProtocolWe crawled and scraped the category of Health Informatics (the seed) on Wikipedia and all its member pages by using a script that connects to the Wikipedia API.
The depth is set to 1, meaning that the script also crawls the member pages of the sub-categories associated with our seed category.
This gives us an initial number of pages of 931.
The 931 pages were then used as inputs to run two different scripts, one that crawls and scrapes all links in all pages and one script that only explores the links in the body text of all pages.
Both scripts produced two .
GEXF files each that will be used in Gephi, however, we have chosen to proceed with the file from each script that contains all pages.
Diagram showing the sequence/order in which we applied the scripts to harvest data from the 931 pages in our category and sub-categories.
The Health Informatics NetworkTo get an overview of the category, we chose to visualise the network of the category itself, including subcategories and its member pages in two different versions: one with connections by links in text (InText) and one with connections by all links on a page (All_Links).
On these, we applied the layout ForceAtlas2 to explore the network structure.
These contained multiple disconnected nodes and by applying a force-directed layout, these got pushed further away.
The filter Giant Component solved this.
For less excess noise, we added the partition filter and removed all the not a member nodes from the network and changed the edge colour to grey to enhance the nodes.
A comparison between the two visualisations indicated that the formation of clusters appearing in the All_Links but not in the InText might be caused by the Wikipedia templates, which differentiated the two networks.
For further insight, we continued working with the All_Links network, filtering the degree range to 100–672, followed by rerunning ForceAtlas2 to prevent overlap between nodes.
Hereafter, the node size range was set to 7–50 according to the degree, which shows the number of edges to other nodes, by increasing/decreasing the size of the node.
For a clearer visualisation, we increased the degree range to 150, before applying the modularity statistics, changing the node colours accordingly and increasing the upper margin of the node sizes to 120.
Visualisation of the network for the 931 Wikipedia pages in the category Health Informatics and its sub-categories as well as pages outside the category, connected by all links on a page, with annotations related to the different visible clusters.
In order to juxtapose the two networks, the same procedure was applied to the InText network, which produced faint clusters compared to the All_Links network.
Resulting in the suspicion that the difference is manifested in the way the two networks harvest links because the All_Links include references in the Wikipedia templates which creates a media effect.
By looking at some of the articles, we noticed a shared used of templates and links connecting the pages.
The four big clusters might indicate some high degree nodes drawing attention to specific crowds in the network.
Six derivatives of the All_Links networkIn order to get an overview of what type of pages our clusters in the All_Links network contained and discussed what might be interesting to gain an insight into, which resulted in a list of 15 keywords we wanted to harvest data on in relation to the pages in the network.
We ran the key-word-search script on the “category members .
JSON_file” in English and with the wildcard on.
Afterwards, we imported the results into our All_Links network and visualised each keyword to see which were the most interesting, and ended up choosing the following for further investigation: efficien*, legislat*, law, polic*, safe*, secur*.
Six representations of the All_Links network (with the aforementioned layout applied) where the node size is based on the number of times a specific keyword is mentioned in a page in the network.
The nodes are sized by the degree 7–60 and coloured according to modularity.
To better understand how the keywords were used in their respective articles, we manually looked through them.
We see law, legislat* and secur* mostly in the green cluster, in the context of data security and privacy.
Meanwhile, polic* and safe* are mostly used in the purple cluster in the context of medical trials, and efficien* is prominent in both purple and green clusters, in the context of technology and data efficiency, with a slight mention of healthcare quality.
So in summation there is indications of different focus areas in the green and purple clusters, where the former mostly deals with concerns about privacy of data used in health information systems, whereas the latter is more concerned with medicine and clinical trials and only deals with health information systems to a small degree.
Timelines for edit history of two selected Wikipedia pages.
We created two timelines of the pages Medical Record (MR) and Evidence-Based Medicine (EBM), respectively, in relation to revision count and unique members, based on our keyword search This showed that the words legisla* and law is clearly related to MR, while polic* is related to EBM, which is located in another cluster.
The words are somehow related to each other meaning-wise but they seem to be context or area specific.
Revision timelines of the Wikipedia pages: Medical record and Evidence-Based Medicine, visualizing both the number of revisions (as the blue line) and number of unique users done the reviews (as the orange bars) since 2004 and 2001 when the articles were first written respectivelyIn both diagrams we observe the occasional high peaks in the number of revisions within a short amount of time by a relatively low number of users, potentially indicating a dispute within the community.
This can be further examined in the revisions, the talk pages and the comments from the revision csv-file.
We decided to look at revisions between 01–12–2007 and 01–01–2010 in MR.
We later discovered that the page was merged in 2005, which might provide an explanation for why the number of revisions is so low.
However, we did not find anything of interest here.
Only a short remark regarding the content of the page in relation to ethics, showing bias against certain perspectives on the topic.
Other than that the talk threads are relatively short with few and polite replies.
While examining the reviews between 01–07–2009 and 01–01–2010 in EBM, we can see from the Contents table that the “Criticism” section has been the topic of a long debate.
This can explain the rise of revisions.
Among the users, there is one prominent reviewer, who has left the majority of the comments in the talk page, often at the end of the threads.
This user’s activity is probably the reason for the low unique user count for this period.
Additionally, in section 2 we notice an argument where a user boldly expresses his displeasure at how his edits were handled, and has a similarly high count of replies, though heavily concentrated on his own thread.
Part of the contents table of the EBM Talk pageA network of co-occurring noun phrases extracted through semantic analysisA visualisation of a semantic analysis network of the full text pages from the wikipedia category of Health Informatics.
Each cluster is provided with a number in relation to the description below.
It seems that the semantic analysis reshapes the clusters into smaller but more specific ones, which we will shortly describe.
Firstly, number 1 mainly consists of health information and technology related pages.
It has edges to number 2 made of information exchange and different health record systems as well as edges to number 3 which comprises informatics and different associations engaged within the field of informatics and health sciences.
Secondly, number 4 is concerned with health information and management systems.
This cluster is connected to the network through only two nodes: the health systems in number 5, and the health informatics node within number 3.
The same is the case with number 6 concerning health care and care providers, which is connected to the rest of the network only by the test result node in the orange cluster.
Number 7 is more mixed, with words mentioning different diseases along with data protection, patient safety, medicine and research, while number 8 is centred about the word management.
This reshaping of the clusters might indicate that even though some topics are well connected (and showed in one cluster in the All_Links network), the way they are talked about might be different, resulting in these smaller and more specific groupings/clusters visualised in the network of the semantic analysis.
What might be interesting here is that the node health informatics only play a minor role in the network since this is the Wikipedia category and thus also the title of the main article of this category.
Even though the algorithms of CorText forces the clusters apart from each other, we still expected that health informatics would play a bigger role, and have a higher node degree.
.. More details