Subscribe to our newsletter

Shining a light on collaboration with Protein Data Bank and Dimensions

18th October 2021
 | Guest Author

To celebrate the 50th anniversary of the Protein Data Bank, Digital Science has created an interactive network explorer, powered by Dimensions, to better understand research collaboration and visualise relationships between publications.

Launched in 1971, Protein Data Bank is considered the first open access digital data resource in biology and medicine. A single repository, it provides access to 3D structures for the molecules of life, proteins, RNA and DNA. At the beginning of 2021, it had more than 170,000 structures, around 150,000 of which had corresponding primary citations describing their entries in a peer-reviewed journal. 

The National Library of Medicine assigns MeSH (Medical Subject Headings) from a controlled vocabulary with 16 main branches to index articles for PubMed. MeSH terms typically appear in a hierarchical tree structure with its main branches named for terms from anatomy, organisms and diseases to health care, publications and geographicals. 

We created a co-occurrence network of the publications in each branch, linking publications that shared a common MeSH term, though we kept only the strongest links since MeSH terms are very broad, common and can be redundant. The clustering is based on publications not on structures, which makes the network more accessible and easier to visualise and usable as a qualitative indicator rather than a function of the number of sequences in the database. 

A network visualisation was produced to visualize the publications relative relationship. We used the OpenOrd layout, a force-directed layout which aims to better distinguish clusters. Groups were then created, clustering by the MeSH term occurrence and frequency. 

Highlights of the network

The networks are compiled by the main MeSH terminology branches. For an example view we chose the network on G, “Phenomena and Processes”, which is dominated by two equally sized, populated groups, which are labelled 82 and 249, respectively. These two groups cluster around the MeSH terms ‘hydrogen bonds’/thermodynamics/catalysis, the other around ‘weak protein interaction’. Immediately we can see the different shape of these two groups: 82 exhibits a “dense cluster” and 249 a ‘widely dispersed’ cluster.

Details of these groups, including most frequent MeSH terms at group level are displayed in the menu to the left with an indication of size and color code. 

Visualisation of MeSH term clusters within a MeSH term group

Focussing on the ‘dense’ group 82, we can point to distinct sub clusters of publications within it: green, see screenshot; best described as Hydrogen bond, thermodynamics and catalysis. These subclusters emerge due to the underpinning publications grouping together on the co-occurence of further MeSH terms. 

Using the visual display, these can be easily inspected further via the group view and further refined by click and MeSH term analysis. In our example, a move over with the mouse highlights the MESH terms for each publication and at a quick glance one can determine that one subgroup displays MESH terms related to ‘virus’, whereas another links to ‘human’

Visualisation of MeSH term clusters within a MeSH term group

Loose clustering into MeSH term groups

Next to the first observation based on a visual inspection with distinct inner group clustering based on the example within the ‘hydrogen bond’ group of publications,  the weak protein interaction group (249) displays a loose broad clustering indicating the broader field of crystallographic research published under these MeSH headings. Consequently, the co-occurence of likewise MeSH terms is less pronounced and the structure within this group is widely spread.

Neighboring MeSH terms

We can now cross examine MeSH term groups across the ”Phenomena and processes’ network in search for combinations of MeSH terms: using the dropdown menu individual MeSH terms can be highlighted and inspected, for example in relation to the two large groups used as examples previously. When choosing the MeSH term ‘Enzyme Activation’ under ‘Phenomena and Processes’, this term covers only a few labelled publications. Of interest is one hit in the group best described as ‘weak protein interaction’ inspected via the browser to the left for more details like publication link, Structure ID and, further MeSH terms.

Summary

These network displays allow for an additional and alternative analysis of MeSH terms and their proximity to each other by co-occurrence analysis of the primary publications. In addition to the hierarchical MeSH browser, this facilitates the quick visual analysis of the vicinity of different MeSH terms within 15 distinct network groups.

Both, exploring MesH terms based on a ‘top down’ approach from a group (dense or broad) to individual MESH term level, or ‘bottom up’ by starting with a less frequent MeSH term and subsequently exploring clusters around it, yield a fresh view and allow for new insights into the relation of MeSH terms while analysing the primary publications on protein structure literature.