Accurately Analysing Diversity in Research

11th February 2019

| Suze Kundu

11th February marks the annual International Day of Women and Girls in Science. Set up by the United Nations back in 2015, the day was created to celebrate the achievements of women in science, technology, engineering and maths (STEM) professions, and encourage girls in school to pursue a STEM-related subjects with a view to contributing to STEM-related careers.

This article was written by Dr Hélène Draux, Research Data Scientist at Digital Science, and Dr Suze Kundu, Head of Public Engagement at Digital Science.

We have released two reports looking at gender representation in research. The first report looked at gender imbalance in cancer research funding and types of cancer research being carried out, while the second report revealed an interactive tool to visualise gender balance in researchers across all UK institutions. The findings of this second report reinforced the need for days such as today, given that there was a gender imbalance across all fields of research, including the Arts and Humanities, which only became more pronounced as fields of research moved towards the Sciences and Engineering.

For this report we used a freely available Gender Guesser Python package to analyse Dimensions data on the gender of researchers. However, this had a range of limitations associated with its use, one example of which was identified in our second report where we found that the number of researchers whose gender could not be identified increased as we move from Arts and Humanities fields of research into STEM fields of research which generally have a more ethnically diverse range of researchers.

Building a Gender Identifier

Most freely available automatic gender identifier tools or gender guessers rely on English or Western names. They either use the probability of being a man or woman, based on frequency of gender attribution in the US census, or attempt to guess genders based on letter patterns. Given the globalisation of research and mobility of researchers, we did not find these methods reliable enough, and did not want to identify the genders manually. Instead, we downloaded the data for ‘persons’ from Wikidata (the structured data derived from Wikipedia), and counted the number of instances the person was identified as a man or a woman. Relying on crowdsourced data has its inconveniences, and we did find vandalised pages, but this was only marginal.

We only included people born after 1940, assuming that names can change genders across time. We first used the field “given_name” from Wikidata as a first name, but realised that many pages describing Asian people had not filled in this field (e.g. 92% of South Korean, 91% of Chinese, and 70% of Indian names). To consolidate the data, when there was no data in the field “given_name” we used the first part of the name of the person. We acknowledge that some of these countries use a different convention for the first name being the first part of the name. However, we used the English version of the name and can reasonably expect that contributors had followed the Western convention of putting the first name first. When this did not happen, or when the first word referred to a title for instance (e.g. Queen Victoria), this simply created a name (e.g. Queen) that did not exist and would not influence our data matching with existing names.

We attributed names to three categories: women, men, and unknown. The ratio for this was: 80% of either gender and more than 10 people, or more than 1 person but 100% of either gender. For names falling out of these ranges (often unisex names), we categorised them as unknown.

Limitations

Data source

Dimensions’ data quality is heavily correlated with the quality of the data source. This means that in some instances the first name has not been properly identified (e.g. if authors have inverted first names and last names, or only the initial is available). In many cultures, it is customary for women to change their family names when they marry. In this case, if they change their name to their husband’s, or if they attach both names together, it will not be possible to uniquely identify them. It is therefore likely that our methodology will over-count the number of women.

Spelling Variability

Languages which do not use the Latin alphabet will have been translated into Latin alphabet to fit into Wikidata as well as while publishing. Since there is no universal translation, this can lead to variations which cannot be resolved with simple letter matching. Therefore, for fields where there is a higher percentage of researchers from countries without the Latin alphabet, it will be harder to guess the gender from the name alone.

Country Variability

Genders are not always fixed to names. Some names are unisex, making it impossible to guess the gender with the name alone. There is also variability between countries, where a name can be fixed to one gender in a country (e.g. Simone is a man in Italy), but to another gender in another one (e.g. Simone is a woman in France). Since researchers are so mobile, using the country of publication would not necessarily help, as researchers with unisex names might still have moved to a country where their gender would be guessed differently.

Gender as Binary

Finally we have made the assumption that gender is binary, that is that the gender guesser tool assigns names as man, woman, or unknown, with no regard for gender being considered a spectrum. This could further exclude some researchers who have chosen gender-neutral pronouns, who have identified as another gender besides just male or female, or who have chosen not to specify their gender.

Further development to improve our understanding of gender inclusion

It is widely agreed that gender equality has not been achieved in most, if not all, research fields. We created the two gender reports to point out, and desiccate the figures. We believe that more data would bring a better understanding, acceptance, and willingness to change. This data could either come through a more inclusive dataset, which could be crowdsourced in order to include different cultural aspects, but also from institutions being more open about their research staff demographics. With more transparency would come more ownership of the problem.

If you think you could contribute to the refinement of our data or our programs, we would love to hear from you. We are planning to hold a programming Hackathon for Ada Lovelace Day 2019 in October. During this event we hope to contribute to the creation of a more refined tool capable of better identifying the gender of researchers more inclusively, in order to give us the clearest picture of the current demographic of the research landscape across all fields of research, and to better help us address the challenges of underrepresentation with positive and better informed actions.