Has the ship sailed for Microbiome research?

Going about doctoral thesis writing, love hate relationship, the thought occurred to me perhaps the very field I’m writing on about has already lived past its Golden Era, or has it?

A knee jerk reaction then was to see if there’s any python or R package which allows me to search the abstracts with the keyword microbiome… this turned up pubmed_parser. However, in order to get this running, I will have to first download a few gigabytes of abstracts in XML from the Open Access subset of pubmed abstracts and run my own pyspark…

NOPE! Not going there!

Then it occurred to me perhaps NCBI has something I could use… in the previous post we talked about the esearch API. Hmmm, this could be useful.

So below’s the script which lets me do this:


using the usual tidyverse, with rvest for XML parsing and artyfarty to spruce up the plot

    library(rvest) # for XML
    library(artyfarty) #because theme_bw is too boring


Here’s the NBCI’s esearch API. Within it, there’s the date range options, mindate, maxdate.

mindate, maxdate API filter

Date range used to limit a search result by the date specified by datetype. These two parameters (mindate, maxdate) must be used together to specify an arbitrary date range. The general date format is YYYY/MM/DD, and these variants are also allowed: YYYY, YYYY/MM.

So we will be searching between 1997 to 2017, a 20 year period.

So lets begin …..

Keyword: Microbiome

with other synonyms microbiota

searchTerm=paste0(api, query)

keyword = "microbiome"

df = mapply(function(start, end){
            count = read_xml(sprintf(searchTerm, keyword, start, end)) %>%
            as_list %$%
            Count %>%
       tibble(count, start, end)
},  start = 1997:2016,
    end = 1998:2017,
) %>% do.call(rbind,.)

Keyword: Cancer

Used as a comparison

keyword = "cancer"

df2 = mapply(function(start, end){
            count = read_xml(sprintf(searchTerm, keyword, start, end)) %>%
            as_list %$%
            Count %>%
       tibble(count, start, end)
},  start = 1997:2016,
    end = 1998:2017,
) %>% do.call(rbind,.)

Putting the two together before we start plotting

df %<>% setNames(c("microbiome", "start", "end"))
df %<>% mutate(cancer = as.integer(df2$count))
df$microbiome %<>% as.integer
df %<>% select(start, end, microbiome, cancer)

As you can the see the order is slighly different between the two, you’ll probably have to do some scaling.

start end microbiome cancer
1997 1998 91 116522
1998 1999 110 124613
1999 2000 133 131481
2000 2001 149 139577
2001 2002 196 153651
2002 2003 249 166393
2003 2004 304 170676
2004 2005 419 181504
2005 2006 576 190710
2006 2007 744 198618
2007 2008 955 210488
2008 2009 1285 219686
2009 2010 1741 231079
2010 2011 2610 248046
2011 2012 3899 265171
2012 2013 5607 281240
2013 2014 8211 308483
2014 2015 10951 331775
2015 2016 13439 331631
2016 2017 14058 285408

Since version 2.2.0 of ggplot2, Hadley has included the sec_axis function in the library which lets you add a secondary axis as long as it’s amenable to a straight forward transformation.

ggplot(df, aes(x=end)) +
    geom_line(aes(y=microbiome, color="Microbiome"), size=1.1) +
    geom_line(aes(y=cancer/20, color="Cancer"), size=1.5, linetype="dotted") + 
    # manipulated the cancer values by dividing by 20
    scale_y_continuous(sec.axis = sec_axis(~.*20, name = "Number of Publications [Cancer]"))+ 
    # restores the division
    # lets we set the axis title
    scale_color_manual("Search Terms",values = pal("five38"))+
    theme_scientific() +
          legend.position = c(0.9, 0.2)) +
    xlab("Year") + ylab("Number of Publications [Microbiome]")


There you have it guys, on the left y-axis the publication count with the keyword “microbiome” and its synonyms like “microbiota” and on the right y-axis the counts for abstract with the keyword “cancer”. As you can see, the growth in publications/articles revolving around microbiome or at least associated to it have been growing at breakneck pace faster than cancer, almost exponential.

For those astute enought, you’ll notice a dip in 2017 for cancer, and the trend is slowing down for microbiome, that’s just cause we haven’t reached the end of 2017 yet, close 😉 but definitely more papers on their way.

Hope this will be helpful for future students! Cheers

