Going about doctoral thesis writing, love hate relationship, the thought occurred to me perhaps the very field I’m writing on about has already lived past its Golden Era, or has it?

A knee jerk reaction then was to see if there’s any python or R package which allows me to search the abstracts with the keyword microbiome… this turned up pubmed_parser. However, in order to get this running, I will have to first download a few gigabytes of abstracts in XML from the Open Access subset of pubmed abstracts and run my own pyspark…

NOPE! Not going there!

Then it occurred to me perhaps NCBI has something I could use… in the previous post we talked about the esearch API. Hmmm, this could be useful.

So below’s the script which lets me do this:

Libraries

using the usual tidyverse, with rvest for XML parsing and artyfarty to spruce up the plot

suppressPackageStartupMessages({
    library(tidyverse)
    library(magrittr)
    library(rvest) # for XML
    library(artyfarty) #because theme_bw is too boring
})

NCBI ESEARCH

Here’s the NBCI’s esearch API. Within it, there’s the date range options, mindate, maxdate.

mindate, maxdate API filter

Date range used to limit a search result by the date specified by datetype. These two parameters (mindate, maxdate) must be used together to specify an arbitrary date range. The general date format is YYYY/MM/DD, and these variants are also allowed: YYYY, YYYY/MM.

So we will be searching between 1997 to 2017, a 20 year period.

So lets begin …..

Keyword: Microbiome

with other synonyms microbiota

#microbiome
api="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?"
query="db=pubmed&term=%s&mindate=%s&maxdate=%s"
searchTerm=paste0(api, query)

keyword = "microbiome"

df = mapply(function(start, end){
            count = read_xml(sprintf(searchTerm, keyword, start, end)) %>%
            as_list %$%
            Count %>%
            unlist
       tibble(count, start, end)
},  start = 1997:2016,
    end = 1998:2017,
    SIMPLIFY=FALSE
) %>% do.call(rbind,.)

Keyword: Cancer

Used as a comparison

keyword = "cancer"

df2 = mapply(function(start, end){
            count = read_xml(sprintf(searchTerm, keyword, start, end)) %>%
            as_list %$%
            Count %>%
            unlist
       tibble(count, start, end)
},  start = 1997:2016,
    end = 1998:2017,
    SIMPLIFY=FALSE
) %>% do.call(rbind,.)

Putting the two together before we start plotting

df %<>% setNames(c("microbiome", "start", "end"))
df %<>% mutate(cancer = as.integer(df2$count))
df$microbiome %<>% as.integer
df %<>% select(start, end, microbiome, cancer)
df

As you can the see the order is slighly different between the two, you’ll probably have to do some scaling.

start	end	microbiome	cancer
1997	1998	91	116522
1998	1999	110	124613
1999	2000	133	131481
2000	2001	149	139577
2001	2002	196	153651
2002	2003	249	166393
2003	2004	304	170676
2004	2005	419	181504
2005	2006	576	190710
2006	2007	744	198618
2007	2008	955	210488
2008	2009	1285	219686
2009	2010	1741	231079
2010	2011	2610	248046
2011	2012	3899	265171
2012	2013	5607	281240
2013	2014	8211	308483
2014	2015	10951	331775
2015	2016	13439	331631
2016	2017	14058	285408

Since version 2.2.0 of ggplot2, Hadley has included the sec_axis function in the library which lets you add a secondary axis as long as it’s amenable to a straight forward transformation.

ggplot(df, aes(x=end)) +
    geom_line(aes(y=microbiome, color="Microbiome"), size=1.1) +
    geom_line(aes(y=cancer/20, color="Cancer"), size=1.5, linetype="dotted") + 
    # manipulated the cancer values by dividing by 20
    scale_x_continuous(breaks=1998:2017)+
    scale_y_continuous(sec.axis = sec_axis(~.*20, name = "Number of Publications [Cancer]"))+ 
    # restores the division
    # lets we set the axis title
    scale_color_manual("Search Terms",values = pal("five38"))+
    theme_scientific() +
    theme(axis.text.x=element_text(angle=90), 
          legend.position = c(0.9, 0.2)) +
    xlab("Year") + ylab("Number of Publications [Microbiome]")

publications-with-keyword-microbiome

There you have it guys, on the left y-axis the publication count with the keyword “microbiome” and its synonyms like “microbiota” and on the right y-axis the counts for abstract with the keyword “cancer”. As you can see, the growth in publications/articles revolving around microbiome or at least associated to it have been growing at breakneck pace faster than cancer, almost exponential.

For those astute enought, you’ll notice a dip in 2017 for cancer, and the trend is slowing down for microbiome, that’s just cause we haven’t reached the end of 2017 yet, close 😉 but definitely more papers on their way.

Hope this will be helpful for future students! Cheers