Sparklyr Joy

When you have a tonne of event logs to parse what should the go to weapon of choice be? In this article I’ll share with you our experience with using spark/sparklyr to tackle this.

At Honestbee 🐝, our event logs are stored in AWS S3, delivered to us by Segment, at 40 minute intervals. The Data(Science) team uses these logs to evaluate the performance of our machine learning models as well as compare their performance, canonical AB testing.

In addition, we also use the same logs to track business KPIs like Click Through Rate, Conversion Rate and GMV.

In this article, I will share how we leverage high memory clusters running Spark to parse the results logs generated from the Food Recommender System.

Fig: Whenever an Honestbee customer proceeds to checkout, our ML models will try their best at making personalised prediction for which items you’ll most likely add to cart. Especially things which you missed or .

A post mortem, will require us to look through event logs to see which treatment group, based on a weighted distribution, a user has been assigned to.

Now, LETS DIVE IN!

Lets begin by importing the necessary libraries

	library(sparklyr)
	library(sparklyr.nested)

	library(tidyverse)
	library(magrittr)

view raw segment_sparklyr.r hosted with ❤ by GitHub

Connecting with the high memory spark cluster

Next, we’ll need to connect with the Spark master node.

Local Cluster

Normally if you’re connecting to a locally installed spark cluster you’ll set master as local.

Luckily sparkly already comes with an inbuilt function to install spark on your local machine:

sparklyr::spark_install(
    version = "2.4.0",
    hadoop_version = "2.7"
)

We are installing Hadoop together with spark, because the module required to read files from the S3 Filesystem with Hadoop

Next you’ll connect with the cluster and establish a spark connection, sc.

	conf <- spark_config()
	conf$sparklyr.defaultPackages <- c("com.databricks:spark-csv_2.10:1.5.0",
	"com.amazonaws:aws-java-sdk-pom:1.10.34",
	"org.apache.hadoop:hadoop-aws:2.7.3")

	conf$`sparklyr.shell.driver-memory.local` = "150G"
	# Connect to Spark:
	sc <- spark_connect(master = "local",
	version = "2.4.0",
	config = conf)

	# Get spark context
	ctx <- sparklyr::spark_context(sc)
	java_sc <- invoke_static(sc,
	"org.apache.spark.api.java.JavaSparkContext",
	"fromSparkContext",
	ctx)

	hconf <- java_sc %>% invoke("hadoopConfiguration")
	hconf %>% invoke("set", "com.amazonaws.services.s3a.enableV4", "true")
	hconf %>% invoke("set", "fs.s3a.fast.upload", "true")
	hconf %>% invoke("set", "fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
	hconf %>% invoke("set","fs.s3a.access.key", AWS_ACCESS_KEY)
	hconf %>% invoke("set","fs.s3a.secret.key", AWS_SECRET_KEY)

	spark_connection_is_open(sc=sc)

view raw connect_local_spark.R hosted with ❤ by GitHub

Caution: At honestbee we do not have a local cluster, so the closest we got is a LARGE EC2 instance which sometimes gives out and you probably want a managed cluster set up by DEs or a 3rd party vendor who knows how to deal with cluster management.

Remote Clusters

Alternatively, there’s also the option of connecting with a remote cluster via a REST API ie. the R process is not running on the master node but on a remote machine. Often these are managed by 3rd party vendors. At Honestbee, we also chosen this option and the clusters are provisioned by Qubole under our AWS account. PS. Pretty good deal!

	# API KEY for accessing it's service
	QUBOLE_API="MY_QUBOLE_API_ID"
	# This is the ID which is set by Qubole
	CLUSTER_ID=12830

	conf <- livy_config("custom_headers" = list(`X-AUTH-TOKEN`=QUBOLE_API))
	sc <- spark_connect(
	master = sprintf("https://us.qubole.com/livy-spark-%s", CLUSTER_ID),
	method = "livy",
	config = conf
	)

view raw connecting_spark.R hosted with ❤ by GitHub

The gist above sets up a spark connection sc, you will need to use this object in most of the functions.

Separately, because we are reading from S3, we will have to set the S3 access keys and secret. This has to be set before executing functions like spark_read_json

	jsc <- invoke_static(sc, "org.apache.spark.api.java.JavaSparkContext", "fromSparkContext", spark_context(sc))
	hconf <- jsc %>% invoke("hadoopConfiguration")

	# set the keys
	hconf %>% invoke("set","fs.s3a.access.key", AWS_ACCESS_KEY)
	hconf %>% invoke("set","fs.s3a.secret.key", AWS_SECRET_KEY)

view raw set_s3_keys.r hosted with ❤ by GitHub

So you would ask what are the pros and cons of each. Local clusters generally are good for EDA since you will be communicating through a REST API (LIVY).

Reading JSON logs

There are essentially two ways to read logs. The first is to read them in as a whole chunks or as a stream — as they get dumped into your bucket.

There’s two functions, spark_read_json and stream_read_json the former is batched and the later creates a structured data stream. There’s also the equivalent of for reading your Parquet files

Batched

The path should be set with the s3a protocol. s3a://segment_bucket/segment-logs/<source_id>/1550361600000

json_input = spark_read_json(
    sc = sc, 
    name= "logs",
    path= s3, 
    overwrite=T)

Below’s where the magic begins:

	add_to_cart_hourly <-
	sc %>%
	spark_read_json(
	name = "atc_tbl",
	path = "s3a://bucket/segment-logs/sourceId/day/*",
	overwrite = TRUE
	) %>%
	filter(
	event == "Added to Cart",
	properties.serviceType == 'Food'
	) %>%
	sdf_select(
	event,
	cart_id = properties.CartId,
	recommender = properties.experimentDetails.id,
	variant = properties.experimentDetails.variant,
	timestamp = properties.timestamp
	) %>%
	filter(!is.na(recommender)) %>%
	mutate(
	fulltime = from_unixtime(as.numeric(timestamp)),
	time = as.character(hour(fulltime))
	) %>%
	group_by(recommender, time) %>% summarise(n=n()) %>%
	mutate(event="Added to Cart") %>%
	arrange(time)

view raw add_to_cart.r hosted with ❤ by GitHub

As you can see it’s a simple query,

Filter for all Added to Cart events from the Food vertical
Select following columns:
- CartID
- experiment_id
- variant (treatment_group) and
- timestamp
Remove events where users were not assigned to a model
Add new columns
- fulltime readable time
- time the hour of the day
Group the logs by service recommender and count the number of rows
Add a new column event with the value Added to Cart
Sort by time

Spark Streams

Alternatively, you could also write the results of the above manipulation to a structured spark stream.

	data_stream <- sc %>% stream_read_json(s3_folder) %>%
	filter(event=="Added to Cart") %>%
	filter(properties.serviceType == 'Food') %>%
	sdf_select(event,
	cart_id = properties.CartId,
	recommender = properties.experimentDetails.id,
	variant = properties.experimentDetails.variant) %>%
	filter(!is.na(recommender)) %>%
	group_by(recommender) %>%
	summarize(n = n()) %>%
	stream_write_memory("data_stream", mode = "complete")

view raw stream.R hosted with ❤ by GitHub

You can preview these the results from the stream using the tbl function coupled to glimpse.

sc %>% 
    tbl("data_stream") %>% 
    glimpse

Observations: ??
Variables: 2
Database: spark_connection
$ expt <chr> "Model_A", "Model_B"
$ n    <dbl> 5345, 621

And that’s it folks on using Sparklyr with your event logs.

Model Metadata

With that many models in the wild, it’s hard to keep track of what’s going on. For my PhD, I personally worked on using Graph Databases to store data with complex relationships and we are currently working on coming up with such a system to store metadata related to our models.

For example

Which APIs they are associated with
What airflow / argo jobs are these models being retrained with
What helm-charts and deployments metadata these models have
And of course meta data like the performance and scores.

Come talk to us, we are hiring! Data Engineer, Senior Data Scientist