Reverb Data Engineering Coding Challenge

This technical challenge is focused on the issues faced by data engineers at Reverb. There is no one correct approach - the task outline below is meant to start a conversation about domain modeling and tradeoffs.

Background

The Reverb platform generates an immense amount of log data that data engineers are responsible for collecting, parsing, and analyzing. At this link is a file containing page view logs that look like this:

{
  "userId": 1,
  "sessionId": "{some-uuid}",
  "timestamp": "2018-01-01T00:00:00.000000",
  "url": "/foo/bar",
  "experiments": ["baz", "boz"]
}

In this example line, we can see that each log is identified with a unique userId and sessionId. A user can make multiple separate page views within the same session. We can also see that the timestamp of the page view is recorded along with the URL, as well as identifiers for any A/B tests the user may be a part of. Note that while the example here is broken out into multiple lines for readability, that in the file these logs are written 1 per line. You can assume that the timestamps increase monotonically.

Specifications

We would like to know a couple of key statistics about our user base that can be computed from the provided data:

  1. What are the top 5 most accessed URLs on the site?
  2. What is the average session time in milliseconds across all users? The median? The max? The min?

The program should print its calculations to STDOUT according to the following specification:

{
  "most_viewed_urls": ["url1", "url2", "url3", "url4", "url5"],
  "session_stats": {
    "median": 13942,
    "mean": 10124,
    "max": 34031,
    "min": 921
  }
}

The keys do not have to be in this order so long as they are all present. Note that the units are in milliseconds.

Regardless of your implementation language, provide an executable called run_analysis that executes your code and prints the result.

Our criteria

We believe in transparency, so here's the criteria we'll be using to evaluate each submission:

  • Functionality — Does your program work according to the specifications of the problem?
  • Modeling — Do your data structures fit the business objects in the problem and is the program's control flow clear?
  • Documentation — Is your code appropriately documented in the form appropriate to your implementation language?
  • Language Use — Do you make good use of the features available in the language you chose?
  • Testing — Did you include tests that explain and reinforce the design of your code?

Good luck!