Week 8/12 #DataScienceBootcamp

Week 8 (16.11.-20.11.)

  • Topic: Markov Simulation
  • Lessons: Markov Chains, Vector & Matrix Multiplication, Python Classes, NumPy Arrays, Program Design, Composition & Inheritance, PEP8
  • Project: Simulate the paths of new customers in a supermarket
  • Dataset: internal
  • Code: GitHub

The goal of this project was to create an Markov Chain Monte Carlo (MCMC) simulation of new customers in a supermarket, based on data about customer paths from entrance to checkout through four aisles (fruit, drinks, dairy, and spices) recorded on five days from 7 am to 10 pm.

This project was particularly challenging for two reasons: it involved OOP and team work. So in this post I’ll give you an overview of our workflow.

Planning the project

I teamed up with two colleagues and right from the start we had to plan and divide our tasks. There are many things to track when working on a team project, but they can be classified into three main categories:

IdeaDesignImplementation
user storystate diagramcode-debug cycle
requirements.txtML model architectureskeleton code
workflow trackerclass diagramprototype
entity-relationship diagrambaseline models
flowchart
data flow diagram
pseudocode

Creating the simulation

Markov-Chain Monte Carlo (MCMC) is a method used to approximate the posterior distribution of a parameter of interest by random sampling in a probabilistic space. As the name says, it combines two properties:

  • Markov-Chain uses each random sample (generated by a sequential process) as a stepping stone to generate the next random sample. While each new sample depends on the one before it, new samples do not depend on any samples before the previous one.
  • Monte Carlo estimates the properties of a distribution by examining random samples from the distribution.

The first step in our project was to calculate the transition probability matrix, i.e. the probability that a customer moves from one aisle to another. For example, in our data there was a 28% probability that a customer goes from drinks to dairy, or 51% from fruit to checkout.

Next, we had to generate new customers and simulate their paths based on the calculated transition probabilities. And here comes OOP! We created three classes:

  • a Customer class that simulates a new customers and their path from entrance to checkout,
  • a Supermarket class that manages multiple Customer instances that are currently in the market,
  • a SupermarketMap class which visualizes the supermarket background. We used numpy and OpenCV to make the supermarket visualization, by basically creating and slicing arrays, and inserting icons in specific tiles.

To calculate the customers’ path from entrance to checkout, we used the A* (star) algorithm. This is a search algorithm used for finding the shortest path in environments where there are obstacles along the way (like walls or aisles in the supermarket). It works as a weighted graph, which in this case is our supermarket with 8 nodes (entrance, fruit, spices, dairy, drinks, three checkouts) and edges (paths) between them. A* starts from a specific starting node of a graph (entrance) and aims to find a path to the given goal node (checkout) having the smallest cost (least distance travelled, shortest time, etc.). Specifically, A* selects the path that minimizes

f (n) = g(n) + h(n)

  • n is the next node on the path,
  • g(n) is the cost of the path from the start node to n,
  • h(n) is a heuristic function that estimates the cost of the cheapest path from n to the goal.

A* terminates when the path it chooses to extend is a path from start to goal or if there are no paths eligible to be extended.

Friday Lightning Talk

At the end of the week, we gave a team presentation of our project. I presented the exploratory data analysis and visualization with Plotly, while my colleagues talked about the implementation of the A* algorithm and our experience and lessons learned from pair programming.

Comments are closed.

Website Powered by WordPress.com.

Up ↑

%d bloggers like this: