markovclick

_images/header.png
Documentation Status https://circleci.com/gh/ismailuddin/markovclick/tree/master.svg?style=svg https://img.shields.io/aur/license/yaourt.svg

API

Documentation for markovclick API.

Dummy functions

API documentation for markovclick.dummy.

Functions for generating dummy content.

markovclick.dummy.gen_random_clickstream(n_of_streams: int, n_of_pages: int, length: tuple = (8, 12)) → list[source]

Generates a list of random clickstreams, to use in absence of real data.

Parameters:
  • n_of_streams (int) – Number of unique clickstreams to generate
  • n_of_pages (int) – Number of unique pages, from which to use to generate clickstreams.
  • length (tuple) – Range of length for each clickstream.
Returns:

List of clickstreams, to use as dummy data.

Return type:

list

Models

API documentation for markovclick.models.

Models module which holds MarkovClickstream model.

class markovclick.models.MarkovClickstream(clickstream_list: list = None, prefixed=True)[source]

Builds a Markov chain from input clickstreams.

Parameters:clickstream_list (list) – List of clickstream data. Each page should be encoded as a string, prefixed by a letter e.g. ‘P1’
calc_prob_all_routes_to(clickstream: list, end_page: str, clicks: int, cartesian_product=True)[source]

Calculates the probability given an input sequence of page clicks, to reach the specified end state with the specified number of transitions before the end state.

Parameters:
  • clickstream (list) – List (sequence) of states
  • end_state (str) – Desired end to state to calculate probability towards
  • transitions (int) – Number of transitions to make after input sequence, before reaching end state.
Returns:

Probability

Return type:

float

calc_prob_to_page(clickstream: list, verbose=True) → float[source]

Calculates the probability for a sequence of clicks (clickstream) taking place.

Parameters:
  • clickstream (list) – Sequence of clicks (pages), for which to calculate the probability of occuring.
  • verbose (bool, optional) – Defaults to True. Specifies whether the output is printed to the terminal, or simply provided back.
calculate_pagerank(max_nodes: int = 2, pr_kwargs: dict = {}) → Tuple[networkx.classes.digraph.DiGraph, dict][source]

Calculates the Google PageRank for each of the pages in the Markov chain.

Converts the Markov chain into a directed graph using networkx, and uses its built in functions to calculate the PageRank score for each page represented as a node in the graph.

Parameters:
Returns:

networkx DiGraph object, and associated

PageRank scores for each page (node in DiGraph).

Return type:

Tuple[nx.DiGraph, dict]

static cartesian_product(iterable, repeats=1)[source]

Modifies Python’s itertools.product() function to return a list of lists, rather than list of tuples.

Parameters:
  • iterable (list) – List of iterables to assemble Cartesian product from
  • repeats (int) – Number of elements in each list of the Cartesian product
Returns:

List of lists of Cartesian product

compute_prob_matrix()[source]

Computes the probability matrix for the input clickstream.

count_matrix

Sets attribute to access the count matrix

get_unique_pages(prefixed=True)[source]

Retrieves all the unique pages within the provided list of clickstreams.

initialise_count_matrix()[source]

Initialises an empty count matrix.

static normalise_row(row)[source]

Normalises each row in count matrix, to produce a probability.

To be used when iterating over rows of self.count_matrix. Sum of each row adds up to 1.

Parameters:row – Each row within numpy matrix to act upon.
static permutations(iterable, r=None)[source]

Modification of itertools.permutations() function to yield a mutable list rather than an immutable tuple.

Unlike the Cartesian product, this does not return a sequence with repetitions in it.

populate_count_matrix()[source]

Assembles a matrix of counts of transitions from each possible state, to every other possible state.

prob_matrix

Sets attribute to access the probability matrix

Preprocessing

API documentation for markovclick.preprocessing.

Functions for preprocessing clickstream datasets

class markovclick.preprocessing.Sessionise(df, unique_id_col: str, datetime_col: str, session_timeout: int = 30)[source]

Class with functions to sessionise a pandas DataFrame containing clickstream data.

assign_sessions(n_jobs: int = 1)[source]

Assigns unique session IDs to individual clicks that form the sessions. Supports parallel processing through setting n_jobs to higher than 1.

Parameters:n_jobs (int, optional) – Defaults to 1. If 2 or higher, enables parallel processing.
Returns:Returns sessionised DataFrame, with session IDs stored in session_UUID column.
Return type:pd.DataFrame
datetime_col

Provides access to datetime_col attribute

df

Provides access to df attribute

session_timeout

Provides access to session_timeout attribute

unique_id_col

Provides access to unique_id_col attribute

Visualisation

API documentation for markovclick.viz.

Functions for visualising Markov chain

markovclick.viz.visualise_markov_chain(markov_chain: markovclick.models.MarkovClickstream) → graphviz.dot.Digraph[source]

Visualises Markov chain for clickstream as a graph, with individual pages as nodes, and edges between the first and second most likely nodes (pages). Probabilities for these transitions are annotated on the edges (arrows).

Parameters:markov_chain (MarkovClickstream) – Initialised MarkovClickstream object with probabilities computed.
Returns:
Graphviz Digraph object, which can be rendered as an image or
PDF, or displayed inside a Jupyter notebook.
Return type:Digraph

Usage

Terminology

In the context of this package, streams refer to a series of clicks belonging to a given user. The time difference between clicks is defined by the user when assembling these streams, but is typically taken to be 30 minutes in the industry.

The pages refer to the individual clicks of the user, and thus the pages they visit. Rather than storing the entire URL of the page the user visits, it is better to encode pages using a simple code such as PXX where X can be any number. This strategy can be used to group similar pages under the same code, as modelling them as separate pages is sometimes not useful leading to an excessively large probability matrix.

Build a dummy Markov chain

To start using the package without any data, markovclick can produce dummy data for you to experiment with:

from markovclick import dummy
clickstream = dummy.gen_random_clickstream(nOfStreams=100, nOfPages=12)

To build a Markov chain from the dummy data:

from markovclick.models import MarkovClickstream
m = MarkovClickstream(clickstream)

The instance m of the MarkovClickstream class provides access the class’s attributes such as the probability matrix (m.prob_matrix) used to model the Markov chain, and the list of unique pages (m.pages) featuring in the clickstream.

Visualisation

Visualising as a heatmap

The probability matrix can be visualised as a heatmap as follows:

sns.heatmap(m.prob_matrix, xticklabels=m.pages, yticklabels=m.pages)
_images/heatmap_example.png

Visualising the Markov chain

A Markov chain can be thought of as a graph of nodes and edges, with the edges representing the transitions from each state. markovclick provides a wrapper function around the graphviz package to visualise the Markov chain in this manner.

from markovclick.viz imoport visualise_markov_chain
graph = visualise_markov_chain(m)

The function visualise_markov_chain() returns a Digraph object, which can be viewed directly inside a Jupyter notebook by simply calling the reference to the object returned. It can also be outputted to a PDF file by calling the render() function on the object.

markovclick.viz.visualise_markov_chain(markov_chain: markovclick.models.MarkovClickstream) → graphviz.dot.Digraph[source]

Visualises Markov chain for clickstream as a graph, with individual pages as nodes, and edges between the first and second most likely nodes (pages). Probabilities for these transitions are annotated on the edges (arrows).

Parameters:markov_chain (MarkovClickstream) – Initialised MarkovClickstream object with probabilities computed.
Returns:
Graphviz Digraph object, which can be rendered as an image or
PDF, or displayed inside a Jupyter notebook.
Return type:Digraph
_images/markov_chain.png

In the graph produced, the nodes representing the individual pages are shown in green, and up to 3 edges from each node are rendered. The first edge is in a thick blue arrow, depicting the most likely transition from this page / state to the next page / state. The second edge depicted by a thinner blue arrow, depicts the second most likely transition from this state. Finally, a third edge is shown that depicts the transition from this page / state back to itself (light grey). This edge is only shown if the the two most likely transitions are not already to itself. For all transitions, the probability is shown next to the edge (arrow).

Clickstream processing with markovclick.preprocessing

markovclick provides functions to process clickstream data such as server logs, which contain unique identifiers such as cookie IDs associated with each click. This allows clicks to be aggregated into groups, whereby clicks from the same browser (identified by the unique identifier) are grouped such that the difference between individual clicks does not exceed the maximum session timeout (typically taken to be 30 minutes).

Sessionise clickstream data

To sessionise clickstream data, the following code can be used that require a pandas DataFrame object.

from markovclic.preprocessing import Sessionise
sessioniser = Sessionise(df, unique_id_col='cookie_id',
            datetime_col='timestamp', session_timeout=30)
class markovclick.preprocessing.Sessionise(df, unique_id_col: str, datetime_col: str, session_timeout: int = 30)[source]

Class with functions to sessionise a pandas DataFrame containing clickstream data.

__init__(df, unique_id_col: str, datetime_col: str, session_timeout: int = 30) → None[source]

Instantiates object of Sessionise class.

Parameters:
  • df (pd.DataFrame) – pandas DataFrame object containing clickstream data. Must contain atleast a timestamp column, unique identifier column such as cookie ID.
  • unique_id_col (str) – Column name of unique identifier, e.g. cookie_id
  • datetime_col (str) – Column name of timestamp column.
  • session_timeout (int, optional) – Defaults to 30. Maximum time in minutes after which a session is broken.

With a Sessionise object instantiated, the assign_sessions() function can then be called. This function supports multi-processing, enabling you the split job into multiple processes to take advantage of a multi-core CPU.

sessioniser.assign_sessions(n_jobs=2)
markovclick.preprocessing.Sessionise.assign_sessions(self, n_jobs: int = 1)

Assigns unique session IDs to individual clicks that form the sessions. Supports parallel processing through setting n_jobs to higher than 1.

Parameters:n_jobs (int, optional) – Defaults to 1. If 2 or higher, enables parallel processing.
Returns:Returns sessionised DataFrame, with session IDs stored in session_UUID column.
Return type:pd.DataFrame

The assign_sessions() function returns the DataFrame, with an additional column added storing the unique identifier for the session. Rows of the DataFrame can then be grouped using this column.

markovclick allows you to model clickstream data from websites as Markov chains, which can then be used to predict the next likely click on a website for a user, given their history and current state.

Requirements

  • Python 3.X
  • numpy
  • matplotlib
  • seaborn (Recommended)
  • pandas

Installation

Install either via the setup.py file:

python setup.py install

or via pip:

pip install markovclick

Tests

Tests can be run using pytest command from the root directory.

Documentation

To build the documentation, run make html inside the /docs directory, or whatever output is preferred e.g. make latex.