In the following notebook an analysis method using Python is presented for exploring and understanding the changes in prices of different items on different websites over time. Product prices are known to vary both on- and offline, depending on the characteristics of potential customers. However, in addition to purposeful manual price manipulation, during which human sellers actively modify prices (e.g. for sales), recent years have seen an increase in online automatic price updates by AI technologies. Algorithmic pricing aims to maximise seller profits by using probabilistic concepts, competition details, user information and user history to exploit business opportunities. For example, the technologies of a company could monitor an item's price across multiple platforms and then optimise the the company's item price by updating it to be lower than all the prices on the other websites. Algorithmic pricing is not a strategy but rather a toolkit for ensuring pricing strategies remain optimal and efficient. However, the exact effect of these technologies on the price management process is not always straightforward. As automated technolgies are employed more and more, expecially by large and established brands, certain markets can become monopolised, restricing access to new and innovative additions. The exact existence and the effect of algorithmic pricing have not properly been researched; thus this notebook provides an initial analysis as well as methods that can be used to assist further investigations into the phenomenon.
The analysis is divided into five main sections:
This specific analysis was conducted on data extracted from x websites between date and date 2021.
The price data for the various products across websites was recorded over time and extracted as follows.
# Add scraper: !python <name>.py
The next step required in order to read the data in Python, is the importation of Pandas, a free, open source data manipulation tool. Furthermore, in preparation for the rest of the analysis, import numpy (which provides access to many mathematical functions and further data manipulation) and matplotlib (an important visualization library).
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
The data - in this case as a csv file - is loaded using Pandas and the appropriate access path. The delimiter is added here to assist with the correct reading of the data.
dataset = pd.read_csv(r"/Users/lsr/Documents/dutch_product_prices.csv", delimiter=';')
The 'head' function provides an overview of the data's different variables (columns). The 'shape' command returns the number of rows and columns in the dataset.
dataset.head()
dataset.shape
Next, we confirm the variable values have the appropriate data types for our analysis. We can see below that all variables have the 'object' type. We want 'price' to be recognised as a float (decimal point number) and 'date' to be recognised as a datetime data type. Therefore, we use the following commands to reset the data types of these variables.
dataset.info()
def conversion(value):
new_value=value.replace(',','.')
return float(new_value)
dataset['price']=dataset['price'].apply(conversion)
dataset.info()
dataset['date']= pd.to_datetime(dataset['date'])
dataset.info()
The following section focuses on the statistical methodologies used to understand the nature of the data and identify patterns within it. The aim is to use the insights generated to specify further investigations into price differences.
To begin, examine the descriptive statistics of the different product prices.
dataset.groupby([dataset['website'], dataset['item']]).describe()
A nice visualisation of the descriptive statistics can be seen obtained using boxplots. Here we can visually compare the distribution of prices for different items on each website.
ax = dataset.groupby(dataset['website']).boxplot(by = 'item', vert=False, figsize = (19,19), fontsize=8)
plt.tight_layout()
Every item has its own price range. To create a new variable for the range, calculated as max-min, once first needs to save the descriptive statistics as a manipulatable table and then make a few changes regarding indices and data types.
summary_table = dataset.groupby([dataset['website'], dataset['item']]).describe().reset_index()
print(summary_table.columns)
As seen above, the columns have MultiIndex indices. It is more convenient to have single indices.
summary_table.columns = ['_'.join(col) for col in summary_table.columns.values]
print(summary_table.columns)
summary_table['price_max'] = summary_table['price_max'].astype(float)
summary_table['price_min'] = summary_table['price_min'].astype(float)
Now the variable 'range' can be calcuated and added.
summary_table.insert(10,"range", (summary_table['price_max']-summary_table['price_min']))
summary_table.head()
It is interesting to separate the products with price ranges of zero, meaning their prices remain exactly constant over time, from the products whose price ranges are non-zero, meaning their prices fluctuate over time. First we count and identify the products with constant prices:
constant_prices = summary_table
constant_prices = constant_prices.loc[constant_prices["range"] == 0]
len(constant_prices) # This counts the number of products
constant_prices
Now we examine the products with fluctuating prices over time:
changing_prices = summary_table
changing_prices = changing_prices.loc[changing_prices["range"] != 0]
len(changing_prices)
changing_prices
To obtain a comparable visualization for each product's price range, we can create a bar graph.
changing_prices['product'] = changing_prices['website_'] + ' ' + changing_prices['item_']
ax = changing_prices.plot(kind = 'bar', x='product', y='range', figsize = (19,8))
ax.set_ylabel('Price Range')
ax.set_xlabel('Product')
ax.set_title('Price ranges for different products with fluctuating prices')
One might want to compare the average price ranges recorded for the different websites.
changing_prices["range"].groupby(changing_prices["website_"]).mean()
This already gives some indication into the websites most and least likely to be employing algorithmic pricing techologies. However, the collection of items recorded for each website are not the same across websites - for example, CoolBlue's recorded items include a laptop while Kruidvat's recorded items do not. Thus, the comparison between overall average price range for the various websites should be interpreted carefully. It may be more meaningful to compare the average price ranges for the same items across different websites, rather than different items across the same websites.
changing_prices["range"].groupby(changing_prices["item_"]).mean()
Here we can see that certain items have large price fluctuations, while other items' prices are far more stable.
It may be suspected that there is a correlation between the price levels of the items and the corresponding price ranges. For example, the more expensive an item, the larger its price range (or vice versa). To determine this, we calculate the correlation between the prices (first maxima and then minima) and the ranges.
x = changing_prices["range"].values
y = changing_prices["price_max"].values
r = np.corrcoef(x, y)
r
x = changing_prices["range"].values
y = changing_prices["price_min"].values
r = np.corrcoef(x, y)
r
Comment on the correlation
Based on the previous section, or perhaps independently, one might want to investigate one specific website and the fluctuation in prices of items offered by it. To begin, specifiy the website's name. This can be changed at any point and the analysis rerun.
individual_website = dataset.loc[(dataset['website']== 'Coolblue')]
We can take a look again at the boxplots to visualize and compare the overall price differences for each item on the website.
ax = individual_website.boxplot(by = 'item', figsize = (19,8))
ax.set_ylabel('Price')
ax.set_xlabel('Item')
\ A meaningful investigation explores how the website's prices for each item changed over time; thus time series plots are included. These can be created for all items individually, or all items can be plotted on one set of axes.
item_array = dataset["item"].unique()
web_items = ["0"]* (len(item_array))
for i in range (len(item_array)):
item_it = individual_website.loc[(individual_website["item"]== item_array[i])]
web_items[i] = item_it
for i in range (len(item_array)):
plt.plot(web_items[i]["date"], web_items[i]["price"])
plt.xlabel('Date')
plt.ylabel('Prices')
plt.title('Time Series plots for Multiple Items')
# Change plot size?
ay = individual_website.groupby(individual_website['item']).plot(kind='scatter', x='date', y='price', figsize =(20,6))
# How to set titles to show the items for each scatterplot?
az = individual_website.groupby(individual_website['item']).plot(kind='line', x='date', y='price', figsize =(20,6))
\ Comment on the time series trends.
Certain items appear on different websites, such as a laptop or grasmaaier. It is interesting to explore the level and consistency of the prices of these items across different brands.
Specify the item of interest:
individual_item = dataset.loc[(dataset['item']== 'laptop')]
individual_item.describe()
To compare the item's price distributions on different websites, one can consider boxplots, violin plots and side-by-side histograms.
ax = individual_item.boxplot(by = 'website', figsize = (19,8))
ax.set_ylabel('Price')
ax.set_xlabel('Website')
web_array = individual_item["website"].unique()
item_webs = ["0"]* (len(web_array))
for i in range (len(web_array)):
web_it = individual_item.loc[(individual_item["website"]== web_array[i])]
item_webs[i] = web_it
for i in range (len(item_webs)):
x = item_webs[i]["price"]
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
bp = ax.violinplot(x)
plt.show()
x = ["0"]* (len(item_webs))
for i in range (len(item_webs)):
x[i] = item_webs[i]["price"]
plt.hist(x[i])
plt.xlabel('Prices')
plt.ylabel('Frequency')
plt.title('Side-by-Side Histogram with Multiple Websites')
# Hmmmm this should not overlap but rather be side-by-side
# Want the above to look like this
x1 = list(individual_item[individual_item["website"]== "Coolblue"]["price"])
x2 = list(individual_item[individual_item["website"]== "Mediamarkt"]["price"])
names = ["Coolblue", "Mediamarkt"]
plt.hist([x1, x2], label=names)
plt.legend()
plt.xlabel('Prices')
plt.ylabel('Frequency')
plt.title('Side-by-Side Histogram with Multiple Items')
Time series plots are again meaningful to notice trends and patterns over time.
for i in range (len(web_array)):
plt.plot(item_webs[i]["date"], item_webs[i]["price"])
plt.xlabel('Date')
plt.ylabel('Prices')
plt.title('Time Series plots for Multiple Websites')
# Change plot size?
\ Or if one wants all the time series plots on separate axes:
ay = individual_item.groupby(individual_item['website']).plot(kind='scatter', x='date', y='price', figsize =(20,6))
az = individual_item.groupby(individual_item['website']).plot(kind='line', x='date', y='price', figsize =(20,6))
Certain items on one website might warrant some individual investigation. To begin we consider a descriptive summary.
dataset['product'] = dataset['website'] + ' ' + dataset['item']
individual_product = dataset.loc[(dataset['product']== 'Coolblue laptop')]
individual_product.describe()
For the product, we calculate the product's average price for each day:
daily_mean = individual_product.groupby(individual_product["date"]).mean()
dates_array = dataset["date"].unique()
plt.plot(dates_array, daily_mean)
Then, using the average prices each day, we calculate the rates of change from one day to the next, and plot these.
# I think the rates calculation could probably be simplified...
mean = [0]* len(dates_array)
for i in range (len(dates_array)):
prod_date = individual_product.loc[(individual_product["date"]== dates_array[i])]
mean[i] = prod_date["price"].mean()
rates = [0]* (len(dates_array)-1)
for i in range (len(dates_array)-1):
each_rate = mean[i+1]/mean[i]*100
rates[i] = each_rate
plt.plot(dates_array[1:len(dates_array)],rates)
Again we can also consider boxplots, violin plots, histograms and density graphs to understand the price distribution.
ax = individual_product.boxplot(figsize = (19,8))
ax.set_ylabel('Price')
x = individual_product["price"]
data_to_plot = [x]
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
bp = ax.violinplot(data_to_plot)
plt.show()
plt.hist(individual_product["price"])
plt.xlabel('Prices')
plt.ylabel('Frequency')
product_mean = individual_product['price'].mean()
product_median = individual_product['price'].median()
ax = individual_product['price'].plot(kind='density', figsize = (14,6))
ax.axvline(product_mean, color = 'red')
ax.axvline(product_median, color = 'green')
ay = individual_product.plot(kind='scatter', x='date', y='price', figsize =(20,6))
az = individual_product.plot(kind='line', x='date', y='price', figsize =(20,6))