Getting started¶

Price discrimination occurs when two users are shown inconsistent prices for the same product (e.g., Travelocity showing a select user a higher price for a particular hotel). This project attempts to examine a potential inconsistency in product search results is due to the client-side state associated with the request.

Within the framework of this project, a script that can be used for getting product information (product prices and details) and investigating a potential algorithmic unfairness was developed.

New!!: Video Tutorial¶

https://youtu.be/ZDsRkwqTNaY

It is recommended for the user to be versed in the following:

Python programming language
Basics of Data Analysis
Be familiar with Pandas and Numpy python libraries
Optionally: familiarity with Selenium python library

For each e-commerce website, there is a script that is structurally identical to the scripts for other websites in the project.

For the sake of simplicity, we are going to take a close look at the script for Bol.com.

For Selenium python library, it is required to choose a webrowser for the webdriver which is a web framework that permits you to execute cross-browser tests. This tool is used for automating web-based application testing to verify that it performs expectedly. In our case, we are automating web-based programmatic collection of data and parsing data from a source. In that project, we are using Firefox webrowser driver.

Importing libraries¶

First of all, we should import the libraries that we need to use. The code cell below imports Pandas, Numpy, Selenium and other utility libraries that will help to visualize the execution process.

Note: we put all warnings that pop up during the execution process of the script off by using the method from the warnings library in order to have less verbose outputs to make the pipeline of the execution more readable.

In [ ]:

# imports
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
from selenium.webdriver import FirefoxOptions

import pandas as pd
import numpy as np
import time
import re
from tqdm import tqdm
import argparse
import warnings
from user_agents import parse
warnings.simplefilter("ignore")

Script usage parameters specification¶

To perform an experiment we need to specify the following parameters to the script:

A name of the experiment
A list of items of products to search
A website URL address
A path to the Firefox webdriver that was downloaded from the external source

Optional:

A user-agent string which is essentially a string that specifies a device and browser acting on behalf of a user to send it to the website to rget a certain webpage layout according to the specified device
A proxy address to specify geolocation of a user to send it to the website

The code cell below defines a function to parse the arguments explained above to execute the script.

In [ ]:

def get_parser():
    # parse parameters
    parser = argparse.ArgumentParser(description='Scrape Bolcom website')
    parser.add_argument("--exp_name", type=str, default="", help="Experiment name")
    parser.add_argument("--items_list", nargs='+', default="", help="List of products to search")
    parser.add_argument("--web_page", type=str, default="", help="Website url")
    parser.add_argument("--exec_path", type=str, default="", help="Path to execute the webdriver")
    parser.add_argument("--ua_string", type=str, default="", help="User agent string to specify to identify/detect devices and browsers")
    parser.add_argument("--proxy", type=str, default="", help="Proxy to mimic IP Address Geolocation")

    return parser

Web crawling function¶

The code cell below defines a function to perform an iteration of data collection for the specified item.

As input parameters, it takes a webdriver, an item to search for, delays which is a list of integer numbers to randomly delay the crawling process to mimic a real user behaviour, and a collected data list where we want to store the information.

First, the script clicks on the banner button to update the page, after which it simulates the behavior of a person on the site by means of a delay. Next, the script finds the information input field (a search bar), inserts the name of the input product there and opens a page with a catalog of this product.

After that, the script again simulates a person falling asleep for 5 seconds, after which it performs the process of collecting the necessary data about products from the catalog, writing it in the list (collected data).

Each line with the product looks like this:

website name (Bol.com in our case)
item name (e.g., sneakers)
product name (e.g., Nike Air Force 1 (PS) Sneakers Kinderen - White/White-White)
seller information (e.g., Sneakersenzo.nl)
time when it was collected
price of the product

Note: if the web driver cannot find the products on the web page, it ends the work by closing the script execution process.

In [1]:

def iteration(driver, item, delays, collected_data):
    # banner button BolCom click to update the search bar
    banner_button = driver.find_element_by_class_name('omniture_main_logo')
    # randomly choose a delay and freeze the execution to mimic a person usage
    delay = np.random.choice(delays)
    time.sleep(delay)
    banner_button.click()   # press ENTER

    delay = np.random.choice(delays)
    time.sleep(delay)

    # put a query in the search bar
    search = driver.find_element_by_name("searchtext")
    search.send_keys(item)  # put it in the search field
    search.submit()   # press ENTER

    time.sleep(5)

    timeout = 30
    try:
        main = WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.ID, 'js_items_content')))
        time.sleep(5)
        articles = main.find_elements_by_class_name('product-item--row')                    # get all products from the page

        for article in tqdm(articles):
            price_header = article.find_elements_by_class_name('price-block__price')        # get a price object

            if len(price_header) != 0:
                # process price text
                price = re.sub(r'[\n\r]+', '.', price_header[0].text)                       # get a price text
                price = re.sub("\-", "00", price)

                product_header = article.find_elements_by_class_name('product-title')       # get a product name

                # get a seller name
                try:
                    seller = article.find_elements_by_class_name('product-seller__name')
                    assert seller
                except:
                    seller = article.find_elements_by_class_name('product-seller')

                if len(seller) == 0:    # case if there is no seller specified
                    _seller = 'NaN'
                else:
                    _seller = seller[0].text    # get a seller name text

                # temporary dictionary of the product data
                temp = {
                    'website': "BolCom",
                    'item': item,
                    'product': product_header[0].text,
                    'seller': _seller,
                    'time': pd.to_datetime('now').strftime("%Y-%m-%d %H:%M:%S"),
                    'price': price}

                collected_data.append(temp)                                                     # append the data

    except TimeoutException:
        # driver.quit()
        print("driver has not found products on the webpage")

Web crawling process execution¶

The code cell below defines the main function to perform iterations of data collection over all items in the list that the script received and parsed with get_parser function.

The function defines and initializes a list of the possible delays to mimic user interaction with websites, a webdriver and webdriver options and the item list to search for. Also, here the web driver finds a cookies button on the website and clicks it to accept cookies.

After that, the process of collecting data for all positions that were passed to the script is started. It is important to clarify that if there are any problems with specific items, the script puts this "problematic" item in a special list (skipped list), to which it returns after going through all the items, and again tries to collect information on these items.

At the end, the script creates a dataframe from the collected data and writes and saves it as a csv file.

In [ ]:

def main(params)
    # initialize a list of the possible delays to mimic user interaction with websites
    delays = [1, 2, 3, 4, 5]

    # initialize a list where we store all collected data
    collected_data = []

    # list of items to search
    items_list = params.items_list

    # initalize webdriver options
    profile = webdriver.FirefoxProfile()
    if params.ua_string != '':
        # user agent string
        ua_string = params.ua_string
        # initialize user agent
        user_agent = parse(ua_string)
        print(f"Current user-agent: {user_agent}")
        profile.set_preference("general.useragent.override", ua_string)

    PROXY = params.proxy
    if PROXY != '':
        webdriver.DesiredCapabilities.FIREFOX['proxy'] = {
            "httpProxy": PROXY,
            "ftpProxy": PROXY,
            "sslProxy": PROXY,
            "proxyType": "MANUAL",
        }
        
    
    opts = FirefoxOptions()
    opts.add_argument("--headless")

    # initialize a webdriver
    driver = webdriver.Firefox(firefox_options=opts, firefox_profile=profile)
    # get the url
    driver.get(params.web_page)

    # time to wait a response from the page
    timeout = 30
    # press the button to accept cookies
    try:
        cookies = WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.CLASS_NAME, "js-confirm-button")))

        delay = np.random.choice(delays)
        time.sleep(delay)

        cookies.send_keys(Keys.RETURN)  # press ENTER

    except TimeoutException:
        print("Didn't found the button accept cookies.")
        pass

    # initialize a list with failed items
    skipped_items = []

    # collect the data
    for item in tqdm(items_list):
        print("================")
        print(item)
        print("================")
        print("\n")

        try:
            try:
                try:
                    _ = iteration(driver, item, delays, collected_data)

                except:
                    _ = iteration(driver, item, delays, collected_data)

            except:
                try:
                    _ = iteration(driver, item, delays, collected_data)

                except:
                    _ = iteration(driver, item, delays, collected_data)

        except:
            print(f"{item} was skipped")
            skipped_items.append(item)
            pass

    print("Writing csv file...")
    df = pd.DataFrame(collected_data)
    df.to_csv(f'{params.exp_name}' + '_' + pd.to_datetime('now').strftime("%Y-%m-%d %H:%M:%S") + ".csv", index=False)
    print("Writing finished.")

    # close the driver
    driver.quit()

Script execution¶

The code cell below executes the script.

In [ ]:

if __name__ == '__main__':
    parser = get_parser()
    params, unknown = parser.parse_known_args()
#     params.exp_name = 'test27'
#     params.items_list = ['sneakers', 'parfum', 'sandalen', 'horloge', 'rugzak', 'zonnebril', 'kostuum', 'trainingspak', 'badpak', 'jurk', 'overhemd', 'mantel', 'laarzen', 'koptelefoon', 'yogamat', 'sjaal', 'badjas', 'halsketting', 'portemonnee']
#     params.web_page = 'https://www.bol.com/'
#     params.exec_path = ''
    # run the script
    main(params)

Last updated:
Contributors

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search