Getting started¶
Price discrimination occurs when two users are shown inconsistent prices for the same product (e.g., Travelocity showing a select user a higher price for a particular hotel). This project attempts to examine a potential inconsistency in product search results is due to the client-side state associated with the request.
Within the framework of this project, a script that can be used for getting product information (product prices and details) and investigating a potential algorithmic unfairness was developed.
New!!: Video Tutorial¶
For each e-commerce website, there is a script that is structurally identical to the scripts for other websites in the project.
For the sake of simplicity, we are going to take a close look at the script for Bol.com.
For Selenium python library, it is required to choose a webrowser for the webdriver which is a web framework that permits you to execute cross-browser tests. This tool is used for automating web-based application testing to verify that it performs expectedly. In our case, we are automating web-based programmatic collection of data and parsing data from a source. In that project, we are using Firefox webrowser driver.
Importing libraries¶
First of all, we should import the libraries that we need to use. The code cell below imports Pandas, Numpy, Selenium and other utility libraries that will help to visualize the execution process.
Note: we put all warnings that pop up during the execution process of the script off by using the method from the warnings library in order to have less verbose outputs to make the pipeline of the execution more readable.
# imports
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
from selenium.webdriver import FirefoxOptions
import pandas as pd
import numpy as np
import time
import re
from tqdm import tqdm
import argparse
import warnings
from user_agents import parse
warnings.simplefilter("ignore")
Script usage parameters specification¶
To perform an experiment we need to specify the following parameters to the script:
- A name of the experiment
- A list of items of products to search
- A website URL address
- A path to the Firefox webdriver that was downloaded from the external source
Optional:
- A user-agent string which is essentially a string that specifies a device and browser acting on behalf of a user to send it to the website to rget a certain webpage layout according to the specified device
- A proxy address to specify geolocation of a user to send it to the website
The code cell below defines a function to parse the arguments explained above to execute the script.
def get_parser():
# parse parameters
parser = argparse.ArgumentParser(description='Scrape Bolcom website')
parser.add_argument("--exp_name", type=str, default="", help="Experiment name")
parser.add_argument("--items_list", nargs='+', default="", help="List of products to search")
parser.add_argument("--web_page", type=str, default="", help="Website url")
parser.add_argument("--exec_path", type=str, default="", help="Path to execute the webdriver")
parser.add_argument("--ua_string", type=str, default="", help="User agent string to specify to identify/detect devices and browsers")
parser.add_argument("--proxy", type=str, default="", help="Proxy to mimic IP Address Geolocation")
return parser
Web crawling function¶
The code cell below defines a function to perform an iteration of data collection for the specified item.
As input parameters, it takes a webdriver, an item to search for, delays which is a list of integer numbers to randomly delay the crawling process to mimic a real user behaviour, and a collected data list where we want to store the information.
First, the script clicks on the banner button to update the page, after which it simulates the behavior of a person on the site by means of a delay. Next, the script finds the information input field (a search bar), inserts the name of the input product there and opens a page with a catalog of this product.
After that, the script again simulates a person falling asleep for 5 seconds, after which it performs the process of collecting the necessary data about products from the catalog, writing it in the list (collected data).
Each line with the product looks like this:
- website name (Bol.com in our case)
- item name (e.g., sneakers)
- product name (e.g., Nike Air Force 1 (PS) Sneakers Kinderen - White/White-White)
- seller information (e.g., Sneakersenzo.nl)
- time when it was collected
- price of the product
Note: if the web driver cannot find the products on the web page, it ends the work by closing the script execution process.
def iteration(driver, item, delays, collected_data):
# banner button BolCom click to update the search bar
banner_button = driver.find_element_by_class_name('omniture_main_logo')
# randomly choose a delay and freeze the execution to mimic a person usage
delay = np.random.choice(delays)
time.sleep(delay)
banner_button.click() # press ENTER
delay = np.random.choice(delays)
time.sleep(delay)
# put a query in the search bar
search = driver.find_element_by_name("searchtext")
search.send_keys(item) # put it in the search field
search.submit() # press ENTER
time.sleep(5)
timeout = 30
try:
main = WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.ID, 'js_items_content')))
time.sleep(5)
articles = main.find_elements_by_class_name('product-item--row') # get all products from the page
for article in tqdm(articles):
price_header = article.find_elements_by_class_name('price-block__price') # get a price object
if len(price_header) != 0:
# process price text
price = re.sub(r'[\n\r]+', '.', price_header[0].text) # get a price text
price = re.sub("\-", "00", price)
product_header = article.find_elements_by_class_name('product-title') # get a product name
# get a seller name
try:
seller = article.find_elements_by_class_name('product-seller__name')
assert seller
except:
seller = article.find_elements_by_class_name('product-seller')
if len(seller) == 0: # case if there is no seller specified
_seller = 'NaN'
else:
_seller = seller[0].text # get a seller name text
# temporary dictionary of the product data
temp = {
'website': "BolCom",
'item': item,
'product': product_header[0].text,
'seller': _seller,
'time': pd.to_datetime('now').strftime("%Y-%m-%d %H:%M:%S"),
'price': price}
collected_data.append(temp) # append the data
except TimeoutException:
# driver.quit()
print("driver has not found products on the webpage")
Web crawling process execution¶
The code cell below defines the main function to perform iterations of data collection over all items in the list that the script received and parsed with get_parser function.
The function defines and initializes a list of the possible delays to mimic user interaction with websites, a webdriver and webdriver options and the item list to search for. Also, here the web driver finds a cookies button on the website and clicks it to accept cookies.
After that, the process of collecting data for all positions that were passed to the script is started. It is important to clarify that if there are any problems with specific items, the script puts this "problematic" item in a special list (skipped list), to which it returns after going through all the items, and again tries to collect information on these items.
At the end, the script creates a dataframe from the collected data and writes and saves it as a csv file.
def main(params)
# initialize a list of the possible delays to mimic user interaction with websites
delays = [1, 2, 3, 4, 5]
# initialize a list where we store all collected data
collected_data = []
# list of items to search
items_list = params.items_list
# initalize webdriver options
profile = webdriver.FirefoxProfile()
if params.ua_string != '':
# user agent string
ua_string = params.ua_string
# initialize user agent
user_agent = parse(ua_string)
print(f"Current user-agent: {user_agent}")
profile.set_preference("general.useragent.override", ua_string)
PROXY = params.proxy
if PROXY != '':
webdriver.DesiredCapabilities.FIREFOX['proxy'] = {
"httpProxy": PROXY,
"ftpProxy": PROXY,
"sslProxy": PROXY,
"proxyType": "MANUAL",
}
opts = FirefoxOptions()
opts.add_argument("--headless")
# initialize a webdriver
driver = webdriver.Firefox(firefox_options=opts, firefox_profile=profile)
# get the url
driver.get(params.web_page)
# time to wait a response from the page
timeout = 30
# press the button to accept cookies
try:
cookies = WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.CLASS_NAME, "js-confirm-button")))
delay = np.random.choice(delays)
time.sleep(delay)
cookies.send_keys(Keys.RETURN) # press ENTER
except TimeoutException:
print("Didn't found the button accept cookies.")
pass
# initialize a list with failed items
skipped_items = []
# collect the data
for item in tqdm(items_list):
print("================")
print(item)
print("================")
print("\n")
try:
try:
try:
_ = iteration(driver, item, delays, collected_data)
except:
_ = iteration(driver, item, delays, collected_data)
except:
try:
_ = iteration(driver, item, delays, collected_data)
except:
_ = iteration(driver, item, delays, collected_data)
except:
print(f"{item} was skipped")
skipped_items.append(item)
pass
print("Writing csv file...")
df = pd.DataFrame(collected_data)
df.to_csv(f'{params.exp_name}' + '_' + pd.to_datetime('now').strftime("%Y-%m-%d %H:%M:%S") + ".csv", index=False)
print("Writing finished.")
# close the driver
driver.quit()
Script execution¶
The code cell below executes the script.
if __name__ == '__main__':
parser = get_parser()
params, unknown = parser.parse_known_args()
# params.exp_name = 'test27'
# params.items_list = ['sneakers', 'parfum', 'sandalen', 'horloge', 'rugzak', 'zonnebril', 'kostuum', 'trainingspak', 'badpak', 'jurk', 'overhemd', 'mantel', 'laarzen', 'koptelefoon', 'yogamat', 'sjaal', 'badjas', 'halsketting', 'portemonnee']
# params.web_page = 'https://www.bol.com/'
# params.exec_path = ''
# run the script
main(params)