Offer monitor for otomoto.pl [Service + Architecture]

Many people told me that the way I bought my car is very interesting. Therefore I decided to create a solution for everyone! Using my service you will be able to receive an email with new offers from otomoto.pl 4 times a day for 7 days.

The only limitation in place is the number of pages you get in otomoto.pl Select all the filters (make, model etc.) and see how many pages this generates. If less than 10, copy the URL and paste in the form. One notice: emails seem to not work with Outlook.

The form is available here.

Otodom.pl and gratka.pl are being worked on right now.

This is where the technical part starts.

The script that I created previously was based on RPI + Telegram. This is not a perfect setup, if you want to create a solution available for everyone. I decided to change Telegram for a good ol’ email, and RPI for Azure. Cloud will be helpful here, as I will not have to open any ports or care about constant internet access. A big disadvantage is the fact that I will actually have to pay out of my pocket – hence the limitation.

The whole solution is based on 3 Azure services:

Blob storage – functions as a database. I know that this is not perfect, but it’s super cheap and I have some Python scripts readily available.
Logic App – three apps that manage registration process and downloading offers.
Function – service that lets me run Python without any service (serverless thingy). This way I do not have to set up a new VM.

Using those services, I created four separate (yet cooperating) processes:

Email registration
Email confirmation
Sending out new offers
Subscription removal

Below you can find details of each process and also pieces of code.

1. Email registration

Logic app view:

The registation process is rather easy to follow. A user registers through a form, which sends a GET request to the logic app. The requests passes user email and the URL from otomoto.pl. The URL is check using a function called GetPages (code below), to check for the number of pages. If it returns more than 10 pages, the users gets a proper response. If it returns less, files are created on blob storage, emails gets an ID, an email is sent with confirmation link and a proper response is shown. Blob storage has a policy, which will delete the files after 2 days (offer URL and email). If the user clicks on the confirmation link after 2 days, it will not work.

OtomotoScrapper class, which contains all the methods:

import requests
from lxml import html
import os
import json
import time

import datetime


class OtomotoScrapper:
    
    
    def __init__(self, url, previous_offers):
        self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}
        self.url = url
        self.previous_offers = previous_offers
        self.new_offers = {}
        self.no_of_pages = self.get_number_of_pages()
        self.html_template =  """
	Super long html_template was here.
    """
        

        

    def get_number_of_pages(self):
        #This function will just retrieve the maximum number of pages on the website. This is used when iterating through n pages

        url = self.url
        
        request = requests.get(url, headers = self.headers)
        tree = html.fromstring(request.text)
        xpath_offer_details = '//div[@class="offers list"]/article'
        max_page= tree.xpath('//ul[@class="om-pager rel"]/li[last()-1]/a/span/text()')
        offers = tree.xpath(xpath_offer_details)
        if not max_page and offers:
            return 1
        elif max_page:
            max_page = max_page[0].strip()
            #print(max_page)
            return int(max_page)
        else:
            return 0


    def get_offers(self, n):
        url = str(self.url) +"&page="+ str(n)
        request = requests.get(url, headers = self.headers)
        tree = html.fromstring(request.text)

        xpath_offer_details = '//div[@class="offers list"]/article'#//text()
        xpath_url = '//div[@class="offers list"]/article/@data-href'#//text()
        
        offer_details = tree.xpath(xpath_offer_details)
        list_of_urls = tree.xpath(xpath_url)
        #print(list_of_urls)
        for i, detail in enumerate(offer_details):
            try:
                if not list_of_urls[i] in self.previous_offers: #check if URLs was present before, if not download all the details

                    self.previous_offers[list_of_urls[i]] = self.get_single_offer(detail)
                    self.new_offers[list_of_urls[i]] = self.get_single_offer(detail)
                    
                    #VIN and Phone require seperate logic
                    offer_id = list_of_urls[i].split("-ID")[1].split(".html")[0]

            except Exception as e:
                print(e)
                print("sss")


    def get_single_offer(self,html_element):
        #This function will enter html_element and retrieve all offer details basing on xpath
        single_offer_details = {}
        single_offer_details['url'] = html_element.xpath('@data-href')[0]

        single_offer_details['name'] = html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__title"]/h2/a')[0].text_content().strip()
        single_offer_details['subtitle'] = html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__title"]/h3')[0].text_content().strip()
        single_offer_details['price'] = " ".join(html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__price"]/div/div/span')[0].text_content().strip().split())
        single_offer_details['foto'] = html_element.xpath('div[@class="offer-item__photo  ds-photo-container"]/a/img/@data-srcset')[0].split(';s=')[0]
        
        single_offer_details['offer_details'] =  html_element.xpath('div[@class="offer-item__content ds-details-container"]/*[@class="ds-params-block"]/*[@class="ds-param"]/span/text()')
        single_offer_details['details_string'] = ' • '.join(single_offer_details['offer_details'])
        return single_offer_details

    
    def get_everything(self):

        #This function iterates through all pages, saving everything into globabl variable previous_offers that will be saves to json.
        for i in range(1,self.get_number_of_pages()+1):
            self.get_offers(i)
            
        return self.new_offers
    


    def get_vin_and_phone(self, id):
        #Digging in website's code let me discover that Vin and Phone number are available under those URLs without any additional authentication
        vin_url = "https://www.otomoto.pl/ajax/misc/vin/"
        phone_url = "https://www.otomoto.pl/ajax/misc/contact/multi_phone/{}/0"

        request = requests.get(vin_url+id)

    
        vin = request.text.replace("\"","")
        request = requests.get(phone_url.format(id))

        
        phone = json.loads(request.text)["value"].replace(" ","")

        return vin, phone

    def create_html(self):
        offer_html = []
        
        for key in self.new_offers:

            offer_html.append(self.html_template.format(**self.new_offers[key]))
            
        return ''.join(offer_html)




#get_everything()
#print(get_number_of_pages())
if __name__ == "__main__":
    
    previos_offers={}
    url = "https://www.otomoto.pl/osobowe/volkswagen/tiguan/seg-suv/od-2017/?search%5Bfilter_enum_generation%5D%5B0%5D=gen-ii-2016&search%5Bfilter_float_year%3Ato%5D=2018&search%5Bfilter_float_mileage%3Ato%5D=55000&search%5Bfilter_float_engine_power%3Afrom%5D=160&search%5Bfilter_enum_gearbox%5D%5B0%5D=automatic&search%5Bfilter_enum_gearbox%5D%5B1%5D=cvt&search%5Bfilter_enum_gearbox%5D%5B2%5D=dual-clutch&search%5Bfilter_enum_gearbox%5D%5B3%5D=semi-automatic&search%5Bfilter_enum_gearbox%5D%5B4%5D=automatic-stepless-sequential&search%5Bfilter_enum_gearbox%5D%5B5%5D=automatic-stepless&search%5Bfilter_enum_gearbox%5D%5B6%5D=automatic-sequential&search%5Bfilter_enum_gearbox%5D%5B7%5D=automated-manual&search%5Bfilter_enum_gearbox%5D%5B8%5D=direct-no-gearbox&search%5Bfilter_enum_country_origin%5D%5B0%5D=pl&search%5Border%5D=created_at%3Adesc&search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="
    scrapper = OtomotoScrapper(url, previos_offers)
    scrapper.get_everything()
    print(scrapper.create_html())

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

import requests

from lxml import html

import os

import json

import time

import datetime

class OtomotoScrapper:

def __init__(self, url, previous_offers):

self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}

self.url = url

self.previous_offers = previous_offers

self.new_offers = {}

self.no_of_pages = self.get_number_of_pages()

self.html_template = """

Super long html_template was here.

"""

def get_number_of_pages(self):

#This function will just retrieve the maximum number of pages on the website. This is used when iterating through n pages

url = self.url

request = requests.get(url, headers = self.headers)

tree = html.fromstring(request.text)

xpath_offer_details = '//div[@class="offers list"]/article'

max_page= tree.xpath('//ul[@class="om-pager rel"]/li[last()-1]/a/span/text()')

offers = tree.xpath(xpath_offer_details)

if not max_page and offers:

return 1

elif max_page:

max_page = max_page[0].strip()

#print(max_page)

return int(max_page)

else:

return 0

def get_offers(self, n):

url = str(self.url) +"&page="+ str(n)

request = requests.get(url, headers = self.headers)

tree = html.fromstring(request.text)

xpath_offer_details = '//div[@class="offers list"]/article'#//text()

xpath_url = '//div[@class="offers list"]/article/@data-href'#//text()

offer_details = tree.xpath(xpath_offer_details)

list_of_urls = tree.xpath(xpath_url)

#print(list_of_urls)

for i, detail in enumerate(offer_details):

try:

if not list_of_urls[i] in self.previous_offers: #check if URLs was present before, if not download all the details

self.previous_offers[list_of_urls[i]] = self.get_single_offer(detail)

self.new_offers[list_of_urls[i]] = self.get_single_offer(detail)

#VIN and Phone require seperate logic

offer_id = list_of_urls[i].split("-ID")[1].split(".html")[0]

except Exception as e:

print(e)

print("sss")

def get_single_offer(self,html_element):

#This function will enter html_element and retrieve all offer details basing on xpath

single_offer_details = {}

single_offer_details['url'] = html_element.xpath('@data-href')[0]

single_offer_details['name'] = html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__title"]/h2/a')[0].text_content().strip()

single_offer_details['subtitle'] = html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__title"]/h3')[0].text_content().strip()

single_offer_details['price'] = " ".join(html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__price"]/div/div/span')[0].text_content().strip().split())

single_offer_details['foto'] = html_element.xpath('div[@class="offer-item__photo ds-photo-container"]/a/img/@data-srcset')[0].split(';s=')[0]

single_offer_details['offer_details'] = html_element.xpath('div[@class="offer-item__content ds-details-container"]/*[@class="ds-params-block"]/*[@class="ds-param"]/span/text()')

single_offer_details['details_string'] = ' • '.join(single_offer_details['offer_details'])

return single_offer_details

def get_everything(self):

#This function iterates through all pages, saving everything into globabl variable previous_offers that will be saves to json.

for i in range(1,self.get_number_of_pages()+1):

self.get_offers(i)

return self.new_offers

def get_vin_and_phone(self, id):

#Digging in website's code let me discover that Vin and Phone number are available under those URLs without any additional authentication

vin_url = "https://www.otomoto.pl/ajax/misc/vin/"

phone_url = "https://www.otomoto.pl/ajax/misc/contact/multi_phone/{}/0"

request = requests.get(vin_url+id)

vin = request.text.replace("\"","")

request = requests.get(phone_url.format(id))

phone = json.loads(request.text)["value"].replace(" ","")

return vin, phone

def create_html(self):

offer_html = []

for key in self.new_offers:

offer_html.append(self.html_template.format(**self.new_offers[key]))

return ''.join(offer_html)

#get_everything()

#print(get_number_of_pages())

if __name__ == "__main__":

previos_offers={}

url = "https://www.otomoto.pl/osobowe/volkswagen/tiguan/seg-suv/od-2017/?search%5Bfilter_enum_generation%5D%5B0%5D=gen-ii-2016&search%5Bfilter_float_year%3Ato%5D=2018&search%5Bfilter_float_mileage%3Ato%5D=55000&search%5Bfilter_float_engine_power%3Afrom%5D=160&search%5Bfilter_enum_gearbox%5D%5B0%5D=automatic&search%5Bfilter_enum_gearbox%5D%5B1%5D=cvt&search%5Bfilter_enum_gearbox%5D%5B2%5D=dual-clutch&search%5Bfilter_enum_gearbox%5D%5B3%5D=semi-automatic&search%5Bfilter_enum_gearbox%5D%5B4%5D=automatic-stepless-sequential&search%5Bfilter_enum_gearbox%5D%5B5%5D=automatic-stepless&search%5Bfilter_enum_gearbox%5D%5B6%5D=automatic-sequential&search%5Bfilter_enum_gearbox%5D%5B7%5D=automated-manual&search%5Bfilter_enum_gearbox%5D%5B8%5D=direct-no-gearbox&search%5Bfilter_enum_country_origin%5D%5B0%5D=pl&search%5Border%5D=created_at%3Adesc&search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="

scrapper = OtomotoScrapper(url, previos_offers)

scrapper.get_everything()

print(scrapper.create_html())

GetPages:

import logging
from ..shared_code import ScrappyScrapper

import azure.functions as func
from lxml import html
import requests


def main(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')

    name = req.params.get('name')
    if not name:
        try:
            req_body = req.get_json()
        except ValueError:
            pass
        else:
            name = req_body.get('name')

    if name:
        
            
        try: 
            previos_offers={}
            scrapper = ScrappyScrapper.OtomotoScrapper(name, previos_offers)
            return func.HttpResponse(str(scrapper.get_number_of_pages())) 

        except Exception as e:
            return func.HttpResponse(e)
        
    else:
        return func.HttpResponse(
             "Please pass a name on the query string or in the request body",
             status_code=400
        )

import logging

from ..shared_code import ScrappyScrapper

import azure.functions as func

from lxml import html

import requests

def main(req: func.HttpRequest) -> func.HttpResponse:

logging.info('Python HTTP trigger function processed a request.')

name = req.params.get('name')

if not name:

try:

req_body = req.get_json()

except ValueError:

pass

else:

name = req_body.get('name')

if name:

try:

previos_offers={}

scrapper = ScrappyScrapper.OtomotoScrapper(name, previos_offers)

return func.HttpResponse(str(scrapper.get_number_of_pages()))

except Exception as e:

return func.HttpResponse(e)

else:

return func.HttpResponse(

"Please pass a name on the query string or in the request body",

status_code=400

)

2. Email confirmation:

Logic app view:

When a users click the confirmation link, a check happens to validate email’s id. This (hopefully) will make it impossible to confirm an email address without an actual access to it. Afterwards the process calculates SHA256 hash of the email (using Hash function), this is done so that I have only one place to delete from. The files used for the confirmation are then deleted. A welcome email with subscription removal link is sent. Later a response is shown to confirm the process. The last step is downloading all the existing offers for the moment of confirmation (ScrapHttp).

Hash:

import logging
from hashlib import sha256
import azure.functions as func


def main(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')

    name = req.params.get('name')
    if not name:
        try:
            req_body = req.get_json()
        except ValueError:
            pass
        else:
            name = req_body.get('name')

    if name:
        name = name.upper()
        name = sha256(name.encode()).hexdigest()
        return func.HttpResponse(name)
    else:
        return func.HttpResponse(
             "Please pass a name on the query string or in the request body",
             status_code=400
        )

import logging

from hashlib import sha256

import azure.functions as func

def main(req: func.HttpRequest) -> func.HttpResponse:

logging.info('Python HTTP trigger function processed a request.')

name = req.params.get('name')

if not name:

try:

req_body = req.get_json()

except ValueError:

pass

else:

name = req_body.get('name')

if name:

name = name.upper()

name = sha256(name.encode()).hexdigest()

return func.HttpResponse(name)

else:

return func.HttpResponse(

"Please pass a name on the query string or in the request body",

status_code=400

)

ScrapHttp:

import logging
from ..shared_code import blob
from ..shared_code import ScrappyScapper
import azure.functions as func
import json


def main(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')

    
    name = req.params.get('name')
    if not name:
        try:
            req_body = req.get_json()
        except ValueError:
            pass
        else:
            name = req_body.get('name')

    if name:

        #download_data_for_email(name)
        return func.HttpResponse(download_data_for_email(name))

    else:
        return func.HttpResponse(
             "Please pass a name on the query string or in the request body",
             status_code=400
        )

def download_data_for_email(email):
    po_file = blob.download_blob('offers/'+email)
    if po_file:
        previos_offers = json.loads(po_file)
    else:
        previos_offers = {}

    url = blob.download_blob('validatedqueries/'+email)

    scrapper = ScrappyScapper.OtomotoScrapper(url, previos_offers)

    new_offers = scrapper.get_everything()
    previos_offers = scrapper.previous_offers

    blob.upload_to_blob(json.dumps(previos_offers),'offers/'+email)
    return scrapper.create_html()
    #print(new_offers)

import logging

from ..shared_code import blob

from ..shared_code import ScrappyScapper

import azure.functions as func

import json

def main(req: func.HttpRequest) -> func.HttpResponse:

logging.info('Python HTTP trigger function processed a request.')

name = req.params.get('name')

if not name:

try:

req_body = req.get_json()

except ValueError:

pass

else:

name = req_body.get('name')

if name:

#download_data_for_email(name)

return func.HttpResponse(download_data_for_email(name))

else:

return func.HttpResponse(

"Please pass a name on the query string or in the request body",

status_code=400

)

def download_data_for_email(email):

po_file = blob.download_blob('offers/'+email)

if po_file:

previos_offers = json.loads(po_file)

else:

previos_offers = {}

url = blob.download_blob('validatedqueries/'+email)

scrapper = ScrappyScapper.OtomotoScrapper(url, previos_offers)

new_offers = scrapper.get_everything()

previos_offers = scrapper.previous_offers

blob.upload_to_blob(json.dumps(previos_offers),'offers/'+email)

return scrapper.create_html()

#print(new_offers)

3. Sending out new offers

Logic app view:

A few times a day this app is triggered. For each email hash from a selected directory (directories in blob storage are just part of the name) the offers are downloaded. If there are no new offers, nothing happens. If there are some new offers, for the selected hash an email is being found (this information is inside blob storage) and an emails is sent.

4. Subscription removal

Logic app view:

This process starts with a check if the email even exists. If not, a proper response is shown. If the email indeed exists, another check happens to confirm its ID. This will prevent deleting random emails. After this is confirmed, the hash is calculated, and all the files (hash-email pair, email id) are deleted. Downloaded offers are then copied to the archive. Afterwards a response with confirmation is shown.

And that is it! My budget is 5 USD per month, so use it! I monitor closely all the emails, as I know that you can probably bypass the system. If someone pushes it to the limits, I will ban! If you need more frequent notifications, let me know!

I start with a newsletter. If you want to be notified of such great pieces of work – subscribe below.

Michał Ćwiok

Analytics, clouds, privacy and stuff.

Offer monitor for otomoto.pl [Service + Architecture]

The form is available here.

Michał

Leave a Reply Cancel reply