Offer monitor for otomoto.pl [Service + Architecture]
Many people told me that the way I bought my car is very interesting. Therefore I decided to create a solution for everyone! Using my service you will be able to receive an email with new offers from otomoto.pl 4 times a day for 7 days.
The only limitation in place is the number of pages you get in otomoto.pl Select all the filters (make, model etc.) and see how many pages this generates. If less than 10, copy the URL and paste in the form. One notice: emails seem to not work with Outlook.
The form is available here.
Otodom.pl and gratka.pl are being worked on right now.
This is where the technical part starts.
The script that I created previously was based on RPI + Telegram. This is not a perfect setup, if you want to create a solution available for everyone. I decided to change Telegram for a good ol’ email, and RPI for Azure. Cloud will be helpful here, as I will not have to open any ports or care about constant internet access. A big disadvantage is the fact that I will actually have to pay out of my pocket – hence the limitation.
The whole solution is based on 3 Azure services:
-
Blob storage – functions as a database. I know that this is not perfect, but it’s super cheap and I have some Python scripts readily available.
-
Logic App – three apps that manage registration process and downloading offers.
-
Function – service that lets me run Python without any service (serverless thingy). This way I do not have to set up a new VM.
Using those services, I created four separate (yet cooperating) processes:
-
Email registration
-
Email confirmation
-
Sending out new offers
-
Subscription removal
Below you can find details of each process and also pieces of code.
1. Email registration
Logic app view:
The registation process is rather easy to follow. A user registers through a form, which sends a GET request to the logic app. The requests passes user email and the URL from otomoto.pl. The URL is check using a function called GetPages (code below), to check for the number of pages. If it returns more than 10 pages, the users gets a proper response. If it returns less, files are created on blob storage, emails gets an ID, an email is sent with confirmation link and a proper response is shown. Blob storage has a policy, which will delete the files after 2 days (offer URL and email). If the user clicks on the confirmation link after 2 days, it will not work.
OtomotoScrapper class, which contains all the methods:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
import requests from lxml import html import os import json import time import datetime class OtomotoScrapper: def __init__(self, url, previous_offers): self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'} self.url = url self.previous_offers = previous_offers self.new_offers = {} self.no_of_pages = self.get_number_of_pages() self.html_template = """ Super long html_template was here. """ def get_number_of_pages(self): #This function will just retrieve the maximum number of pages on the website. This is used when iterating through n pages url = self.url request = requests.get(url, headers = self.headers) tree = html.fromstring(request.text) xpath_offer_details = '//div[@class="offers list"]/article' max_page= tree.xpath('//ul[@class="om-pager rel"]/li[last()-1]/a/span/text()') offers = tree.xpath(xpath_offer_details) if not max_page and offers: return 1 elif max_page: max_page = max_page[0].strip() #print(max_page) return int(max_page) else: return 0 def get_offers(self, n): url = str(self.url) +"&page="+ str(n) request = requests.get(url, headers = self.headers) tree = html.fromstring(request.text) xpath_offer_details = '//div[@class="offers list"]/article'#//text() xpath_url = '//div[@class="offers list"]/article/@data-href'#//text() offer_details = tree.xpath(xpath_offer_details) list_of_urls = tree.xpath(xpath_url) #print(list_of_urls) for i, detail in enumerate(offer_details): try: if not list_of_urls[i] in self.previous_offers: #check if URLs was present before, if not download all the details self.previous_offers[list_of_urls[i]] = self.get_single_offer(detail) self.new_offers[list_of_urls[i]] = self.get_single_offer(detail) #VIN and Phone require seperate logic offer_id = list_of_urls[i].split("-ID")[1].split(".html")[0] except Exception as e: print(e) print("sss") def get_single_offer(self,html_element): #This function will enter html_element and retrieve all offer details basing on xpath single_offer_details = {} single_offer_details['url'] = html_element.xpath('@data-href')[0] single_offer_details['name'] = html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__title"]/h2/a')[0].text_content().strip() single_offer_details['subtitle'] = html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__title"]/h3')[0].text_content().strip() single_offer_details['price'] = " ".join(html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__price"]/div/div/span')[0].text_content().strip().split()) single_offer_details['foto'] = html_element.xpath('div[@class="offer-item__photo ds-photo-container"]/a/img/@data-srcset')[0].split(';s=')[0] single_offer_details['offer_details'] = html_element.xpath('div[@class="offer-item__content ds-details-container"]/*[@class="ds-params-block"]/*[@class="ds-param"]/span/text()') single_offer_details['details_string'] = ' • '.join(single_offer_details['offer_details']) return single_offer_details def get_everything(self): #This function iterates through all pages, saving everything into globabl variable previous_offers that will be saves to json. for i in range(1,self.get_number_of_pages()+1): self.get_offers(i) return self.new_offers def get_vin_and_phone(self, id): #Digging in website's code let me discover that Vin and Phone number are available under those URLs without any additional authentication vin_url = "https://www.otomoto.pl/ajax/misc/vin/" phone_url = "https://www.otomoto.pl/ajax/misc/contact/multi_phone/{}/0" request = requests.get(vin_url+id) vin = request.text.replace("\"","") request = requests.get(phone_url.format(id)) phone = json.loads(request.text)["value"].replace(" ","") return vin, phone def create_html(self): offer_html = [] for key in self.new_offers: offer_html.append(self.html_template.format(**self.new_offers[key])) return ''.join(offer_html) #get_everything() #print(get_number_of_pages()) if __name__ == "__main__": previos_offers={} url = "https://www.otomoto.pl/osobowe/volkswagen/tiguan/seg-suv/od-2017/?search%5Bfilter_enum_generation%5D%5B0%5D=gen-ii-2016&search%5Bfilter_float_year%3Ato%5D=2018&search%5Bfilter_float_mileage%3Ato%5D=55000&search%5Bfilter_float_engine_power%3Afrom%5D=160&search%5Bfilter_enum_gearbox%5D%5B0%5D=automatic&search%5Bfilter_enum_gearbox%5D%5B1%5D=cvt&search%5Bfilter_enum_gearbox%5D%5B2%5D=dual-clutch&search%5Bfilter_enum_gearbox%5D%5B3%5D=semi-automatic&search%5Bfilter_enum_gearbox%5D%5B4%5D=automatic-stepless-sequential&search%5Bfilter_enum_gearbox%5D%5B5%5D=automatic-stepless&search%5Bfilter_enum_gearbox%5D%5B6%5D=automatic-sequential&search%5Bfilter_enum_gearbox%5D%5B7%5D=automated-manual&search%5Bfilter_enum_gearbox%5D%5B8%5D=direct-no-gearbox&search%5Bfilter_enum_country_origin%5D%5B0%5D=pl&search%5Border%5D=created_at%3Adesc&search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D=" scrapper = OtomotoScrapper(url, previos_offers) scrapper.get_everything() print(scrapper.create_html()) |
GetPages:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
import logging from ..shared_code import ScrappyScrapper import azure.functions as func from lxml import html import requests def main(req: func.HttpRequest) -> func.HttpResponse: logging.info('Python HTTP trigger function processed a request.') name = req.params.get('name') if not name: try: req_body = req.get_json() except ValueError: pass else: name = req_body.get('name') if name: try: previos_offers={} scrapper = ScrappyScrapper.OtomotoScrapper(name, previos_offers) return func.HttpResponse(str(scrapper.get_number_of_pages())) except Exception as e: return func.HttpResponse(e) else: return func.HttpResponse( "Please pass a name on the query string or in the request body", status_code=400 ) |
2. Email confirmation:
Logic app view:
When a users click the confirmation link, a check happens to validate email’s id. This (hopefully) will make it impossible to confirm an email address without an actual access to it. Afterwards the process calculates SHA256 hash of the email (using Hash function), this is done so that I have only one place to delete from. The files used for the confirmation are then deleted. A welcome email with subscription removal link is sent. Later a response is shown to confirm the process. The last step is downloading all the existing offers for the moment of confirmation (ScrapHttp).
Hash:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import logging from hashlib import sha256 import azure.functions as func def main(req: func.HttpRequest) -> func.HttpResponse: logging.info('Python HTTP trigger function processed a request.') name = req.params.get('name') if not name: try: req_body = req.get_json() except ValueError: pass else: name = req_body.get('name') if name: name = name.upper() name = sha256(name.encode()).hexdigest() return func.HttpResponse(name) else: return func.HttpResponse( "Please pass a name on the query string or in the request body", status_code=400 ) |
ScrapHttp:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
import logging from ..shared_code import blob from ..shared_code import ScrappyScapper import azure.functions as func import json def main(req: func.HttpRequest) -> func.HttpResponse: logging.info('Python HTTP trigger function processed a request.') name = req.params.get('name') if not name: try: req_body = req.get_json() except ValueError: pass else: name = req_body.get('name') if name: #download_data_for_email(name) return func.HttpResponse(download_data_for_email(name)) else: return func.HttpResponse( "Please pass a name on the query string or in the request body", status_code=400 ) def download_data_for_email(email): po_file = blob.download_blob('offers/'+email) if po_file: previos_offers = json.loads(po_file) else: previos_offers = {} url = blob.download_blob('validatedqueries/'+email) scrapper = ScrappyScapper.OtomotoScrapper(url, previos_offers) new_offers = scrapper.get_everything() previos_offers = scrapper.previous_offers blob.upload_to_blob(json.dumps(previos_offers),'offers/'+email) return scrapper.create_html() #print(new_offers) |
3. Sending out new offers
Logic app view:
A few times a day this app is triggered. For each email hash from a selected directory (directories in blob storage are just part of the name) the offers are downloaded. If there are no new offers, nothing happens. If there are some new offers, for the selected hash an email is being found (this information is inside blob storage) and an emails is sent.
4. Subscription removal
Logic app view:
This process starts with a check if the email even exists. If not, a proper response is shown. If the email indeed exists, another check happens to confirm its ID. This will prevent deleting random emails. After this is confirmed, the hash is calculated, and all the files (hash-email pair, email id) are deleted. Downloaded offers are then copied to the archive. Afterwards a response with confirmation is shown.
And that is it! My budget is 5 USD per month, so use it! I monitor closely all the emails, as I know that you can probably bypass the system. If someone pushes it to the limits, I will ban! If you need more frequent notifications, let me know!
I start with a newsletter. If you want to be notified of such great pieces of work – subscribe below.