Monitor ofert dla otomoto.pl [Usługa + architektura]

Wiele osób stwierdziło, że to w jaki sposób kupiłem swój samochód jest bardzo interesujący, dlatego postanowiłem stworzyć rozwiązanie dla każdego! Dzięki mojej usłudze możesz przez 7 dni 4 razy dziennie dostawać na maila nowe oferty z otomoto.pl, które pojawiły się w międzyczasie.

Ograniczenie jakie teraz zastosowałem polega na limicie stron, które generuje twoje zapytanie. Więc wybierz wszystkie filtry na otomoto.pl i sprawdź ile stron ofert masz. Jeśli poniżej 10, to skopiuj adres i wklej na gotowej stronie z rozwiązaniem. Jedna uwaga: maile nie wyświetlają się poprawnie w programie Outlook.

Znajdziesz ją tutaj.

Otodom.pl i gratka.pl są w przygotowaniu.

Tutaj zaczyna się część techniczna.

Skrypt, który stworzyłem na własne potrzeby działał w oparciu do RPI + Telegram. Nie jest to idealny zestaw, gdy stworzyć rozwiązanie dla wszystkich. Zdecydowałem się na zamianę Telegrama na maila, a RPI na Azure. Chmura ułatwi sprawę, ponieważ nie będę musiał otwierać RPI na świat, ani martwić się do stały dostęp do internetu dla RPI. Minusem jest zdecydowanie fakt, że płacę za to z własnej kieszeni – stąd wszystkie ograniczenia.

Całe rozwiązanie opiera się o 3 usługi na Azure:

Blob storage – funkcjonuje jako baza danych. Wiem, że nie jest to idealne rozwiązanie, ale jest tanie i mam gotowe kawałki Pythona do obsługi.
Logic App – trzy aplikacje, które zarządzają procesem rejestracji, a także wyzwalają pobieranie ofert
Function – usługa, która pozwala na uruchomienie Pythona bez żadnego serwera (serverless). Tym sposobem nie muszę stawiać dodatkowej maszyny.

Korzystając z tych serwisów stworzyłem cztery oddzielne (ale współpracujące) procesy:

Rejestracja adresu mailowego
Potwierdzenie adresu mailowego
Cykliczne wysyłanie ofert otomoto.pl
Usunięcie subskrypcji

Poniżej znajdują się szczegóły każdego z procesów, a także kawałki kodu.

1. Rejestracja adresu mailowego

Widok logic app:

Proces rejestracji działa w bardzo prosty sposób. Użytkownik rejestruje się za pomocą formularza, który generuje zapytanie GET trafiające do logic app. Zapytanie GET przekazuje email użytkownika i jego URL z otomoto. URL jest sprawdzany przez funkcję GetPages (kod poniżej), która sprawdza ile stron ma zapytanie. Jeśli ma więcej niż 10 stron, użytkownik dostaje stosowną odpowiedź. Jeśli ma mniej niż 10 stron to są tworzone pliki w blob storage, zostaje nadany id dla maila, wysłany email z linkiem do potwierdzenia i pokazana odpowiedź potwierdzająca. Blob storage zawiera politykę, która usunie pliki (zapytanie i email) po 2 dniach. Jeśli użytkownik spróbuje kliknąć z link po 2 dniach, proces rejestracji nie powiedzie się.

Klasa OtomotoScrapper, która zawiera wszystkie potrzebne metody:

import requests
from lxml import html
import os
import json
import time

import datetime


class OtomotoScrapper:
    
    
    def __init__(self, url, previous_offers):
        self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}
        self.url = url
        self.previous_offers = previous_offers
        self.new_offers = {}
        self.no_of_pages = self.get_number_of_pages()
        self.html_template =  """
	Super long html_template was here.
    """
        

        

    def get_number_of_pages(self):
        #This function will just retrieve the maximum number of pages on the website. This is used when iterating through n pages

        url = self.url
        
        request = requests.get(url, headers = self.headers)
        tree = html.fromstring(request.text)
        xpath_offer_details = '//div[@class="offers list"]/article'
        max_page= tree.xpath('//ul[@class="om-pager rel"]/li[last()-1]/a/span/text()')
        offers = tree.xpath(xpath_offer_details)
        if not max_page and offers:
            return 1
        elif max_page:
            max_page = max_page[0].strip()
            #print(max_page)
            return int(max_page)
        else:
            return 0


    def get_offers(self, n):
        url = str(self.url) +"&page="+ str(n)
        request = requests.get(url, headers = self.headers)
        tree = html.fromstring(request.text)

        xpath_offer_details = '//div[@class="offers list"]/article'#//text()
        xpath_url = '//div[@class="offers list"]/article/@data-href'#//text()
        
        offer_details = tree.xpath(xpath_offer_details)
        list_of_urls = tree.xpath(xpath_url)
        #print(list_of_urls)
        for i, detail in enumerate(offer_details):
            try:
                if not list_of_urls[i] in self.previous_offers: #check if URLs was present before, if not download all the details

                    self.previous_offers[list_of_urls[i]] = self.get_single_offer(detail)
                    self.new_offers[list_of_urls[i]] = self.get_single_offer(detail)
                    
                    #VIN and Phone require seperate logic
                    offer_id = list_of_urls[i].split("-ID")[1].split(".html")[0]

            except Exception as e:
                print(e)
                print("sss")


    def get_single_offer(self,html_element):
        #This function will enter html_element and retrieve all offer details basing on xpath
        single_offer_details = {}
        single_offer_details['url'] = html_element.xpath('@data-href')[0]

        single_offer_details['name'] = html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__title"]/h2/a')[0].text_content().strip()
        single_offer_details['subtitle'] = html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__title"]/h3')[0].text_content().strip()
        single_offer_details['price'] = " ".join(html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__price"]/div/div/span')[0].text_content().strip().split())
        single_offer_details['foto'] = html_element.xpath('div[@class="offer-item__photo  ds-photo-container"]/a/img/@data-srcset')[0].split(';s=')[0]
        
        single_offer_details['offer_details'] =  html_element.xpath('div[@class="offer-item__content ds-details-container"]/*[@class="ds-params-block"]/*[@class="ds-param"]/span/text()')
        single_offer_details['details_string'] = ' • '.join(single_offer_details['offer_details'])
        return single_offer_details

    
    def get_everything(self):

        #This function iterates through all pages, saving everything into globabl variable previous_offers that will be saves to json.
        for i in range(1,self.get_number_of_pages()+1):
            self.get_offers(i)
            
        return self.new_offers
    


    def get_vin_and_phone(self, id):
        #Digging in website's code let me discover that Vin and Phone number are available under those URLs without any additional authentication
        vin_url = "https://www.otomoto.pl/ajax/misc/vin/"
        phone_url = "https://www.otomoto.pl/ajax/misc/contact/multi_phone/{}/0"

        request = requests.get(vin_url+id)

    
        vin = request.text.replace("\"","")
        request = requests.get(phone_url.format(id))

        
        phone = json.loads(request.text)["value"].replace(" ","")

        return vin, phone

    def create_html(self):
        offer_html = []
        
        for key in self.new_offers:

            offer_html.append(self.html_template.format(**self.new_offers[key]))
            
        return ''.join(offer_html)




#get_everything()
#print(get_number_of_pages())
if __name__ == "__main__":
    
    previos_offers={}
    url = "https://www.otomoto.pl/osobowe/volkswagen/tiguan/seg-suv/od-2017/?search%5Bfilter_enum_generation%5D%5B0%5D=gen-ii-2016&search%5Bfilter_float_year%3Ato%5D=2018&search%5Bfilter_float_mileage%3Ato%5D=55000&search%5Bfilter_float_engine_power%3Afrom%5D=160&search%5Bfilter_enum_gearbox%5D%5B0%5D=automatic&search%5Bfilter_enum_gearbox%5D%5B1%5D=cvt&search%5Bfilter_enum_gearbox%5D%5B2%5D=dual-clutch&search%5Bfilter_enum_gearbox%5D%5B3%5D=semi-automatic&search%5Bfilter_enum_gearbox%5D%5B4%5D=automatic-stepless-sequential&search%5Bfilter_enum_gearbox%5D%5B5%5D=automatic-stepless&search%5Bfilter_enum_gearbox%5D%5B6%5D=automatic-sequential&search%5Bfilter_enum_gearbox%5D%5B7%5D=automated-manual&search%5Bfilter_enum_gearbox%5D%5B8%5D=direct-no-gearbox&search%5Bfilter_enum_country_origin%5D%5B0%5D=pl&search%5Border%5D=created_at%3Adesc&search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="
    scrapper = OtomotoScrapper(url, previos_offers)
    scrapper.get_everything()
    print(scrapper.create_html())

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

import requests

from lxml import html

import os

import json

import time

import datetime

class OtomotoScrapper:

def __init__(self, url, previous_offers):

self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}

self.url = url

self.previous_offers = previous_offers

self.new_offers = {}

self.no_of_pages = self.get_number_of_pages()

self.html_template = """

Super long html_template was here.

"""

def get_number_of_pages(self):

#This function will just retrieve the maximum number of pages on the website. This is used when iterating through n pages

url = self.url

request = requests.get(url, headers = self.headers)

tree = html.fromstring(request.text)

xpath_offer_details = '//div[@class="offers list"]/article'

max_page= tree.xpath('//ul[@class="om-pager rel"]/li[last()-1]/a/span/text()')

offers = tree.xpath(xpath_offer_details)

if not max_page and offers:

return 1

elif max_page:

max_page = max_page[0].strip()

#print(max_page)

return int(max_page)

else:

return 0

def get_offers(self, n):

url = str(self.url) +"&page="+ str(n)

request = requests.get(url, headers = self.headers)

tree = html.fromstring(request.text)

xpath_offer_details = '//div[@class="offers list"]/article'#//text()

xpath_url = '//div[@class="offers list"]/article/@data-href'#//text()

offer_details = tree.xpath(xpath_offer_details)

list_of_urls = tree.xpath(xpath_url)

#print(list_of_urls)

for i, detail in enumerate(offer_details):

try:

if not list_of_urls[i] in self.previous_offers: #check if URLs was present before, if not download all the details

self.previous_offers[list_of_urls[i]] = self.get_single_offer(detail)

self.new_offers[list_of_urls[i]] = self.get_single_offer(detail)

#VIN and Phone require seperate logic

offer_id = list_of_urls[i].split("-ID")[1].split(".html")[0]

except Exception as e:

print(e)

print("sss")

def get_single_offer(self,html_element):

#This function will enter html_element and retrieve all offer details basing on xpath

single_offer_details = {}

single_offer_details['url'] = html_element.xpath('@data-href')[0]

single_offer_details['name'] = html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__title"]/h2/a')[0].text_content().strip()

single_offer_details['subtitle'] = html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__title"]/h3')[0].text_content().strip()

single_offer_details['price'] = " ".join(html_element.xpath('div[@class="offer-item__content ds-details-container"]/div[@class="offer-item__price"]/div/div/span')[0].text_content().strip().split())

single_offer_details['foto'] = html_element.xpath('div[@class="offer-item__photo ds-photo-container"]/a/img/@data-srcset')[0].split(';s=')[0]

single_offer_details['offer_details'] = html_element.xpath('div[@class="offer-item__content ds-details-container"]/*[@class="ds-params-block"]/*[@class="ds-param"]/span/text()')

single_offer_details['details_string'] = ' • '.join(single_offer_details['offer_details'])

return single_offer_details

def get_everything(self):

#This function iterates through all pages, saving everything into globabl variable previous_offers that will be saves to json.

for i in range(1,self.get_number_of_pages()+1):

self.get_offers(i)

return self.new_offers

def get_vin_and_phone(self, id):

#Digging in website's code let me discover that Vin and Phone number are available under those URLs without any additional authentication

vin_url = "https://www.otomoto.pl/ajax/misc/vin/"

phone_url = "https://www.otomoto.pl/ajax/misc/contact/multi_phone/{}/0"

request = requests.get(vin_url+id)

vin = request.text.replace("\"","")

request = requests.get(phone_url.format(id))

phone = json.loads(request.text)["value"].replace(" ","")

return vin, phone

def create_html(self):

offer_html = []

for key in self.new_offers:

offer_html.append(self.html_template.format(**self.new_offers[key]))

return ''.join(offer_html)

#get_everything()

#print(get_number_of_pages())

if __name__ == "__main__":

previos_offers={}

url = "https://www.otomoto.pl/osobowe/volkswagen/tiguan/seg-suv/od-2017/?search%5Bfilter_enum_generation%5D%5B0%5D=gen-ii-2016&search%5Bfilter_float_year%3Ato%5D=2018&search%5Bfilter_float_mileage%3Ato%5D=55000&search%5Bfilter_float_engine_power%3Afrom%5D=160&search%5Bfilter_enum_gearbox%5D%5B0%5D=automatic&search%5Bfilter_enum_gearbox%5D%5B1%5D=cvt&search%5Bfilter_enum_gearbox%5D%5B2%5D=dual-clutch&search%5Bfilter_enum_gearbox%5D%5B3%5D=semi-automatic&search%5Bfilter_enum_gearbox%5D%5B4%5D=automatic-stepless-sequential&search%5Bfilter_enum_gearbox%5D%5B5%5D=automatic-stepless&search%5Bfilter_enum_gearbox%5D%5B6%5D=automatic-sequential&search%5Bfilter_enum_gearbox%5D%5B7%5D=automated-manual&search%5Bfilter_enum_gearbox%5D%5B8%5D=direct-no-gearbox&search%5Bfilter_enum_country_origin%5D%5B0%5D=pl&search%5Border%5D=created_at%3Adesc&search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="

scrapper = OtomotoScrapper(url, previos_offers)

scrapper.get_everything()

print(scrapper.create_html())

Funkcja GetPages:

import logging
from ..shared_code import ScrappyScrapper

import azure.functions as func
from lxml import html
import requests


def main(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')

    name = req.params.get('name')
    if not name:
        try:
            req_body = req.get_json()
        except ValueError:
            pass
        else:
            name = req_body.get('name')

    if name:
        
            
        try: 
            previos_offers={}
            scrapper = ScrappyScrapper.OtomotoScrapper(name, previos_offers)
            return func.HttpResponse(str(scrapper.get_number_of_pages())) 

        except Exception as e:
            return func.HttpResponse(e)
        
    else:
        return func.HttpResponse(
             "Please pass a name on the query string or in the request body",
             status_code=400
        )

import logging

from ..shared_code import ScrappyScrapper

import azure.functions as func

from lxml import html

import requests

def main(req: func.HttpRequest) -> func.HttpResponse:

logging.info('Python HTTP trigger function processed a request.')

name = req.params.get('name')

if not name:

try:

req_body = req.get_json()

except ValueError:

pass

else:

name = req_body.get('name')

if name:

try:

previos_offers={}

scrapper = ScrappyScrapper.OtomotoScrapper(name, previos_offers)

return func.HttpResponse(str(scrapper.get_number_of_pages()))

except Exception as e:

return func.HttpResponse(e)

else:

return func.HttpResponse(

"Please pass a name on the query string or in the request body",

status_code=400

)

2. Potwierdzenie adresu mailowego:

Widok logic app:

Gdy użytkownik kliknie w link w emailu, następuje sprawdzenie czy link z maila zawiera id takie samo jak to w blob storage. To zapobiegnie (mam nadzieję) potwierdzeniu dowolnego maila bez dostępu do skrzynki. Kiedy to zostanie potwierdzone, zostaje wyliczany hash maila (za pomocą funkcji Hash), żebym miał tylko jedno miejsce, z którego muszę usunąć pliki w przypadku usunięcia subskrypcji. Dodatkowo zostaną usunięte pliki używane do rejestracji. Następnie zostaje wysłany email z potwierdzeniem i linkiem do usunięcia subskrypcji, zostaje też wysłana odpowiedź z potwierdzeniem. Na koniec są pobierane oferty na moment rejestracji (funkcja ScrapHttp).

Funkcja Hash:

import logging
from hashlib import sha256
import azure.functions as func


def main(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')

    name = req.params.get('name')
    if not name:
        try:
            req_body = req.get_json()
        except ValueError:
            pass
        else:
            name = req_body.get('name')

    if name:
        name = name.upper()
        name = sha256(name.encode()).hexdigest()
        return func.HttpResponse(name)
    else:
        return func.HttpResponse(
             "Please pass a name on the query string or in the request body",
             status_code=400
        )

import logging

from hashlib import sha256

import azure.functions as func

def main(req: func.HttpRequest) -> func.HttpResponse:

logging.info('Python HTTP trigger function processed a request.')

name = req.params.get('name')

if not name:

try:

req_body = req.get_json()

except ValueError:

pass

else:

name = req_body.get('name')

if name:

name = name.upper()

name = sha256(name.encode()).hexdigest()

return func.HttpResponse(name)

else:

return func.HttpResponse(

"Please pass a name on the query string or in the request body",

status_code=400

)

Funkcja ScrapHttp:

import logging
from ..shared_code import blob
from ..shared_code import ScrappyScapper
import azure.functions as func
import json


def main(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')

    
    name = req.params.get('name')
    if not name:
        try:
            req_body = req.get_json()
        except ValueError:
            pass
        else:
            name = req_body.get('name')

    if name:

        #download_data_for_email(name)
        return func.HttpResponse(download_data_for_email(name))

    else:
        return func.HttpResponse(
             "Please pass a name on the query string or in the request body",
             status_code=400
        )

def download_data_for_email(email):
    po_file = blob.download_blob('offers/'+email)
    if po_file:
        previos_offers = json.loads(po_file)
    else:
        previos_offers = {}

    url = blob.download_blob('validatedqueries/'+email)

    scrapper = ScrappyScapper.OtomotoScrapper(url, previos_offers)

    new_offers = scrapper.get_everything()
    previos_offers = scrapper.previous_offers

    blob.upload_to_blob(json.dumps(previos_offers),'offers/'+email)
    return scrapper.create_html()
    #print(new_offers)

import logging

from ..shared_code import blob

from ..shared_code import ScrappyScapper

import azure.functions as func

import json

def main(req: func.HttpRequest) -> func.HttpResponse:

logging.info('Python HTTP trigger function processed a request.')

name = req.params.get('name')

if not name:

try:

req_body = req.get_json()

except ValueError:

pass

else:

name = req_body.get('name')

if name:

#download_data_for_email(name)

return func.HttpResponse(download_data_for_email(name))

else:

return func.HttpResponse(

"Please pass a name on the query string or in the request body",

status_code=400

)

def download_data_for_email(email):

po_file = blob.download_blob('offers/'+email)

if po_file:

previos_offers = json.loads(po_file)

else:

previos_offers = {}

url = blob.download_blob('validatedqueries/'+email)

scrapper = ScrappyScapper.OtomotoScrapper(url, previos_offers)

new_offers = scrapper.get_everything()

previos_offers = scrapper.previous_offers

blob.upload_to_blob(json.dumps(previos_offers),'offers/'+email)

return scrapper.create_html()

#print(new_offers)

3. Cykliczne wysyłanie ofert otomoto.pl

Widok logic app:

Kilka razy dziennie aplikacja zostaje wyzwolona automatycznie. Dla każdego hasha emaila z odpowiedniego “katalogu” (w blob storage katalog to tylko część nazwy) są pobierane nowe oferty. Jeśli nie ma nowych ofert nic się nie dzieje. W przeciwnym wypadku dla hasha emaila jest odnajdowany email (ta informacja znajduje się w blob storage), a następnie jest wysyłany email.

4. Usunięcie subskrypcji

Widok logic app:

Proces usunięcia emaila zaczyna się od sprawdzenia czy email w ogóle istnieje. Jeśli nie zostaje wyświetlona stosowna informacja. Jeśli email istnieje, sprawdzane jest czy id z zapytania pokrywa się z id emaila. To zapobiegnie (mam nadzieję) sytuacji, w ktoś usunie nieswojego maila. Kiedy to zostaje potwierdzone, liczony jest hash maila, zostają usunięte pliki, które mapują hash i email. Pobrane oferty zostają skopiowane do archiwum i zostaje wysłana odpowiedź z potwierdzeniem.

I to tyle! Mój budżet to 5 USD miesięcznie, więc korzystajcie! Cały czas czuwam nad rejestrowanymi mailami, bo wiem, że da się obejść moje limity. Jeśli ktoś będzie przesadzał, będę banował! Jeśli potrzebujecie częstszych powiadomień, dajcie znać!

Startuję z newsletterem. Jeśli chcesz dostawać podobne zestawienia i moje analizy na maila – zapraszam.

2 thoughts on “Monitor ofert dla otomoto.pl [Usługa + architektura]”

Bukszpan pisze:

8 listopada 2020 o 18:09

Fajnie by było gdyby dało się coś takiego zrobić na otodom na przykład, albo w tym projekcie z cenami, żeby można było ustawić sobie np 5 ofert i żeby na maila przychodziły powiadomienia jak w dodatku keepa do przeglądarek

Odpowiedz
1. Michał pisze:
  
  8 listopada 2020 o 22:28
  
  Taki mam plan, bo to kwestia skopiowania całej architektury z podmianką skryptu na otodom. Ale czasu brak…
  
  Odpowiedz

Michał Ćwiok

Analityka, chmura, prywatność i ogólnie takie takie.

Monitor ofert dla otomoto.pl [Usługa + architektura]

Znajdziesz ją tutaj.

Michał

2 thoughts on “Monitor ofert dla otomoto.pl [Usługa + architektura]”

Dodaj komentarz Anuluj pisanie odpowiedzi