QNA > Q > Qual È Il Modo Migliore Per Raschiare I Dati Da Un Sito Web?

Qual è il modo migliore per raschiare i dati da un sito web?

Se conoscete python - vi consiglio i moduli beautiful soup, splinter e pandas.

splinter automatizza l'inserimento e il recupero della pagina web (usa un vero browser e quindi può funzionare con pagine che hanno bisogno di eseguire javascript), e poi beautifulsoup può essere usato per analizzare i dati, poi pandas dataframe per scrivere i dati in formato csv o xls.

Ho scritto il codice per fare questo - non era troppo difficile. Avrete bisogno di installare python e i moduli pandas, beautifulsoup e splinter per usarlo. E avete bisogno del browser firefox installato. Usage is (from command line)

  1. python script.py datestart dateend 
  2. python script.py 16-Sep-2014 20-Sep-2014 

datestart and dateend are optional

It should run by just doubleclicking if you have your paths setup correct.

There isn't any error handling and other things that would be done in production code, but you get what you pay for :)

It runs a bit slow since I've added long delays since the site has slow response time and it is using a real browser due to having to execute javascript.

  1. # -*- coding: cp1252 -*- 
  2. #!/usr/bin/python 
  3.  
  4. from splinter import Browser 
  5. from bs4 import BeautifulSoup 
  6. from time import sleep 
  7. from datetime import datetime, timedelta 
  8. import pandas as pd 
  9. import sys, os 
  10.  
  11. def get_header_and_columns_from_html(soup): 
  12. table = soup.find('table') 
  13. head = table.findAll('th') 
  14. header = [h.getText().strip() for h in head] 
  15. all_columns = [] 
  16. for row in table.findAll('tr'): 
  17. col = row.findAll('td') 
  18. columns = [c.getText().strip() for c in col] 
  19. if(columns): 
  20. all_columns.append(columns) 
  21.  
  22. return header, all_columns 
  23.  
  24. def get_flight_table_for_city_on_date(citycode, date): 
  25. url = r"http://bristowgroup.com/clients/flight-status" 
  26. df = None 
  27. with Browser() as browser: 
  28. browser.visit(url) 
  29. sleep(3) 
  30. city = '//select[@id="id_base"]/option[@value="{}"]'.format(citycode) 
  31. browser.find_by_xpath(city)._element.click() 
  32. sleep(3) 
  33. browser.find_by_id('id_request_date').fill(date+"\t") 
  34. sleep(3) 
  35. browser.find_by_name('submit').click() 
  36. if browser.is_text_present('Important Information', wait_time=7): 
  37. html_source = browser.html 
  38. soup = BeautifulSoup(html_source, 'html.parser') 
  39. try: 
  40. header, all_columns = get_header_and_columns_from_html(soup) 
  41. df = pd.DataFrame(data=all_columns, columns=header) 
  42. return df 
  43. except: 
  44. return None 
  45. else: 
  46. return None 
  47.  
  48. def get_dates(): 
  49. months = ['', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
  50. 'Jul', 'Aug', 'Sep', 'Nov', 'Dec'] 
  51.  
  52. now = datetime.now() 
  53. date = "{}-{}-{}".format(now.day, months[now.month], now.year) 
  54.  
  55. if len(sys.argv) == 1: 
  56. return [date] 
  57.  
  58. elif len(sys.argv) == 2: 
  59. date = sys.argv[1] 
  60. return [date] 
  61.  
  62. elif len(sys.argv) == 3: 
  63. dates = [] 
  64. d1 = datetime.strptime(sys.argv[1], "%d-%b-%Y").date() 
  65. d2 = datetime.strptime(sys.argv[2], "%d-%b-%Y").date() 
  66.  
  67. delta = d2 - d1 
  68.  
  69. for i in range(delta.days + 1): 
  70. d = d1 + timedelta(days=i) 
  71. dates.append(d.strftime("%d-%b-%Y").lstrip('0')) 
  72. return dates 
  73. else: 
  74. return None 
  75.  
  76. def create_folder(directory, foldername): 
  77. if not directory: 
  78. directory = os.path.curdir 
  79.  
  80. outdir = directory+r"/{}".format(foldername) 
  81. if not os.path.exists(outdir): 
  82. os.makedirs(outdir) 
  83. return outdir 
  84.  
  85. if __name__ == '__main__': 
  86. #for a specific cities you can comment out the cities that you don't want 
  87.  
  88. cities ={ 
  89. "20":"Bergen", 
  90. "47":"Brønnøysund", 
  91. "19":"Den Helder", 
  92. "21":"Hammerfest", 
  93. "30":"Humberside", 
  94. "18":"Norwich", 
  95. "2":"Scatsta", 
  96. "13":"Sola" 
  97.  
  98. dates = get_dates() 
  99.  
  100. directory = r"" #replace with a string containing your prefered path, default will be the directory the script is ran from 
  101. for citycode in cities: 
  102. for date in dates: 
  103. outdir = create_folder(directory, date) 
  104. flights_table = get_flight_table_for_city_on_date(citycode, date) 
  105. if type(flights_table) != type(None): 
  106. out_path = r"{}\{}_{}.csv".format(outdir, date, cities[citycode])  
  107. flights_table.to_csv(out_path) 
  108. else: 
  109. out_path = "{}\{}_{}_log.txt".format(outdir, date, cities[citycode])  
  110.  
  111. with open(out_path, 'w') as f: 
  112. f.write("failed {} {}".format(date, cities[citycode])) 
  113.  

Di Bond Agosto

C'è qualcuno che ha ottenuto un lavoro usando le nanodegrees di Udacity? :: How long do I cook 2 lbs of meatloaf and what temperature do I set the oven?
Link utili