Wednesday, August 13, 2014

Phone numbers harvesting from homeless.co.il website. (Python)

Hi there.
If you ever though about such a nasty thing like SMS spamming, you had need a phone number database to do that.
 
Or you just want to call them?

I will show you one technique to harvest phone numbers from... hmm http://www.homeless.co.il - that will be enough.

Let's say, you need all phone numbers that involved in real estate ok?
So, we go to the homeless web site, and see column called 'מכירה תיווך', and URL for that:

http://www.homeless.co.il/saletivuch/

If you look closer, you can find the 'details' section (לפרטים) on the left side of the ad.
If you click here, new popup window will be opened with link like:

http://www.homeless.co.il/SaleTivuch/viewad,367948.aspx


Something is coming to your head? Any thoughts?

All we need, is to grab all AD numbers from the main URL (including all secondary pages), and parse every VEWAD.URL for extracting phone numbers.

If you can see, there is not only one page of ads, there are about 130 pages.
The difference in URL: number of page is added after the main URL:

http://www.homeless.co.il/saletivuch/[1-10]

Let's do it automatically with python:
First of all, will determine how many pages we need to parse:

Take a closer look at the source:


Will extract total number of pages with BeautifulSoup library:
#!/usr/bin/python

from bs4 import BeautifulSoup
import urllib2
import re

path_name = 'saletivuch'

r = urllib2.urlopen('http://www.homeless.co.il/%s/' % path_name)
if r.code == 200:
   got_url = r.read()
   soup = BeautifulSoup(got_url)

   #determine how many pages we have to get:

   #In one line it will look like this:   
   #pages = [i.get('value') for i in BeautifulSoup(str(soup.find_all('select',{'id':'jump_page'}))).find_all('option')]

   #But i will write it in a more understandable format:   
   jump_page_id = str(soup.find_all('select',{'id':'jump_page'}))

   soup2 = BeautifulSoup(jump_page_id)
   pages = []

   for i in soup2.find_all('option'):
      pages.append(i.get('value'))
   print '%d pages to parse' % len(pages)
   

Let's extract ad.ID from every page:
 

   #get every ad.Number from every page:
   
   ads = []
   for i in pages:
      try:
         r = urllib2.urlopen('http://www.homeless.co.il/%s/%s' % (path_name,i))
         inner_url = r.read()
         soup = BeautifulSoup(inner_url)
         
         #one line will look like this:
         #ads += [r.get('id').split('_')[1] for r in soup.find_all('tr',{'class':'ad_details'})]

         for num in soup.find_all('tr',{'class':'ad_details'}):
            ads.append(num.get('id').split('_')[1])
            print 'Total ads collected: %d' % len(ads)
            
      # added ^C exception if you don't want to wait 130 pages :)      
      except KeyboardInterrupt:
         break
   print 'total ads collected: %d' % len(ads)


Extract phone numbers from '/viewad,XXXXXX.aspx' page:
There is no special 'class' or 'id' to lock on, so will grab them with regex:
   #Extract phone numbers from every AD.

   phones = []
   for i in ads:
      try:
         r = urllib2.urlopen('http://www.homeless.co.il/saletivuch/viewad,%s.aspx' % i)
         got_ad = r.read()

         #you should remember that re.findall() returns list of matches
         #so we can just extend the main list with new one

         phones += re.findall(r'\b0[2-9][0-9]?-?[0-9]{7}\b',got_ad)

         print 'Total phone numbers collected: %d' % len(phones)
      except KeyboardInterrupt:
         break
         
   for i in phones: print i,

Threading/Multiprocessing adding is necessary, otherwise you will wait a lot of time for job completion.

REMEMBER: SPAM IS ILLEGAL and I'm NOT responsible for how do you use the information covered in this post, or part of it!

No comments:

Post a Comment