Disclaimer: Wade Cybertech takes no responsibility for how the information on this website is used. All information or ideas are presented for entertainment and educational purposes only.
Thank you for visiting Wade Cybertech, throughout the pages you will find tutorials, web automation ideas, and general programming help.
Our goal is to provide the best resources for web automation, data mining, and most important how to do all this with the Python programming language.
Welcome back in this post we are going to dig into the guts of scraping the yellow pages results pages. There are several tools you can use to make your life simpler. If your browser is Firefox you can use firebug, if your browser is chrome you can us the developer tools.
I personally find chrome’s developer tools to be the best so that is what we will be using in this tutorial.
So let’s get started. In the last post we looked at the results page to determine where the information that we wanted was located.
With chrome open developer tools:
Click the icon that looks like a piece of paper with a folded edge
Click develop
Finally click developer tools
Now if you look at the bottom of your browser window there is a little bar that you need to drag up. Once you drag up the bar so you can see some HTML. Navigate your page so you can see the starting of the results (Look for “Refine results by:”)
Next click on the little hour glass on the developer tools icon bar. Once that is selected high light your mouse over “Refine results by:” you will want to move your mouse a little higher so you see a big blue square wrapping all the results. Now that you have the blue box single left mouse click.
Now if you look at the HTML plane in developer tools you will see the line selected is:
<div class=”rounded-10” id=”search-results”>…</div>
This is the master container of all our results. Expanding (click the arrow on the left) we can see the blue box changes as we hover our mouse over the div tags in the HTML plane. We find the div tag:
<div id=”results”>…</div>
Contains the results we are looking for. However if we dig a little deeper we will see that is not the tag we want our scraper to look for. Instead there is a sub tag:
<div id=”search-content”>…</div>
Contained within this tag are all the individual company listings. All the listings are wrapped in their own <div> tags so we can simply create an array of <div> tags. Inside these div tags we want the information found inside:
<div class=”listing_content”>…</div>
Finally inside the listing content tag we specifically want the information inside:
<div class=”info”>…</div>
At this point I hope you can see how much easier digging through HTML web developer tools makes it. The finally piece is the next links at the bottom of the page.
Let’s scroll down the page so we can see the page navigation. Then click the magnifying glass again and hover your mouse over the space left to the number 1. You will see the entire line high light.
Now are HTML plane should have the links tag highlighted:
<div class=”pagination”>…</div>
Now looking at the HTML we have two choices:
We can look at the results-total use regular expressions to pull out the total number of pages, and then keep track of the page we are currently on. Or a much simpler option is to look for the tag:
<li class=”next”>…</li>
On every page and if the tag is not found we are on the last page. If the tag is found we simply store the href so we can navigate to the next page.
This covers how to rip apart the HTML to find how the page is structured and what our parser must look for.
In the next article we will start to actually create our scraper.
Hi sorry for the long delay in posting.
A lot of work has come up and I have been working hard trying to get a game finished and lunched on my other site: http://www.warplydesigned.com. I will post some more this week. And try to keep posting on a more regular bases.
If you have been following along the last part of the series covered “How to research yellow pages to learn how the website works.” In this section of the series we will be taking a closer look at the yellow pages results page. Following the last part of the series we learned how to get to this results page, now lets target some information on that page.
We want to look at the section of the results page that contains the company listings:
Besides the actual listing we have to be careful to check how many pages of results there is for our search:
Without looking into the source code for this page we can guess on how the structure for data we want is structured. By looking at the page it looks like all the listings we want are either in a table or grouped in div tags.
Each group contains:
- A Company title which links to another page with more information about that company
- Street and Address on a single line
- Phone Number
- Website
On the more information page for the selected company you can sometimes find extra phone numbers, fax numbers, or even contact people. Each listing follows the exact same format.
At the very bottom of the listing we see the page numbers if there is more then one results page. Following the page numbers we see a Next link going to the next page. You cannot rely on the numbers showing before the next link to be the total number of pages for your listing. The key way to find out if you are on the last page is there will be no Next link. Therefore instead of having the spider click on each page number you should make your bot click only on the Next link then once it cannot find the next link you know it is done.
That covers this part of the series in the next section we will look at digging into the source code to learn what is really going on.
Back to “How to scrape websites?“
Welcome to the first part of the yellow pages scraper tutorial series. In this first part we are going to take a closer look at the yellow pages website to see how it works.
Here are somethings we will cover:
- How do find what you are looking for.
- Keep a list of the steps it takes to get where you want to go.
- Look for and note any patterns that we find between searches.
- Note how the final information is displayed.
OK lets get started first we need to learn how to find what we are looking for on the yellow pages site.
If you have not done so already open a new browser window to look at http://www.yellowpages.com. You are going to be presented with the new design. (No scraper released covers this at the time of writing).
If you have been thinking about purchasing a yellow pages scraper make sure the company you are buying from first lets you test that it works. There are some yellow pages scrapers that I have tested that do not work at all.
OK back on topic. On the main page of yellowpages.com there are a few different options when finding what you are looking for:
- Type in a category and city/state. (Live search will be displayed helping you make your choice.)
- Browse by event type.
- Browse by state then city in state then by keyword for that specific city state combination.
No matter which process you take to find your results you will find a pattern in the URL: http://www.yellowpages.com/[city]-[state]/[keyword(s)]. There is other patterns in the keywords you need to watch out for but I will leave that up to you to figure out.
The new results page looks a lot better then the old page and seems more specific. (No more adds)
Now you have a choice you can either just scrape all of these pages or you can go one level deeper and scrape the more info page for each company.
OK now we have a basic understanding of how to use the yellowpages.com website. In the next series we will dive deeper into the search page results to find out how we would scrape the yellow pages results page(s).
In the past prior to learning how to scrape websites, a company asked me to develop an application that would allow them to pick numbers that they like. After picking the numbers all the numbers would run through an algorithm generating lottery patterns. I could have won $30 million from this application but I didn’t risk the $300 it would have cost me to by tickets. (I know very dumb choice).
Now I was thinking what if I could make the application even better? Have it grab all the lottery winning numbers for the past say 10 years and trying to determine the probability of specific numbers coming up. The app would then run the most probable numbers through a algorithm generating numbers someone should play.
I cannot guarantee anyone would win the lottery but it would still put web scraping and webbots to good use.
Let me know what you think.
For anyone new to Python programming who are looking for a good book to get them started I recommend Beginning Python: From Novice to Professional, Second Edition
I have learned a lot from this book.
The first part of this book covers the core Python language Lists, Tuples, Strings, Dictionaries etc. Next it moves onto OOP with data abstraction. But don’t think this book only covers the basics you will get coverage in Files, GUI (wxPython), Database (SQLight), Network programming (talks about Twisted), Web (scraping), Testing, extending Python, and finally how to package your programs.
If the book ended at this point you would still get some good coverage of Python. However it doesn’t end here next you move onto the fun part of the book: Creating example Programs.
Here is a list of the projects you will work on:
- Instant Markup
- Painting a Pretty Picture
- XML for All Occasions
- In the News
- A Virtual Tea Party
- Remote Editing with CGI
- You Own Bulletin Board
- File Sharing with XML-RPC
- File Sharing II – Now with GUI
- Do-It-Yourself Arcade Game
Lastly the book provides you with a Python reference and an introduction to Python 3.0 which is the newest version of the language at this time.
Continuing on with the series how to scrape websites, we are going to look at parsing web pages with Python. This is a continuation from the last post “how to download webpages with Python“. If you are looking to follow the PHP or Perl version click on the appropriate link.
With Python you have a verity of options for parsing through webpages you have downloaded. However this series will be focusing on BeautifulSoup which is a great Python library created specifically for scraping websites. You will find BeautifulSoup simplifies the process of scraping websites, anyone wanting to learn more about how this library works should view BeautifulSoup’s official documentation.
In this tutorial we are going to download the front page of Wade Cybertech, then pull out all the posts titles and URL’s storing them into a set, finally we will print the results to show everything worked.
Lets get started: (make sure you download and install BeautifulSoup or this will not work)
# Import the libraries that we require
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
# Download the main page save it into a variable
text = urlopen('http://www.wadecybertech.com').read()
# Create an object with the downloaded page
soup = BeautifulSoup(text)
# Create an empty set to store our results
posts = set()
# Look through the page looking for the span header tags
for header in soup('span', 'PostHeader'):
# find and store the link tag for the
# found header
links = header('a')
# If there was no link found
# start from the beginning
if not links: continue
# Grab the link information
link = links[0]
# Add the link text and URL
# To the post set
posts.add('%s (%s)' % (link.string, link['href']))
# Print all the values in the posts set
print '\n'.join(sorted(posts, key=lambda s: s.lower()))
[NOTE: Indentation is very important when programming with Python]
This concludes the basics of how to download and parse webpages with Python. In the next part of the series we will look at how to submit form data with Python.
I have now seen a few requests for this type of webbot. Someone would like to enter the URL of a website and have the webbot parse all content saving all images, css, js and HTML files. Converting allabsolute URL’s to relative URL’s so they can view the website offline.
This could be a very helpful webbot screen scraper to have if you own the website. I will write a tutorial on how to do this however if you are doing this to someone else’s site I will not be held responsible. Scraping should be used when you would like specific information from a website to use in your product, not when you want to steel the entire website of someone else to benefit yourself.
Now that I have shared my feelings on this issue I will post a link on this page once the webbot is finished. This webbot will use all the basic concepts this website provides: scraping, spidering all pages, downloading all images, and saving everything into a hierarchy.
In this part of the series we are going to see how easy Python makes things. If you have not been following the series you should start from how to scrape websites. If you would like to learn how to download pages with PHP or Perl follow there links.
If you are new to Python I am positive the more you use the language the more you are going to love it. Python has a way of making things very simple even scraping websites as you will see throughout this series.
Well lets get started we are going to download and display the home page of Wade Cybertech.
from urllib import urlopen
text = urlopen('http://www.wadecybertech.com').read()
print text
And that is all there is to downloading and printing a web page using Python. If you checkout the PHP version you will find that it is a lot more complicated.
In the next part of the series we will cover how to parse a downloaded page with Python. If you are new to Python development I recommend the book Beginning Python from Novice to Professional Second Edition.










