Results 1 to 8 of 8
  1. #1
    WTF Senior Array
    Join Date
    Dec 2010
    Posts
    1,542
    Thanks
    97
    Thanked 28 Times in 23 Posts

    Question scrapebox to harvest pages from a site

    How do I use scrapebox to harvest pages from a site? When I want to list all the pages of a site like www.domain.net, what steps must I follow. What must I do?

  2. #2
    WTF Senior Array
    Join Date
    Nov 2009
    Posts
    3,899
    Thanks
    172
    Thanked 106 Times in 86 Posts


    Use the link extractor addon and extract internal links only.

  3. The Following User Says Thank You to jesda For This Useful Post:

    Smoke (12-01-2011)

  4. #3
    WTF Senior Array
    Join Date
    Dec 2010
    Posts
    1,542
    Thanks
    97
    Thanked 28 Times in 23 Posts


    Quote Originally Posted by jesda View Post
    Use the link extractor addon and extract internal links only.
    I first thought there was a problem with the settings, but now it seems it's the proxies. They all "die", when I use scrapebox

  5. #4
    WTF Lurker Array
    Join Date
    Apr 2011
    Location
    Indiana, USA
    Posts
    28
    Thanks
    0
    Thanked 3 Times in 3 Posts


    Well public proxies are going to die, private proxies would be ideal. It doesn't matter on the link extractor anyway though, because it doesn't use proxies.

    You can also find the sitemap url of the site and use the sitemap addon to pull urls.

    Also you can scrape the site: operator from google

    like

    site:domain.com

    You can use the full link instead of just the domain.com as well. Google will give results if you use inner page urls, but trimming to root usually yields the best results. Yahoo doesn't support site: and AOL is broken ATM.

    Bing also supports the site: operator.

  6. #5
    WTF Senior Array
    Join Date
    Dec 2010
    Posts
    1,542
    Thanks
    97
    Thanked 28 Times in 23 Posts


    With me it's the public proxies dying once I want to do a PR check on pages.

  7. #6
    WTF Lurker Array
    Join Date
    Apr 2011
    Location
    Indiana, USA
    Posts
    28
    Thanks
    0
    Thanked 3 Times in 3 Posts


    Yeah everyone uses public proxies for this, so they die quickly. You an usually pr check a thousand or a few thousand urls on your own IP before it blocked and it will unblock in a matter of hours or a day. Then you can always find new proxy sources, there are countless sources. $30 will get you 10 private proxies and if you set your connecitons to 2 you can practically pr check all day long.

  8. #7
    WTF Senior Array
    Join Date
    Dec 2010
    Posts
    1,542
    Thanks
    97
    Thanked 28 Times in 23 Posts


    Seems one needs to do a risk assessment on using this and the proxies. At the moment I don't even come that far to use any of the free proxies, because they die.

    Where can one buy proxies btw.?

  9. #8
    WTF Lurker Array
    Join Date
    Apr 2011
    Location
    Indiana, USA
    Posts
    28
    Thanks
    0
    Thanked 3 Times in 3 Posts


    I can't post a straight link because of the 25 post thing, but


    scrapeboxfaq .com/ scrapebox-proxies

    Remove the spaces and that has some proxy resources.

    But check the videos on that site or the youtube link on the home page and there is a vid on how to find your own free proxy sources.

  10. The Following User Says Thank You to loopline For This Useful Post:

    webdevelopmentkit (12-05-2011)


 

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •