- 12-01-2011 02:31 PM #1
scrapebox to harvest pages from a site
How do I use scrapebox to harvest pages from a site? When I want to list all the pages of a site like www.domain.net, what steps must I follow. What must I do?
- 12-01-2011 06:11 PM #2
-
The Following User Says Thank You to jesda For This Useful Post:
Smoke (12-01-2011)
- 12-02-2011 02:17 AM #3
- 12-03-2011 05:52 PM #4WTF Lurker Array
- Join Date
- Apr 2011
- Location
- Indiana, USA
- Posts
- 28
- Thanks
- 0
- Thanked 3 Times in 3 Posts
Well public proxies are going to die, private proxies would be ideal. It doesn't matter on the link extractor anyway though, because it doesn't use proxies.
You can also find the sitemap url of the site and use the sitemap addon to pull urls.
Also you can scrape the site: operator from google
like
site:domain.com
You can use the full link instead of just the domain.com as well. Google will give results if you use inner page urls, but trimming to root usually yields the best results. Yahoo doesn't support site: and AOL is broken ATM.
Bing also supports the site: operator.
- 12-03-2011 07:08 PM #5
- 12-04-2011 06:32 PM #6WTF Lurker Array
- Join Date
- Apr 2011
- Location
- Indiana, USA
- Posts
- 28
- Thanks
- 0
- Thanked 3 Times in 3 Posts
Yeah everyone uses public proxies for this, so they die quickly. You an usually pr check a thousand or a few thousand urls on your own IP before it blocked and it will unblock in a matter of hours or a day. Then you can always find new proxy sources, there are countless sources. $30 will get you 10 private proxies and if you set your connecitons to 2 you can practically pr check all day long.
- 12-04-2011 07:21 PM #7
Seems one needs to do a risk assessment on using this and the proxies. At the moment I don't even come that far to use any of the free proxies, because they die.
Where can one buy proxies btw.?
-
- 12-05-2011 01:50 AM #8WTF Lurker Array
- Join Date
- Apr 2011
- Location
- Indiana, USA
- Posts
- 28
- Thanks
- 0
- Thanked 3 Times in 3 Posts
-
The Following User Says Thank You to loopline For This Useful Post:
webdevelopmentkit (12-05-2011)








LinkBack URL
About LinkBacks
Reply With Quote

Bookmarks