Mar 21st, 2008
Site Ripping Bastards: collecting the proof…
One of my last posts was about someone else ripping off one of my company’s sites and that at least this one may go somewhere. So the powers-that-be want some documentation.
So we decided to give them print outs of all the site pages and a copy on disc of the site.
For “downloading” the offending company’s website, we needed to use a auto-download program (becuase doing it by hand would have just been a colossal waste of time). Plus whomever ripped off the site in the first place probably did the same thing (though I really hope they weren’t that smart and spent a ridiculous amount of time doing it manually).
After trying several Firefox extensions and some other trial-crippled software packages, I landed with HTTrack Website Copier, a free open-source application that allows you to input the URL of a website and it will spider the site and download the pages and images, retainign the linking structure of the site, so that you have a navigable (is that a word?) version stored locally. Worked like a champ. There are Windows, Linux and Mac versions available.
It also works with dynamic sites, so if you ever have a client that wants to have their dynamic database driven PHP site stored on a CD so they can brows it locally at a tradeshow, desert island, Iditarod dog race in Alaska or some other place that doesn’t have internet access. (But of course it may also really help to have a static version if your database ever craps out on you and you still want a functional site).
Now for the print-outs issue, that would take a while, so here was my workaround:
- Use Yahoo Site Explorer to return a list of pages* it had in it’s index and use the “Export to TSV” option usually found at the bottom right of the results page. This will give you a full list of the pages for that site that has indexed, rather than just the 1st 10 links in the paginated results.
- Open the TSV file in Excel and save it as HTML
- Open the HTML file in Internet Explorer (I prefer Firefox, but IE is what makes the next step possible)
- Choose to print that one page with all the links and in the print options look for “print linked documents“
- Kick back and reload your paper and toner as necessary
This was just a hack for getting this done and I am sure there are other, better options. I am sure there is a Firefox extension I am not aware of that woudl have really helped with this. If you know of one, I would love to know and link to it.
*Since I was depending on Yahoo’s index of the site, it is a possibility I may have missed some pages, but it gave me more than enough to show what we needed. Just a disclaimer.
Hope this helps anyone who might find themselves in the same boat.
P.S.
There may be some Mac folks out there thinking “An Automator workflow would have been able to take care of all that” and you may be right, but the 3-4 hours I spent with it never seemed to be able to connect the steps correctly, even after trying the very promising “Download URLs as PDFs” action. Automator was the one of the coolest features I had seen in an OS and it made me quite interested in moving to Mac, but so far for me, it has been nothing but a big tease- it looks like it can do almost anything and makes it seem easy, but the pieces never really fit together the way I need them too. But I’ll keep trying.