We’re in the process of retiring our last production server running NT and ColdFusion (whew!), and this means we needed to get a few old projects ported to our newer Linux machines. The main site, http://aen.walkerart.org/, is marginally database-driven: that is, it pulls random links and projects from a database to make the pages different each time you load. The admin at the time was nice enough to include MDB dump files from the Microsoft Access(!) project database, and the free mdbtools software was able to extract the schema and generate import scripts. Most of this page works as-is, but I had to tweak the schema by hand.
After the database was ported to MySQL, it was time to convert the ColdFusion to PHP. (Note: the pages still say .cfm so we don’t break links or search engines – it’s running php on the server) Luckily the scripts weren’t doing anything terribly complicated, mostly just selects and loops with some “randomness” thrown in. I added a quick database-abstraction file to handle connections and errors and sanitize input, and things were up and running quickly.
… sort of. The site is essentially a repository of links to other projects, and was launched in February 2000. As you might imagine there’s been some serious link rot, and I’m at a bit of loss on how to approach a solution. Steve Dietz, former New Media curator here at the Walker, has an article discussing this very issue here (ironically mentioning another Walker-commissioned project that’s suffered link rot. Hmm.).
One strategy Dietz suggests is to update the links by hand as the net evolves. This seems resource-heavy, even if a link-validating bot could automate the checking — someone would have to curate new links and update the database. I’m not sure we can make that happen.
It also occurred to me to build a proxy using the wayback machine to try to give the user a view of the internet in early 2000. There’s no API for pulling pages, but archive.org allows you to build a URL to get the copy of a page closest to a specific date, so it seems possible. But this is tricky for other reasons – what if the site actually still exists? Should we go to the live copy or the copy from 2000? Do we need to pull the header on the url and only go to archive.org if it’s a 404 to 500? And what if the domain is now owned by a squatter who returns a 200 page of ads? Also, archive.org respects robots.txt, so a few of our links have apparently never been archived and are gone forever. Rough.
In the end, the easy part was pulling the code to a new language and server – it works pretty much exactly like it did before, broken links and all. The hard part is figuring out what to do with the rest of the web… I do think I’ll try to build that archive.org proxy someday, but for now the fact it’s running on stable hardware is good enough.
Thoughts? Anyone already built that proxy and want to share?