Tuesday, September 6, 2016

"Disallow ia_archiver" does not remove pages from archive.org

The experiment (September 2016):

- A site exists since 2008 and has many snapshots on archive.org.

- The FAQ says:
You can exclude your site from display in the Wayback Machine by placing a robots.txt file on your web server that is set to disallow User-Agent: ia_archiver.
- Made the robots.txt as follows:

User-agent: ia_archiver
Disallow: /

- In a couple of days, indeed, instead of the site snapshots, a message saying that the archive is not available because of the robots.txt instruction. Success!

- Removed the robots.txt.

My expectations:

The old pages had disappeared from the archive completely, but the current one is now being crawled and included into the archive.

The reality:

All the old pages are back!

The bottom line:

Using the robots.txt, one can instruct Wayback Machine to stop displaying the site (which is exactly what the FAQ says!) but not to remove the pages from the  archive.