2016-09-06

"Disallow ia_archiver" does not remove pages from archive.org

The experiment (September 2016):

- A site exists since 2008 and has many snapshots on archive.org.

- The FAQ says:
You can exclude your site from display in the Wayback Machine by placing a robots.txt file on your web server that is set to disallow User-Agent: ia_archiver.
- Made the robots.txt as follows:

User-agent: ia_archiver
Disallow: /

- In a couple of days, indeed, instead of the site snapshots, a message saying that the archive is not available because of the robots.txt instruction. Success!

- Removed the robots.txt.

My expectations:

The old pages had disappeared from the archive completely, but the current one is now being crawled and included into the archive.

The reality:

All the old pages are back!

The bottom line:

Using the robots.txt, one can instruct Wayback Machine to stop displaying the site (which is exactly what the FAQ says!) but not to remove the pages from the  archive.

No comments:

Post a Comment