Archive.org and the forum


#1

I noticed that archive.org has preserved quite a few threads from the old forum which are still missing in forum.archive.openwrt.org. Are there any efforts under way to restore those threads and posts from archive.org?

The current OpenWrt forum apparently can't be indexed by archive.org. I tried to archive a thread, and the result was pretty much unusable. Are there any automatic redirects for crawlers to a static version of the discourse forum contents which would make it possible to archive the contents of the current OpenWrt forum?


#2

Honestly, I didn't pursue the archive.org route very long back when I gathered and scraped the forum content, for several reasons:
a) it was just a fraction of the content compared to other sources (mainly Bing Cache ... Microsoft is good for something after all)
b) it was very often very outdated, and most of all
c) the threads and posts are in about half a dozen different variations of forum software templates, and that made it comparatively cumbersome to isolate the content and its metadata

That being said, I do have a full scrape from everything archive.org had on the old forum tucked away somewhere, if you say there's still something to gather from the archive.org content I shall have a second look, and possibly integrate it with the existing content.


#3

Apparently some threads from the old forum only turned up in archive.org recently. I remember looking for some specific threads back when the old forum had just died, and archive.org didn't have them available. Having remembered that sometimes archive.org crawl results only showed up months later, I retried accessing those threads on archive.org a few days ago, and lo and behold, they were there. I might be wrong, though.
An example where archive.org has something which is missing in forum.archive.openwrt.org: https://web.archive.org/web/20180226032651/https://forum.openwrt.org/viewtopic.php?id=37368&p=4


#4

I went back to what I could gather of archive.org back in May (about 4000 pages of forum posts), parsed it, and integrated the content with what is already in the archive. All in all, I could restore another 6000-ish forum posts that have been missing from the other "big" scrape. In the grand scheme that's not all that much, about 2 or 3% of the total archive, but it may fill in some gaps.

Which is part of the content that has now been added:
https://forum.archive.openwrt.org/viewtopic.php?id=37368&p=4

Thank you for your input, it is well appreciated.


#5

Thank you very much for doing that extra work, it is really appreciated.


#6

About archiving the current forum with archive.org:
This seems to be an open problem with Discourse in general.
https://meta.discourse.org/t/make-discourse-play-nice-with-the-wayback-machine/34579/36


#7

I believe it's not all that big of a problem. The issue with the old forum is that not only the server went down, but rather that there is now a gaping hole where the server was, including all data. As far as I know the situation, the server was provided externally, and neither it nor its owner are accessible anymore.

I believe this can and will not happen again, now that the server is set up properly in OpenWrt's own server infrastructure, with proper backups and all.