In collaboration with Wikimedia Portugal, Arquivo.pt has launched a project to preserve the online references contained in Portuguese Wikipedia articles. The aim is to change the references to broken links in Wikipedia articles so that they refer, in perpetuity, to content preserved on Arquivo.pt, thus keeping the referenced information always accessible to Wikipedia users.
One of the most widely used online resources for educational purposes are Wikipedia articles. However, Wikipedia articles often reference external pages with important complementary information that has since become unavailable on their original websites. This problem degrades the quality of Wikipedia as a credible and verifiable source of information.
In August 2023, the Arquivo.pt team carried out an experiment to measure the percentage of external links (outside the wikipedia.org domain) that were broken, in articles on the Portuguese Wikipedia. The results showed that 25% of the external links referenced in the Portuguese Wikipedia were broken.
In addition, there is the problem that a link may even reference content that is still available, but this may no longer be what was originally intended to be referenced in the Wikipedia article. Either because the domain has since been bought by a third party, or for other malicious purposes. This phenomenon is called Content Drift.
To deal with these problems, Arquivo.pt has launched a project in partnership with Wikimedia Portugal with the aim of preserving the online references present in Wikipedia articles in Portuguese. The main goal is to replace broken links in Wikipedia articles with links to content preserved at Arquivo.pt, thus ensuring that the information cited remains accessible to readers and Wikipedia users.
Preservation of referenced pages on Wikipedia
Arquivo.pt automatically extracted 14 million links from the references in all the articles on the Portuguese Wikipedia. Of these links, only 620 referenced Arquivo.pt and 744,553 the Internet Archive (5.3%). Note that Wikipedia's guide to creating references recommends publishing citations to web archives (parameter arquivourl/archive-url).
On February 15, 2023, Arquivo.pt collected all the pages referenced in Portuguese Wikipedia articles, resulting in a new collection named EAWP42: Collection of external links from wikipedia using the wikimedia dumps, which contains 12 million files (856 GB).
The main result of this project was the creation of a new automatic process for extracting and collecting external links cited on Portuguese Wikipedia pages. This process is now part of the Arquivo.pt collection operation, and an annual compilation of Wikipedia citations is carried out.