In collaboration with Wikimedia Portugal, the FCCN unit of the Foundation for Science and Technology (FCT) has launched a project to preserve online references contained in Portuguese Wikipedia articles. The goal is to change references to broken links in Wikipedia articles so that they permanently reference content preserved on Arquivo.pt, thus keeping the referenced information always accessible to Wikipedia users.
One of the most widely used online resources for educational purposes is Wikipedia articles. However, Wikipedia articles often reference external pages with important additional information that has since become unavailable on their original websites. This problem degrades Wikipedia's quality as a credible and verifiable source of information.
In August 2023, FCCN's Arquivo.pt team conducted an experiment to measure the percentage of external links (outside the wikipedia.org domain) that were broken, in articles on Portuguese Wikipedia. The results obtained showed that 25% of the external links referenced in the Portuguese Wikipedia were broken.
Furthermore, there is the added problem that a link may even reference content that is still available, but this may no longer be what was originally intended in the Wikipedia article. This may be because the domain has since been purchased by a third party, or for other malicious purposes. This phenomenon is called Content Drift.
To address these issues, Arquivo.pt launched a project in partnership with Wikimedia Portugal to preserve online references present in Portuguese Wikipedia articles. The main goal is to replace broken links in Wikipedia articles with links that direct to content preserved on Arquivo.pt, thus ensuring that cited information remains accessible to Wikipedia readers and users.
Preservation of pages referenced in Wikipedia
The Portuguese Wikipedia contains about 1 million articles and on average they are 140 pages edited per day.
The Archive.pt automatically extracted 14 million links from references in all Portuguese Wikipedia articles. Of these links, it was observed that only 620 referenced Arquivo.pt and 744,553 Internet Archive (5.3%). Note that Wikipedia's guide to creating references recommends publishing citations for web archives (parameter archiveurl/archive-url).
On February 15, 2023, the Arquivo.pt collected all the pages referenced in articles on the Portuguese Wikipedia, which resulted in a new collection named EAWP42: Collection of external links from wikipedia using the wikimedia dumps which contains 12 million files (856 GB).
The main result of this project was the creation of a new automated process for extracting and collecting external links cited on Portuguese Wikipedia pages. This process became part of the FCCN unit's collection operation, with an annual compilation of Wikipedia citations being conducted.