André Mourão, from the team Arquivo.pt, explains everything about the functionality of this FCCN Unit service that allows users to search for images from the past.
Arquivo.pt launched the Dionisius project on March 24th. Can you tell us what this initiative entails?
At Arquivo.pt, we have a model for periodically releasing new versions of our portal.
In these versions, we group together all the improvements made, which are usually centered around a central objective of the version.
THE Dionysius had a special impact as we launched the new version of image search, the result of years of work. We went from a prototype with 22 million searchable images to a system that provides over 1.8 billion, while maintaining the same speed of response and ease of use. our portal web.
Since then, as you say, 1.8 billion images from the web's past have become searchable on Arquivo.pt. How do you classify this result?
This process went very well and far exceeded our most optimistic expectations. We processed over 8 billion pages, totaling 520TB of archived data, covering the period 1992 to 2019.
In May 2020, we predicted we would find 18 times more images; the end result was an 81-fold increase in the number of searchable images.
This solution is described as "an innovative system" by Arquivo.pt. How does this version innovate compared to what other web archives have done?
Beyond scale, the biggest innovation of this Arquivo.pt research is the focus on extracting relevant information from the pages for each image. For all images on all pages, we extracted a textual caption, corresponding to the portion of the page text that is closest to the image.

This is especially relevant on pages that have a lot of images, as it allows users to find the specific image that illustrates their search.
Other features worth highlighting are related to the automatic classification of content that is potentially offensive to users, advanced search with multiple content filters and automatic access from APIs, which allows the use of data collected by Arquivo.pt in innovative projects (https://arquivo.pt/apis).
What kind of added value and potential does this new feature represent for Arquivo.pt users?
External studies show that around a quarter of generalist web searches are for imagesIn the case of Arquivo.pt, image searches represent around one-fifth of all searches performed. Searching archived data provides insight into generalist image search engines like Google Images. These are focused on searching images from the present, especially popular and recent content.
Arquivo.pt allows for retrospective searches with a special focus on time. Older versions of images and pages are available for consultation, allowing you to see how pages and images have evolved. Our search allows for greater impartiality in the results returned, as we are not focused on popularity metrics. We also allow for greater granularity in filtering search results (for example, filtering results by date, website, file type, among others).
Arquivo.pt has already given rise to many projects with potential for positive impact in society. In the specific case of this research, it was recently published a scientific article by Ricardo Campos and co-authors, where the Image Search API is used to find images to illustrate the results of the temporal division of a news item.
To give a personal example, I found many records of book reviews done by my great-aunt in Calouste Gulbenkian Foundation. These records were digitized from the originals from the 60s and 70s, placed on the Gulbenkian website and are now available for consultation and research on Arquivo.pt.
Is there anything you'd like to add?
Arquivo.pt will host an online session where I will talk about how we made these 1.8 billion images searchable. The event will take place on April 23rd at 3:00 p.m. (with free advance registration).
I would like to emphasize that our web portal and APIs are open source and free to access and are available for personal use or for research projects, without prior registration.
Finally, I would also like to mention the Arquivo.pt Award (https://arquivo.pt/premio2021) is now in its 4th year and aims to award up to €10,000 to innovative works based on historical information preserved by Arquivo.pt. The works can cover topics from any field (e.g., Education, History, Sociology, Communication, Health, IT), and applications are open until May 4, 2021.