Arquivo.pt presents itself as a solution for tools based on Artificial Intelligence (AI) to have better performance in the Portuguese language. This digital service from the Foundation for Science and Technology, developed through FCCN, It is considered the largest set of textual data in Portuguese in Portugal, available in open access, for researchers to train natural language processing (NLP) models.
The need for AI to interpret the complexities of the Portuguese language
Artificial Intelligence encompasses several areas of knowledge, such as linguistics and computer science, and is present in new technologies used daily by everyone worldwide. When we search for information on the internet, for example, and a response is generated in a specific language, this process uses AI.
Natural language processing is what allows machines to perfect the algorithm that generates these responses tailored to users, and this is the aspect of artificial intelligence that helps computers understand, interpret, and manipulate human language. However, these models have been developed mostly for the English language and not so much for others, such as Portuguese.
The truth is that the more NLPs are trained in a language, the better they will be able to interpret its complexities. However, this is only possible if they use quality data, and it is precisely in this sense that Arquivo.pt, a digital service from the Foundation for Science and Technology, emerges as a solution.
Arquivo.pt: the largest set of textual data in the Portuguese language
Arquivo.pt presents itself here as the largest set of textual data in Portuguese and in Portugal, available in open access, for researchers to train natural language processing models.
With over 1 Petabyte of content preserved since the 1990s, including everything that can be found on the web pages, Arquivo.pt not only provides text, but also images, audio files, video files and various metadata, among other types of content in Portuguese.
Content is accessible through the Arquivo.pt search interface and APIs.
Glória, a model for the Portuguese language
One of the projects that used Arquivo.pt to obtain large amounts of text is called GlórIA, a large-scale linguistic model (LLM) focused on the European Portuguese language.
“Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese,” as explained by Ricardo Lopes, João Magalhães and David Semedo, authors of the project and researchers at the Faculty of Science and Technology of NOVA University of Lisbon, in their article GlórIA – A Generative and Open Large Language Model for Portuguese.
The model used 35 million tokens or expressions that machines can process, coming from various sources, and Arquivo.pt contributed a collection of 1.4 million news and periodicals archived in Portuguese.











