Arquivo.pt presents itself as a solution for tools based on Artificial Intelligence (AI) to have better performance in the Portuguese language. This digital service from the Foundation for Science and Technology, developed through FCCN, It is considered the largest set of textual data in Portuguese in Portugal, available in open access, for researchers to train natural language processing (NLP) models.

The need for AI to interpret the complexities of the Portuguese language

Artificial Intelligence encompasses several areas of knowledge, such as linguistics and computer science, and is present in new technologies used daily by everyone worldwide. When we search for information on the internet, for example, and a response is generated in a specific language, this process uses AI.

Natural language processing is what allows machines to perfect the algorithm that generates these responses tailored to users, and this is the aspect of artificial intelligence that helps computers understand, interpret, and manipulate human language. However, these models have been developed mostly for the English language and not so much for others, such as Portuguese.

The truth is that the more NLPs are trained in a language, the better they will be able to interpret its complexities. However, this is only possible if they use quality data, and it is precisely in this sense that Arquivo.pt, a digital service from the Foundation for Science and Technology, emerges as a solution.

Arquivo.pt: the largest set of textual data in the Portuguese language

Arquivo.pt presents itself here as the largest set of textual data in Portuguese and in Portugal, available in open access, for researchers to train natural language processing models.

With over 1 Petabyte of content preserved since the 1990s, including everything that can be found on the web pages, Arquivo.pt not only provides text, but also images, audio files, video files and various metadata, among other types of content in Portuguese.

Content is accessible through the Arquivo.pt search interface and APIs.

Glória, a model for the Portuguese language

One of the projects that used Arquivo.pt to obtain large amounts of text is called GlórIA, a large-scale linguistic model (LLM) focused on the European Portuguese language.

“Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese,” as explained by Ricardo Lopes, João Magalhães and David Semedo, authors of the project and researchers at the Faculty of Science and Technology of NOVA University of Lisbon, in their article GlórIA – A Generative and Open Large Language Model for Portuguese.

The model used 35 million tokens or expressions that machines can process, coming from various sources, and Arquivo.pt contributed a collection of 1.4 million news and periodicals archived in Portuguese.

Latest articles

Boletim FCCN: Analista de Cibersegurança, por João Machado

João Machado, Analista de Cibersegurança da FCT, na unidade de serviços digitais FCCN, releva-nos as suas motivações, ferramentas secretas e outras curiosidades do dia a dia de quem protege a internet.

Read article

Inteligência Artificial no Ensino Superior: desafios à identidade e aos valores da ciência

Os sistemas baseados em IA passaram a integrar o quotidiano de estudantes, docentes e investigadores.

Read article

Política sobre Acesso Aberto pela voz da comunidade científica

Para melhor compreender o alcance da Nova Política sobre Acesso Aberto a Publicações Científicas, foram reunidos testemunhos de elementos da comunidade de investigação nacional.

Read article

IAedu: Democratizing access to Artificial Intelligence in education and research in Portugal

IAedu is a digital solution developed by FCT, through its digital services unit FCCN, in partnership with Microsoft Portugal.

Read article

Innovation Hub regressa às Jornadas FCCN e já tem candidaturas abertas

Se tem um projeto, desafio ou ideia que gostaria de apresentar à comunidade, candidate-se a subir ao palco do Innovation Hub.

Read the news

POLEN Sync vem dar apoio à fase de gestão de dados ativos das atividades científicas financiadas pela FCT

A FCT apresenta o POLEN Sync, um serviço concebido para apoiar a comunidade científica na gestão dos dados durante a fase ativa dos projetos de investigação.

Read the news

Jornadas FCCN 2026: programa já disponível

Já é possível conhecer a agenda da edição de 2026 das Jornadas FCCN. Computação, Colaboração, Conectividade, Conhecimento e Segurança são as áreas em destaque.

Read the news

Da investigação à sala de aula: projetos da FCCN potenciados pela Inteligência Artificial

A IA está a transformar a forma como se desenvolve a investigação científica e como o conhecimento produzido chega à sociedade, incluindo às salas de aula.

Read the news

FCT e CNCA promovem webinar sobre Tendências de Inteligência Artificial no Setor da Energia

Fábrica de IA BSC (Barcelona Supercomputing Center) promove, no próximo 27 de março, às 10h00, o webinar “Tendências de Inteligência Artificial para o Setor da Energia”.

More information

CRIS2026: inscrições individuais já estão abertas

A 17.ª edição da conferência decorre de 19 a 22 de maio, no campus de Ponta Delgada da Universidade dos Açores.

More information

FCT promove sessão do ciclo “Arquivos do Saber: Ciência, História e Memória” dedicada a Miguel Mota

A sessão decorre no próximo dia 11 de março dedicada à vida e obra de Miguel Eugénio Galvão de Melo e Mota.

More information

17.ª Conferência Lusófona de Ciência Aberta regressa a Portugal em 2026

A Universidade do Algarve acolhe o principal fórum de reflexão e partilha sobre Ciência Aberta no espaço lusófono.

More information

The need for AI to interpret the complexities of the Portuguese language

Arquivo.pt: the largest set of textual data in the Portuguese language

Glória, a model for the Portuguese language

Latest articles

Boletim FCCN: Analista de Cibersegurança, por João Machado

Inteligência Artificial no Ensino Superior: desafios à identidade e aos valores da ciência

Política sobre Acesso Aberto pela voz da comunidade científica

IAedu: Democratizing access to Artificial Intelligence in education and research in Portugal

Innovation Hub regressa às Jornadas FCCN e já tem candidaturas abertas

POLEN Sync vem dar apoio à fase de gestão de dados ativos das atividades científicas financiadas pela FCT

Jornadas FCCN 2026: programa já disponível

Da investigação à sala de aula: projetos da FCCN potenciados pela Inteligência Artificial

FCT e CNCA promovem webinar sobre Tendências de Inteligência Artificial no Setor da Energia

CRIS2026: inscrições individuais já estão abertas

FCT promove sessão do ciclo “Arquivos do Saber: Ciência, História e Memória” dedicada a Miguel Mota

17.ª Conferência Lusófona de Ciência Aberta regressa a Portugal em 2026

POLEN Sync vem dar apoio à fase de gestão de dados ativos das atividades científicas financiadas pela FCT