Find out the opinion of António Branco, Professor at the Faculty of Science of Lisbon and General Director of PORTULAN CLARIN Research Infrastructure for Language Science and Technology, about the Albertina PT-* project.
The advances in Artificial Intelligence have been impressive, especially in its application to Language Technology. This progress is based on machine learning with so-called Big Language Models, such as GPT-3 or ChatGPT, which have been talked about so much recently.
These networks are gigantic - GPT-3, for example, has 175 billion connections between neurons. They pick up linguistic regularities when trained in massive computational processes, on colossal volumes of linguistic data, text or audio. In the case of GPT-3, 500 billion words were used for training.
Once trained, these models can be used in other language tasks at an unprecedented level of quality, such as translation, conversation, speech transcription and subtitling, text and speech generation, content analysis and information extraction, etc. When integrated into wider systems, they are transforming diagnostics and healthcare, financial and legal services, gaming and entertainment, education, creativity and culture, etc.
Due to the size of the models, these processing tasks are available remotely as online services, such as search engines, rather than as locally installed spell-checkers on our devices. Due to the size of the resources for learning, these services are immediately available from the oligopoly of bigtechs, which can be counted on the fingers of one hand, with the ability to access the colossal amounts of computing and data needed for training.
As a result, in the digital age, language use - with other humans, organizations, services or artificial devices - will never be done again without this pervasive and deep technological intermediation, which processes acts of communication and accesses their meaning.
We have enough experience with information search engines, for example, and their assumptions and impacts, to intuit the consequences of this technological intermediation on the everyday use of language itself. Technological intermediation, in general, generates a digital trail of personal data beyond our control. Incessant technological intermediation of human language and communication in particular, funneled into a small global oligopoly, creates alarming risks for individual and collective sovereignties.
Undesirable impacts of emerging technologies are mitigated with more and better technology, not less. Dispersing the supply of these services is crucial to counter the threat posed by their concentration. The answer thus lies in fostering an innovation ecosystem that alternatively enables timely and widespread access to the resources needed for the appropriation and exploitation of Language Technology by as many individuals and organizations as possible, private and public, small and large, national and international.
In this respect, the RNCA is already playing a major role, particularly through Computing Projects Calls Advanced: Artificial Intelligence in the Cloud.
I coordinate one of the projects funded by the first edition of this competition, in which we seek to contribute to open AI and the technological preparation of the Portuguese language. One of the results of this project, which I'm reporting on here, is Albertina PT-*. This is a foundational model developed specifically for the Portuguese language, both for the European variant spoken in Portugal and for the American variant spoken in Brazil.
As far as we know, with its 900 million parameters and its performance level, it constitutes the current state of the art regarding large foundational language models of the encoder class for this language that are publicly available in open source, free of charge and with unrestricted license. A comprehensive presentation of its features and implementation can be found in the paper accepted for publication in the proceedings of EPIA2023, the annual conference of the Portuguese Association for Artificial Intelligence.
This is just a first step towards the democratization of this technology, which is key for the future, and in the promotion of open generative AI, to which RNCA, I am sure, will continue to make an invaluable contribution.