This Monday it was announced release of ALIA language models. The initiative has been in development for years and it is now that the first fruits are beginning to be seen, still modest, but promising.
To learn more details about ALIA, at Xataka we have spoken with Martha Villegas (@MartaVillegasM), head of the Language Technologies Unit of the Barcelona Supercomputing Center (BSC). This has allowed us to clarify the status of the project, its objectives and its next challenges.
To compete with ChatGPT, nothing
The first thing we wanted to know is how ALIA had been created, and here Marta Villegas clarified that the model is based on the Llama architecture – Meta’s Open Source model –, “but the model has been trained from scratch and with zero initial weights“.


This is important because ALIA is not a Llama-based model that has undergone a refinement or “fine-tuning” process. In those cases, this expert explained, “you start from a model trained with other data and with initialized weights, and you do it to adapt that model to your needs, either because you have more data and you want it to be better or because perhaps you want to adapt it to a particular domain.
But here, he told us, “the vocabulary (set of tokens) is completely different.” In other models the corpus or training data set may be mostly in English, which causes the set of admissible tokens to be calculated through English. That, Villegas indicates, would make it adapt less efficiently to other languages.
That is precisely what has been sought with ALIA: reduce the relevance of English to increase the number of 35 languages of the European Union and, especially, Spanish, Catalan, Basque and Galician.
How ALIA has been trained
The ALIA training process began with some experiments in April 2024. It is necessary because as Villegas explained, “training is not pressing the button after feeding the data and that’s it.” It had to be taken into account that MareNostrum 5the supercomputer located at and managed by the BSC, had just come into operation at full power and there was high demand to use it.


MareNostrum 5
In this training process, the ALIA project has had gradual availability of the computing capacity of MareNostrum 5. Although for a short period of time they had access to 512 of the 1,120 specialized nodes of the supercomputer, 256 nodes were used for many months and since September They are using 128 nodes, “which is a lot,” Villegas highlights.
During the training process, he told us, there are so-called “checkpoints”, in which it is possible to evaluate how the training process is going. These “pauses” also allow certain training data to be updated, as in fact happened in that process in which at a given moment they introduced a new corpus with high quality that allowed them to replace some data they had.
This is just the beginning: it’s time to “instruct” and “align” ALIA
Villegas explained to us that ALIA is a foundational model: it is not prepared to be an alternative to ChatGPT. The latter is based on GPT-4, a much more ambitious foundational model that involved much more investment.


Here we must differentiate the foundational model from the “educated” and “aligned” models with which we usually interact. As this expert told us, “ALIA-40b is a foundational model that is not instructed or aligned. For a model to be a ChatGPT and understand the conversation and have a certain memory and be “politically correct,” the foundational model (which only learns to say the next token) is “instructed” by passing a bunch of texts.”
Even so, the goal is to gradually consider these options. “In March, the instructed version of ALIA-40b is expected to be launched, with a first set of open instructions,” Villegas told us. These instructions are going to be subcontracted – the ones that allow these models to be instructed – and a million euros are going to be invested in that set of instructions from scratch.
This data will also be published so that it is available to institutions and developers: if it has been paid with public money, explains Villegas, it is logical that this data will also be public, something that does not usually happen with other AI models from private companies.
While training AI models provides guidance on how to respond and defines the context and purpose of those responses, alignment solves problems such as avoid discriminatory biasprevent misinformation or protect privacy.
Precisely this lack of alignment means that using these models in this initial phase can produce responses with errors and biases that are precisely mitigated to a great extent with this alignment phase.
ALIA and the competition: it is neither a rival of ChatGPT nor does it intend to be
In fact, Villegas highlights, “the objective is not to compete with ChatGPT, for that we would need 5 billion dollars.” ALIA-40b “is a good model, and a chatbot can be made in the future because the intention is to instruct and align it, but that will take time.”


Within the ALIA family we have the Salamandra models (2b and 7b), smaller and more modest but which already have first instructed versions. Its performance and capacity still have room for improvement, but they are good starting points for the future.
It was inevitable to ask how ALIA then intends to compete with other models, both closed and developed by private companies and Open Source models. For her “There is a demand for intermediate models that each person can then adapt to their specific use case, not everyone can use ChatGPT for reasons such as privacy or use case.”
Villegas also wanted to highlight how these smaller models can have exceptional performance in specific tasks, and can work at levels of security and not sharing important data.
The objective is not to compete with ChatGPT, for that we would need 5,000 million dollars
Not only that, he reveals: “we also took out the know-how that we get as a country, we have a group of young researchers who have great experience in this, and generating this pool of people is important.
Villegas could not give us data on the first two projects to which ALIA will theoretically be applied. At the launch, there was talk of an internal chatbot that promises to speed up the work of the Tax Agency, and a solution for primary care medicine that will allow “an early and more precise diagnosis of heart failure.”
ALIA’s next steps
As this expert anticipated, it is expected that in two or three months we will have a trained version of ALIA that we can use in a way that is somewhat closer to how we now use ChatGPT, for example.
For the remainder of the year, apart from this release of instructions to be able to instruct the model, the objective is to have a first version of alignment that will be much closer to what we now have with ChatGPT, Claude or Gemini, for example.
It is also important to have a model with these characteristics because, as Villegas explains, this “allows us to generate synthetic data for train smaller and very specific modelsapart from using it in applications of all kinds”.
There is another curiosity: the “large” model, ALIA-40b, can also be used as a kind of “judge” (LLM as a judge) that allows you to evaluate and judge the quality and accuracy of the responses generated by other AI models. It is a way to train and align smaller models, which makes the relevance of ALIA-40b as a basis for the future even clearer.
This is ALIA inside
As the published data indicate on HuggingFaceALIA-40b is a family of multi-language AI models pre-trained from scratch. It has variants of 2B, 7B and 40B, and is published with an Open Source license. Specifically, with Apache 2.0 license. All your training scripts, configuration files and weights are available in the GitHub repository.


The Salamandra-2B demo is available on Hugging Face for translation purposes. It is a good way to test that function of this model.
For its training, 6.9 billion “heavily filtered” data tokens with texts and code from 35 European languages have been used. The training data set is also extensively detailed in the ALIA-Kit on the project website, something that is especially appreciated and that provides total transparency to the project.
All models have been trained on the MareNostrum 5 supercomputer managed by the Barcelona Supercomputing Center – National Supercomputing Center (BSC-CNS). It consists of 1,120 nodeseach of which has four NVIDIA Hopper cards with 64 GB of HBM2 memory, two Intel 8460Y Sapphire Rapids processors, 512 GB of main memory (DDR) and 460 GB of storage.
The “pre-training” data focused on giving more importance to Spanish and the co-official languages (Catalan, Galicia, Basque). The data and code in English were reduced by half, those of these languages used in Spain were doubled, and the rest of the languages treated remained the same. Thus, English represents 39.31% of these data, compared to 16.12% of Spanish, 1.97% of Catalan, 0.31% of Galician and 0.24% of Basque.


This is the structure of the “corpus” languages used to train ALIA.
The main source of the training data is a dataset called Colossal OSCAR (Open Source-large Crawled Aggregated coRpus), which represents 53% of the total tokens. There are many more datasets, among which are for example CATalog (the largest dataset in the world in Catalan) or Legal-ESwith data from the BOE, the Senate, or the Congress.
Images | BSC | ALIA
In Xataka | The EU wants to close the gap in the race for AI with 750 million euros. And it is good news for Barcelona
GIPHY App Key not set. Please check settings