We spoke with the creators of ALIA, the 100% Spanish AI, to understand its future
This Monday it was announced release of ALIA language models. The initiative has been in development for years and it is now that the first fruits are beginning to be seen, still modest, but promising. To learn more details about ALIA, at Xataka we have spoken with Martha Villegas (@MartaVillegasM), head of the Language Technologies Unit of the Barcelona Supercomputing Center (BSC). This has allowed us to clarify the status of the project, its objectives and its next challenges. To compete with ChatGPT, nothing The first thing we wanted to know is how ALIA had been created, and here Marta Villegas clarified that the model is based on the Llama architecture – Meta’s Open Source model –, “but the model has been trained from scratch and with zero initial weights“. This is important because ALIA is not a Llama-based model that has undergone a refinement or “fine-tuning” process. In those cases, this expert explained, “you start from a model trained with other data and with initialized weights, and you do it to adapt that model to your needs, either because you have more data and you want it to be better or because perhaps you want to adapt it to a particular domain. But here, he told us, “the vocabulary (set of tokens) is completely different.” In other models the corpus or training data set may be mostly in English, which causes the set of admissible tokens to be calculated through English. That, Villegas indicates, would make it adapt less efficiently to other languages. That is precisely what has been sought with ALIA: reduce the relevance of English to increase the number of 35 languages of the European Union and, especially, Spanish, Catalan, Basque and Galician. How ALIA has been trained The ALIA training process began with some experiments in April 2024. It is necessary because as Villegas explained, “training is not pressing the button after feeding the data and that’s it.” It had to be taken into account that MareNostrum 5the supercomputer located at and managed by the BSC, had just come into operation at full power and there was high demand to use it. MareNostrum 5 In this training process, the ALIA project has had gradual availability of the computing capacity of MareNostrum 5. Although for a short period of time they had access to 512 of the 1,120 specialized nodes of the supercomputer, 256 nodes were used for many months and since September They are using 128 nodes, “which is a lot,” Villegas highlights. During the training process, he told us, there are so-called “checkpoints”, in which it is possible to evaluate how the training process is going. These “pauses” also allow certain training data to be updated, as in fact happened in that process in which at a given moment they introduced a new corpus with high quality that allowed them to replace some data they had. This is just the beginning: it’s time to “instruct” and “align” ALIA Villegas explained to us that ALIA is a foundational model: it is not prepared to be an alternative to ChatGPT. The latter is based on GPT-4, a much more ambitious foundational model that involved much more investment. Here we must differentiate the foundational model from the “educated” and “aligned” models with which we usually interact. As this expert told us, “ALIA-40b is a foundational model that is not instructed or aligned. For a model to be a ChatGPT and understand the conversation and have a certain memory and be “politically correct,” the foundational model (which only learns to say the next token) is “instructed” by passing a bunch of texts.” Even so, the goal is to gradually consider these options. “In March, the instructed version of ALIA-40b is expected to be launched, with a first set of open instructions,” Villegas told us. These instructions are going to be subcontracted – the ones that allow these models to be instructed – and a million euros are going to be invested in that set of instructions from scratch. This data will also be published so that it is available to institutions and developers: if it has been paid with public money, explains Villegas, it is logical that this data will also be public, something that does not usually happen with other AI models from private companies. While training AI models provides guidance on how to respond and defines the context and purpose of those responses, alignment solves problems such as avoid discriminatory biasprevent misinformation or protect privacy. Precisely this lack of alignment means that using these models in this initial phase can produce responses with errors and biases that are precisely mitigated to a great extent with this alignment phase. ALIA and the competition: it is neither a rival of ChatGPT nor does it intend to be In fact, Villegas highlights, “the objective is not to compete with ChatGPT, for that we would need 5 billion dollars.” ALIA-40b “is a good model, and a chatbot can be made in the future because the intention is to instruct and align it, but that will take time.” Within the ALIA family we have the Salamandra models (2b and 7b), smaller and more modest but which already have first instructed versions. Its performance and capacity still have room for improvement, but they are good starting points for the future. It was inevitable to ask how ALIA then intends to compete with other models, both closed and developed by private companies and Open Source models. For her “There is a demand for intermediate models that each person can then adapt to their specific use case, not everyone can use ChatGPT for reasons such as privacy or use case.” Villegas also wanted to highlight how these smaller models can have exceptional performance in specific tasks, and can work at levels of security and not sharing important data. The objective is not to compete with ChatGPT, for that we would need 5,000 million dollars Not only that, he reveals: “we also took out the … Read more