This DMP is a direct adaptation of the Agence nationale de la recherche (ANR): ANR - DMP template (english) available here.
This DMP is a work-in-progress that will be updated during the lifetime of the project
The following datasets will be collected and used in LLM4ALL:
The training datasets C1 and C2 will be collected from public sources, especially from the Hugginface Hub, following the standard procedures that have been widely adopted nowadays by the scientific community who train open-source LLMs. The precise licenses will depend on the licenses of the source corpora, but the resulting collection will be as open as possible. The data sources will of course be documented, following the “data card” format typically used on the Huggingface hub.
The simulated emergency calls dataset C3 will be compiled by the AP-HP partner, and will be released publicly, which is possible because it is a simulated corpus without any private information.
The reports R1 produced by the project will be open-sourced whenever possible, and will be kept private exceptionally on explicit request from a project partner.
The source codes and model weights R2 will be released in open-source by default, except on explicit request from a partner.
The R3 data is designed to be released in open-source and thus will be.
For the training datasets, the original source and version will be tracked.
For the model weights, the training scripts will also be released jointly to enable reproducibility.
We will not use any strict standard (such as TEI), as we will rather follow best practices in the LLM training community that only exploits lightweight formats: UTF-8 texts, YAML, JSON.
The notion of data quality may refer to various criteria; we will focus on data sources that are already documented with all required authorizations for publication (also known as “white” data).
The quality of the training and finetuning data will be mainly tracked through perplexity-like measures, which has now become the de facto standard for LLM training.
During the research period, i.e., when the data is heavily modified, data will be stored on the GPU grid local disks, to enable fast processing, and further copied regularly (every week) onto a saved network disk available at LORIA, as a guarantee to prevent potential loss of data.
It is not planned to use any sensitive data at this stage.
All data used in the project will come either from existing datasets, or from simulators. Consequently, no data will be directly collected from any un protected data source, users for instance. There is thus no need to gather informed consent.
It is planned to release all data with an open-source licence at this stage.
As we will reuse already existing datasets, it may be possible that, at a later time, issues are detected with regard to one of these datasets. If this is the case, we will realize all appropriate measures to remove the faulty content from our model, including, if necessary, retraining the model. Other options than retraining will also be studied if required, such as “untraining” methods that basically continue training the model so that it forget a specific information, “corrupting” methods that override the undesired piece of information under large quantities of noisy input, or “editing” methods that directly edit or prune a subset of the weights to remove the faulty knowledge.
We do not plan to handle the various types of bias that will affect the model, as there is no widely adopted successfull method to handle them in LLM as far as we know as of March 2024, but we will always explicitly and visibly insist and warn the user about these bias, risks and other model’s limitations.
We do not plan to allocate resources to study the carbon footprint of our methods, but we will contact the ANR project InExtenso, which is focused on these aspects.
The corpora, model weights and documentations will be uploaded as soon as they are finalized in the Huggingface Hub to gain visibility.
For long-term preservation, the corpora, model weights and documentations will be uploaded into the French CINES (or its local instance: ORTOLANG) long-term preservation facilities. The final version only of the models, algorithms and documentations will be uploaded into CINES. Intermediate versions will be uploaded into the Huggingface hub.
All the models will be trained in pytorch. Documentations will be made available in markdown. These are de facto standard formats in the LLM domain. The only specific required software is pytorch, but there also exists numerous open-source scripts to convert from pytorch to other formats (ONNX, GGUF, tensorflow…)
Through the CINES (or its local instance: ORTOLANG).
Each Work package leader is responsible for the data management in her/his work package.
Every partner and workpackage will allocate the part of its resources that is suitable to ensure that data will be FAIR. Concretely, every individual member of the project is deeply concerned with open-source and FAIRness. Hence, Huggingface is the company that is the de facto world-wide leader on open-source LLM, and Linagora is the company that is the French leader on open-source software. Also note that in the open-source LLM community, best practices and de facto standards exist and are widely recognized and adopted by every LLM practitioners and users. These best practices include: