LLM4ALL DMP

Data Management Plan of the ANR LLM4ALL project

Version: 29th Match 2024

This DMP is a direct adaptation of the Agence nationale de la recherche (ANR): ANR - DMP template (english) available here.

This DMP is a work-in-progress that will be updated during the lifetime of the project

1. Data description and collection or re-use of existing data

How will new data be collected or produced and/or how will existing data be re-used?

The following datasets will be collected and used in LLM4ALL:

C1: Training corpus to pretrain and continue pretrain our foundation French LLMs
- This corpus will consist of an appropriate compilation of existing open-source datasets, and of pre-processing of these datasets to enable training our LLMs on them.
- The existing data sources will be detailed during the course of the project, but we plan to start from a subset of the following sources:
  - ROOTS (corpus used to train Bloom), RedPajama V2, Claire corpus
C2: Finetuning corpus for the LLM dedicated to meetings
- SUMM-RE
C3: Corpus of simulated SAMU calls for the healthcare LLM
R1: Internal and external reports, model documentations
R2: Source code, model weights
R3: publications, blog posts, podcasts, social media…

The training datasets C1 and C2 will be collected from public sources, especially from the Hugginface Hub, following the standard procedures that have been widely adopted nowadays by the scientific community who train open-source LLMs. The precise licenses will depend on the licenses of the source corpora, but the resulting collection will be as open as possible. The data sources will of course be documented, following the “data card” format typically used on the Huggingface hub.

The simulated emergency calls dataset C3 will be compiled by the AP-HP partner, and will be released publicly, which is possible because it is a simulated corpus without any private information.

The reports R1 produced by the project will be open-sourced whenever possible, and will be kept private exceptionally on explicit request from a project partner.

The source codes and model weights R2 will be released in open-source by default, except on explicit request from a partner.

The R3 data is designed to be released in open-source and thus will be.

2. Documentation and data quality

What metadata and documentation (for example the methodology of data collection and way of organising data) will accompany the data?

For the training datasets, the original source and version will be tracked.

For the model weights, the training scripts will also be released jointly to enable reproducibility.

We will not use any strict standard (such as TEI), as we will rather follow best practices in the LLM training community that only exploits lightweight formats: UTF-8 texts, YAML, JSON.

The notion of data quality may refer to various criteria; we will focus on data sources that are already documented with all required authorizations for publication (also known as “white” data).

What data quality control measures will be used?

The quality of the training and finetuning data will be mainly tracked through perplexity-like measures, which has now become the de facto standard for LLM training.

3. Storage and backup during the research process

How will data and metadata be stored and backed up during the research?

During the research period, i.e., when the data is heavily modified, data will be stored on the GPU grid local disks, to enable fast processing, and further copied regularly (every week) onto a saved network disk available at LORIA, as a guarantee to prevent potential loss of data.

How will data security and protection of sensitive data be taken care during the research

It is not planned to use any sensitive data at this stage.

4. Legal and ethical requirements, code of conduct

If personal data are processed, how will compliance with legislation on personal data and on security be ensured?

All data used in the project will come either from existing datasets, or from simulators. Consequently, no data will be directly collected from any un protected data source, users for instance. There is thus no need to gather informed consent.

How will other legal issues, such as intellectual property rights and ownership, be managed? What legislation is applicable?

It is planned to release all data with an open-source licence at this stage.

What ethical issues and codes of conduct are there, and how will they be taken into account?

As we will reuse already existing datasets, it may be possible that, at a later time, issues are detected with regard to one of these datasets. If this is the case, we will realize all appropriate measures to remove the faulty content from our model, including, if necessary, retraining the model. Other options than retraining will also be studied if required, such as “untraining” methods that basically continue training the model so that it forget a specific information, “corrupting” methods that override the undesired piece of information under large quantities of noisy input, or “editing” methods that directly edit or prune a subset of the weights to remove the faulty knowledge.

We do not plan to handle the various types of bias that will affect the model, as there is no widely adopted successfull method to handle them in LLM as far as we know as of March 2024, but we will always explicitly and visibly insist and warn the user about these bias, risks and other model’s limitations.

We do not plan to allocate resources to study the carbon footprint of our methods, but we will contact the ANR project InExtenso, which is focused on these aspects.

The corpora, model weights and documentations will be uploaded as soon as they are finalized in the Huggingface Hub to gain visibility.

How will data for preservation be selected, and where data will be preserved long-term (for example a data repository or archive)?

For long-term preservation, the corpora, model weights and documentations will be uploaded into the French CINES (or its local instance: ORTOLANG) long-term preservation facilities. The final version only of the models, algorithms and documentations will be uploaded into CINES. Intermediate versions will be uploaded into the Huggingface hub.

What methods or software tools are needed to access and use data?

All the models will be trained in pytorch. Documentations will be made available in markdown. These are de facto standard formats in the LLM domain. The only specific required software is pytorch, but there also exists numerous open-source scripts to convert from pytorch to other formats (ONNX, GGUF, tensorflow…)

How will the application of a unique and persistent identifier (such as a Digital Object Identifier (DOI)) to each data set be ensured?

Through the CINES (or its local instance: ORTOLANG).

6. Data management responsibilities and resources

Who (for example role, position, and institution) will be responsible for data management (i.e. the data steward)?

Each Work package leader is responsible for the data management in her/his work package.

What resources (for example financial and time) will be dedicated to data management and ensuring that data will be FAIR (Findable, Accessible, Interoperable, Re-usable)?

Every partner and workpackage will allocate the part of its resources that is suitable to ensure that data will be FAIR. Concretely, every individual member of the project is deeply concerned with open-source and FAIRness. Hence, Huggingface is the company that is the de facto world-wide leader on open-source LLM, and Linagora is the company that is the French leader on open-source software. Also note that in the open-source LLM community, best practices and de facto standards exist and are widely recognized and adopted by every LLM practitioners and users. These best practices include:

Findable: The Huggingface hub is the globally recognized worldwide central repository to host every open-source LLM as well as related documentations and datasets. Our assets will be uploaded there.
Accessible: Every open-source LLM and its related resources are easily accessible on the Huggingface hub; it seems very unlikely, given the success, current levels of funding and business model of the Huggingface company that this hub may close in the following years. However, to further guarantee accessibility, we will also upload the resources developed in the project onto the ORTOLANG/CINES repository, which is less visible than the Huggingface hub but is supported by the French government and offers long-term (10 years) guarantees of accessibility.
Interoperable: Following the current best practices widely adopted by the open-source LLMs community, we will release our LLMs in the pytorch safetensor format, given that open-source and well-known conversion scripts exist to convert this format into other formats that may be more adapted to various use cases and tools (such as ONNX for tensorflow, GGUF for llama.cpp…). The datasets will be released in the most basic interoperable formats (raw text and sounds), while the documentations will be released as raw texts and markdown.
Re-usable: The formats describe above enable to easily reuse the models, which licences will also be as open as possible and derived from the licence of the base model (e.g., openRAIL for Bloom). All documentations will be released under a CC BY licence and datasets CC BY-NC-SA.