Skip to content
Snippets Groups Projects

Developer documentation for the GGA automated development tools

The gga_load_data Python command line scripts aim to facilitate the deployment of GGA environments using Docker technology as well as the loading of genomic data into these environments.

All scripts in the gga_load pipeline are written in Python3 and are executable as standalone. Python objects are created by each script, making it possible to use their methods and attributes externally.

Input and configuration files

The gga_load_data scripts all use the same input file, in YAML format. This input file describing the organism and associated data has to be provided to every script. You can find an example of this file here along with explanations about the different describable items.

A configuration file is also required for some scripts (gga_init.py, gga_load_data.py and gga_run_workflow*.py scripts). The configuration file contains information used in docker containers creation and deployment and sensitive data such as admin passwords. It also contains the information required to interact remotely with the galaxy instances (user, password, email). A template config file is available here

Constants

The pipeline uses constants defined in constants.py and constants_phaeo.py to handle pipeline-wide variables.

Defining all your constants in these files prevents having to redefine the same variables in different scripts, which could potentially cause issues with typos and require updating several files if a constant value is changed.

Utilities

Two "utilities" Python files contain the various methods that are used in different scripts from the gga_load pipeline. The utilities methods are divided in two files to differentiate between methods using the Bioblend library and those that don't.

gga_init.py

The gga_init.py script is in charge of creating the directories for the organisms present in the input file, writing docker-compose files for these organisms and deploying/updating the docker stacks. It also creates the Traefik and Authelia directory and docker-compose file if required or if explicitly specified using the option --force-traefik when calling the script.

gga_init.py uses the information contained in the input and config files to generate the docker-compose files. Using Jinja2 templating, the script will replace existing variables by values specified in the input and config files. A template exists for generating docker-compose for organism stacks, for the Traefik stack and for the Galaxy NGINX proxy configuration. These jinja2 templates are found in the templates directory of the repository.

To achieve this, gga_init.py will parse the YAML input file with a function from utilities.py to turn input organisms into Python dictionaries. The YAML config file is also turned into a Python dictionary.

The organisms dictionary parsed from the input file contains "unique" species (also under the form of dictionaries), i.e. organisms from the same species but with a different strain/sex are grouped together in a dictionary. This allows to create a single GGA environment containing all strains/sexes for a species. The traefik directory and its docker stack is created and deployed independently of the loop over the organisms dictionaries.

gga_init.py will also create the mount points for the docker containers volumes before deploying. These volumes are detailed under the volumes section of services in the docker-compose files.

gga_get_data.py

gga_get_data.py takes the same input file as gga_init.pybut doesn't require the config file, as it doesn't interact with docker, or the Galaxy instance.

This script is in charge of creating the src_data directory trees and copying the datasets present in the input file for every species into their respective src_data directory tree.

The ogs_version and genome_version values in the input files make it possible to differentiate between several versions of the genome or annotation files in the src_data directory tree.

For example, if the input file of a genome dataset is located in genome_path: /files/genomes/example_genome.fasta and is tagged with genome_version: 1.0, the file will be put into ./src_data/genome/v1.0/example_genome.fasta

gga_load_data.py

gga_load_data.py takes care of loading the src_data directory tree as a library in the Galaxy instance of every input species. The script uses Bioblend (https://bioblend.readthedocs.io/en/latest/) to interact with Galaxy instances. The script requires both the input file, and the config to be able to connect to the Galaxy instances.

First, the gga_load_data.py script will verify that the current Galaxy instance is running and available using the method check_galaxy_state(). The Galaxy instance for a species is accessible at the following address: http[s]://specified_domain_name:specified_http[s]_port/sp/genus_species/ from utilities_bioblend.py. Using the information contained in the config file, gga_load_data.py will try to connect to the Galaxy instance of every input species to create a Galaxy library, then create symlinks to the src_data folder architecture and files.

Additionally, gga_load_data.py will run a Galaxy tool to remove the default organism "Homo sapiens" from the Chado database to obtain a clean Galaxy instance.

gga_run_workflow*.py

gga_run_workflow*.py scripts are for the moment highly specific to Phaeoexplorer data.

These scripts will run tools and workflows inside the Galaxy instances to load the data into the GMOD applications. gga_run_workflow*.py scripts make extensive use of the Bioblend library methods.

The script will first ensure that the Galaxy instances are running (just like gga_load_data.py). Then, it verifies that tool versions installed in the Galaxy instances match the versions specified in the script (in constants_phaeo.py) and the workflow. If the version and changeset of an installed tool doesn't match the one in the script/workflow, the correct version will be installed via a Bioblend method.

Once all tools versions are verified and installed, datasets from the instance Project library (the Galaxy library mirroring the local src_data directory tree) are imported into a dedicated Galaxy history. This way, the datasets can more easily be used as inputs for Galaxy tools and workflows.

After the required datasets are imported in the history, Galaxy tools are run to add organisms and analyses into the Chado database.

Once this is done, the specified template workflow is imported in Galaxy. These templates can be found here templates. This workflow parameters (most are set at runtime in the template workflows) are then populated using the imported datasets and the organisms/analyses. The workflow is then invoked in the species Galaxy history.

A constant file (e.g. constants_phaeo.py) defines the Galaxy workflow .ga file with its steps, as well as the other Galaxy tools (name, version, id, changeset) used in the gga_run_workflow_phaeo*.py script.

To invoke a workflow using Bioblend, it requires to provide both the input files (datasets) and the parameters separately as two dictionaries. In the gga_run_workflow*.py scripts, these two are respectively named datamap and workflow_parameters.