-
Troubardours authored5fb21e32
Developper documentation for the GGA automated development tools
The gga_load_data Python commande line scripts aim to facilitate the deployment of GGA environments using Docker technology as well as the loading of genomic data into said environments.
All scripts in the gga_load pipeline are written in Python3 and are executable as standalone. Python objects are created by each script, making it possible to use their methods and attributes externally.
Input and configuration files
The gga_load_data scripts all use the same input file, in YAML format This input file describing the organism and associated data has to be provided to every script. You can find an example of this file here along with explanations about the different describable items
A configuration file is also required for some scripts (gga_init.py, gga_load_data.py and gga_run_workflow scripts). The configuration file contain information used in docker containers creation and deployment and sensible data such as admin passwords. It also contains the information required to interact remotely with the galaxy instances (user, password, email).
A template config file is available here
Constants
The pipeline uses constants defined in constants.py and constants_phaeo.py to handle pipeline-wide variables.
Defining all your constants in these files prevents having to redefine the same variables in different scripts, which could potentially cause issues with typos and require updating several files if a constant value is changed.
utilities
Two "utilities" Python files contain the various methods that are used in different scripts from the gga_load pipeline. The utilities methods are divided in two files to differentiate between methods using the Bioblend library and those that don't.
gga_init.py
The gga_init.py script is in charge of creating the directories for
the organisms present in the input file,
writing docker-compose files for these organisms
and deploying/updating the docker stacks.
It also creates the Traefik and Authelia directory and docker-compose file
if required or if explicitly specified using the option --force-traefik
when calling the script.
gga_init.py
uses the information contained in the input and config files
to generate the docker-compose files. Using jinja2 templating, the script
will replace existing variables by values specified in the input and config files.
A template exists for generating docker-compose for organism stacks, for the Traefik stack
and for the Galaxy NGINX proxy configuration.
These jinja2 templates are found in the "templates" directory of the repository.
To achieve this, gga_init.py
will parse the input YAML input file with a function
from utilities.py
to turn input organisms into Python dictionaries, as well as turn
the YAML config file that is also turned into a Python dictionary.
To understand jinja2 templating, please read their docs: https://jinja2docs.readthedocs.io/en/stable/
The organisms dictionary parse from the input file contain "unique" species (also under the form of dictionaries) , i.e organisms from the same species but with a different strain/sex are grouped together in a dictionary. This allows to create a single GGA environment containing all strains/sexes for a species. The Traefik directory and its docker stack is created and deployed independently of the loop over the organisms dictionaries.
gga_init.py
will also create the mount points for the docker containers volumes before deploying.
These volumes are detailed under the "volumes" section of services in the docker-compose files.
gga_get_data.py
gga_get_data.py takes the same input file as gga_init.py
but doesn't require
the config file, as it doesn't interact with docker, or the galaxy instance.
This script is in charge of creating the "src_data" directory trees and copying the datasets present in the input file for every species into their respective "src_data" directory tree.
The "ogs_version" and "genome_version" values in the input files make it possible to differentiate between several versions of the genome or annotation files in the "src_data" directory tree.
For example if the input file a genome dataset is located in genome_path: /files/genomes/example_genome.fasta
tagged
with genome_version: 1.0
will be put into ./src_data/genome/v1.0/example_genome.fasta
gga_load_data.py
gga_load_data.py takes care of loading the "src_data" directory tree as a library in the galaxy instance of every input species. The script uses Bioblend (https://bioblend.readthedocs.io/en/latest/) to interact with Galaxy instances. The script requires both the input file, and the config to be able to connect to Galaxy instances.
First, the gga_load_data.py script will verify that the current Galaxy
instance is running and available using the method check_galaxy_state()
.
The galaxy instance for a species is accessible at the following address:
https://specified_domain_name:specified_https_port/sp/genus_species/
or
http://specified_domain_name:specified_http_port/sp/genus_species/
from utilities_bioblend.py.
Using the information contained in the config file, gga_load_data.py
will try
to connect to the Galaxy instance of every input species to create a Galaxy library, then
create an exact copy of the "src_data" folder architecture and files.
Additionally, gga_load_data.py will run a Galaxy tool to remove the organism "Homo sapiens" from the Chado database to obtain a clean Galaxy instance.
gga_run_workflow
"gga_run_workflow" scripts are for the moment highly specific to Phaeoexplorer data. These scripts will run tools and workflow inside the Galaxy instances to load the data into the GMOD applications. "gga_run_workflow" scripts make extensive use of the Bioblend library methods.
The script will first ensure that the Galaxy instances are running (just like gga_load_data.py). Then, it verifies that tool versions installed in the Galaxy instances match the ones specified in the script and the workflow. If the version and changeset of an installed tool doesn't match the one in the script/workflow, the correct version will be installed via a Bioblend method.
Once all tools versions are verified and installed, datasets from the instance "Project library" (the Galaxy library mirroring the local "src_data" directory tree) are imported into a dedicated Galaxy history. This way, the datasets can more easily be used as inputs for Galaxy tools and workflows.
After the required datasets are imported in the history, Galaxy tools are ran to add organisms and analyses into the Chado database.
Once this is done, the specified template workflow is imported in Galaxy. These templates can be found here templates. This workflow parameters (most are set at runtime in the template workflows) are then populated using the imported datasets and the organisms/analyses. The workflow is then invoked in the species Galaxy history.
To be able to populate workflow parameters and call it remotely, it is imperative that the workflow steps are ordered correctly. To achieve this, the steps are defined as constants in constants_phaeo. To invoke a workflow using Bioblend, it also required to provide both the input files (datasets) and the parameters separately as two dictionaries. In the gga_run_workflow scripts, these two are respectively named "datamap" and "workflow_parameters".