# Developer documentation for the GGA automated development tools The gga_load_data Python command line scripts aim to facilitate the deployment of GGA environments using Docker technology as well as the loading of genomic data into these environments. All scripts in the gga_load pipeline are written in Python3 and are executable as standalone. Python objects are created by each script, making it possible to use their methods and attributes externally. ## Input and configuration files The gga_load_data scripts all use the same input file, in YAML format. This input file describing the organism and associated data has to be provided to every script. You can find an example of this file [here](/examples/citrus_sinensis.yml) along with explanations about the different describable items. A configuration file is also required for some scripts ([gga_init.py](/gga_init.py), [gga_load_data.py](/gga_load_data.py) and `gga_run_workflow*.py` scripts). The configuration file contains information used in docker containers creation and deployment and sensitive data such as admin passwords. It also contains the information required to interact remotely with the galaxy instances (user, password, email). A template config file is available [here](/examples/config.yml) ## Constants The pipeline uses constants defined in [constants.py](/constants.py) and [constants_phaeo.py](/constants_phaeo.py) to handle pipeline-wide variables. Defining all your constants in these files prevents having to redefine the same variables in different scripts, which could potentially cause issues with typos and require updating several files if a constant value is changed. ## Utilities Two "utilities" Python files contain the various methods that are used in different scripts from the gga_load pipeline. The utilities methods are divided in two files to differentiate between methods using the Bioblend library and those that don't. * [utilities.py](/utilities.py) * [utilities_bioblend.py](/utilities_bioblend.py) ## `gga_init.py` The [gga_init.py](/gga_init.py) script is in charge of creating the directories for the organisms present in the input file, writing docker-compose files for these organisms and deploying/updating the docker stacks. It also creates the Traefik and Authelia directory and docker-compose file if required or if explicitly specified using the option `--force-traefik` when calling the script. [gga_init.py](/gga_init.py) uses the information contained in the input and config files to generate the docker-compose files. Using [Jinja2 templating](https://jinja2docs.readthedocs.io), the script will replace existing variables by values specified in the input and config files. A template exists for generating docker-compose for organism stacks, for the Traefik stack and for the Galaxy NGINX proxy configuration. These jinja2 templates are found in the `templates` directory of the repository. To achieve this, [gga_init.py](/gga_init.py) will parse the YAML input file with a function from [utilities.py](/utilities.py) to turn input organisms into Python dictionaries. The YAML config file is also turned into a Python dictionary. The `organisms` dictionary parsed from the input file contains "unique" species (also under the form of dictionaries), i.e. organisms from the same species but with a different strain/sex are grouped together in a dictionary. This allows to create a single GGA environment containing all strains/sexes for a species. The `traefik` directory and its docker stack is created and deployed independently of the loop over the organisms dictionaries. [gga_init.py](/gga_init.py) will also create the mount points for the docker containers volumes before deploying. These volumes are detailed under the `volumes` section of `services` in the docker-compose files. ## `gga_get_data.py` [gga_get_data.py](/gga_init.py) takes the same input file as [gga_init.py](/gga_init.py)but doesn't require the config file, as it doesn't interact with docker, or the Galaxy instance. This script is in charge of creating the `src_data` directory trees and copying the datasets present in the input file for every species into their respective `src_data` directory tree. The `ogs_version` and `genome_version` values in the input files make it possible to differentiate between several versions of the genome or annotation files in the `src_data` directory tree. For example, if the input file of a genome dataset is located in `genome_path: /files/genomes/example_genome.fasta` and is tagged with `genome_version: 1.0`, the file will be put into `./src_data/genome/v1.0/example_genome.fasta` ## `gga_load_data.py` [gga_load_data.py](/gga_load_data.py) takes care of loading the `src_data` directory tree as a library in the Galaxy instance of every input species. The script uses Bioblend (https://bioblend.readthedocs.io/en/latest/) to interact with Galaxy instances. The script requires both the input file, and the config to be able to connect to the Galaxy instances. First, the [gga_load_data.py](/gga_load_data.py) script will verify that the current Galaxy instance is running and available using the method `check_galaxy_state()`. The Galaxy instance for a species is accessible at the following address: *http[s]://specified_domain_name:specified_http[s]_port/sp/genus_species/* from [utilities_bioblend.py](/utilities_bioblend.py). Using the information contained in the config file, [gga_load_data.py](/gga_load_data.py) will try to connect to the Galaxy instance of every input species to create a Galaxy library, then create symlinks to the `src_data` folder architecture and files. Additionally, [gga_load_data.py](/gga_load_data.py) will run a Galaxy tool to remove the default organism "Homo sapiens" from the Chado database to obtain a clean Galaxy instance. ## `gga_run_workflow*.py` `gga_run_workflow*.py` scripts are for the moment highly specific to Phaeoexplorer data. These scripts will run tools and workflows inside the Galaxy instances to load the data into the GMOD applications. `gga_run_workflow*.py` scripts make extensive use of the Bioblend library methods. The script will first ensure that the Galaxy instances are running (just like [gga_load_data.py](/gga_load_data.py)). Then, it verifies that tool versions installed in the Galaxy instances match the versions specified in the script (in [constants_phaeo.py](/constants_phaeo.py)) and the workflow. If the version and changeset of an installed tool doesn't match the one in the script/workflow, the correct version will be installed via a Bioblend method. Once all tools versions are verified and installed, datasets from the instance `Project library` (the Galaxy library mirroring the local `src_data` directory tree) are imported into a dedicated Galaxy history. This way, the datasets can more easily be used as inputs for Galaxy tools and workflows. After the required datasets are imported in the history, Galaxy tools are run to add organisms and analyses into the Chado database. Once this is done, the specified template workflow is imported in Galaxy. These templates can be found here [templates](/templates). This workflow parameters (most are set at runtime in the template workflows) are then populated using the imported datasets and the organisms/analyses. The workflow is then invoked in the species Galaxy history. A constant file (e.g. [constants_phaeo.py](/constants_phaeo.py)) defines the Galaxy workflow `.ga` file with its steps, as well as the other Galaxy tools (name, version, id, changeset) used in the `gga_run_workflow_phaeo*.py` script. To invoke a workflow using Bioblend, it requires to provide both the input files (datasets) and the parameters separately as two dictionaries. In the `gga_run_workflow*.py` scripts, these two are respectively named `datamap` and `workflow_parameters`.