# Developer documentation for the GGA automated development tools

The gga_load_data Python command line scripts aim to facilitate 
the deployment of GGA environments using Docker technology as well
as the loading of genomic data into these environments.

All scripts in the gga_load pipeline are written in Python3 and are executable
as standalone. Python objects are created by each script, making it possible to
use their methods and attributes externally.

## Input and configuration files

The gga_load_data scripts all use the same input file, in YAML 
format.
This input file describing the organism and associated data has to 
be provided to every script. You can find an example of this file [here](/examples/citrus_sinensis.yml)
along with explanations about the different describable items.

A configuration file is also required for some scripts 
([gga_init.py](/gga_init.py), [gga_load_data.py](/gga_load_data.py) and `gga_run_workflow*.py`
scripts).
The configuration file contains information used in
docker containers creation and deployment and sensitive data such as 
admin passwords. It also contains the information required to
interact remotely with the galaxy instances (user, password, email).
A template config file is available [here](/examples/config.yml)

## Constants

The pipeline uses constants defined in [constants.py](/constants.py)
and [constants_phaeo.py](/constants_phaeo.py) to handle 
pipeline-wide variables. 

Defining all your constants in these files prevents having
to redefine the same variables in different scripts, which could
potentially cause issues with typos and require updating several
files if a constant value is changed. 

## Utilities

Two "utilities" Python files contain the various methods that are used in different scripts from
the gga_load pipeline. The utilities methods are divided in two files to differentiate between
methods using the Bioblend library and those that don't.

* [utilities.py](/utilities.py)
* [utilities_bioblend.py](/utilities_bioblend.py)

## `gga_init.py`

The [gga_init.py](/gga_init.py) script is in charge of creating the directories for
the organisms present in the input file, 
writing docker-compose files for these organisms
and deploying/updating the docker stacks.
It also creates the Traefik and Authelia directory and docker-compose file 
if required or if explicitly specified using the option `--force-traefik`
when calling the script.

[gga_init.py](/gga_init.py) uses the information contained in the input and config files 
to generate the docker-compose files. Using [Jinja2 templating](https://jinja2docs.readthedocs.io), the script
will replace existing variables by values specified in the input and config files.
A template exists for generating docker-compose for organism stacks, for the Traefik stack
and for the Galaxy NGINX proxy configuration.
These jinja2 templates are found in the `templates` directory of the repository.

To achieve this, [gga_init.py](/gga_init.py) will parse the YAML input file with a function
from [utilities.py](/utilities.py) to turn input organisms into Python dictionaries.
The YAML config file is also turned into a Python dictionary.

The `organisms` dictionary parsed from the input file contains "unique" species (also under the form of dictionaries),
i.e. organisms from the same species
but with a different strain/sex are grouped together in a dictionary. 
This allows to
create a single GGA environment containing all strains/sexes for a species.
The `traefik` directory and its docker stack is created and deployed 
independently of the loop over
the organisms dictionaries.

[gga_init.py](/gga_init.py) will also create the mount points for the docker containers volumes before deploying.
These volumes are detailed under the `volumes` section of `services` in the docker-compose files.

## `gga_get_data.py`

[gga_get_data.py](/gga_init.py) takes the same input file as [gga_init.py](/gga_init.py)but doesn't require
the config file, as it doesn't interact with docker, or the Galaxy instance.

This script is in charge of creating the `src_data` directory trees and
copying the datasets present in the input file for every species into
their respective `src_data` directory tree. 

The `ogs_version` and `genome_version` values in the input files make it possible 
to differentiate 
between several versions of the genome or annotation files in the `src_data`
directory tree.

For example, if the input file of a genome dataset is located in `genome_path: /files/genomes/example_genome.fasta` 
and is tagged with `genome_version: 1.0`, the file will be put into `./src_data/genome/v1.0/example_genome.fasta`

## `gga_load_data.py`

[gga_load_data.py](/gga_load_data.py) takes care of loading the `src_data` directory tree as a 
library in the Galaxy instance of every input species. The script uses Bioblend 
(https://bioblend.readthedocs.io/en/latest/) to interact with Galaxy instances.
The script requires both the input file, and the config to be able to connect to the Galaxy instances.

First, the [gga_load_data.py](/gga_load_data.py) script will verify that the current Galaxy
instance is running and available using the method `check_galaxy_state()`.
The Galaxy instance for a species is accessible at the following address:
*http[s]://specified_domain_name:specified_http[s]_port/sp/genus_species/*
from [utilities_bioblend.py](/utilities_bioblend.py).
Using the information contained in the config file, [gga_load_data.py](/gga_load_data.py) will try
to connect to the Galaxy instance of every input species to create a Galaxy library, then 
create symlinks to the `src_data` folder architecture and files.

Additionally, [gga_load_data.py](/gga_load_data.py) will run a Galaxy tool to 
remove the default organism "Homo sapiens" from the Chado database to obtain a clean
Galaxy instance.

## `gga_run_workflow*.py`

`gga_run_workflow*.py` scripts are for the moment highly specific to Phaeoexplorer data.

These scripts will run tools and workflows inside the Galaxy instances to load the 
data into the GMOD applications. `gga_run_workflow*.py` scripts make extensive use of the Bioblend
library methods.

The script will first ensure that the Galaxy instances are running 
(just like [gga_load_data.py](/gga_load_data.py)). Then, it verifies that tool versions 
installed in the Galaxy instances match the versions specified in the script (in [constants_phaeo.py](/constants_phaeo.py))
and the workflow.
If the version and changeset of an installed tool doesn't match the one in the script/workflow,
the correct version will be installed via a Bioblend method.

Once all tools versions are verified and installed, datasets from the instance 
`Project library`
(the Galaxy library mirroring the local `src_data` directory tree) are imported 
into a dedicated Galaxy history. This way, the datasets can more easily 
be used as inputs
for Galaxy tools and workflows.

After the required datasets are imported in the history, Galaxy tools are run to
add organisms and analyses into the Chado database.

Once this is done, the specified template workflow is imported in Galaxy. 
These templates can be found here [templates](/templates).
This workflow parameters (most are set at runtime in the template workflows)
are then populated using the imported datasets and the organisms/analyses.
The workflow is then invoked in the species Galaxy history.

A constant file (e.g. [constants_phaeo.py](/constants_phaeo.py)) defines the Galaxy workflow `.ga` file
with its steps, as well as the other Galaxy tools (name, version, id, changeset) used in the `gga_run_workflow_phaeo*.py` script.

To invoke a workflow using Bioblend, it 
requires to provide both the input files (datasets) and the parameters separately as two dictionaries. 
In the `gga_run_workflow*.py` scripts, these two are respectively named `datamap` and `workflow_parameters`.