README.md

# gga_load_data (WIP)

Automated integration of new organisms into GGA environments as a form of a docker stack of services.

## Description:
Automatically generate functional GGA environments from a descriptive input file. 
See example datasets (example.json, example.yml or example.xlsx) for an example of what information can be described 
and the correct formatting of these input files. 

"gga_load_data" in its current version is divided in three (automated) parts: 
- Create the stacks of services for the input organisms (orchestrated using docker swarm, with traefik used as a networking interface between the different stacks)
- Load the organisms datasets into the galaxy instance
- Remotely run a custom workflow in galaxy

## Metadata files (WIP):
A metadata file will be generated to summarize what actions have previously been taken inside a stack.

## Directory tree:
For every input organism, a dedicated directory is created. The script will create this directory and all subdirectories
required.

If the user is adding new data to a species (for example adding another strain/sex's datasets to the same species), the directory tree will be updated

Directory tree structure:
```
/main_directory
|
|---/genus1_species1
|   |
|   |---/blast
|   |   |---/links.yml
|   |   |---/banks.yml
|   |
|   |---/nginx
|   |   |---/conf
|   |       |---/default.conf
|   |
|   |---/blast
|   |   |---/banks.yml
|   |   |---/links.yml
|   |
|   |---/docker_data  # Data used internally by docker (do not delete!)
|   |  
|   |---/src_data
|   |	|---/genome
|   | 	|	|---/genus1_species1_strain_sex                       
|   |   |       |---/vX.X
|   |   |        	|---/genus_species_vX.X.fasta
|   |   |
|   |   |---/annotation
|   |	|   |---/genus1_species1_strain_sex
|   |   |       |---/OGSX.X
|   |   |           |---/OGSX.X.gff
|   |   |           |---/OGSX.X_pep.fasta
|   |   |           |---/OGSX.X_transcripts.fasta
|   |   |
|   |   |---/tracks
|   |    	|---/genus1_species1_strain_sex
|   |                    
|   |---/apollo	
|   |   |---/annotation_groups.tsv
|   |
|   |---/docker-compose.yml
|   |
|   |---/metada_genus1_species1.yml (WIP)
|
|---/metadata.yml
|
|---/traefik
    |---/docker-compose.yml
    |---/authelia
	    |---/users.yml
	    |---/configuration.yml

```

## Steps:
For each input organism, the tool works in three parts (1 part = 1 separate script).

**The first two parts are required to set up a functional GGA stack**

**Part 1)**

1) Create the directory tree structure (if it already exists, only create the required subdirectories)

2) Create the docker-compose file for the organism and deploy the stack of services.


**Warning: the Galaxy service takes up to 2 hours to be set up. During these 2 hours it can't be interacted with, wait at least 2 hours 
before calling the other scripts**

**Part 2)**

3) Gather source data files as specified in the input, can recursively search the directory (fully automated for local phaeoexplorer data)

4) Link the source files to the organism correct src_data folders and load all the data into the galaxy container as a galaxy library

*(Optional)* **Part 3)**

5) (*Optional*) Modify headers in the transcripts and protein fasta files

6) (*Optional*) TODO: Generate blast banks (no commit)

7) (*Optional*) Connect to the galaxy instance

8) (*Optional*) Run data integration galaxy steps (see http://gitlab.sb-roscoff.fr/abims/e-infra/gga)

9) (*Optional*) TODO: Generate and update metadata files

## Usage:
The scripts all take one mandatory input file that describes your species and their associated data 
(see yml_example_input.yml in the "examples" folder of the repository)

You must also fill in a "config" file containing sensible variables (galaxy and tripal passwords, etc..) that
the script will read to create the different services and to access the galaxy container. By default, the config file
inside the repository root will be used if none is precised in the command line

- Deploy stacks part: ```$ python3 /path/to/repo/gga_init.py your_input_file.yml -c/--config your_config_file [-v/--verbose]```

- Copy source data file and load data into the galaxy container: ```$ python3 /path/to/repo/gga_load_data.py your_input_file.yml -c/--config your_config_file [-v/--verbose]```

- Run a workflow (currently for phaeoexplorer only): ```$ python3 /path/to/repo/run_workflow_phaeoexplorer.py your_input_file.yml -c/--config your_config_file [-v/--verbose] -w/--workflow your_workflow``` 

**Warning: the "input file" and "config file" have to be the same for the 3 steps!**

## Current limitations
When deploying the stack of services, the galaxy service takes a long time to be ready (around 2 hours of wait time)

For the moment, the stacks deployment and the data loading into galaxy should be run separately (only once the galaxy service is ready)

To check the status of the galaxy service, run ```$ docker service logs -f genus_species_galaxy``` or 
```./serexec genus_species_galaxy supervisorctl status```
to verify directly from the container
*(The "gga_load_data.py" script will check on the galaxy container anyway and will exit if it's not ready)*

## Requirements (*temporary*):
Requires Python 3.7+

Packages required:
```
bioblend==0.14.0  
boto==2.49.0
certifi==2019.11.28
cffi==1.14.0
chardet==3.0.4
cryptography==2.8
idna==2.9
numpy==1.18.1
pandas==1.0.3
pycparser==2.20
pyOpenSSL==19.1.0
PySocks==1.7.1
python-dateutil==2.8.1
pytz==2019.3
PyYAML==5.3.1
requests==2.23.0
requests-toolbelt==0.9.1
six==1.14.0
urllib3==1.25.7
xlrd==1.2.0
```