Skip to content
Snippets Groups Projects
README.md 5.44 KiB
Newer Older
Arthur Le Bars's avatar
Arthur Le Bars committed
# gga_load_data (WIP)
Loraine Gueguen's avatar
Loraine Gueguen committed

Automated integration of new organisms into GGA environments as a form of a docker stack of services.
Arthur Le Bars's avatar
Arthur Le Bars committed
## Description:
Automatically generate functional GGA environments from a descriptive input file. 
See example datasets (example.json, example.yml or example.xlsx) for an example of what information can be described 
and the correct formatting of these input files. 
"gga_load_data" in its current version is divided in three (automated) parts: 
- Create the stacks of services for the input organisms (orchestrated using docker swarm, with traefik used as a networking interface between the different stacks)
- Load the organisms datasets into the galaxy instance
- Remotely run a custom workflow in galaxy
Arthur Le Bars's avatar
Arthur Le Bars committed

## Metadata files (WIP):
A metadata file will be generated to summarize what actions have previously been taken inside a stack.
Arthur Le Bars's avatar
Arthur Le Bars committed
## Directory tree:
For every input organism, a dedicated directory is created. The script will create this directory and all subdirectories
required.

If the user is adding new data to a species (for example adding another strain/sex's datasets to the same species), the directory tree will be updated
Arthur Le Bars's avatar
Arthur Le Bars committed

Directory tree structure:
Arthur Le Bars's avatar
Arthur Le Bars committed
```
/main_directory
|
|---/genus1_species1
|   |
|   |---/blast
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |   |---/links.yml
|   |   |---/banks.yml
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |
|   |---/nginx
|   |   |---/conf
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |       |---/default.conf
|   |
|   |---/blast
|   |   |---/banks.yml
|   |   |---/links.yml
|   |
|   |---/docker_data  # Data used internally by docker (do not delete!)
|   |  
|   |---/src_data
|   |	|---/genome
|   | 	|	|---/genus1_species1_strain_sex                       
|   |   |       |---/vX.X
|   |   |        	|---/genus_species_vX.X.fasta
|   |   |
|   |   |---/annotation
|   |	|   |---/genus1_species1_strain_sex
|   |   |       |---/OGSX.X
|   |   |           |---/OGSX.X.gff
|   |   |           |---/OGSX.X_pep.fasta
|   |   |           |---/OGSX.X_transcripts.fasta
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |   |
|   |   |---/tracks
|   |    	|---/genus1_species1_strain_sex
|   |                    
|   |---/apollo	
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |   |---/annotation_groups.tsv
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |---/docker-compose.yml
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |
|   |---/metada_genus1_species1.yml (WIP)
Arthur Le Bars's avatar
Arthur Le Bars committed
|
Arthur Le Bars's avatar
Arthur Le Bars committed
|---/metadata.yml
Arthur Le Bars's avatar
Arthur Le Bars committed
|
Arthur Le Bars's avatar
Arthur Le Bars committed
|---/traefik
    |---/docker-compose.yml
    |---/authelia
	    |---/users.yml
	    |---/configuration.yml
Arthur Le Bars's avatar
Arthur Le Bars committed

```

## Steps:
For each input organism, the tool works in three parts (1 part = 1 separate script).

**The first two parts are required to set up a functional GGA stack**

**Part 1)**
1) Create the directory tree structure (if it already exists, only create the required subdirectories)
2) Create the docker-compose file for the organism and deploy the stack of services.


**Warning: the Galaxy service takes up to 2 hours to be set up. During these 2 hours it can't be interacted with, wait at least 2 hours 
before calling the other scripts**

**Part 2)**
3) Gather source data files as specified in the input, can recursively search the directory (fully automated for local phaeoexplorer data)
4) Link the source files to the organism correct src_data folders and load all the data into the galaxy container as a galaxy library

*(Optional)* **Part 3)**
5) (*Optional*) Modify headers in the transcripts and protein fasta files
6) (*Optional*) TODO: Generate blast banks (no commit)
7) (*Optional*) Connect to the galaxy instance
8) (*Optional*) Run data integration galaxy steps (see http://gitlab.sb-roscoff.fr/abims/e-infra/gga)
9) (*Optional*) TODO: Generate and update metadata files

## Usage:
The scripts all take one mandatory input file that describes your species and their associated data 
(see yml_example_input.yml in the "examples" folder of the repository)
You must also fill in a "config" file containing sensible variables (galaxy and tripal passwords, etc..) that
the script will read to create the different services and to access the galaxy container. By default, the config file
inside the repository root will be used if none is precised in the command line
- Deploy stacks part: ```$ python3 /path/to/repo/gga_init.py your_input_file.yml -c/--config your_config_file [-v/--verbose]```
- Copy source data file and load data into the galaxy container: ```$ python3 /path/to/repo/gga_load_data.py your_input_file.yml -c/--config your_config_file [-v/--verbose]```

Arthur Le Bars's avatar
Arthur Le Bars committed
- Run a workflow (currently for phaeoexplorer only): ```$ python3 /path/to/repo/run_workflow_phaeoexplorer.py your_input_file.yml -c/--config your_config_file [-v/--verbose] -w/--workflow your_workflow``` 

**Warning: the "input file" and "config file" have to be the same for the 3 steps!**
Arthur Le Bars's avatar
Arthur Le Bars committed

## Current limitations
When deploying the stack of services, the galaxy service takes a long time to be ready (around 2 hours of wait time)
Arthur Le Bars's avatar
Arthur Le Bars committed

For the moment, the stacks deployment and the data loading into galaxy should be run separately (only once the galaxy service is ready)
To check the status of the galaxy service, run ```$ docker service logs -f genus_species_galaxy``` or 
```./serexec genus_species_galaxy supervisorctl status```
to verify directly from the container
*(The "gga_load_data.py" script will check on the galaxy container anyway and will exit if it's not ready)*

## Requirements (*temporary*):
Requires Python 3.7+

Packages required:
```
bioblend==0.14.0  
boto==2.49.0
certifi==2019.11.28
cffi==1.14.0
chardet==3.0.4
cryptography==2.8
idna==2.9
numpy==1.18.1
pandas==1.0.3
pycparser==2.20
pyOpenSSL==19.1.0
PySocks==1.7.1
python-dateutil==2.8.1
pytz==2019.3
PyYAML==5.3.1
requests==2.23.0
requests-toolbelt==0.9.1
six==1.14.0
urllib3==1.25.7
xlrd==1.2.0
```