Newer
Older
Automated integration of new organisms into GGA environments as a form of a docker stack of services.
Automatically generate functional GGA environments from a descriptive input file.
See example datasets (example.json, example.yml or example.xlsx) for an example of what information can be described
and the correct formatting of these input files.

Arthur Le Bars
committed
"gga_load_data" in its current version is divided in three (automated) parts:
- Create the stacks of services for the input organisms (orchestrated using docker swarm, with traefik used as a networking interface between the different stacks)
- Load the organisms datasets into the galaxy instance
- Remotely run a custom workflow in galaxy
A metadata file will be generated to summarize what actions have previously been taken inside a stack.
For every input organism, a dedicated directory is created. The script will create this directory and all subdirectories
required.
If the user is adding new data to a species (for example adding another strain/sex's datasets to the same species), the directory tree will be updated
```
/main_directory
|
|---/genus1_species1
| |
| |---/blast
| |
| |---/blast
| | |---/banks.yml
| | |---/links.yml
| |
| |---/docker_data # Data used internally by docker (do not delete!)
| |
| |---/src_data
| | |---/genome
| | | |---/genus1_species1_strain_sex
| | | |---/vX.X
| | | |---/genus_species_vX.X.fasta
| | |
| | |---/annotation
| | | |---/genus1_species1_strain_sex
| | | |---/OGSX.X
| | | |---/OGSX.X.gff
| | | |---/OGSX.X_pep.fasta
| | | |---/OGSX.X_transcripts.fasta
| | |
| | |---/tracks
| | |---/genus1_species1_strain_sex
| |
| |---/apollo
| |---/metada_genus1_species1.yml (WIP)
|---/docker-compose.yml
|---/authelia
|---/users.yml
|---/configuration.yml
For each input organism, the tool works in three parts (1 part = 1 separate script).
**The first two parts are required to set up a functional GGA stack**
**Part 1)**

Arthur Le Bars
committed
1) Create the directory tree structure (if it already exists, only create the required subdirectories)

Arthur Le Bars
committed
2) Create the docker-compose file for the organism and deploy the stack of services.
**Warning: the Galaxy service takes up to 2 hours to be set up. During these 2 hours it can't be interacted with, wait at least 2 hours
before calling the other scripts**
**Part 2)**

Arthur Le Bars
committed
3) Gather source data files as specified in the input, can recursively search the directory (fully automated for local phaeoexplorer data)

Arthur Le Bars
committed

Arthur Le Bars
committed
4) Link the source files to the organism correct src_data folders and load all the data into the galaxy container as a galaxy library

Arthur Le Bars
committed
5) (*Optional*) Modify headers in the transcripts and protein fasta files

Arthur Le Bars
committed
6) (*Optional*) TODO: Generate blast banks (no commit)

Arthur Le Bars
committed
7) (*Optional*) Connect to the galaxy instance

Arthur Le Bars
committed

Arthur Le Bars
committed
8) (*Optional*) Run data integration galaxy steps (see http://gitlab.sb-roscoff.fr/abims/e-infra/gga)

Arthur Le Bars
committed
9) (*Optional*) TODO: Generate and update metadata files
## Usage:

Arthur Le Bars
committed
The scripts all take one mandatory input file that describes your species and their associated data
(see yml_example_input.yml in the "examples" folder of the repository)

Arthur Le Bars
committed
You must also fill in a "config" file containing sensible variables (galaxy and tripal passwords, etc..) that
the script will read to create the different services and to access the galaxy container. By default, the config file
inside the repository root will be used if none is precised in the command line

Arthur Le Bars
committed
- Deploy stacks part: ```$ python3 /path/to/repo/gga_init.py your_input_file.yml -c/--config your_config_file [-v/--verbose]```

Arthur Le Bars
committed
- Copy source data file and load data into the galaxy container: ```$ python3 /path/to/repo/gga_load_data.py your_input_file.yml -c/--config your_config_file [-v/--verbose]```
- Run a workflow (currently for phaeoexplorer only): ```$ python3 /path/to/repo/run_workflow_phaeoexplorer.py your_input_file.yml -c/--config your_config_file [-v/--verbose] -w/--workflow your_workflow```

Arthur Le Bars
committed
**Warning: the "input file" and "config file" have to be the same for the 3 steps!**

Arthur Le Bars
committed
When deploying the stack of services, the galaxy service takes a long time to be ready (around 2 hours of wait time)

Arthur Le Bars
committed
For the moment, the stacks deployment and the data loading into galaxy should be run separately (only once the galaxy service is ready)

Arthur Le Bars
committed
To check the status of the galaxy service, run ```$ docker service logs -f genus_species_galaxy``` or
```./serexec genus_species_galaxy supervisorctl status```
to verify directly from the container
*(The "gga_load_data.py" script will check on the galaxy container anyway and will exit if it's not ready)*
## Requirements (*temporary*):
Requires Python 3.7+
Packages required:
```
bioblend==0.14.0
boto==2.49.0
certifi==2019.11.28
cffi==1.14.0
chardet==3.0.4
cryptography==2.8
idna==2.9
numpy==1.18.1
pandas==1.0.3
pycparser==2.20
pyOpenSSL==19.1.0
PySocks==1.7.1
python-dateutil==2.8.1
pytz==2019.3
PyYAML==5.3.1
requests==2.23.0
requests-toolbelt==0.9.1
six==1.14.0
urllib3==1.25.7
xlrd==1.2.0
```