Skip to content
Snippets Groups Projects
README.md 5.12 KiB
Newer Older
Arthur Le Bars's avatar
Arthur Le Bars committed
# gga_load_data (WIP)
Loraine Gueguen's avatar
Loraine Gueguen committed

Automated integration of new organisms into GGA environments as a form of a docker stack of services.
Arthur Le Bars's avatar
Arthur Le Bars committed
## Description:
Automatically generate functional GGA environments from a descriptive input yaml file.
See example datasets (example.yml) for an example of what information can be described
Arthur Le Bars's avatar
Arthur Le Bars committed
and the correct formatting of these input files
Arthur Le Bars's avatar
Arthur Le Bars committed
The "gga_load_data" tool is divided in 4 separate scripts:
Arthur Le Bars's avatar
Arthur Le Bars committed
- gga_init: Create directory tree for organisms and deploy stacks for the input organisms as well as Traefik and optionally Authelia stacks
- gga_get_data: Create "src_data" directory tree for organisms and copy datasets for the input organisms into the organisms directory tree
- gga_load_data: Load the datasets of the input organisms into a library in their galaxy instance
- run_workflow_phaeoexplorer: Remotely run a custom workflow in galaxy, proposed as an "example script" to take inspiration from as workflow parameters are specific to Phaeoexplorer data
Arthur Le Bars's avatar
Arthur Le Bars committed
## Directory tree:
For every input organism, a dedicated directory is created. The script will create this directory and all subdirectories required.

If the user is adding new data to a species (for example adding another strain/sex's datasets to the same species), the directory tree will be updated
Arthur Le Bars's avatar
Arthur Le Bars committed

Directory tree structure:
Arthur Le Bars's avatar
Arthur Le Bars committed
```
/main_directory
|
|---/genus1_species1
|   |
|   |---/blast
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |   |---/links.yml
|   |   |---/banks.yml
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |
|   |---/nginx
|   |   |---/conf
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |       |---/default.conf
|   |
|   |---/blast
|   |   |---/banks.yml
|   |   |---/links.yml
|   |
|   |---/docker_data  # Data used internally by docker (do not delete!)
|   |---/src_data
|   |	|---/genome
|   | 	|	|---/genus1_species1_strain_sex
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |   |        	|---/vX.X.fasta
|   |   |
|   |   |---/annotation
|   |	|   |---/genus1_species1_strain_sex
|   |   |       |---/OGSX.X
|   |   |           |---/OGSX.X.gff
|   |   |           |---/OGSX.X_pep.fasta
|   |   |           |---/OGSX.X_transcripts.fasta
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |   |
|   |   |---/tracks
|   |    	|---/genus1_species1_strain_sex
|   |
|   |---/apollo
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |   |---/annotation_groups.tsv
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |---/docker-compose.yml
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |
|   |---/metada_genus1_species1.yml (WIP)
Arthur Le Bars's avatar
Arthur Le Bars committed
|
Arthur Le Bars's avatar
Arthur Le Bars committed
|---/metadata.yml
Arthur Le Bars's avatar
Arthur Le Bars committed
|
Arthur Le Bars's avatar
Arthur Le Bars committed
|---/traefik
    |---/docker-compose.yml
    |---/authelia
	    |---/users.yml
	    |---/configuration.yml
Arthur Le Bars's avatar
Arthur Le Bars committed

```

The scripts all take one mandatory input file that describes your species and their associated data 
Arthur Le Bars's avatar
Arthur Le Bars committed
(see example.yml in the "examples" folder of the repository). Every dataset path in this input must be an absolute path.
You must also fill in a "config" file containing sensible variables (galaxy and tripal passwords, etc..) that
the script will read to create the different services and to access the galaxy container. By default, the config file
Arthur Le Bars's avatar
Arthur Le Bars committed
inside the repository root will be used if none is precised in the command line. An example of this config file is available
in the "examples" folder of the repository.
Arthur Le Bars's avatar
Arthur Le Bars committed
**Warning: the config file is not required as an option for the "gga_init" and "gga_get_data" scripts**

- Deploy stacks part: ```$ python3 /path/to/repo/gga_init.py your_input_file.yml -c/--config your_config_file [-v/--verbose] [OPTIONS]```
Arthur Le Bars's avatar
Arthur Le Bars committed
		--main-directory $PATH (Path where to create/update stacks; default=current directory)
		--force-traefik (If specified, will overwrite traefik and authelia files; default=False)

- Copy source data file: ```$ python3 /path/to/repo/gga_get_data.py your_input_file.yml [-v/--verbose] [OPTIONS]```
Arthur Le Bars's avatar
Arthur Le Bars committed
		--main-directory $PATH (Path where to access stacks; default=current directory)
- Load data in galaxy library and prepare galaxy instance: ```$ python3 /path/to/repo/gga_load_data.py your_input_file.yml -c/--config your_config_file [-v/--verbose]```
Arthur Le Bars's avatar
Arthur Le Bars committed
		--main-directory $PATH (Path where to access stacks; default=current directory)
- Run a workflow in galaxy: ```$ python3 /path/to/repo/gga_load_data.py your_input_file.yml -c/--config your_config_file --workflow /path/to/workflow.ga [-v/--verbose] [OPTIONS]```
Arthur Le Bars's avatar
Arthur Le Bars committed
		--workflow $WORKFLOW (Path to the workflow to run in galaxy. A couple of preset workflows are available in the "workflows" folder of the repository)
		--main-directory $PATH (Path where to access stacks; default=current directory)
**Warning: the "input file" and "config file" have to be the same for all scripts!**
Arthur Le Bars's avatar
Arthur Le Bars committed

## Current limitations
Arthur Le Bars's avatar
Arthur Le Bars committed
When deploying the stack of services, the galaxy service can take a long time to be ready. This is due to the galaxy container preparing a persistent location for the container data. This can be bypassed by setting the variable "persist_galaxy_data" to "True" in the script "config" YAML file
Arthur Le Bars's avatar
Arthur Le Bars committed

The stacks deployment and the data loading into galaxy should hence be run separately and only once the galaxy service is ready
To check the status of the galaxy service, you can run ```$ docker service logs -f genus_species_galaxy``` or 
Arthur Le Bars's avatar
Arthur Le Bars committed
```./serexec genus_species_galaxy supervisorctl status``` to verify directly from the container

\
*(The "gga_load_data.py" script will do this automatically anyway and will exit while notifying you it is not ready)*

## Requirements (*temporary*):
Arthur Le Bars's avatar
Arthur Le Bars committed
Requires Python 3.6
Arthur Le Bars's avatar
Arthur Le Bars committed
[requirements.txt](./requirements.txt)