Skip to content
Snippets Groups Projects
user avatar
Loraine Guéguen authored
7fb2d08e

gga_load_data (WIP)

Automated integration of new organisms into GGA environments as a form of a docker stack of services.

Description:

Automatically generate functional GGA environments from a descriptive input yaml file. See example datasets (example.yml) for an example of what information can be described and the correct formatting of these input files.

"gga_load_data" in its current version is divided in 4 parts:

  • gga_init: Create directory tree and deploy stacks for the input organisms as well as Traefik and optionally Authelia stacks
  • gga_get_data: Copy datasets for the input organisms into the organisms directory tree
  • gga_load_data: Load the datasets of the input organisms into a library in their galaxy instance
  • run_workflow_phaeoexplorer: Remotely run a custom workflow in galaxy, proposed as an "example script" to take inspiration from as workflow parameters are specific to Phaeoexplorer data

Metadata files (WIP):

A metadata file will be generated to summarize what actions have previously been taken inside a stack.

Directory tree:

For every input organism, a dedicated directory is created. The script will create this directory and all subdirectories required.

If the user is adding new data to a species (for example adding another strain/sex's datasets to the same species), the directory tree will be updated

Directory tree structure:

/main_directory
|
|---/genus1_species1
|   |
|   |---/blast
|   |   |---/links.yml
|   |   |---/banks.yml
|   |
|   |---/nginx
|   |   |---/conf
|   |       |---/default.conf
|   |
|   |---/blast
|   |   |---/banks.yml
|   |   |---/links.yml
|   |
|   |---/docker_data  # Data used internally by docker (do not delete!)
|   |
|   |---/src_data
|   |	|---/genome
|   | 	|	|---/genus1_species1_strain_sex
|   |   |       |---/vX.X
|   |   |        	|---/genus_species_vX.X.fasta
|   |   |
|   |   |---/annotation
|   |	|   |---/genus1_species1_strain_sex
|   |   |       |---/OGSX.X
|   |   |           |---/OGSX.X.gff
|   |   |           |---/OGSX.X_pep.fasta
|   |   |           |---/OGSX.X_transcripts.fasta
|   |   |
|   |   |---/tracks
|   |    	|---/genus1_species1_strain_sex
|   |
|   |---/apollo
|   |   |---/annotation_groups.tsv
|   |
|   |---/docker-compose.yml
|   |
|   |---/metada_genus1_species1.yml (WIP)
|
|---/metadata.yml
|
|---/traefik
    |---/docker-compose.yml
    |---/authelia
	    |---/users.yml
	    |---/configuration.yml

Steps:

For each input organism, the tool is used as a set of separate steps/scripts.

Part 1)

  1. Create the directory tree structure (if it already exists, only create the required subdirectories)
  2. Create the docker-compose file for the organism and deploy the stack of services

Warning: the Galaxy service takes up to 2 hours to be set up (for backup purposes). During these 2 hours it can't be interacted with, wait at least 2 hours before calling the other scripts

Part 2)

  1. Find source data files as specified in the input file and copy these source files to the organism correct src_data subfolders

Part 3)

  1. Create galaxy library and load datasets in this library. Set up the galaxy instance and history(-ies)

Part 4) (Script only available to Phaeoexplorer members)

  1. Modify headers in the transcripts and protein fasta files
  2. Transfer manual annotation descriptions and hectar descriptions to the organism GFF file

Part 5)

  1. Configure and invoke a workflow for the organism

Usage:

The scripts all take one mandatory input file that describes your species and their associated data (see yml_example_input.yml in the "examples" folder of the repository)

You must also fill in a "config" file containing sensible variables (galaxy and tripal passwords, etc..) that the script will read to create the different services and to access the galaxy container. By default, the config file inside the repository root will be used if none is precised in the command line

Warning: the config file is not required as an option for the "gga_get_data" script

  • Deploy stacks part: $ python3 /path/to/repo/gga_init.py your_input_file.yml -c/--config your_config_file [-v/--verbose] [OPTIONS] OPTIONS --main-directory $PATH (Path where to create/update stacks; default=current directory) --traefik (If specified, will try to start or overwrite traefik+authelia stack; default=False) --http (Use a http traefik+authelia configuration; default=False) --https (Use a https traefik+authelia configuration, might require a certificate for the hostname; default=True)

  • Copy source data file: $ python3 /path/to/repo/gga_get_data.py your_input_file.yml [-v/--verbose] [OPTIONS] OPTIONS --main-directory $PATH (Path where to access stacks; default=current directory)

  • Load data in galaxy library and prepare galaxy instance: $ python3 /path/to/repo/gga_load_data.py your_input_file.yml -c/--config your_config_file [-v/--verbose] OPTIONS --main-directory $PATH (Path where to access stacks; default=current directory)

  • Run a workflow in galaxy: $ python3 /path/to/repo/gga_load_data.py your_input_file.yml -c/--config your_config_file --workflow /path/to/workflow.ga [-v/--verbose] [OPTIONS] --workflow $WORKFLOW Path to the workflow to run in galaxy. A couple of preset workflows are available in the "workflows" folder of the repository OPTIONS --main-directory $PATH (Path where to access stacks; default=current directory) --setup (Set up the organism instance: create organism and analyses, get their IDs, etc.. This option is MANDATORY when first running the script in a new galaxy instance or else the script will not be able to set runtime parameters for the workflows)

Warning: the "input file" and "config file" have to be the same for all scripts!

Current limitations

When deploying the stack of services, the galaxy service takes a long time to be ready (around 2 hours of wait time). This is due to the galaxy container preparing a persistent location for the container data.

The stacks deployment and the data loading into galaxy should hence be run separately and only once the galaxy service is ready

To check the status of the galaxy service, you can run $ docker service logs -f genus_species_galaxy or ./serexec genus_species_galaxy supervisorctl status to verify directly from the container (The "gga_load_data.py" script will check on the galaxy container anyway and will exit while notifying you it is not ready)

Requirements (temporary):

Requires Python 3.7+

Packages required:

bioblend==0.14.0
boto==2.49.0
certifi==2019.11.28
cffi==1.14.0
chardet==3.0.4
cryptography==2.8
idna==2.9
numpy==1.18.1
pandas==1.0.3
pycparser==2.20
pyOpenSSL==19.1.0
PySocks==1.7.1
python-dateutil==2.8.1
pytz==2019.3
PyYAML==5.3.1
requests==2.23.0
requests-toolbelt==0.9.1
six==1.14.0
urllib3==1.25.7
xlrd==1.2.0