Skip to content
Snippets Groups Projects
README.md 6.04 KiB
Newer Older
Loraine Gueguen's avatar
Loraine Gueguen committed
# gga_load_data tools
Loraine Gueguen's avatar
Loraine Gueguen committed

Loraine Gueguen's avatar
Loraine Gueguen committed
The gga_load_data tools allow automated deployment of GMOD visualisation tools (Chado, Tripal, JBrowse, Galaxy) for a bunch of genomes and datasets. 
Loraine Gueguen's avatar
Loraine Gueguen committed
They are based on the Galaxy Genome Annotation (GGA) project (https://galaxy-genome-annotation.github.io). 
Loraine Gueguen's avatar
Loraine Gueguen committed
A stack of Docker services is deployed for each organism, from an input yaml file describing the data.
Loraine Gueguen's avatar
Loraine Gueguen committed
See `examples/example.yml` for an example of what information can be described and the correct formatting of this input file.

Each GGA environment is deployed at [https://hostname/sp/genus_species/](https://hostname/sp/genus_species/).

Loraine Gueguen's avatar
Loraine Gueguen committed
## Reverse proxy and authentication
Loraine Gueguen's avatar
Loraine Gueguen committed

Traefik is a reverse proxy which allows to direct HTTP traffic to various Docker Swarm services.
The Traefik dashboard is deployed at [https://hostname/traefik/](https://hostname/traefik/)

Authelia is an authentication agent, which can be plugged to an LDAP server, and that Traefik can you to check permissions to access services.
Loraine Gueguen's avatar
Loraine Gueguen committed
The authentication layer is optional. If used, the config file needs the variables `https_port`, `auth_hostname`, `authelia_config_path`.
Loraine Gueguen's avatar
Loraine Gueguen committed

Authelia is accessed automatically by Traefik to check permissions everytime someones wants to access a page. 
If the user is not logged in, he is redirected to the authelia portal. 
Note that Authelia needs a secured connexion (no self-signed certificate) between the upstream proxy and Traefik (and https between internet and the proxy).

Loraine Gueguen's avatar
Loraine Gueguen committed
## Steps
Loraine Gueguen's avatar
Loraine Gueguen committed
The "gga_load_data" tools are composed of 4 scripts:
Arthur Le Bars's avatar
Arthur Le Bars committed
- gga_init: Create directory tree for organisms and deploy stacks for the input organisms as well as Traefik and optionally Authelia stacks
Loraine Gueguen's avatar
Loraine Gueguen committed
- gga_get_data: Create `src_data` directory tree for organisms and copy datasets for the input organisms into the organisms directory tree
- gga_load_data: Load the datasets of the input organisms into their Galaxy library
- run_workflow_phaeoexplorer: Remotely run a custom workflow in Galaxy, proposed as an "example script" to take inspiration from as workflow parameters are specific to Phaeoexplorer data
Loraine Gueguen's avatar
Loraine Gueguen committed
## Usage:

Loraine Gueguen's avatar
Loraine Gueguen committed
For all scripts one input file is required, that describes the species and their associated data.
Loraine Gueguen's avatar
Loraine Gueguen committed
(see `examples/example.yml`). Every dataset path in this file must be an absolute path.

Loraine Gueguen's avatar
Loraine Gueguen committed
Another yaml file is required, the config file, with configuration variables (Galaxy and Tripal passwords, etc..) that
the scripts need to create the different services and to access the Galaxy container. By default, the config file
Loraine Gueguen's avatar
Loraine Gueguen committed
inside the repository root will be used if none is precised in the command line. An example of this config file is available
in the `examples` folder.

**The input file and config file have to be the same for all scripts!**

- Deploy stacks part: 

```bash
Loraine Gueguen's avatar
Loraine Gueguen committed
$ python3 /path/to/repo/gga_init.py input_file.yml -c/--config config_file [-v/--verbose] [OPTIONS]
Loraine Gueguen's avatar
Loraine Gueguen committed
		--main-directory $PATH (Path where to create/update stacks; default=current directory)
		--force-traefik (If specified, will overwrite traefik and authelia files; default=False)
```

- Copy source data file: 

```bash
Loraine Gueguen's avatar
Loraine Gueguen committed
$ python3 /path/to/repo/gga_get_data.py input_file.yml [-v/--verbose] [OPTIONS]
Loraine Gueguen's avatar
Loraine Gueguen committed
		--main-directory $PATH (Path where to access stacks; default=current directory)
```

- Load data in Galaxy library and prepare Galaxy instance: 

```bash
Loraine Gueguen's avatar
Loraine Gueguen committed
$ python3 /path/to/repo/gga_load_data.py input_file.yml -c/--config config_file [-v/--verbose]
Loraine Gueguen's avatar
Loraine Gueguen committed
		--main-directory $PATH (Path where to access stacks; default=current directory)
```

- Run a workflow in galaxy: 
 
```bash
Loraine Gueguen's avatar
Loraine Gueguen committed
$ python3 /path/to/repo/gga_load_data.py input_file.yml -c/--config config_file --workflow /path/to/workflow.ga [-v/--verbose] [OPTIONS]
Loraine Gueguen's avatar
Loraine Gueguen committed
		--workflow $WORKFLOW (Path to the workflow to run in galaxy. A couple of preset workflows are available in the "workflows" folder of the repository)
		--main-directory $PATH (Path where to access stacks; default=current directory)
```

Arthur Le Bars's avatar
Arthur Le Bars committed
## Directory tree:
Loraine Gueguen's avatar
Loraine Gueguen committed

Loraine Gueguen's avatar
Loraine Gueguen committed
For every input organism, a dedicated directory is created with `gga_get_data.py`. The script creates this directory and all subdirectories required.
Loraine Gueguen's avatar
Loraine Gueguen committed
If the user is adding new data to a species (for example adding another strain dataset to the same species), the directory tree will be updated
Arthur Le Bars's avatar
Arthur Le Bars committed

Directory tree structure:
Arthur Le Bars's avatar
Arthur Le Bars committed
```
/main_directory
|
|---/genus1_species1
|   |
|   |---/blast
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |   |---/links.yml
|   |   |---/banks.yml
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |
|   |---/nginx
|   |   |---/conf
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |       |---/default.conf
|   |
|   |---/blast
|   |   |---/banks.yml
|   |   |---/links.yml
|   |
|   |---/docker_data  # Data used internally by docker (do not delete!)
|   |---/src_data
|   |	|---/genome
|   | 	|	|---/genus1_species1_strain_sex
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |   |        	|---/vX.X.fasta
|   |   |
|   |   |---/annotation
|   |	|   |---/genus1_species1_strain_sex
|   |   |       |---/OGSX.X
|   |   |           |---/OGSX.X.gff
|   |   |           |---/OGSX.X_pep.fasta
|   |   |           |---/OGSX.X_transcripts.fasta
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |   |
|   |   |---/tracks
|   |    	|---/genus1_species1_strain_sex
|   |
|   |---/apollo
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |   |---/annotation_groups.tsv
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |---/docker-compose.yml
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |
Arthur Le Bars's avatar
Arthur Le Bars committed
|---/traefik
    |---/docker-compose.yml
    |---/authelia
	    |---/users.yml
	    |---/configuration.yml
Arthur Le Bars's avatar
Arthur Le Bars committed

```

## Current limitations
Arthur Le Bars's avatar
Arthur Le Bars committed

Loraine Gueguen's avatar
Loraine Gueguen committed
The stacks deployment and the data loading into Galaxy should be run separately and only once the Galaxy service is ready.
The `gga_load_data.py` script check that the Galaxy service is ready before loading the data and exit with a notification if it is not.
Loraine Gueguen's avatar
Loraine Gueguen committed
The status of the Galaxy service can be checked manually with `$ docker service logs -f genus_species_galaxy` or 
Loraine Gueguen's avatar
Loraine Gueguen committed
`./serexec genus_species_galaxy supervisorctl status`.
Loraine Gueguen's avatar
Loraine Gueguen committed
When deploying the stack of services, the Galaxy service can take a long time to be ready, because of the data persistence. 
In development mode only, this can be disabled by setting the variable `persist_galaxy_data` to `False` in the config file.
Loraine Gueguen's avatar
Loraine Gueguen committed

Loraine Gueguen's avatar
Loraine Gueguen committed
## Requirements
Arthur Le Bars's avatar
Arthur Le Bars committed
Requires Python 3.6
Arthur Le Bars's avatar
Arthur Le Bars committed
[requirements.txt](./requirements.txt)
Loraine Gueguen's avatar
Loraine Gueguen committed

## License

[BSD 3-Clause](./LICENSE)
Loraine Gueguen's avatar
Loraine Gueguen committed

## Acknowledgments

[Anthony Bretaudeau](https://github.com/abretaud)
Loraine Gueguen's avatar
Loraine Gueguen committed
[Matéo Boudet](https://github.com/mboudet)