Skip to content
Snippets Groups Projects
Arthur Le Bars's avatar
Arthur Le Bars authored
Swarm functional version (wait for galaxy service to finish scanning /export before running the data loading + workflow script)
8fa6a46e

gga_load_data (WIP)

Automated integration of new organisms into GGA instances.

Description:

This script is made for automatically integrating new organisms into GGA instances as part of the phaeexplorer project. As input, the script either takes a tabulated file (xls, xlsx or csv) or a json file describing the organism for which it has to create/update instances. For each organism to be integrated, the script needs at least its genus and species (strain, sex, genome and annotation files versions are optional, but the two later will be set to the default version of 1.0, and the two former will be set as empty and will not being considered during the integration process). See example datasets (example.json and example.xlsx) for an example of what information can be described and the correct formatting of these input files. The script should then take of everything (for phaeoexplorer organisms), from generating the directory tree to running workflows and tools in the galaxy instance.

TODO:

  • ready the script for production (prepare production arguments) + remove dev args for master
  • metadata
  • call the scripts for formatting data and generate blast banks

Metadata files (WIP):

The script also generates a metadata file in the directory of the newly integrated species, summing up what actions were taken for this organism (see meta_toy.yaml for the kind of information it can contain). It also creates another metadata files in the main directory (where you put all the organisms you have integrated), which contains the sum of all metadata files from all integrated organisms. These metadata files are also updated when updating an existing instance.

nginx conf (WIP):

The default.conf will be automatically generated (automatic port affectation), APIs will be able to bypass authentication (for bioblend, a master key is set at the creation of the docker-compose.yml of the organisms)

Directory tree:

For every input organism, the script will create the following directories structure, or try to update it if it already exists. It will update the files in the main directory to account for the new organisms that are getting integrated.

/main_directory
|
|---/genus1_species1
|   |
|   |---/blast
|   |   |---/links.yml
|   |   |---/banks.yml
|   |
|   |---/nginx
|   |   |---/conf
|   |       |---/default.conf
|	|
|	|---/src_data
|	|	|---/genome
|	| 	|	|---/genus1_species1_strain_sex                       
|	|   |    	|---/vX.X
|	|   |        	|---/genus_species_vX.X.fasta
|	|   |
|	|	|---/annotation
|	|	|	|---/genus1_species1_strain_sex                   
|	|	|		|---/OGSX.X
|	|	|           |---/OGSX.X.gff
|	|	|           |---/OGSX.X_pep.fasta
|	|	|           |---/OGSX.X_transcripts.fasta
|   |   |
|   |   |---/tracks
|   |    	|---/genus1_species1_strain_sex
|   |                    
|   |---/apollo	
|   |   |---/annotation_groups.tsv
|   |
|   |---/docker-compose.yml
|   |
|   |---/metada_genus1_species1.yml
|
|---/metadata.yml
|
|---/traefik
	|---/authelia
		|---/users.yml
		|---/configuration.yml

Steps:

For each input organism:

  1. parsing the tabulated input
  2. create the docker-compose.yml for the organism (+ default.conf and edit main_proxy nginx default.conf for docker-compose docker configuration)
  3. create the directory tree structure (if it already exists, only create the required directories)
  4. gather files in the "source data" directory tree, can recursively search the directory (by default, the source-data folder is fixed for phaeoexplorer data, this default fixed directory can be set in the attributes of the Autoload class in autoload.py.
  5. link the source files to the organism correct src_data folders
  6. modify headers in the transcripts and protein fasta files
  7. generate blast banks (no commit)
  8. start the containers
  9. connect to the galaxy instance
  10. run data integration galaxy steps (see @ http://gitlab.sb-roscoff.fr/abims/e-infra/gga)
  11. generate and update metadata files

Usage (production):

In progress

Requirements:

bioblend==0.13.0
boto==2.49.0
certifi==2019.11.28
cffi==1.14.0
chardet==3.0.4
cryptography==2.8
idna==2.9
numpy==1.18.1
pandas==1.0.3
pycparser==2.20
pyOpenSSL==19.1.0
PySocks==1.7.1
python-dateutil==2.8.1
pytz==2019.3
PyYAML==5.3.1
requests==2.23.0
requests-toolbelt==0.9.1
six==1.14.0
urllib3==1.25.7
xlrd==1.2.0