Skip to content
Snippets Groups Projects
README.md 4.3 KiB
Newer Older
Arthur Le Bars's avatar
Arthur Le Bars committed
# gga_load_data (WIP)
Loraine Gueguen's avatar
Loraine Gueguen committed

Arthur Le Bars's avatar
Arthur Le Bars committed
Automated integration of new organisms into GGA instances.
Arthur Le Bars's avatar
Arthur Le Bars committed
## Description:
This script is made for automatically integrating new organisms into GGA instances as part of the phaeexplorer project.
As input, the script either takes a tabulated file (xls, xlsx or csv) or a json file describing the organism for which it has to create/update instances. 
For each organism to be integrated, the script needs at least its genus and species (strain, sex, genome and annotation files versions are optional, but the two later will be set to the default version of 1.0, and the two former will be set as empty and will not being considered during the integration process). 
Loraine Gueguen's avatar
Loraine Gueguen committed
See example datasets (example.json and example.xlsx) for an example of what information can be described and the correct formatting of these input files. The script should then take of everything (for phaeoexplorer organisms), from generating the directory tree to running workflows and tools in the galaxy instance.
Arthur Le Bars's avatar
Arthur Le Bars committed
## TODO: 
Arthur Le Bars's avatar
Arthur Le Bars committed
- ready the script for production (prepare production arguments) + remove dev args for master
Arthur Le Bars's avatar
Arthur Le Bars committed
- metadata
Arthur Le Bars's avatar
Arthur Le Bars committed
- call the scripts for formatting data and generate blast banks
Arthur Le Bars's avatar
Arthur Le Bars committed

## Metadata files (WIP):
Arthur Le Bars's avatar
Arthur Le Bars committed
The script also generates a metadata file in the directory of the newly integrated species, summing up what actions were taken for this organism (see meta_toy.yaml for
the kind of information it can contain). It also creates another metadata files in the main directory (where you put all the organisms you have integrated), which contains the sum of all metadata files from all integrated organisms. These metadata files are also updated when updating an existing instance.

Arthur Le Bars's avatar
Arthur Le Bars committed
## nginx conf (WIP):
The default.conf will be automatically generated (automatic port affectation), APIs will be able to bypass authentication (for bioblend, a master key
is set at the creation of the docker-compose.yml of the organisms)

Arthur Le Bars's avatar
Arthur Le Bars committed
## Directory tree:
For every input organism, the script will create the following directories structure, or try to update it if it already exists.
It will update the files in the main directory to account for the new organisms that are getting integrated.

```
/main_directory
|
|---/genus1_species1
|   |
|   |---/blast
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |   |---/links.yml
|   |   |---/banks.yml
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |
|   |---/nginx
|   |   |---/conf
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |       |---/default.conf
Arthur Le Bars's avatar
Arthur Le Bars committed
|	|
|	|---/src_data
|	|	|---/genome
|	| 	|	|---/genus1_species1_strain_sex                       
|	|   |    	|---/vX.X
Arthur Le Bars's avatar
Arthur Le Bars committed
|	|   |        	|---/genus_species_vX.X.fasta
Arthur Le Bars's avatar
Arthur Le Bars committed
|	|   |
|	|	|---/annotation
|	|	|	|---/genus1_species1_strain_sex                   
|	|	|		|---/OGSX.X
Arthur Le Bars's avatar
Arthur Le Bars committed
|	|	|           |---/OGSX.X.gff
|	|	|           |---/OGSX.X_pep.fasta
|	|	|           |---/OGSX.X_transcripts.fasta
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |   |
|   |   |---/tracks
|   |    	|---/genus1_species1_strain_sex
|   |                    
|   |---/apollo	
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |   |---/annotation_groups.tsv
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |---/docker-compose.yml
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |
Arthur Le Bars's avatar
Arthur Le Bars committed
|   |---/metada_genus1_species1.yml
Arthur Le Bars's avatar
Arthur Le Bars committed
|
Arthur Le Bars's avatar
Arthur Le Bars committed
|---/metadata.yml
Arthur Le Bars's avatar
Arthur Le Bars committed
|
Arthur Le Bars's avatar
Arthur Le Bars committed
|---/traefik
	|---/authelia
		|---/users.yml
		|---/configuration.yml
Arthur Le Bars's avatar
Arthur Le Bars committed

```

## Steps:
For each input organism:
Arthur Le Bars's avatar
Arthur Le Bars committed
1) parsing the tabulated input
Arthur Le Bars's avatar
Arthur Le Bars committed
2) create the docker-compose.yml for the organism (+ default.conf and edit main_proxy nginx default.conf for docker-compose docker configuration)
3) create the directory tree structure (if it already exists, only create the required directories)
4) gather files in the "source data" directory tree, can recursively search the directory (by default, the source-data folder is fixed for phaeoexplorer data, this default fixed directory can be set in the attributes of the Autoload class in autoload.py.
Arthur Le Bars's avatar
Arthur Le Bars committed
5) link the source files to the organism correct src_data folders
6) modify headers in the transcripts and protein fasta files
7) generate blast banks (no commit)
8) start the containers
9) connect to the galaxy instance
10) run data integration galaxy steps (see @ http://gitlab.sb-roscoff.fr/abims/e-infra/gga)
11) generate and update metadata files

## Usage (production):
Arthur Le Bars's avatar
Arthur Le Bars committed
In progress
Arthur Le Bars's avatar
Arthur Le Bars committed

## Requirements:
Arthur Le Bars's avatar
Arthur Le Bars committed
bioblend==0.13.0  
boto==2.49.0  
certifi==2019.11.28  
cffi==1.14.0  
chardet==3.0.4  
cryptography==2.8  
idna==2.9  
numpy==1.18.1  
pandas==1.0.3  
pycparser==2.20  
pyOpenSSL==19.1.0  
PySocks==1.7.1  
python-dateutil==2.8.1  
pytz==2019.3  
PyYAML==5.3.1  
requests==2.23.0  
requests-toolbelt==0.9.1  
six==1.14.0  
urllib3==1.25.7  
xlrd==1.2.0