diff --git a/README.md b/README.md index ede2f8e45720dd6e8f1d22eeaca8758d7ae7cc4f..1c2aa4ab8bac43dceeca593f9b91dafe2a82f664 100755 --- a/README.md +++ b/README.md @@ -3,21 +3,22 @@ Automated integration of new organisms into GGA environments as a form of a docker stack of services. ## Description: -Automatically generate functional GGA environments from a descriptive input file. -See example datasets (example.json, example.yml or example.xlsx) for an example of what information can be described +Automatically generate functional GGA environments from a descriptive input yaml file. +See example datasets (example.yml) for an example of what information can be described and the correct formatting of these input files. -"gga_load_data" in its current version is divided in three (automated) parts: -- Create the stacks of services for the input organisms (orchestrated using docker swarm, with traefik used as a networking interface between the different stacks) -- Load the organisms datasets into the galaxy instance -- Remotely run a custom workflow in galaxy +"gga_load_data" in its current version is divided in 4 parts: + +- gga_init: Create directory tree and deploy stacks for the input organisms as well as Traefik and optionally Authelia stacks +- gga_get_data: Copy datasets for the input organisms into the organisms directory tree +- gga_load_data: Load the datasets of the input organisms into a library in their galaxy instance +- run_workflow_phaeoexplorer: Remotely run a custom workflow in galaxy, proposed as an "example script" to take inspiration from as workflow parameters are specific to Phaeoexplorer data ## Metadata files (WIP): A metadata file will be generated to summarize what actions have previously been taken inside a stack. ## Directory tree: -For every input organism, a dedicated directory is created. The script will create this directory and all subdirectories -required. +For every input organism, a dedicated directory is created. The script will create this directory and all subdirectories required. If the user is adding new data to a species (for example adding another strain/sex's datasets to the same species), the directory tree will be updated @@ -40,10 +41,10 @@ Directory tree structure: | | |---/links.yml | | | |---/docker_data # Data used internally by docker (do not delete!) -| | +| | | |---/src_data | | |---/genome -| | | |---/genus1_species1_strain_sex +| | | |---/genus1_species1_strain_sex | | | |---/vX.X | | | |---/genus_species_vX.X.fasta | | | @@ -56,8 +57,8 @@ Directory tree structure: | | | | | |---/tracks | | |---/genus1_species1_strain_sex -| | -| |---/apollo +| | +| |---/apollo | | |---/annotation_groups.tsv | | | |---/docker-compose.yml @@ -75,37 +76,32 @@ Directory tree structure: ``` ## Steps: -For each input organism, the tool works in three parts (1 part = 1 separate script). - -**The first two parts are required to set up a functional GGA stack** +For each input organism, the tool is used as a set of separate steps/scripts. **Part 1)** 1) Create the directory tree structure (if it already exists, only create the required subdirectories) +2) Create the docker-compose file for the organism and deploy the stack of services -2) Create the docker-compose file for the organism and deploy the stack of services. - - -**Warning: the Galaxy service takes up to 2 hours to be set up. During these 2 hours it can't be interacted with, wait at least 2 hours -before calling the other scripts** +**Warning: the Galaxy service takes up to 2 hours to be set up (for backup purposes). During these 2 hours it can't be interacted with, wait at least 2 hours before calling the other scripts** **Part 2)** -3) Gather source data files as specified in the input, can recursively search the directory (fully automated for local phaeoexplorer data) +3) Find source data files as specified in the input file and copy these source files to the organism correct src_data subfolders -4) Link the source files to the organism correct src_data folders and load all the data into the galaxy container as a galaxy library +**Part 3)** -*(Optional)* **Part 3)** +4) Create galaxy library and load datasets in this library. Set up the galaxy instance and history(-ies) -5) (*Optional*) Modify headers in the transcripts and protein fasta files +**Part 4)** (Script only available to Phaeoexplorer members) -6) (*Optional*) TODO: Generate blast banks (no commit) +5) Modify headers in the transcripts and protein fasta files +6) Transfer manual annotation descriptions and hectar descriptions to the organism GFF file -7) (*Optional*) Connect to the galaxy instance +**Part 5)** -8) (*Optional*) Run data integration galaxy steps (see http://gitlab.sb-roscoff.fr/abims/e-infra/gga) +7) Configure and invoke a workflow for the organism -9) (*Optional*) TODO: Generate and update metadata files ## Usage: The scripts all take one mandatory input file that describes your species and their associated data @@ -115,30 +111,48 @@ You must also fill in a "config" file containing sensible variables (galaxy and the script will read to create the different services and to access the galaxy container. By default, the config file inside the repository root will be used if none is precised in the command line -- Deploy stacks part: ```$ python3 /path/to/repo/gga_init.py your_input_file.yml -c/--config your_config_file [-v/--verbose]``` +**Warning: the config file is not required as an option for the "gga_get_data" script** + +- Deploy stacks part: ```$ python3 /path/to/repo/gga_init.py your_input_file.yml -c/--config your_config_file [-v/--verbose] [OPTIONS]``` + OPTIONS + --main-directory $PATH (Path where to create/update stacks; default=current directory) + --traefik (If specified, will try to start or overwrite traefik+authelia stack; default=False) + --http (Use a http traefik+authelia configuration; default=False) + --https (Use a https traefik+authelia configuration, might require a certificate for the hostname; default=True) + +- Copy source data file: ```$ python3 /path/to/repo/gga_get_data.py your_input_file.yml [-v/--verbose] [OPTIONS]``` + OPTIONS + --main-directory $PATH (Path where to access stacks; default=current directory) -- Copy source data file and load data into the galaxy container: ```$ python3 /path/to/repo/gga_load_data.py your_input_file.yml -c/--config your_config_file [-v/--verbose]``` +- Load data in galaxy library and prepare galaxy instance: ```$ python3 /path/to/repo/gga_load_data.py your_input_file.yml -c/--config your_config_file [-v/--verbose]``` + OPTIONS + --main-directory $PATH (Path where to access stacks; default=current directory) -- Run a workflow (currently for phaeoexplorer only): ```$ python3 /path/to/repo/run_workflow_phaeoexplorer.py your_input_file.yml -c/--config your_config_file [-v/--verbose] -w/--workflow your_workflow``` +- Run a workflow in galaxy: ```$ python3 /path/to/repo/gga_load_data.py your_input_file.yml -c/--config your_config_file --workflow /path/to/workflow.ga [-v/--verbose] [OPTIONS]``` + --workflow $WORKFLOW + Path to the workflow to run in galaxy. A couple of preset workflows are available in the "workflows" folder of the repository + OPTIONS + --main-directory $PATH (Path where to access stacks; default=current directory) + --setup (Set up the organism instance: create organism and analyses, get their IDs, etc.. This option is MANDATORY when first running the script in a new galaxy instance or else the script will not be able to set runtime parameters for the workflows) -**Warning: the "input file" and "config file" have to be the same for the 3 steps!** +**Warning: the "input file" and "config file" have to be the same for all scripts!** ## Current limitations -When deploying the stack of services, the galaxy service takes a long time to be ready (around 2 hours of wait time) +When deploying the stack of services, the galaxy service takes a long time to be ready (around 2 hours of wait time). This is due to the galaxy container preparing a persistent location for the container data. -For the moment, the stacks deployment and the data loading into galaxy should be run separately (only once the galaxy service is ready) +The stacks deployment and the data loading into galaxy should hence be run separately and only once the galaxy service is ready -To check the status of the galaxy service, run ```$ docker service logs -f genus_species_galaxy``` or +To check the status of the galaxy service, you can run ```$ docker service logs -f genus_species_galaxy``` or ```./serexec genus_species_galaxy supervisorctl status``` to verify directly from the container -*(The "gga_load_data.py" script will check on the galaxy container anyway and will exit if it's not ready)* +*(The "gga_load_data.py" script will check on the galaxy container anyway and will exit while notifying you it is not ready)* ## Requirements (*temporary*): Requires Python 3.7+ Packages required: ``` -bioblend==0.14.0 +bioblend==0.14.0 boto==2.49.0 certifi==2019.11.28 cffi==1.14.0 diff --git a/examples/config_example.yml b/examples/config_example.yml new file mode 100644 index 0000000000000000000000000000000000000000..13acc3763da48e6b92768244643fdd8fd1a50918 --- /dev/null +++ b/examples/config_example.yml @@ -0,0 +1,27 @@ +# This is the configuration file used by the gga_init.py, gga_load_data.py and run_workflow.py scripts +# It contains (sensible) variables to set up different docker services and should not be committed when filled + +# "all" section contains variables used by several services at once or the paths to import sensible files that cannot be procedurally generated/formatted using the scripts +all: + hostname: localhost # The hosting machine name + dashboard_port: 8001 + http_port: 8888 + https_port: 8889 + proxy_ip: XXXXXXXXXXXX # IP of the upstream proxy (for traefik) + auth_hostname: XXXXXXXXXXXX + authelia_config_path: /path/to/authelia_config.yml.j2 # Path to the custom authelia configuration file (j2 template) +# galaxy-specific environment variables +galaxy: + galaxy_default_admin_email: gga@galaxy.org + galaxy_defaut_admin_user: gga + galaxy_default_admin_password: password + galaxy_config_master_api_key: master + webapollo_user: admin_apollo@galaxy.org + webapollo_password: apollopass +# tripal-specific variables +tripal: + tripal_password: tripalpass # Tripal database password (also used by galaxy as an environment variable) + banner_path: /home/fr2424/sib/alebars/projects/gga_load_data/misc/banner.png # Custom banner path + tripal_theme_name: abims # Use this to use another theme + tripal_theme_git_clone: http://gitlab.sb-roscoff.fr/abims/e-infra/tripal_abims.git # Use this to install another theme (cannot be named custom_theme_git_clone currently) + diff --git a/examples/example.yml b/examples/example.yml new file mode 100644 index 0000000000000000000000000000000000000000..e718673e9de3e25f2412ed777d309121013b0bdd --- /dev/null +++ b/examples/example.yml @@ -0,0 +1,68 @@ +# Input file for the automated creation GGA docker stacks +# The file consists in a "list" of species for which the script will have to create these stacks/load data into galaxy/run workflows +# This file is internally turned into a list of dictionaries by the scripts + +ectocarpus_sp2_male: # Dummy value the user gives to designate the species (isn't used by the script) + description: + # Species description, leave blank if unknown or you don't want it to be used + # These parameters are used to set up the various urls and adresses in different containers + # The script requires at least the genus to be specified + genus: "ectocarpus" # Mandatory! + species: "sp2" # # Mandatory! + sex: "male" + strain: "" + common_name: "" + origin: "" + # the sex and strain, the script will look for files containing the genus, species, sex and strain of the species) + # If no file corresponding to the description is found, this path will be considered empty and the script will + # proceed to the next step (create the directory tree for the GGA docker stack) + data: + # Sequence of paths to the different datasets to copy and import into galaxy + # Paths must be absolute paths + genome_path: "/path/to/fasta" # Mandatory! + transcripts_path: "/path/to/fasta" # Mandatory! + proteins_path: "/path/to/fasta" # Mandatory! + gff_path: "/path/to/gff" # Mandatory! + interpro_path: "/path/to/interpro" + orthofinder_path: "/path/to/orthofinder" + blastp_path: "/path/to/blastx" + blastx_path: "/path/to/blastp" + # If the user has several datasets of the same 'nature' (gff, genomes, ...) to upload to galaxy, the next scalar is used by the script to differentiate + # between these different versions and name directories according to it and not overwrite the existing data + # If left empty, the genome will be considered version "1.0" + genome_version: "1.0" + # Same as genome version, but for the OGS analysis + ogs_version: "" + performed_by: "" + services: + # Describe what optional services to deploy for the stack + # By default, only tripal, tripaldb and galaxy services will be deployed + blast: "False" + wiki: "False" + apollo: "False" + +# Second example without the explanation +ectocarpus_sp2_female: + description: + genus: "ectocarpus" + species: "sp2" + sex: "female" + strain: "" + common_name: "" + origin: "" + data: + genome_path: "/path/to/fasta" + transcripts_path: "/path/to/fasta" + proteins_path: "/path/to/fasta" + gff_path: "/path/to/gff" + interpro_path: "/path/to/interpro" + orthofinder_path: "/path/to/orthofinder" + blastp_path: "/path/to/blastx" + blastx_path: "/path/to/blastp" + genome_version: "1.0" + ogs_version: "1.0" + performed_by: "" + services: + blast: "False" + wiki: "False" + apollo: "False" \ No newline at end of file diff --git a/examples/input_example.yml b/examples/input_example.yml index d41de54eed68090bb650952d4fa4ebcacbd34a53..e718673e9de3e25f2412ed777d309121013b0bdd 100644 --- a/examples/input_example.yml +++ b/examples/input_example.yml @@ -1,46 +1,47 @@ # Input file for the automated creation GGA docker stacks # The file consists in a "list" of species for which the script will have to create these stacks/load data into galaxy/run workflows +# This file is internally turned into a list of dictionaries by the scripts ectocarpus_sp2_male: # Dummy value the user gives to designate the species (isn't used by the script) + description: # Species description, leave blank if unknown or you don't want it to be used # These parameters are used to set up the various urls and adresses in different containers # The script requires at least the genus to be specified - description: genus: "ectocarpus" # Mandatory! - species: "sp2" + species: "sp2" # # Mandatory! sex: "male" strain: "" common_name: "" origin: "" - # Data files scalars contain paths to the source files that have to be loaded into galaxy - # WARNING: The paths must be absolute paths! - # If any path is left blank and the "parent_directory" scalar is specified, this directory and ALL its subdirectories will be - # scanned for files corresponding to the description provided for the species (i.e if the user specified # the sex and strain, the script will look for files containing the genus, species, sex and strain of the species) # If no file corresponding to the description is found, this path will be considered empty and the script will # proceed to the next step (create the directory tree for the GGA docker stack) - # If a path is left blank and the "parent_directory" scalar is also blank, no file will be loaded for this "path" scalar - # If the files are not named using this nomenclature, please provide all the paths in the corresponding scalars below data: - # "parent_directory": (optional) directory from where to search files if a "***_path" scalar is empty - # NOTE: Try to set a parent directory "close" to the data files so as not to increase runtime - # If empty (""), the script will not search for files and no dataset will be loaded for the corresponding scalar - parent_directory: "/path/to/closest/parent/dir" - # "***_path": path to the file (optional if parent_directory is set and species "description" scalars are precised) - # TODO Not implemented yet - genome_path: "/path/to/fasta" - transcripts_path: "/path/to/fasta" - proteins_path: "/path/to/fasta" - gff_path: "/path/to/gff" - # If the user has several genomes to upload to galaxy, the next scalar is used by the script to differentiate - # between these different versions and name directories according to it. + # Sequence of paths to the different datasets to copy and import into galaxy + # Paths must be absolute paths + genome_path: "/path/to/fasta" # Mandatory! + transcripts_path: "/path/to/fasta" # Mandatory! + proteins_path: "/path/to/fasta" # Mandatory! + gff_path: "/path/to/gff" # Mandatory! + interpro_path: "/path/to/interpro" + orthofinder_path: "/path/to/orthofinder" + blastp_path: "/path/to/blastx" + blastx_path: "/path/to/blastp" + # If the user has several datasets of the same 'nature' (gff, genomes, ...) to upload to galaxy, the next scalar is used by the script to differentiate + # between these different versions and name directories according to it and not overwrite the existing data # If left empty, the genome will be considered version "1.0" genome_version: "1.0" - # Same as genome version, but for the analysis + # Same as genome version, but for the OGS analysis ogs_version: "" performed_by: "" + services: + # Describe what optional services to deploy for the stack + # By default, only tripal, tripaldb and galaxy services will be deployed + blast: "False" + wiki: "False" + apollo: "False" -# Second example without the comments doc +# Second example without the explanation ectocarpus_sp2_female: description: genus: "ectocarpus" @@ -50,11 +51,18 @@ ectocarpus_sp2_female: common_name: "" origin: "" data: - parent_directory: "/path/to/closest/parent/dir" genome_path: "/path/to/fasta" transcripts_path: "/path/to/fasta" proteins_path: "/path/to/fasta" gff_path: "/path/to/gff" + interpro_path: "/path/to/interpro" + orthofinder_path: "/path/to/orthofinder" + blastp_path: "/path/to/blastx" + blastx_path: "/path/to/blastp" genome_version: "1.0" ogs_version: "1.0" performed_by: "" + services: + blast: "False" + wiki: "False" + apollo: "False" \ No newline at end of file