Index of /datafiles/datasets/Aqua-Faang

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[TXT]README.html2024-02-15 19:43 799K 
[DIR]blacklist/2024-02-06 11:22 -  
[DIR]chromatin_states/2024-02-28 16:02 -  
[DIR]figures/2024-02-15 19:22 -  
[DIR]nfcore/2022-10-19 11:21 -  
[DIR]robust_ATAC_peaks/2024-02-12 14:45 -  
[DIR]salmon-trout-orthologs/2024-04-23 07:54 -  
[DIR]trackhub/2024-02-29 13:24 -  

Salmobase - AQUA-FAANG dataset

Salmobase - AQUA-FAANG dataset

RNA-seq, ATAC-seq and ChIP-seq (four histone marks) of tissues and developing embryos.

Relevant species (assemblies):

  • Rainbow trout (Assembly: USDA_OmykA_1.1)
  • Atlantic salmon (Assembly: Ssal_v3.1)

About AQUA-FAANG

AQUA-FAANG is a European research initiative aiming to improve understanding the genome function and usage of genotype-to-phenotype prediction in the six most important European farmed fish species: European seabass, gilthead seabream, rainbow trout, Atlantic salmon, common carp and turbot. https://www.aqua-faang.eu/.

The FAANG part of the name comes from “Functional Annotation of ANimal Genomes”, a larger project aimed at producing comprehensive maps of functional elements in the genomes of domesticated animal species.

A major part of this project is to provide functional genomics data, including RNA-seq, ChIP-seq and ATAC-seq from developing embryos (DevMap) and from multiple tissues from grown fish (BodyMap).

Data source

Raw data and sample metadata is available at ENA (https://www.ebi.ac.uk/ena) or via the FAANG data portal (https://data.faang.org/projects/AQUA-FAANG)

ENA Accession numbers:

RNA ATAC ChIP
DevMap AS PRJEB51855 PRJEB51854 PRJEB53399
DevMap RT PRJEB51857 PRJEB51856 PRJEB55010
BodyMap AS PRJEB47409 PRJEB47408 PRJEB55063
BodyMap RT PRJEB57191 PRJEB57190 PRJEB57956

Data available on salmobase

Salmobase hosts processed data for the DevMap and BodyMap of rainbow trout (Assembly: USDA_OmykA_1.1) and Atlantic salmon (Assembly: Ssal_v3.1)

DevMap samples

Embryos were sampled at 14 different developmental stages. All 14 stages were subject to RNA sequencing while a subset of 5 stages subject to ATAC- and ChIP-seq. Developmental stages in chronological order:

  1. LateCleavage
  2. EarlyBlastulation
  3. MidBlastulation
  4. LateBlastulation*
  5. EarlyGastrulation
  6. MidGastrulation*
  7. LateGastrulation
  8. EarlySomitogenesis*
  9. EarlyMidSomitogenesis
  10. MidSomitogenesis*
  11. MidLateSomitogenesis
  12. LateSomitogenesis*
  13. EarlyEyed
  14. LateEyed

*: stages that were subject to ATAC- and ChIP-seq

Samples are pools of multiple embryos. Each stage was sampled in three replicates (except ChIP input of which there are only one per stage. Also, one replicate is missing from rainbow trout ATAC late somitogenesis).

BodyMap samples

Tissues were sampled at two different stages of sexual maturity (immature and mature). For each maturity stage three male and three female fish were sampled. So in total 12 (2x2x3) individual fish were sampled per species. The following tissues were sampled from each fish:

  • Brain
  • Gill (no ATAC)
  • Gonad (no ChIP for rainbow trout)
  • HeadKidney (RNA only)
  • Liver
  • Muscle
  • SuppDistalIntestine (RNA only)

Note: ChIP-seq was only performed on two of the three replicates. Exceptions: salmon gonads have all three replicates. Rainbow trout has no gonad ChIP-seq. Rainbow trout brain has only a single replicate of the immature stages.

Note 2: Gill ATAC-seq turned out to be challenging and was discarded because of low quality.

ChIP-seq targets

ChIP-seq was performed with 4 different antibodies targeting histone modifications:

  • H3K27ac
  • H3K27me3
  • H3K4me1
  • H3K4me3

In addition there is the input (control) sample.

Note: CTCF ChIP-seq was performed but quality was discarded because of low quality.

Sample naming scheme

On Salmobase the samples have been named based on the metadata in the following format:

DevMap: [species]_[assay]_[devstage]_[repNr]

BodyMap: [species]_[assay]_[tissue]_[maturity]_[sex]_[repNr]

Examples:

  • RainbowTrout_ATAC_MidGastrulation_R1
  • RainbowTrout_ChIP-H3K4me3_MidSomitogenesis_R2
  • AtlanticSalmon_RNA_Liver_Mature_Female_R3
  • AtlanticSalmon_ChIP-H3K27ac_Gill_Immature_Male_R1

nf-core pipeline results

Processing was performed using nf-core pipelines (rnaseq, atacseq and chipseq). All the results from the pipelines are available on salmobase for browsing/download https://salmobase.org/datafiles/datasets/Aqua-Faang/nfcore/.

Trackhub with nf-core results

BigWig (read pileups), bed (peaks) and bam (read mapping) can be viewed in the Salmobase JBrowse by pressing “turn on the connection”. Or you can open the tracks in any genome browser that supports trackhubs via these links:

Note that the genome browser must be set up with the correct genome assembly and support the chromosome naming from Ensembl

Blacklist regions

Tracks > Aqua-Faang > blacklist

To remove signal-artifact regions from the sequencing data for downstream analysis, we generated blacklist files for Atlantic salmon and rainbow trout genomes. First, uint8 mappability files were created for 75, 100, and 150 kmers across each chromosome in the genome using Umap (Karimzadeh et al, 2018), then all kmer mappability results were combined. Finally, blacklist regions were defined using Blacklist (Amemiya et al, 2019) taking in all available ChIP-Seq input bam files (32 inputs for Atlantic salmon, 29 for Rainbow trout) and combined uint8 files. Blacklists were generated first separately for classic ChIP and μChIPmentation inputs and the resulting regions were merged.

Chromatin state tracks

Tracks > Aqua-Faang > chromatin_states

Chromatin states were annotated with ChromHMM using the ATAC-Seq and ChIP-seq alignments produced by the nf-core pipelines. Biologically respective inputs were given as control background signals for the ChIP-Seq data. Prior to ChromHMM, alignments were filtered to keep reads only in chromosomes and not within blacklist regions. The alignments were then binarised using the ChromHMM BinarizeBed function with default bin size of 200 bp, before running the ChromHMM LearnModel function. After testing different numbers of states to model to capture biologically expected states, the final settings were two 12-state models for Atlantic salmon developmental stages and tissues (brain, gonad, liver, muscle), respectively, and for rainbow trout a 12-state model for developmental stages, and three 10-state models for each tissue (brain, liver, muscle). Each rainbow trout tissue was modelled separately as data quality was more varied between tissues for trout than for salmon. The combination of signals of ATAC and ChIP marks were used to annotate states across the models with labels representing different biological states of the chromatin, including regulatory features. For each biological condition across the samples (developmental stages and tissues at each maturity and sex) BED files were produced by ChromHMM annotating all genome regions with a chromatin state at a 200bp resolution, using the respective model and unblacklisted assay data for each condition.

Figure: Definition of Chromatin States. (This is from Atlantic salmon but rainbow trout is similar)

Robust ATAC peaks with state annotation tracks

Tracks > Aqua-Faang > robust_ATAC_peaks

A set of robust, reproducible open chromatin regions for each biological condition was generated by comparing ATAC-Seq narrow peak results across biological replicates. Narrow peak files from ATAC-seq pipeline results were first filtered to remove peaks completely overlapping blacklist regions and not to a chromosome sequence. The reproducibility of peaks between possible pairs of biological replicates (2-3 replicates per condition) was determined using the Irreproducible Discovery Rate (IDR) software from the ENCODE project (doi:10.1214/11-AOAS466). This measures the reproducibility in scoring of peaks in each replicate to determine an optimal cutoff for significance. IDR was run using options –rank=‘p.value’ and –soft-idr-threshold=0.1. This gave a set of significantly reproducible peaks between pairs of replicates. When there were three replicates for a condition, the resulting reproducible peaks from each pair were merged together (=>1bp overlap), taking the max IDR score for the resulting peak. With one set of reproducible peaks per condition, the summits of those peaks were recalculated using the MACS2 refinepeak function and all replicate ATAC-Seq BAM alignments together. These reproducible peaks with summits were finally assigned a chromatin state based on what state the peak regions overlapped, using the state annotations from the respective condition. When a peak overlapped multiple states, the state was assigned in an order of importance (same order as in the figure above). A unified set of these reproducible peaks were generated per species by merging all overlapping peak regions (=>1bp overlap) across all conditions.