Skip to content

Latest commit

 

History

History
166 lines (86 loc) · 5.94 KB

psoriasis_ATACseq_folder_README.md

File metadata and controls

166 lines (86 loc) · 5.94 KB

ATACseq folder README file

Alicia Lledo Lara- January 2016


Project: Functional characterisation of genetic susceptibility to psoriasis


Folder structure

All ATAC-seq data generated for the psoriasis project using blood and skin from patients and controls is stored in the following directory: /well/jknight/ATACseq_all_projects/Psoriasis

Under this directory there are different folders containing either the analysis of the data for the different samples (with appropriate ID, so far none) as well as other relevant files used in the analysis.

The folders which can be currently found are:

  • Annotation_files
  • Ensembl_TSS
  • GAT_workspace


Folder content description

1.Annotation files

The folder Annotation_data contains the annotation files that have been used to the ATAC-seq analysis for different cell types isolated from blood or for the skin.

For each of those cell types there is a folder containing the relevant annotation files:

  • CD14_monocytes
  • skin

Chromatin segmentation files

The chromatin segmentation state files used to generate the heat maps illustrating enrichment for a particular chromatin state for different fragment sizes is the 18 labels model generated by ChromHMM and retrieved from Epigenome Road Map: http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/core_K27ac/jointModel/final/.

They are bed files containing chromosome number, starting position, end position and chromatin state (1 to 18)

The legend to interpret those labels can be found at:

http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/imputed12marks/jointModel/final/labelmap_25_imputed.tab

http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/imputed12marks/jointModel/final/annotation_25_imputed12marks.txt

The ID for the different cell types are as follows:

  • E029: Primary monocytes from peripheral blood
  • E057 and E_058: Foreskin Keratinocyte Primary Cells skin02 and skin03> We have been usin E_058

The suffix name for these files is: ID_18_core_K27ac_stateno.bed


ENCODE DNase-I hypersensitivity peak files

These files contains the peaks corresponding to open chromatin sites asayed using DNase-I hypersensitivity technique.

1. University of Washington (UW) DHS

These files have been downloaded from the Epigenome Roadmap with the ID E_057 and I think it is the same data that the UW tracks in ENCODE: http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/

The narrowPeak file contains the peak calling combined for the different repeats (usually 2) wich have been called using MACS2 and default qval (0.01), same software and parametres that we are currently using.

Bed files (containing BAM file type of data from alignments) and wig Files for each repeat can also be downloaded from http://genboree.org/EdaccData/Release-9/experiment-sample/Chromatin_Accessibility/CD14_Primary_Cells/.

The ID for the different cell types is the same that the previously specified for the chromatin segmentation data.

The suffix name for the files containing this type of data is: ID_UW_DNase.macs2.narrowPeak.

In order to be able to use the files for analysis using the Genomic Association Tester (GAT) software, some formating to the narrow.Peak file has been necessary.

The modified files contain the first 3 columns of the narrow.Peak file (chromosome name, start of the peak and end of the peak) and a 4th column which is a description of which cells and experiment the peaks come from e.g CD14_UW. In order GAT performs the overlap and enrichment analysis appropriately the name of the 4th column needs to be the same for all the peaks.

The modified files are named using GAT as preffix: GAT_ID_UW_DNase.macs2.narrowPeak.


2. Duke University DHS

Data can be retrieved via the UCSC website Table browser the Pk file Regions of enriched signal in DNaseI HS experiments. Peaks were called based on signals created using F-Seq, a software program developed at Duke (Boyle et al., 2008b).

Significant regions were determined by fitting the data to a gamma distribution to calculate p-values. Contiguous regions where p-values were below a 0.05/0.01 threshold were considered significant.

Peak files are the results of pooled replicates. It is quite confussing the way it is explained.

Raw data for individual replicates can be downloaded from https://genome.ucsc.edu/cgi-bin/hgFileUi?g=wgEncodeOpenChromDnase

The ID for the different cell types correspond to the name of the cell. The suffix of these files are ID_Duke_peaks.bed

In order to be able to use the file GAT similar modifications to the ones specified for the UW DHS files were performed. Files preffix for these files is GAT.


2.GAT workspace file

This is a file containing the human genome size hg19 build ungapped, which means it contains only the regions that have been properly assembled and not those which have not been able to be uneqivocally mapped.

It is a bed file with the chromosome names and two columns corresponding to the starting and end position of the mappable region.

The file is named contigs_ungapped.bed


3.ENSEMBLE transcription start sites (TSS) coordinates

This file contains all the positions identidied as transcription start sites by ENSEMBL using the hg37 build. It was provided by Silvia Salatino (Core Genomics).

A similar file can be downloaded using the UCSC Table Browser (Genes and gene predictions, Ensmbl genes). The output file contains different features of the genes and the TSS (labelled as txStart) can be retrieved. However, this file contains the gene names with the ENS ID which would need to be converted.

The file stored here is a bed file containing chromosome name, strand (-1 or 1), TSS coordinates and an associated gene name.

The name of the file is Ensembl_TSS_GRCh37.txt.gz