Peak calling

This page provides an overview of the peak calling methodologies applied to our functional characterization data. Our approach incorporates the results from various screen analysis tools to ensure comprehensive analysis for assay comparisons.

Unified processing pipeline for peak calling in high-throughput reporter assays

An unified processing pipeline has been developed by Junke from the Yu Lab to standardize the enhancer calling process for high-throughput reporter assays.

deep_ATAC_STARR_seq.genomic_bin_100_sliding_10.tar.gz
lentiMPRA.tar.gz
tilingMPRA_MYC_GATA.tar.gz
tilingMPRA_OL13.tar.gz
tilingMPRA_OL45.tar.gz
WHG_STARR_TR.tar.gz
FCC Metadata (STARR/MPRA)
Label Assay Direction Count
ASTARR_A ATAC-STARR-seq Active (either direction) 35505
ASTARR_AB ATAC-STARR-seq Active (both direction) 11680
ASTARR_R ATAC-STARR-seq Repressive (either direction) 154337
ASTARR_RB ATAC-STARR-seq Repressive (both direction) 28775
eSTARR_A eSTARR-seq Active (either direction) 150
eSTARR_AB eSTARR-seq Active (both direction) 31
eSTARR_R eSTARR-seq Repressive (either direction) 341
eSTARR_RB eSTARR-seq Repressive (both direction) 65
LMPRA_A Lenti-MPRA Active (either direction) 25648
LMPRA_AB Lenti-MPRA Active (both direction) 16603
LMPRA_R Lenti-MPRA Repressive (either direction) 485
LMPRA_RB Lenti-MPRA Repressive (both direction) 128
TMPRA_A Tiling-MPRA Active (either direction) 6017
TMPRA_AB Tiling-MPRA Active (both direction) 57
TMPRA_R Tiling-MPRA Repressive (either direction) 254
TMPRA_RB Tiling-MPRA Repressive (both direction) 1
WSTARR_A WHG-STARR-seq Active (either direction) 79738
WSTARR_AB WHG-STARR-seq Active (both direction) 25505
WSTARR_R WHG-STARR-seq Repressive (either direction) 62201

In this study we are going to use only “either direction” calls.

Column descriptions

  • Chrom: Name of the chromosome
  • ChromStart: The starting position of the feature in the chromosome
  • ChromEnd: The ending position of the feature in the chromosome
  • Name: Name
  • Score: Z score based on mean(logFC of all the bins)
  • Strand: Strand
  • Group: Assay name
    • ASTARR = ATAC-STARR
    • WSTARR = Whole genome (WHG)-STARR
    • LMPRA = Lenti-MPRA
    • TMPRA = Tiling MPRA
  • Label: Assay name + direction (A/R)
    • A: enhancer calls (merged_enhancer_peaks_in_either_orientation.bed.gz)
    • R: repressive calls (merged_repressor_peaks_in_either_orientation.bed.gz)
  • Dataset: Assay dataset
    • TR = Reddy lab (Tim Reddy); ATAC-STARR and WHG-STARR
    • Nadav = Ahituv lab (Nadav Ahituv); Lenti-MPRA
    • OL = dataset label from Tewhey lab; Tiling MPRA

Summary counts

I am using the merged peak files of in_either_orientation in Junke peak files.

Assay Active (A) Repressive (R)
ATAC-STARR-seq 35,505 154,337
WHG-STARR-seq 79,738 62,201
Lenti-MPRA 25,648 485
Tiling-MPRA 6,017 254
eSTARR-seq 150 341

Applying ChIP-seq differential peak calling (csaw) on ATAC-STARR-seq assay

The csaw tool was utilized for differential peak calling in ATAC-STARR assay data to identify cis-regulatory element from the chromatin accessible regions. This process was conducted by Alex.

KS91 (6Dna4Rna) -> KSMerge (6Dna7Rna)   
Number of significant regions increased. Negative regions increased more than positive regions.

Total number of regions: 
352,944 -> 359,104

Significant regions (-log10Q >= 3):
87,695 -> 93,208

Percentage of negative and positive:
- Postive: 0.61 (53110) -> 0.53 (49041)
- Negative: 0.39 (34585) -> 0.47 (44167)

CRISPR activity screen analysis (CASA) on CRISPRi-HCR Flow-FISH data

The CASA analysis pipeline, developed by the Sabeti Lab, has been applied to CRISPRi-HCR Flow-FISH data to identify regulatory elements. The results of significant regions can be downloaded from the ENCODE portal as follows:

The table is downloaded by ENCODE FCC CRSIRPi HCR FlowFISH

Calling DHS regions using DESeq analysis for CRISPRi-Growth

For the analysis of CRISPRi-Growth data, DHS (DNase I hypersensitive sites) regions with significant effect on cell fittness have been identified using DESeq analysis, performed by Alex.

There are ~ 1M (1,092,166) guides designed to screen across ~111K (111,702) DHS regions in K562.

Method: DESeq2 analysis on all guides -> log2 foldchange and p-values

Significant: Guide with fdr_0_05

We got 6424 DHS regions containing at least one significant guides.

#Guide  (Total):      1092166 
#Region (Total):      111702 
#Guide  (padj<=0.05): 8200 
#Region (padj<=0.05): 6242 

#Guide  (Signif): 6242 
#Region (Signif): 6242 

ENCODE E2G Benchmark data

Build ENCODE E2G model

Logistic regression

Train and test the E2G model using their collected “gold standard” dataset

  • 10,375 total element-gene pairs collected from previous studies.
    • 472 “positive” unique element-gene pairs
    • 9,903 “negative” element-gene pairs

To train and evaluate models, we aggregated a gold-standard dataset of 10,411 element-gene pairs tested with CRISPR in K562 erythroleukemia cells, an ENCODE Tier 1 cell line. We re-analyzed and harmonized data from previous studies that used genetic perturbations (mostly CRISPR interference (CRISPRi)) to inhibit candidate enhancers and measure effects on gene expression 9,19,23–25 (see Note S1). Importantly, we developed approaches to compute statistical power for every tested element-gene pair, identifying 472 “positive” unique element-gene pairs where CRISPR perturbation of the element led to a significant decrease in gene expression (–1 to –93% effects, Fig. S1.1f) and 9,938 “negative” element-gene pairs where no significant reduction in expression was observed despite the experiment having good power to detect >15-25% effects on gene expression (Note S1). We trained logistic regression classifiers to distinguish positives from negatives using hold-one-chromosome-out cross-validation. Then, we applied the trained model to all element-gene pairs across the genome and to new cell types.

Biosample: K562
Reference
Ulirsch2016
Gasperini et al., 2019
Wakabayashi2016
Schraivogel et al., 2020
Klann2017
Thakore2015
Xie2017
Fulco2019
Qi2018
Huang2018
Xu2015
Fulco2016
Source       Count
Fulco2016:    103
Fulco2019:    3501
Gasperini et al., 2019: 5318
Schraivogel et al., 2020: 1306