Gene nucleotide composition accurately predicts expression and is linked to topological chromatin domains
C. Bessière, M. Taha M., F. Petitprez, J. Vandel, J.-M. Marin, L. Bréhélin, S. Lèbre and C.-H. Lecellier.
- Freely available from bioRxiv: doi.org/10.1101/117499.
- Abstract:
Gene expression is orchestrated by distinct regulatory regions (e.g. promoters, enhancers, UTRs) to ensure a wide variety of cell types and functions. A challenge is to identify which regulatory regions are active, what are their associated features and how they work together in each cell type. Several approaches have tackled this problem by modeling gene expression based on epigenetic marks (e.g. ChIP-seq, methylation, DNase hypersensitivity), with the ultimate goal of identifying driving genomic regions and mutations that are clinically relevant in particular in precision medicine. However, these models rely on experimental data, which are limited to specific samples (even often to cell lines) and cannot be generated for all regulators and all patients. In addition, we show here that, although these approaches are accurate in predicting gene expression, their biological interpretation can be misleading. Finally these methods are not designed to capture potential regulation instructions present at the sequence level, before the binding of regulators or the opening of the chromatin. We develop here a method for predicting mRNA levels based solely on sequence features collected from distinct regulatory regions, which is as accurate as methods based on experimental data. Our approach confirms the importance of nucleotide composition in predicting gene expression and ranks regulatory regions according to their contribution. It also unveils strong influence of gene body sequence, in particular introns. We further provide evidence that the contribution of nucleotide content can be linked to co-regulations associated with genome 3D architecture and to associations of genes within topologically associated domains.
- Supplementary Materials:
- Matrices of predicted variables (log transformed RNA seq data from 241 TCGA samples) and predictive variables (nucleotide and dinucleotide percentages, motifs and DNA shape scores)
- computed for all genes: genes_data_predicted_and_predictive_variables.tar
- computed for all isoforms: isoforms_data_predicted_and_predictive_variables.tar
- barcodes list of the 241 TCGA samples: TCGA_barcodes_list.txt
- Barcodes list of the 2 additional validation datasets
- 1270 TCGA RNA-seq samples: additional_1270_TCGAbarcodes_list_RNAseq.txt
- 582 TCGA microarray samples: additional_582_TCGAbarcodes_list_microarray.txt
- Matrices of predicted variables (log transformed RNA seq data from 241 TCGA samples) and predictive variables (nucleotide and dinucleotide percentages, motifs and DNA shape scores)