Introduction
When the transcription of a precursor mRNA reaches its end, it usually undergoes two coupled maturation processes: 3′ end cleavage and polyadenylation (Richard & Manley,
2009; Tian & Manley,
2016). In eukaryotic organisms, transcripts derived from the same gene locus could end at different positions, resulting in distinct isoforms. The phenomenon, termed as alternative polyadenylation (APA), affects more than 70% of protein‐coding genes in mammals (Derti
et al,
2012; Hoque
et al,
2013). APA can lead to the production of alternative protein isoforms, if it involves the splicing of an alternative last exon or the usage of a polyadenylation site (PAS) upstream of the distal stop codon. More often, if alternative PASs are limited to the 3′UTR, it results in transcripts with identical protein‐coding sequences, but 3′UTRs of different lengths. These APA isoforms may contain different sets of
cis‐regulatory elements in their 3′UTRs and therefore may differ in mRNA stability (Chen & Bin,
1995; Barreau
et al,
2005; Bartel,
2009; Jonas & Izaurralde,
2015), translation efficiency (De Moor
et al,
2005; Lau
et al,
2010), intracellular localization (Ephrussi
et al,
1991; Bertrand
et al,
1998; An
et al,
2008; Niednery
et al,
2014), or even in the localization and function of the encoded proteins (Berkovits & Mayr,
2015). APA often shows tissue‐, cell type‐, or developmental stage‐specific patterns. In general, distal PAS usage appears to be favored in differentiated cells, whereas proximal PAS usage is more frequently observed in proliferating, undifferentiated, or cancerous cells (Ji & Tian,
2009; Mayr & Bartel,
2009).
Polyadenylation is regulated by various
cis‐ and
trans‐regulatory elements. In vertebrates, the canonical hexamer AAUAAA and its highly similar variants are the most prominent
cis‐regulatory elements and defined as polyadenylation signal (Proudfoot & Brownlee,
1976; Proudfoot,
2011). Located 15–30 nucleotides upstream of the cleavage site, they are directly recognized by the cleavage and polyadenylation specificity factor (CPSF) subunits CPSF30 and Wdr33 (Shi,
2014). The choice of alternative PASs is achieved by a variety of mechanisms. Global APA changes can be introduced by different expression levels of components of 3′ end complexes, e.g., higher expression of CFI would lead to the preference of distal PASs (Martin
et al,
2012). Among the
trans‐regulatory factors are also RNA‐binding proteins (RBPs) which can enhance or repress the binding of the polyadenylation machinery to their target sites (Erson‐Bensan,
2016).
Although the function of different APA isoforms has been demonstrated for many individual genes, the phenotypic relevance of most APA observed in modern high‐throughput sequencing experiments remains elusive. In a recent study, Xu and Zhang proposed the “error hypothesis”, stating that most genes have only “one optimal polyadenylation site and that APA is caused largely by deleterious polyadenylation errors” (Xu & Zhang,
2018). The authors support this hypothesis by showing that in general, lower gene expression levels correlate with higher APA diversity within and between tissues. In extension of Kimura's neutral theory of molecular evolution (Kimura,
1968,
1983), Zhang also suggests that differences between species in gene regulatory phenotypes like APA are largely non‐adaptive, as only a small fraction of these will affect higher‐level phenotypes and organismal fitness (Zhang,
2018).
If patterns of APA in mammals generally conform to the error hypothesis (concerning the molecular function of APA) as well as the neutral theory (concerning evolutionary patterns of APA), the following predictions can be made: (i) The large majority of genes should have only one optimal PAS and genes subject to the strongest purifying selection would show no APA. (ii) Among the genes with more than one PAS, genes under the lowest levels of selective constraints should exhibit more APA diversity within and across tissues as well as higher divergence between species. (iii) Compared to genes under stronger purifying selection, these genes should show weaker and less conserved APA regulation. Alternatively, the usage of alternative PASs could be functional within and/or between tissues, and changes in APA regulation between species could be adaptive and result from positive selection rather than random drift.
In this study, we test the predictions made by the error hypothesis and neutral theory, by globally characterizing APA patterns across various tissues in the F1 hybrid of the two mouse inbred strains C57BL/6J and SPRET/EiJ, belonging to the closely related species
Mus musculus (house mouse) and
Mus spretus (Algerian mouse), respectively. In F1 hybrids, the RNA transcripts from both parental alleles are subject to the same
trans‐regulatory environments; therefore, the observed differences in allele‐specific APA patterns reflect the effects of
cis‐regulatory divergence (Wittkopp
et al,
2004; Xiao
et al,
2016). As shown in a previous study by our group, evolutionary divergence of APA between closely related species can be attributed largely to
cis‐regulatory changes (Xiao
et al,
2016). But the extent and general importance of these changes and their interaction with tissue‐dependent
trans‐regulation remain undetermined. This dataset therefore enables us to simultaneously study the tissue dependence and potential function of PAS regulation as well as the evolutionary patterns of
cis‐regulatory APA divergence in mammals.
Discussion
APA is a common process in higher eukaryotes which has received increased attention in recent years with regard to its function, regulation, and evolution. Improved high‐throughput deep sequencing technology has enabled researchers to identify a growing number of genes undergoing APA, and it is often assumed that the detected variation in PAS usage is biologically relevant. However, cases where functions of different APA isoforms have been clearly demonstrated are still rare compared to the ubiquity of the phenomenon. It has therefore been proposed that most observed APA is noise (“error hypothesis”; Xu & Zhang,
2018) and that APA divergence between species is largely neutral or slightly deleterious (“neutral theory”; Zhang,
2018). Here, we address these issues in an F1 mouse model by investigating APA diversity within each tissue, APA differences between tissues, and
cis‐regulatory divergence between
Mus musculus (C57BL/6J) and its sister species
Mus spretus (SPRET/EiJ). We show that the error hypothesis does not predict the differences between single‐PAS and multi‐PAS genes and that it only applies to a specific subset of multi‐PAS genes. Furthermore, we demonstrate that higher APA diversity or divergence can only partially be attributed to a relaxation of selective constraint on lowly expressed genes. Our results indicate that functional coupling between transcriptional and APA regulation (with higher expression levels leading to increased distal PAS usage) and competition between different PASs constitute additional factors with a significant impact on APA variation. Finally, we provide a set of genes with likely functional APA.
Different selective constraints might be reflected in different major PAS locations. We found that genes with the major PAS located upstream of the stop codon have lower expression levels and higher dN/dS ratios than all the other genes with major PAS in the 3′UTR (Fig
2C and D). Whereas for the latter generally gene expression levels are positively correlated with major PAS usage, this is not the case for the former (Fig
EV4A–C), indicating that usage of many major PASs affecting the coding region might be deleterious. Indeed, the major PASs located upstream of the stop codon also have lower usage compared to the 3′UTR major PASs. Among genes with major PAS located in the 3′UTR, genes with major PAS located at a more distal position have lower dN/dS ratios than those proximal to the stop codon. These results together demonstrate that major PAS location is related with selective constraints both on protein‐coding sequence and PAS choice accuracy, with distal UTR PASs being selectively favored over proximal UTR PASs and these in turn over PASs upstream of the stop codon.
The error hypothesis predicts stronger noise in APA regulation and therefore higher APA diversity within and across tissues in genes with lower functional impact. In general, these would be genes under relaxed selective constraints. When the relationship between gene expression levels and APA diversity within a tissue or APA changes across tissues is considered, the multi‐PAS genes with major PAS located in 3′UTR can largely be separated into two distinct groups: 1. genes with dominant PAS usage above or equal to 90% within each tissue (F1 genes, Fig
EV5), which show strong positive correlations between gene expression level and dominant PAS usage and 2. the remaining genes with more diverse APA, for which the negative correlations between gene expression and APA diversity become much weaker and even negligible (Fig
3C). This suggests that for the former group, all the other alternative PASs can be safely regarded as noise, while within the latter group, some genes might contain functional minor PASs (see
Discussion below, Fig
EV5).
In contrast to gene expression levels, dN/dS ratios, as an alternative measure of selective constraints, however, do not show any correlations with APA diversity/change. At least two possible factors could contribute to the discordance between different measures of selective constraints. One possibility is that some genes with lower average expression levels experience relaxed selective constraints on APA regulation, but not on coding sequence evolution, i.e., purifying selection acts independently on the two. This would be the case, for example, when a different version of the protein has a toxic effect on the cell even in low quantities, whereas low levels of a minor mRNA isoform with a different 3′‐UTR length would have little impact on organismic fitness. Alternatively, the regulation of gene expression level and APA accuracy is mechanistically or functionally coupled. There could be three non‐mutually exclusive scenarios of such coupling: (i) General
trans‐factors are a limited resource and highly expressed genes compete more successfully for their binding (e.g., through increased recruitment and higher localized concentrations of
trans‐factors), resulting in increased polyadenylation accuracy compared to lowly expressed genes. (ii) Specific
trans‐factors increasing the cleavage and polyadenylation accuracy of their target genes are co‐regulated with these genes in a tissue‐dependent manner. For example, genes with the major PAS located in the 3′UTR proximal (distal) to the stop codon could be more highly expressed in tissues with higher abundance of
trans‐factors favoring general 3′UTR shortening (lengthening; MacDonald & McMahon,
2010; MacDonald,
2019), or genes containing specific
cis‐regulatory sequences might be co‐regulated with the
trans‐factors targeting these
cis‐elements. This might result in increased polyadenylation error of genes in the tissues where the gene and its regulators both are lowly expressed. (iii) Transcriptional regulation is mechanistically coupled with polyadenylation in a way that automatically leads to increased major PAS usage with increased mRNA expression levels. Consistent with such mechanistic coupling, we found that the usage of the most distal PAS (3′UTR(L) PAS) for each gene was strongly correlated with gene expression level across tissues, even for the minor distal PASs (Fig
5C and D), i.e., distal PAS usage of a given gene automatically increases with higher expression levels. Recently, a study in
Drosophila showed that faster elongation rates might lead to increased usage of distal PASs in the body, but not in brain (Liu
et al,
2017). Therefore, the mechanistic basis for this phenomenon and the relationships between transcriptional activation, elongation rate, PAS choice, and mRNA abundance warrant further experimental investigation in mammalian organisms.
In addition, the magnitude of PAS usage differences between tissues or between alleles might depend on the kinetics of competition between different PASs within the same gene. Indeed, we observe that major PASs with intermediate usage (and therefore higher APA diversity) more likely show larger magnitudes of allelic divergence, a scaling law similar to that previously described for alternative splicing (Baeza‐Centurion
et al,
2019). This indicates that the same
cis‐regulatory mutation could cause changes of different magnitude dependent on the relative strengths of other competing PASs.
Taken these two together, our results suggest that (i) the transcription and polyadenylation machineries have co‐evolved to ensure that energy and limited resources are preferentially used for the accurate cleavage and polyadenylation of highly transcribed genes; and (ii) the scaling law ensures that the subtle stochastic regulation, due to either trans‐ or cis‐perturbation, would not cause dramatic usage changes for functionally important PASs with high usage, particularly those with usage higher than 90%. On the other hand, stronger purifying selection would be required to maintain equal usage of two competing functionally important PASs in the same gene. In the future, it will be important to investigate the mechanism(s) which lead(s) to better accuracy of highly expressed genes as well as the role of competition between PASs in shaping the response to regulatory perturbations, e.g., mutations in cis‐elements and fluctuations in trans‐factor expression levels.
Finally, we showed that at least 3,778 multi‐PAS genes (F2 genes, Fig
EV5) express potentially functional minor PASs simultaneously in at least one tissue. Two lines of evidence support their functional relevance: (i) The conservation scores of the 300‐nt upstream regions of these minor PASs are similar to those of major PASs, and (ii) the regions between the minor PASs and the major PASs in these genes are more conserved than in other genes and these regions contain higher densities of microRNA binding sites. Functional minor PASs apparently are particularly important in the brain, where the two isoforms are often localized in different subcellular compartments. A significant proportion of genes with two potentially functional PASs in the same tissue also show tissue‐dependent regulation. Interestingly, switch‐like regulation of different isoforms is also found in 193 FU genes (Fig
EV5), and in these genes, the regions between the dominant and the rank 2 PAS have higher conservation scores and microRNA densities than in other FU genes (Fig
EV6C and D). Genes with switch‐like tissue regulation also are largely consistent in their relative isoform usage between alleles, even if the genes are classified as divergent. There are only 2 genes out of 105 divergent switched genes exhibiting different allelic PAS change directions between the two tissues (Fig
EV8). This implies that the influence of tissue‐dependent APA regulation exceeds the molecular noise reflected in the divergence within a given tissue. Within the union of switch genes and F2 genes (3,971 genes in total), we define a subset of 2,218 high‐confidence candidates for functional APA regulation with high gene expression levels in tissues with significant minor PAS usage. Among these, 1,571 (70.83%) contain two or more conserved PASs with
Homo sapiens and might therefore be of functional relevance also in humans. These results, together, provide a prioritized set of APA isoforms for future functional studies.
Materials and Methods
RNA extraction from tissues and cultured cells
Tissues including cortex, cerebellum, heart, lung, liver, spleen, kidney, and muscle were obtained from three adult F1 hybrid mice (C57BL/6J x SPRET/EiJ mouse strains). The mice were sacrificed by cervical dislocation, and tissues were dissected and stored in −80°C before RNA extraction. All experiments were carried out in accordance with guidelines for the use of laboratory animals and were approved by the local ethical committee (Pasteur Institute). The embryonic stem cells (ES cells) were cultured in Neurobasal‐DMEM F12 (Gibco) with N2B27 (Invitrogen), 2i (Selleck), and LIF (Millipore). Total RNA was extracted from all tissues and cells using TRIzol reagent according to the manufacturer's protocol (Life Technologies). The integrity of purified total RNA was estimated by Agilent Bioanalyzer using RNA Nano Kit (Agilent Technologies) before subsequent experiments. Total RNA with an RNA integrity number (RIN) above 9.0 was used for 3′ mRNA library preparation.
3′ mRNA library sequencing
QuantSeq 3′ mRNA‐Seq Library Prep Kit REV for Illumina (Lexogen) was used to sequence the 3′ end of RNA with polyA tail. In brief, 500 ng total RNA was taken as input. RNA 3′ end regions were reverse‐transcribed using an anchored oligo‐dT primer, and second strand synthesis was initiated by random priming. PCR amplification was then performed to obtain an Illumina compatible sequencing library. All libraries were sequenced in paired‐end 2 × 151 nt format on an Illumina HiSeq X Ten machine.
Alignment of sequencing reads
The C57BL/6J reference genome (mm10) was downloaded from Ensembl (
https://ensembl.org/info/data/ftp/index.html). The SPRET/EiJ reference genome was created as described previously (Gao
et al,
2015). The reference set of PASs was obtained from PolyA_DB3 (
http://exon.umdnj.edu/polya_db/), which provides precise cleavage positions determined by 3′ READS method (Zheng
et al,
2016). We converted the PAS reference in mm9 to mm10 coordinates using g2gtools (version 0.1.29). Genes with overlapping annotated PASs were removed from further analysis.
For the 3′ mRNA‐seq data, the 3′ adaptor sequence 5′‐AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC‐3′ was first removed from forward reads and 5′‐AAAAAAAAAAAAAAAAAAAGATCGGAAGAGCG TCGTGTAGGG‐3′ from reverse reads by Cutadapt (version 1.18). The paired reads with each read longer than 15 nt were mapped to the mm10 and SPRET/EIJ genomes separately by HISAT2 (version 2.0.1) with parameters—no‐softclip—no‐discordant ‐x. Reads mapped to at least one of the reference genomes were used. In these mapped reads, if a read could map uniquely to one genome with a shorter edit distance than to the other, it was assigned as unambiguously mapped to the specific allele with the shorter edit distance. If a uniquely mapped read had the same edit distance to both parental genomes, it was assigned as common read. To unify the coordinates, reads assigned to the SPRET/EIJ allele were converted to mm10 coordinates by g2gtools. We defined all the uniquely mapped forward reads with their 5′ end within 24 nt away from the annotated cleavage positions as PAS reads.
PAS usage quantification
Genes with a sum of < 20 PAS reads in each sample were defined as not expressed and were not considered for further analysis. A PAS was considered as used if at least five reads were assigned to the PAS cluster in at least one sample. The usage of an individual PAS was calculated as the number of its PAS reads divided by the sum of PAS reads in the gene. Genes that use only one identical PAS in all tissues sampled in this study were defined as single‐PAS genes.
For allelic PAS usage, only reads unambiguously assigned to specific alleles were used. The allelic usage of an individual PAS was calculated as the number of allele‐specific reads mapped to this PAS divided by the sum of all allele‐specific PAS reads in the gene. Known imprinted genes in mouse extracted from the Geneimprint database (
http://www.gene-imprint.com/site/genes-by-species) and genes on the sex chromosomes or mitochondria were excluded from allelic PAS analyses. In order to avoid inaccurate calculation of allelic PAS usage due to sampling errors for low coverage PAS regions, genes with the sum of allele‐specific PAS reads from either allele < 10 in one sample were removed. For the remaining PASs, we further estimated their relative PAS usages by combining allelic reads from both alleles and compared the result to the usage estimated using both allelic reads and common reads. If the difference between the two estimates was > 10%, we regarded the PAS cluster as having insufficient coverage of unambiguously assigned allelic PAS reads for accurate estimation of allelic PAS usage, and it was therefore not used in further allelic analyses (Xiao
et al,
2016). We compared the allelic PAS usage divergence between the two alleles using DEXSeq (Anders
et al,
2012). Only PASs with Benjamini–Hochberg‐adjusted
P value < 0.05 and average allelic difference in PAS usage > 10% were defined as divergent.
Estimation of gene expression level
Since PAS reads were derived from the 3′ end of polyA RNA, one read from each transcript, normalization for transcript length, as often performed for regular RNA‐seq data, is not needed for quantifying gene expression. Therefore, the expression level of a gene was calculated as the sum of total PAS reads in the gene divided by sum of PAS reads from all the genes in the sample and multiplied by 10
6 (referred to as reads per million mapped reads, “RPM”). Allelic gene expression was calculated as the sum of allele‐specific PAS reads from the gene divided by the sum of total PAS reads from all the Refseq genes in the sample and multiplied by 10
6. Differential gene expression between alleles was analyzed with DESeq2 (Love
et al,
2014). Genes were considered as divergently expressed when the fold change between alleles was > 2 and the FDR < 0.05.
Down‐sampling of sequencing data
In order to avoid potential sampling bias in PAS identification caused by differences in sequencing depth between genes of different expression level, we generated a down‐sampled dataset in which each gene was covered by the same sequencing depth. In brief, we first pooled the reads from all the sequenced samples after normalizing them to the same sequencing depth and then randomly picked 100 reads mapped to each gene, to determine the number of used PASs and their relative usage.
APA diversity within tissues, variability between replicates, and APA change across tissues
To quantify the APA diversity within a tissue, we used the Shannon index (Shannon,
1948) defined as
where
pi is the usage of
ith PAS usage in a gene. The Shannon index measures the entropy of the alternative PAS usage, which reflects both the number of different PASs and the evenness of the usage distribution across these sites.
To measure the PAS variability between biological replicates, we calculated for each gene in each replicate a vector containing the usage of all its PASs (vj,k = [PAS1,PAS2,…PASi,…PASn], where j and k indicate gene j in the kth replicate in a tissue and i indicates the ith PAS in a gene) and used the vectors’ maximum norm difference between three replicates as APA replicate variability (varj = max ([|vj,1‐vj,2|, |vj,2‐vj,3|, |vj,3‐vj,1|]). In order to remove the potential biases introduced by different PAS number and reads depth, we created a mock replicate dataset by pooling all reads from three biological replicates and randomly assigning the reads into three groups where a gene's total read number is equal to the original read number. The mock dataset was then used to calculate the mock APA replicate variability. We subtracted the average variability of 100 randomly simulated mock replicates from the original replicate variability for each gene j as adjusted variability
Between tissues, we calculated for each gene in each tissue a vector containing the usage of all its PASs and used the vectors’ maximum norm difference between a tissue pair as APA difference. The mock data were created by pooling the dataset from two tissues and randomly assigning them back to the two tissues and were used for calculating the adjusted APA difference similarly as above. We defined the gene's switch score as the maximum adjusted APA difference between any two tissues where the gene is expressed.
Genes with brain‐specific minor functional PASs and genes with exactly three PASs
Genes with brain‐specific minor functional PASs were selected from genes which were classified into the F2 group exclusively in cortex and/or cerebellum (their rank 2 PAS usages in the other seven tissues were not high enough to be considered as genes with potentially functional minor PASs in these tissues).
To clarify different features between first, middle, and last PAS in 3′UTR, we defined a gene with exactly three PASs in 3′UTR if (i) it has exactly three PASs located in 3′UTR, and (ii) the sum of its three 3′UTR PASs usages is above 50% in every tissue where the gene is expressed.
Estimation of sequence conservation, sequence variant density, and dN/dS ratios
PhastCons scores of the
Glires clade (rodents, rabbits, etc.) were used to estimate sequence conservation. The PhastCons score data were obtained by PHAST (Siepel & Haussler,
2005;
http://compgen.cshl.edu/phast/phastCons-tutorial.php). Sequence variant density between C57BL/6J and SPRET/EIJ was calculated in the chain file as performed in our previous F1 hybrid study (Gao
et al,
2015). dN/dS ratios (ratio of the number of non‐synonymous substitutions per non‐synonymous site to the number of synonymous substitutions per synonymous site) between the house mouse (
M. musculus) and rat (
Rattus norvegicus) were downloaded from the Ensembl database (ensembl.org), to estimate selective constraints on the amino acid sequences of proteins.
Estimation of C‐terminal peptide conservation
Among the genes in which the dominant and the 2
nd most used PASs are conserved between humans and mouse, those with alternative last exons were selected. The C‐terminal peptides of the two isoforms of the gene were obtained from the mm10 annotated genome dataset (
http://ensemblgenomes.org/). The conservation score of C‐terminal peptides between humans and mouse was calculated as the peptide alignment score divided by mean peptide length, using the Blosum62 scoring matrix (Pearson,
2013) and the default penalties for gap opening (−11) and extension (−1). The C‐terminal peptides were considered as conserved if their conservation scores are positive.
MicroRNA target site density
We used conserved microRNA target sites annotated in TargetScanMouse (Agarwal
et al,
2015;
http://www.targetscan.org/mmu_72) to calculate the density of microRNA target sites in different 3′UTR regions. For a gene with its dominant PAS and rank 2 PAS in 3′UTR, we selected 1,000‐nt upstream region of the distal PAS or the whole region between dominant PAS and rank 2 PAS if the region length is < 1,000 nt. The microRNA target site density in each group was calculated as the average microRNA target number at each position of the regions. The microRNA densities of the regions between the major PASs and 2
nd PASs located in the 3′UTR were calculated in the same way. For the 300‐nt upstream region of a PAS, if the region overlaps any coding regions annotated in Refseq, the region was shortened to its non‐overlapping 3′UTR. The microRNA target site density of this region was calculated similarly as above.
Clustering of allelic APA as well as gene expression pattern across multiple tissues
For APA pattern and mRNA abundance clustering, the 5,264 genes with at least one PAS having sufficient allelic PAS reads in all nine tissues were used. For each of these genes, the PAS with maximum average usage was used for APA pattern clustering. Hierarchical clustering was performed using Pearson correlations as a similarity measure.