RNA配列データからの遺伝子集団構造の直接推論と制御

Communications Biology volume 6、記事番号: 804 (2023) この記事を引用

2275 アクセス

21 オルトメトリック

メトリクスの詳細

RNAseq データは遺伝的変異を推測するために使用できますが、遺伝的集団構造を推定するためのその使用はまだ研究されていません。ここでは、RNAseq ベースの遺伝主成分 (RG-PC) を推定し、遺伝子発現解析における集団構造の制御に RG-PC を使用できるかどうかを評価するために、自由に利用できる計算ツール (RGStraP) を構築します。研究が不十分なネパール人集団からの全血サンプルとGeuvadisの研究を使用して、RG-PCがペアアレイベースの遺伝子型と同等の結果を示し、遺伝子型の高い一致性と遺伝的主成分の相関性が高く、データセット内の部分集団を捕捉したことを示します。差次的遺伝子発現解析では、共変量として RG-PC を含めることで検定統計量のインフレが減少することがわかりました。私たちの論文は、RNAseq データを使用して遺伝子集団の構造を直接推論および制御できるため、トランスクリプトームデータの遡及的および将来の分析の改善が容易になることを示しています。

RNA シーケンス (RNAseq) は、トランスクリプトームの理解に革命をもたらし、遺伝子発現の正確な定量化方法と、特定の選択的スプライシング部位および細胞型特異的な転写産物の同定の両方を提供します 1,2。その応用は臨床現場にまで広がり、複雑な疾患をさらに解明し、感染性疾患と非感染性疾患の両方で有望なバイオマーカーを特定できるようになります3。

しかし、RNAseq を使用した研究では、RNAseq リードセット内にも含まれる生殖系列の遺伝的変異が考慮されることはほとんどありません。この情報を利用しない研究は、グループ間の転写に影響を与える可能性のある集団階層化などのバイアスや交絡に対して脆弱になる可能性があります4、5、6、7。この問題を克服するために、研究者は通常、RNAseq を使用して同じ個人に対して照合されたゲノムワイドアレイまたは全ゲノム配列 (WGS) データに依存してきました。これにより、研究者は、遺伝主成分 (PC) の計算や、その後の統計的関連モデルにおける共変量としてのそれらの使用など、集団の階層化を制御するアプローチを展開することができます 8,9,10。遺伝的 PC は、集団内および集団間の潜在的な遺伝構造を表すものと見なされ、社会環境の違い 11 または（差次的遺伝子発現の場合）集団間の量的形質遺伝子座の不均一性による交絡を引き起こします。ただし、RNAseq データと照合するためのゲノムワイドアレイまたは WGS の必要性は潜在的に不必要であり、非常に多様で研究が不十分な人口を抱える低所得国および低中所得国 (LMIC) など、リソースが限られている環境では実際に不可能な可能性があります。

GATK12、13、14 などのツールを使用して、RNAseq データから遺伝子型の呼び出しを行うことができることが実証されています。 RNAseq データを利用して遺伝子構造を捕捉するアプローチは、家畜および農業の目的 15、16、17、18 に適用されており、たとえば、家畜化されたオオムギ (Hordeum vulgare) の個体群構造、歴史および適応を調査するために適用されています 17。 RNAseq ベースの遺伝子型の概念実証とその後の有用性は組織特異的バリアントなどで実証されていますが 19、ヒト集団構造の推論への応用は有望であるものの、まだ比較的研究が進んでいません 20。

この研究の目的は、(i) RNAseq ベースの遺伝子型が、多様ではあるが十分に研究されていないヒト集団の遺伝集団構造を捕捉できることを実証すること、(ii) RNAseq ベースの遺伝的主成分 (RG-PC) の使用により、アソシエーション分析で集団構造を効果的に制御します。ここでは、ヒマラヤ山脈に位置し、125 を超える民族が住む内陸国であるネパールから 376 人の全血 RNAseq データを収集して生成しました 21,22。私たちは、RNAseq データから直接遺伝主成分を計算する RNAseq 解析パイプライン (RGStraP) を開発し、同じネパール人からのゲノムワイドアレイ遺伝子型データを使用して RGStraP のパフォーマンスを検証しました。また、Geuvadis コンソーシアムからのサンプルでもパイプラインをテストしました。このコンソーシアムには、1,000 のゲノム集団のうち 5 つのペアの遺伝子型と RNAseq データを含む 465 のサンプルが含まれています 23。最後に、性特異的遺伝子発現を特定するための関連分析において RG-PC を調整することの有効性を示します。全体として、我々の研究は、ヒトの集団構造、特に研究が十分ではないが多様な集団の構造を、RNAseq データを使用して効果的に捕捉し、直接制御できることを証明しています。

0.05 and a pairwise LD threshold of r2 < 0.05 struck the optimal balance of offering the most variants for analysis and the highest correlation between RNAseq- and array-based genetic PCs (Supplementary Fig. 2). From the total of 4,921,472 genetic variants, 152,072 SNPs passed the MAF filter (MAF > 0.05), and 36,440 SNPs further passed the LD filter (LD < 0.05). Genetic variants from paired genomic data are available for 299 out of the initial 376 individuals; a total of 552,758 SNPs were identified and passed initial quality control filters (Methods), of which 315,615 SNPs and 29,943 SNPs then passed MAF > 0.05 and further LD < 0.05 filters, respectively. Out of the 299 samples with both RNAseq and paired array genotypes, 280 of them passed quality control and were used for further downstream analyses./p>0.90 concordances. b Canonical correlation analysis between ten RG-PCs and ten array PCs showed significant (Wilks’ Lambda, p-value < 0.05) correlations for the first 7 canonical variates (CVs) between the two sets. The first 3 CVs from 10 RG-PCs strongly captured the genetic information from array PCs (Rc1 = 0.946, Rc2 = 0.864, Rc3 = 0.853), in which the cumulative proportion of shared variance between the two sets reached up to 0.956 from just the 3 CVs./p> 0.05) variants, of which 4887 passed the LD filter (LD < 0.05) and were used to calculate RG-PCs. We also calculated genetic PCs from the 29,943 paired genotype array SNPs as a measure of true genetic structure to be compared against RG-PCs. To assess the consistency of inferred population structure between the two approaches, we calculated Spearman correlation between genetic PCs from paired genotype array SNPs and the RG-PCs. PC1 of both RNAseq and array sets correlated strongly with each other (|ρ| = 0.93), followed by RG-PC3 and PC2 from array data (|ρ| = 0.61) and RG-PC2 and PC3 from array data (|ρ| = 0.6) (Supplementary Fig. 4). As expected, the genetic PCs of one approach do not exclusively correspond to only one PC of the other approach, as can be seen with significant correlations of a single array PC with several RG-PCs. To investigate this further, we performed canonical correlation analysis between the top 10 array PCs and the RG-PCs and found that the RG-PCs fully explained the variance of the top 10 array PCs (Fig. 2b)./p> 0.05) to account for differences in sequencing depths. Only autosomal genes were included in the analyses./p> 1) in the set without considering genetic PCs, and the number decreased to 3 when including either array or RG-PCs. This demonstrates how RG-PCs control for population stratification in downstream RNAseq analysis similar to the genetic PCs calculated from paired array genotypes, reducing significant associations that reflected variations in population structure instead of the biology of interest./p>38.5 °C temperature or history of fever for >72 h. From the total blood sample volumes (≤16 mL for patients >16 years of age, ≤7 mL for ≤16 years), aliquots were subjected to (i) bacteriological culture to identify presence of Salmonella enterica serovars Typhi (S. Typhi); (ii) storage in PAXgene tubes for later RNA extraction; and (iii) DNA extraction and subsequent human genotyping. Blood was also collected from healthy participants in the serosurvey (≤8 mL for patients >16 years of age, ≤7 mL for ≤16 years), from which aliquots were also subjected to (i) serological analysis; (ii) PAXgene storage for RNA analysis; and (iii) DNA extraction./p> 0.05 in at least 20% of the samples from the analyses. Differential gene expression (DGE) analyses was done contrasting males and females using edgeR43,44, taking into account age, disease group, and sequencing batches; we ran the analyses with and without populations structure PCs as an additional covariate to then compare how genetic structure may stratify gene expression. From both results, we also plotted the Q-Q plot and calculated the systematic inflation (m), which is the ratio of the median of the empirically observed chi-squared test statistics (in our case, results of DGE analysis with RG-PCs) to the expected median chi-squared test statistics (results of DGE analysis without RG-PCs), to quantify the stratification due to population structure in gene expression data./p>