请问什么是reference panel？它在snp基因分型中有什么作用

2023-10-09by@chenglonglcl1-9PcS_&@GPT-4.Model: gpt-4

Reference panel是一个包含了大量已知基因型个体的数据库。在SNP基因分型中，reference panel用于参考或比对，帮助研究者确定未知基因型。因为每个人的基因型都是独特的，但是在大的人群中，某些基因型可能会更常见。通过比对reference panel中的已知基因型，可以推测出最可能的基因型组合。

具体来说，基因分型过程中，样本的部分基因型可能无法直接通过测序得到，这时就需要依赖已知的reference panel进行推断。例如在基因缺失或插入、拷贝数变异等情况下，就需要依赖reference panel中的信息进行推断。同时，对于低频率的变异位点，单纯依赖测序技术可能无法准确判断，也需要reference panel的辅助。

此外，reference panel还可以用于质控，通过比对已知基因型，检查测序或者基因分型的准确性。

总的来说，reference panel在基因分型中发挥着重要的参考和辅助作用。

chenglonglcl1-9PcS_

请根据下面的研究内容，列出详细的技术路线，如下：The phenotype and genotype data of 21 traits were collected from 2,622 individuals over the past decade, including our newly collected data from the above 120 cattle. The summary of phenotype records is shown in Table S1. All 2,622 individuals were genotyped using the Illumina Bovine 770K Bead chip. The SNPs with minor allele frequencies < 0.05, genotype call rates < 90%, located in non-autosome and significant Hardy–Weinberg disequilibrium at 1 × 10−6
, as well as samples with call rates < 90% were removed from the downstream analysis using PLINK 1.90 [33]. After quality control, a total of 590,065 variants remained. We then imputed the SNPs to sequence variants level based on a multiple breeds reference panel by Beagle 5.4 [34]. The reference panel consists 1,847 individuals (including 113 Simmental cattle) and was downloaded from https://www.ebi.ac.uk/ena/browser/view/ERZ1738264 [35]. We removed variants with MAF < 0.05 or dosage R-squared (DR2) < 0.8. After quality control (same as above), 8,221,244 autosomal variants were obtained for GWAS and eQTL mapping. The average DR2 of the imputed variants was 0.92. Before performing GWAS, the phenotype was adjusted by year, sex, age, and the first two principal components (PCs) of the genotype generated by PLINK and normalized by rank-transformation using the transform function in GenABEL [36]. We performed an association test for each SNP based on a linear mixed model:

yy=XXββ+uu+εε,
where y is adjusted phenotype; X is a vector of genotypes of a variant at the locus tested; β is the effect size of the variant; u is a vector of random polygenic effects ~ N (0, GGσσ2gg
), where G is genomic relationship matrix constructed from all variants; ε is a vector of residual errors. Variance component estimation via restricted maximum likelihood (REML) analysis was implemented in GCTA software [37]. We used P-value < 5 × 10−8 as significance thresholds of GWAS for all traits. We compared GWAS results with Cattle QTLdb (release 47, Apr 25, 2022) [1]. The QTLs that were within ± 100 kb of a QTL/association for the same trait(s) of the Cattle QTLdb were treated as the newly detected QTLs.

A server error has occurred

INTERNAL_SERVER_ERROR

chenglonglcl1-9PcS_

The phenotype and genotype data of 21 traits were collected from 2,622 individuals over the past decade, including our newly collected data from the above 120 cattle. The summary of phenotype records is shown in Table S1. All 2,622 individuals were genotyped using the Illumina Bovine 770K Bead chip. The SNPs with minor allele frequencies < 0.05, genotype call rates < 90%, located in non-autosome and significant Hardy–Weinberg disequilibrium at 1 × 10−6
, as well as samples with call rates < 90% were removed from the downstream analysis using PLINK 1.90 [33]. After quality control, a total of 590,065 variants remained. We then imputed the SNPs to sequence variants level based on a multiple breeds reference panel by Beagle 5.4 [34]. The reference panel consists 1,847 individuals (including 113 Simmental cattle) and was downloaded from https://www.ebi.ac.uk/ena/browser/view/ERZ1738264 [35]. We removed variants with MAF < 0.05 or dosage R-squared (DR2) < 0.8. After quality control (same as above), 8,221,244 autosomal variants were obtained for GWAS and eQTL mapping. The average DR2 of the imputed variants was 0.92. Before performing GWAS, the phenotype was adjusted by year, sex, age, and the first two principal components (PCs) of the genotype generated by PLINK and normalized by rank-transformation using the transform function in GenABEL [36]. We performed an association test for each SNP based on a linear mixed model:

根据这段研究内容，详细的技术路线如下：

收集2,622个个体过去十年的21个性状的表型和基因型数据，其中包括新收集的120头牛的数据。
使用Illumina Bovine 770K Bead chip对所有2,622个个体进行基因分型。
使用PLINK 1.90进行数据清洗，移除小于0.05的次等位基因频率，小于90%的基因型呼叫率，位于非常染色体上，以及在1 × 10−6处显著硬-温伯格不平衡的SNPs，以及基因型呼叫率小于90%的样本。
质控后，剩余总计590,065个变异。
基于Beagle 5.4的多品种参考面板将SNPs插补到序列变异水平。参考面板包含1,847个个体（包括113头Simmental牛），并从网站下载。
移除MAF < 0.05或剂量R-squared (DR2) < 0.8的变异。
进行质控后（同上），获得8,221,244个常染色体变异，用于GWAS和eQTL作图。
在进行GWAS之前，根据年份、性别、年龄和基因型的前两个主成分（通过PLINK生成）调整表型，并使用GenABEL中的transform函数进行秩转换。
基于线性混合模型对每个SNP进行关联测试。
使用GCTA软件实现通过受限最大似然（REML）分析的方差成分估计。
对所有性状的GWAS，我们使用P-value < 5 × 10−8作为显著性阈值。
将GWAS结果与Cattle QTLdb（发行版47，2022年4月25日）进行比较。在Cattle QTLdb的QTL/关联的相同性状的±100 kb以内的QTL被视为新检测到的QTL。

chenglonglcl1-9PcS_

请问什么是SNP插补？它有什么作用

SNP插补（Imputation）是一种统计方法，用于推断基因型数据中缺失的部分。在基因组研究中，由于成本和技术原因，我们往往不能对所有的SNP位点进行直接测序或分型，因此会有很多位点的基因型数据是缺失的。SNP插补就是通过已知的参考人群基因型和个体之间的连锁不平衡（Linkage Disequilibrium）信息，推断这些缺失位点的基因型。

SNP插补的主要作用是提高基因型数据的密度和完整性，使得基因组关联研究（如GWAS）能够考察更多的SNP位点，增加研究的统计力。此外，插补后的数据可以更好地反映基因型和表型之间的关系，提高研究的准确性。同时，通过对不同研究的数据进行插补，可以实现数据的整合和比较，方便进行大样本或者跨队列的研究。

chenglonglcl1-9PcS_

请问什么是average DR2？它有什么作用

DR2，全称Dosage R-squared，是一种衡量基因插补（imputation）质量的指标。它描述的是观察到的基因型（或等位基因剂量）和插补后的基因型（或等位基因剂量）之间的相关性平方。DR2的值范围为0至1，值越接近1，说明插补的质量越高，也就是说插补后的基因型越接近真实的基因型。

average DR2，就是指所有插补SNP的DR2值的平均值。它反映了整体的基因插补质量，是评估插补效果好坏的重要指标。average DR2的值越高，说明整体的插补质量越好，插补结果越可信。在进行基因组关联分析等研究时，需要对基因插补的质量进行严格的控制，通常只有DR2值达到一定阈值（如0.8）的SNP才会被纳入后续的分析。

chenglonglcl1-9PcS_

下面是一段研究方法，如下：Before performing GWAS, the phenotype was adjusted by year, sex, age, and the first two principal components (PCs) of the genotype generated by PLINK and normalized by rank-transformation using the transform function in GenABEL. 请问其中的基因型的前两个主成分指的是什么？

主成分分析（Principal Component Analysis，PCA）是一种常用的统计方法，用于在高维数据中找出最重要（解释最大方差）的特征。在基因型数据中，主成分通常用来捕捉样本之间的遗传背景差异。

基因型的前两个主成分（PCs）就是通过PCA方法对基因型数据进行处理后，解释方差最大的前两个主成分。这两个主成分能够反映样本之间最主要的遗传差异，可以用来描绘样本的遗传结构或者人口历史。

在进行全基因组关联研究（GWAS）时，如果样本之间存在遗传背景的差异，可能会引入假阳性结果。因此，研究者会将基因型的前两个主成分作为协变量，来调整表型，消除这种潜在的混淆因素。例如，如果样本中包含了不同的族群，那么主成分就可以用来调整族群结构的影响。

chenglonglcl1-9PcS_

GWAS中调整表型后，使用GenABEL中的transform函数进行秩转换，请问这是为什么？有什么作用？

A server error has occurred

INTERNAL_SERVER_ERROR

chenglonglcl1-9PcS_

GWAS中调整表型后，使用GenABEL中的transform函数进行秩转换，请问这是为什么？有什么作用？

OpenPrompt