A comprehensive R package, multiple-environments multiple methods genomic selection (MMGS), developed by Mingjia Zhu, integrates the polygenic environmental interaction (PEI) and Reaction Norm (RE) methods along with 15 prediction models that include difference prediction estimated methods contains parametic, semi-parametric and non-parametric.
RE model includes four steps: (1) Using CERIS algorithm (Guo, 2021) to identify an environmental index that explained the largest proportion of phenotypic variation. (2) Regressing the observed phenotypes on the identified environmental index to obtain an intercept and a slope estimate for each tested genotype. (3) Treating intercept and slope as new "traits" and perform genomic prediction through ridge regression to predict the intercept and slope for each untested genotype. (4) Predict the phenotypes of the untested genotypes using the predicted intercept and slope and the environmental index value of each environment. Consistent with the RE model, the PEI model starts with identifying key environmental index that best captures the phenotypic variation
Total of these predicted statistical models are classified into three major categories: parametric, semi-parametric, and non-parametric (Admas et al, 2024). The parametric statistical models include mixed linear models like genomic best linear unbiased prediction (G-BLUP) (Vanraden, 2008), BayesA (BA) and BayesB (BB) (Meuwissen et al., 2001), BayesC (BC) (George and McCulloch, 1993), Bayesian ridge regression (BRR) (Erbe et al., 2012), and Bayesian LASSO (BL) (Park and Casella, 2008), least absolute shrinkage and selection operator (LASSO) (Usai et al., 2009), ridge regression (RR) (Whittaker et al., 2000), ridge regression best linear unbiased prediction (RR-BLUP) (Meuwissen et al., 2001), and elastic net (EN) (Zou and Hastie, 2005). The semi-parametric method includes the reproducing kernel Hilbert space (RKHS) model and multiple kernel RKHS (MKRKHS) (Gianola et al., 2006). The non-parametric method comprises support vector machine (SVM) (Maenhout et al., 2007), and random forest (RF) (Chen and Ishwaran, 2012), and gradient boosting machine (GBM) (Li et al., 2018).
You can install the package from CRAN using the following command: From Github:
devtools::install_github("Ryougi-yukiro/MMGS")
Provide examples of how to use your package. Include code snippets and brief explanations to demonstrate the key features and functionalities. You can also provide links to additional resources or documentation.
We have built-in data from a hybrid population that includes environmental data from multiple locations, filtered genotype data, and flowering-related phenotypic data. This dataset is smaller and easier for beginners to understand how the package is used.
#Load the required packages
library("MMGS")
library("dplyr")# used for data reshape and melt
data(trait)
data(geno)
data(env_info)
data("PTT_PTR")
This step explores the basic attributes of the data situation and provides pre-processing for subsequent analysis.
env_trait<-env_trait_calculate(data=trait,trait="FTgdd",env="env_code")
LbyE<-LbyE_calculate(data=trait,trait="FTgdd",env="env_code",line="line_code")
LbyE_corrplot(LbyE=LbyE)
etl<-LbyE_Reshape(data=env_trait,env="env_code",LbyE=LbyE)
etl_plotter(data=etl,trait=env_trait)
Regression<-Reg(LbyE = LbyE, env_trait = env_trait)
#Reg_plotter(Reg = Regression)
result<-line_trait_mean(data=trait,trait="FTgdd",mean=env_trait,LbyE=LbyE,row=2)
MSE<-result[[1]]
ltm<-result[[2]]
#mse_plotter(MSE)
#Mean_trait_plot(Regression,MSE)
This step aims to find the most relevant environmental factors to provide a solid basis for subsequent predictions
Paras <- colnames(PTT_PTR)[-c(1:4)]
#windows-search
pop_cor<-Exhaustive_search(data=env_trait, env_paras=PTT_PTR, searching_daps=122,
p=1, dap_x=122,dap_y=122,LOO=0,Paras=Paras)
#plot
#Exhaustive_plotter(Correlation=pop_cor,dap_x=122, dap_y=122,p=1,Paras=Paras)
#correlation
envMeanPara<-envMeanPara(data=env_trait, env_paras=PTT_PTR, maxR_dap1=18,
maxR_dap2=43, Paras=Paras)
#plot
#envMeanPara_plotter(data=envMeanPara,Paras=Paras)
Users can customize the model they need, the function uses the by default, the given environment parameters can be obtained from the previous results , fold number represents the number of folds, reshuffle represents the number of repetitions.
#Check pheno
pheno<-LbyE[which(as.character(LbyE$line_code)%in%c("line_code",as.character(geno$line_code))),];
#CV
out<-MMGP(pheno=pheno, geno=geno, env=env_info,para=envMeanPara, Para_Name=Para[1], depend="PEI",model="BB", kernel="linear", fold=2, reshuffle=5, methods="RM.G")
#result
#> mean(out[[3]])
#[1] 0.8728506
#> apply(out[[2]],2,mean)
# PR12 IA14 PR11 IA13 PR14S KS11 KS12
#0.5418663 0.3868576 0.5381628 0.4759335 0.4871427 0.6219213 0.6380658
#head(out[[1]])
# obs pre col para
#1 1595.988 1588.782 #FF0000 PR12
#2 1512.918 1576.437 #FF0000 PR12
Correlation here refers to the correlation between the predicted phenotypes and the actual phenotypes of the environment, not the breeding values, so please do your own calculations first if needed (before the R package is updated).
pheno<-LbyE
pheno$PR11 <-NA
#linear radial polynomial linear
#library(dplyr)
for( i in envMeanPara$env_code){
pheno<-LbyE
pheno[["KS12"]]<-NA
out<-MMPrdM(pheno=pheno, geno=geno,env=env_info,para=envMeanPara,
Para_Name=c("PTS"), depend="PEI",
SVM_cost = 1,gamma=10,kernel="linear",fixed=T,
model="SVM",reshuffle=1,methods="RM.G")
(cor<-cor(out[,2],LbyE[["KS12"]]))
print(paste(i," : ",cor))
}
For some non-parametric algorithms, please refer here for changes.
#SVM : There are 4 kernel you can use :linear: u0v
#polynomial: (γu0v + coef0)degree
#radial basis: e( − γ|u − v|2)
#sigmoid: tanh(γu0v + coef0)
#GBM function
if(is.null(GBM_params)){
params <- list(boosting="gbdt",objective = "regression",metric = "RMSE",min_data = 1L,
learning_rate = 0.01,num_iterations=1000,num_leaves=3,max_depth=-1,
early_stopping_round=50L,cat_l2=10,skip_drop=0.5,drop_rate=0.5,
cat_smooth=5)
}
See full documentation from original repository
- env_trait_calculate
- envMeanPara
- envMeanPara_plotter
- etl_calculate
- etl_plotter
- Exhaustive_plotter
- Exhaustive_search
- MMGS
- h2_rrBLUP
- LbyE_calculate
- LbyE_corrplot
- line_trait_mean
- ltm_plotter
- Mean_trait_plot
- mse_plotter
- prdM_plotter
- Reg
- Reg_plotter
- Slope_Intercept
MMGS is a collection of tools for cross-environmental genome-wide selection prediction that integrates most genome-wide prediction models, both parametric and non-parametric. You can input your own collected data against sample data and get the results you want directly through the built-in functions of the toolkit, which requires no additional statistical knowledge or coding skills and is somewhat user-friendly because it saves users from having to search for various tools and apply them to cross-environmental prediction.