GWAS-ROCS Database: Field Documentation

Introduction to the GWAS-ROCS Database

GWAS-ROCS is a detailed database containing SNP-derived AUROCs. All of the data is either directly from, or derived from, studies accessible through PubMed or GWAS Central — an open-access online repository of summary-level genome-wide association study (GWAS) data. Each study simulation record (GR-Card) contains study information such as SNP data (odds ratio, risk allele frequency, and p-values), and simulated population data (e.g. ROC curves, AUROCs, SNP-heritability scores).

SNP: A single nucleotide polymorphism examined in a particular genome-wide associated study (GWAS).
Simulated Population: A csv file with computer-generated individuals who are marked as either cases or controls, and given data about the presence of risk alleles (1 having the risk allele, or 0 not having the risk allele) at SNPs previously identified as being significant.
Study Simulation: A GWAS-ROCS Database record uniquely identified by a GWAS-ROCS ID and displayed in a GR-Card that details a simulated population modelled after a publicly available genome-wide association study.

Field	Description
Creation Date	The date/time the record was created.
Last Updated	The date/time the record was last updated.
GWAS-ROCS ID	A unique GWAS-ROCS accession number consisting of a 2 letter prefix (GR) and a 5 number suffix. This ID is used to access the study simulation entry (i.e. GR-Card) via the URL. If an entry is deleted, its GWAS-ROCS ID will not be reused.
Condition/Phenotype	The disease, condition, or phenotype studied by the GWAS.
GWAS Central Data Set ID	The data set identifier assigned by GWAS Central corresponding to the data set from which the GWAS-ROCS record's simulated population was modelled after. Clicking on the identifier link takes the user to the GWAS Central page containing the study and its corresponding data set(s).
PubMed ID (PMID)	The identifier assigned by PubMed corresponding to the citation of the original research publication from which the GWAS-ROCS record's simulated population was modelled after. Clicking on the identifier link takes the user to the publication's PubMed citation page displayed in abstract format.
Meta-Analysis (Yes/No)	Classifies whether or not a given study simulation details a meta-analysis study which combined the results from multiple studies.
Control Subjects	The number of individuals in the control group of the GWAS (i.e. number of individuals without the condition of interest).
Case Subjects	The number of individuals in the case group of the GWAS (i.e. number of individuals with the condition of interest).
SNP Name	The reference SNP cluster ID for a given SNP.
SNP Odds Ratio	The ratio of the odds of a risk allele being present in the case group to the odds of the risk allele being present in the control group. Odds ratios range between zero and infinity. A risk allele with no correlation to a condition would have an odds ratio of 1, whereas a risk allele with perfect correlation would have an odds ratio of infinity or zero.
SNP Allele Frequency	The frequency of a risk allele of a particular SNP being present in the control population. The allele frequency ranges between 0 and 1, with 0 indicating that the risk allele is not present in the control population, and 1 indicating that every individual in the control population has the allele.
SNP p-value	The p-value of a given SNP indicating the probability that the SNP is associated with a particular condition. A p-value ranges from 0 (associated with absolute certainty) to 1 (no association).
ROC Plot	Logistic regression, a common modelling method for multi-marker data, was applied to every simulated population in the database to generate ROC curves. Logistic regression was used to model multiple independent variables (i.e. SNPs) to explain two possible outcomes (i.e. healthy or diseased). Once constructed, the logistic regression model assigns every individual a risk score (i.e. disease likelihood) between 0 and 1. Any individual that has a risk score above a given cut-off value is classified as "diseased" and any individual below it is classified as "healthy". A plot of the sensitivity (true positive rate) against 1-specificity (false positive rate) for all possible cut-off values is known as a receiver-operating characteristic (ROC) curve. Ideally, the apex of the ROC curve is as high up and to the left as possible. This would indicate that the model in question is able to accurately distinguish between healthy and diseased individuals on the basis of SNPs alone.
AUROC	Logistic regression, a common modelling method for multi-marker data, was applied to every simulated population in the database to generate ROC curves and calculate the AUROC. The AUROC is a measure of the classification accuracy of logistic regression and can be measured by calculating the area under the receiver operating characteristic curve (AUROC). A model with perfect classification accuracy (i.e. a model which can perfectly distinguish healthy and diseased individuals on the basis of SNPs alone) would have an AUROC of 1, while a model with no classification accuracy would have an AUROC of 0.5.
Logistic Regression Equation	An expression of the logistic regression model, applied to a simulated population to generate a ROC curve, as an equation of the form: log(p/(1-p)) = β + β₁x₁ + β₂x₂ +...+ β_nx_n, where β is the y-intercept and β_n is the log-odds of the n^th binary variable x_n. The equation represents the logit transformed line of best fit. Note: All logistic regression equations were generated using OLS regardless of which regression type was used to estimate the AUROC.
Heritability	The narrow sense heritability (denoted by h²) was calculated for every simulated population in the database. This statistic captures the proportion of phenotypic variation (the condition status) due to additive genetic values (the SNPs). The heritability ranges from 0 (no phenotypic variance is explained by genetic factors), to 1 (all phenotypic variance is explained by genetic factors). In this database, the heritability was calculated as the square of the Somers’ rank correlation, D². Thus, h² = D² = (2*AUROC-1)².
Regression Type	The type of regression used on the simulated populations to estimate the AUROC. Two different types of regressions were used: OLS and penalized. OLS stands for ordinary least squares logistic regression. Penalized refers to ridge logistic regression.