Field Documentation
Introduction to the GWAS-ROCS Database
GWAS-ROCS is a detailed database containing SNP-derived AUROCs. All of the data is either directly from, or derived from, studies accessible through PubMed or GWAS Central — an open-access online repository of summary-level genome-wide association study (GWAS) data. Each study simulation record (GR-Card) contains study information such as SNP data (odds ratio, risk allele frequency, and p-values), and simulated population data (e.g. ROC curves, AUROCs, SNP-heritability scores).
- SNP
- A single nucleotide polymorphism examined in a particular genome-wide associated study (GWAS).
- Simulated Population
- A csv file with computer-generated individuals who are marked as either cases or controls, and given data about the presence of risk alleles (1 having the risk allele, or 0 not having the risk allele) at SNPs previously identified as being significant.
- Study Simulation
- A GWAS-ROCS Database record uniquely identified by a GWAS-ROCS ID and displayed in a GR-Card that details a simulated population modelled after a publicly available genome-wide association study.
Field | Description |
---|---|
Creation Date | The date/time the record was created. |
Last Updated | The date/time the record was last updated. |
GWAS-ROCS ID | A unique GWAS-ROCS accession number consisting of a 2 letter prefix (GR) and a 5 number suffix. This ID is used to access the study simulation entry (i.e. GR-Card) via the URL. If an entry is deleted, its GWAS-ROCS ID will not be reused. |
Condition/Phenotype | The disease, condition, or phenotype studied by the GWAS. |
GWAS Central Data Set ID | The data set identifier assigned by GWAS Central corresponding to the data set from which the GWAS-ROCS record's simulated population was modelled after. Clicking on the identifier link takes the user to the GWAS Central page containing the study and its corresponding data set(s). |
PubMed ID (PMID) | The identifier assigned by PubMed corresponding to the citation of the original research publication from which the GWAS-ROCS record's simulated population was modelled after. Clicking on the identifier link takes the user to the publication's PubMed citation page displayed in abstract format. |
Meta-Analysis (Yes/No) | Classifies whether or not a given study simulation details a meta-analysis study which combined the results from multiple studies. |
Control Subjects | The number of individuals in the control group of the GWAS (i.e. number of individuals without the condition of interest). |
Case Subjects | The number of individuals in the case group of the GWAS (i.e. number of individuals with the condition of interest). |
SNP Name | The reference SNP cluster ID for a given SNP. |
SNP Odds Ratio | The ratio of the odds of a risk allele being present in the case group to the odds of the risk allele being present in the control group. Odds ratios range between zero and infinity. A risk allele with no correlation to a condition would have an odds ratio of 1, whereas a risk allele with perfect correlation would have an odds ratio of infinity or zero. |
SNP Allele Frequency | The frequency of a risk allele of a particular SNP being present in the control population. The allele frequency ranges between 0 and 1, with 0 indicating that the risk allele is not present in the control population, and 1 indicating that every individual in the control population has the allele. |
SNP p-value | The p-value of a given SNP indicating the probability that the SNP is associated with a particular condition. A p-value ranges from 0 (associated with absolute certainty) to 1 (no association). |
ROC Plot | Logistic regression, a common modelling method for multi-marker data, was applied to every simulated population in the database to generate ROC curves. Logistic regression was used to model multiple independent variables (i.e. SNPs) to explain two possible outcomes (i.e. healthy or diseased). Once constructed, the logistic regression model assigns every individual a risk score (i.e. disease likelihood) between 0 and 1. Any individual that has a risk score above a given cut-off value is classified as "diseased" and any individual below it is classified as "healthy". A plot of the sensitivity (true positive rate) against 1-specificity (false positive rate) for all possible cut-off values is known as a receiver-operating characteristic (ROC) curve. Ideally, the apex of the ROC curve is as high up and to the left as possible. This would indicate that the model in question is able to accurately distinguish between healthy and diseased individuals on the basis of SNPs alone. |
AUROC | Logistic regression, a common modelling method for multi-marker data, was applied to every simulated population in the database to generate ROC curves and calculate the AUROC. The AUROC is a measure of the classification accuracy of logistic regression and can be measured by calculating the area under the receiver operating characteristic curve (AUROC). A model with perfect classification accuracy (i.e. a model which can perfectly distinguish healthy and diseased individuals on the basis of SNPs alone) would have an AUROC of 1, while a model with no classification accuracy would have an AUROC of 0.5. |
Logistic Regression Equation | An expression of the logistic regression model, applied to a simulated population to generate a ROC curve, as an equation of the form: log(p/(1-p)) = β + β1x1 + β2x2 +...+ βnxn, where β is the y-intercept and βn is the log-odds of the nth binary variable xn. The equation represents the logit transformed line of best fit.
Note: All logistic regression equations were generated using OLS regardless of which regression type was used to estimate the AUROC. |
Heritability | The narrow sense heritability (denoted by h2) was calculated for every simulated population in the database. This statistic captures the proportion of phenotypic variation (the condition status) due to additive genetic values (the SNPs). The heritability ranges from 0 (no phenotypic variance is explained by genetic factors), to 1 (all phenotypic variance is explained by genetic factors). In this database, the heritability was calculated as the square of the Somers’ rank correlation, D2. Thus, h2 = D2 = (2*AUROC-1)2. |
Regression Type | The type of regression used on the simulated populations to estimate the AUROC. Two different types of regressions were used: OLS and penalized. OLS stands for ordinary least squares logistic regression. Penalized refers to ridge logistic regression. |