Site Weighted Gene Set Enrichment Analysis

Performs site weighted gene set enrichment analysis or standard GSEA when likelihood/weight columns in input_df are 1 or 0, p=1, q=1 and thresh_type="val".

Usage

swGsea(
  input_df,
  thresh_type = "percentile",
  thresh = 0.9,
  thresh_action = "exclude",
  min_set_size = 10,
  max_set_size = 500,
  max_score = "max",
  min_score = "min",
  psuedocount = 0.001,
  perms = 1000,
  p = 1,
  q = 1,
  nThreads = 1,
  rng_seed = 1,
  fork = FALSE
)

Arguments

input_df: A data frame in which first column is name of item of interest (gene, protein, phosphosite, etc.), the second is the correlation of that item of interest with the phenotype (typically log ratio of expression for phenotype vs. normal), and the remaining columns are the scores for the likelihood that the item belongs in each set (one column per set).
thresh_type: The type of thresh. Use 'percentile' to include all scores over that percentile given in thresh (i.e., 0.9 would be all items in 90th percentile, or top 10 percent); 'list' to include a list of set lists where the set lists are in the same order as the corresponding set columns in the input_df; 'val' to apply a single threshold value to all sets; or 'values' to use a vector of unique cutoffs for each set (needs to be in the same order as the sets are specified in the columns of input_df")
thresh: Depends on thresh_type. A list of lists of the items in each set (with same names as colnames of the scores); a numeric vector of threshold scores for each set (in the same order as the colnames of the scores in the input_df), or a single percentile value between 0 and 1 (i.e., if thresh=0.9, the 90th percentile of the score or the highest scoring 10 of of the items are included in the set for each scoring regimen) (thresh ="all" is not supported at this time, as it doesn't result in a Kolgorov-Smirnoff statistic; this may be worked in as an alternate scoring method later on).
thresh_action: Either "include", "exclude (default)", or "adjust"; this specifies how to treat each set if it doesn't contain a minimum number of items or contains all of the items; this option cannot be used with predefined lists of items in sets (if the number of items in a given set doesn't meet requirements, that set will be skipped).
min_set_size, max_set_size: The minimum/maximum number of items each set needs for the analysis to proceed.
max_score, min_score: A optional numeric vector of minimum/maximum boundaries to clip scores for each set.
psuedocount: Psuedocount (pc) is used for rescaling set scores: (score - min_score + pc)/(max_score - min_score +pc); this is needed to prevent division by 0 if max_score==min_score (in this case, all scores for items in set will be 1, which is equivalent to standard GSEA); it also allows users to adjust weights for scores that are close to the minimum for the scores in the set (unless min_score==max_score): as psuedocount value approaches 0, scaled minimum scores also approach 0; as psuedocount approaches infinity, scaled minimum scores approach the scaled maximum scores (which equal 1); this value must be larger than 0.
perms: The number of permutations.
p: The exponential scaling factor of the phenotype score (second column in input_df).
q: The exponential scaling factor of the likelihood score (weights).
nThreads: The number of threads to use in calculating permutaions.
rng_seed: Random seed.
fork: A boolean. Whether pass "fork" to type parameter of makeCluster on Unix-like machines.

Value

A list of Enrichment_Results, Items_in_Set and Running_Sums.

Enrichment_Results: A data frame with row names of gene set and columns of "ES", "NES", "p_val", "fdr".
Items_in_Set: A list of one-column data frames. Describes genes and their ranks in each set.
Running_Sums: Running sum scores along genes sorted by ranked scores, with gene sets as columns.

Details

The formula for weighting is as follows $$\frac{s_{j}^{q}|r_{j}|^{p}}{\sum s^{q}|r|^{p}}$$ Where r is log ratio score, s is likelihood score, j is the index of the gene.

Author

Eric Jaehnig