Performs site weighted gene set enrichment analysis or standard GSEA when
likelihood/weight columns in input_df
are 1 or 0, p=1
,
q=1
and thresh_type="val"
.
Usage
swGsea(
input_df,
thresh_type = "percentile",
thresh = 0.9,
thresh_action = "exclude",
min_set_size = 10,
max_set_size = 500,
max_score = "max",
min_score = "min",
psuedocount = 0.001,
perms = 1000,
p = 1,
q = 1,
nThreads = 1,
rng_seed = 1,
fork = FALSE
)
Arguments
- input_df
A data frame in which first column is name of item of interest (gene, protein, phosphosite, etc.), the second is the correlation of that item of interest with the phenotype (typically log ratio of expression for phenotype vs. normal), and the remaining columns are the scores for the likelihood that the item belongs in each set (one column per set).
- thresh_type
The type of
thresh
. Use 'percentile' to include all scores over that percentile given inthresh
(i.e., 0.9 would be all items in 90th percentile, or top 10 percent); 'list' to include a list of set lists where the set lists are in the same order as the corresponding set columns in theinput_df
; 'val' to apply a single threshold value to all sets; or 'values' to use a vector of unique cutoffs for each set (needs to be in the same order as the sets are specified in the columns ofinput_df
")- thresh
Depends on
thresh_type
. A list of lists of the items in each set (with same names as colnames of the scores); a numeric vector of threshold scores for each set (in the same order as the colnames of the scores in the input_df), or a single percentile value between 0 and 1 (i.e., ifthresh
=0.9, the 90th percentile of the score or the highest scoring 10 of of the items are included in the set for each scoring regimen) (thresh
="all" is not supported at this time, as it doesn't result in a Kolgorov-Smirnoff statistic; this may be worked in as an alternate scoring method later on).- thresh_action
Either "include", "exclude (default)", or "adjust"; this specifies how to treat each set if it doesn't contain a minimum number of items or contains all of the items; this option cannot be used with predefined lists of items in sets (if the number of items in a given set doesn't meet requirements, that set will be skipped).
- min_set_size, max_set_size
The minimum/maximum number of items each set needs for the analysis to proceed.
- max_score, min_score
A optional numeric vector of minimum/maximum boundaries to clip scores for each set.
- psuedocount
Psuedocount (pc) is used for rescaling set scores:
(score - min_score + pc)/(max_score - min_score +pc)
; this is needed to prevent division by 0 ifmax_score==min_score
(in this case, all scores for items in set will be 1, which is equivalent to standard GSEA); it also allows users to adjust weights for scores that are close to the minimum for the scores in the set (unless min_score==max_score): as psuedocount value approaches 0, scaled minimum scores also approach 0; as psuedocount approaches infinity, scaled minimum scores approach the scaled maximum scores (which equal 1); this value must be larger than 0.- perms
The number of permutations.
- p
The exponential scaling factor of the phenotype score (second column in
input_df
).- q
The exponential scaling factor of the likelihood score (weights).
- nThreads
The number of threads to use in calculating permutaions.
- rng_seed
Random seed.
- fork
A boolean. Whether pass "fork" to
type
parameter ofmakeCluster
on Unix-like machines.
Value
A list of Enrichment_Results
, Items_in_Set
and Running_Sums
.
- Enrichment_Results
A data frame with row names of gene set and columns of "ES", "NES", "p_val", "fdr".
- Items_in_Set
A list of one-column data frames. Describes genes and their ranks in each set.
- Running_Sums
Running sum scores along genes sorted by ranked scores, with gene sets as columns.