Download a copy of the vignette to follow along here: settings_matrix.Rmd
This vignette outlines the main functionality of the
generate_settings_matrix
function.
The most minimal settings_matrix can be obtained by providing a data_list object.
library(metasnf)
# It's best to list out the individual elements with names, i.e. data = ...,
# name = ..., domain = ..., type = ..., but we'll skip that here for brevity.
data_list <- generate_data_list(
list(cort_t, "cortical_thickness", "neuroimaging", "continuous"),
list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
settings_matrix <- generate_settings_matrix(
data_list
)
head(settings_matrix)
## [1] row_id alpha
## [3] k t
## [5] snf_scheme clust_alg
## [7] cont_dist disc_dist
## [9] ord_dist cat_dist
## [11] mix_dist inc_cortical_thickness
## [13] inc_cortical_surface_area inc_subcortical_volume
## [15] inc_household_income inc_pubertal_status
## <0 rows> (or 0-length row.names)
The resulting columns are:
row_id
: A label to keep track of which row is
whichalpha
: The alpha (also referred to as sigma or eta)
hyperparameter in SNFk
: The K (nearest neighbours) hyperparameter in
similarity matrix calculations and SNFt
: The T (number of iterations) hyperparameter used in
SNFsnf_scheme
: Which SNF “scheme” is being used to convert
the initial provided dataframes into a final fused network (more on this
in the appendix
of the “Less Simple Example” vignette)clust_alg
: Which clustering algorithm will be applied
to the final fused network. By default, this varies between the
pre-provided options of (1) spectral clustering with the number of
clusters determined by the eigen-gap heuristic and (2) the same thing
but using the rotation cost heuristic. You can learn more about using
this parameter in the clustering
algorithnms vignette.dist
: Which distance metric is being
used for the various types of features (more on this in the distance
metrics vignette)inc
: Whether or not the
corresponding dataframe will be included (1) or excluded (0) from this
rowBy varying the values in these columns, we can define distinct SNF
pipelines that should give rise to a broader space of cluster solutions.
The following sections outline how to use
generate_settings_matrix
to build a wide range of settings
that will hopefully help you find a subtyping solution useful for your
purposes.
When not specifying any parameters beyond the number of rows that are created, the function will randomly (but sensibly) vary the values in the matrix.
# Through minimums and maximums
settings_matrix <- generate_settings_matrix(
data_list,
nrow = 100,
)
head(settings_matrix)
## row_id alpha k t snf_scheme clust_alg cont_dist disc_dist ord_dist cat_dist
## 1 1 0.3 11 20 2 1 1 1 1 1
## 2 2 0.3 20 20 3 2 1 1 1 1
## 3 3 0.4 65 20 3 2 1 1 1 1
## 4 4 0.7 79 20 2 2 1 1 1 1
## 5 5 0.3 53 20 2 2 1 1 1 1
## 6 6 0.7 57 20 1 2 1 1 1 1
## mix_dist inc_cortical_thickness inc_cortical_surface_area
## 1 1 0 1
## 2 1 1 1
## 3 1 1 1
## 4 1 1 1
## 5 1 1 0
## 6 1 1 0
## inc_subcortical_volume inc_household_income inc_pubertal_status
## 1 1 1 1
## 2 1 1 1
## 3 1 1 1
## 4 1 1 1
## 5 1 1 1
## 6 1 1 1
The alpha
and k
hyperparameters are varied
from 0.3 to 0.8 and 10 to 100 respectively based on suggestion from the
authors of SNF.
The t
hyperparameter, which controls how many iterations
of updates occur to the fused network during SNF, stays fixed at 20, by
default. This value (20) has been empirically demonstrated to be
sufficient for achieving convergence of the matrix, and varying it
doesn’t seem to have much relevance to what kinds of cluster solutions
are produced.
The snf_scheme
column will vary from 1 to 3, which
outlines the 3 differente schemes that are available.
The clust_alg
column will vary randomly between (1)
spectral clustering using the eigen-gap heuristic and (2) spectral
clustering using the rotation cost heuristic by default.
The distance columns will always be 1 by default, as they will just use the default distance metrics of simple Euclidean for anything numeric and Gower’s distance for anything mixed or categorical.
Controlling the scheme, the clustering algorithms, and the distance metrics are discussed in more details in separate vignettes linked to above. Controlling the remaining options are shown below.
You can control any of these parameters either by providing a vector of values you’d like to randomly sample from or by specifying a minimum and maximum range.
# Through minimums and maximums
settings_matrix <- generate_settings_matrix(
data_list,
nrow = 100,
min_k = 10,
max_k = 60,
min_alpha = 0.3,
max_alpha = 0.8,
min_t = 15,
max_t = 30
)
## Warning in add_settings_matrix_rows(settings_matrix = settings_matrix_base, :
## The original SNF paper recommends a t between 10 to 20. Empirically, setting t
## above 20 is always sufficient for SNF to converge. This warning is raised
## anytime a user tries to set a t value smaller than 10 or larger than 20.
# Through specific value sampling
settings_matrix <- generate_settings_matrix(
data_list,
nrow = 20,
k_values = c(10, 25, 50),
alpha_values = c(0.4, 0.8),
t_values = c(20, 30)
)
## Warning in add_settings_matrix_rows(settings_matrix = settings_matrix_base, :
## The original SNF paper recommends a t between 10 to 20. Empirically, setting t
## above 20 is always sufficient for SNF to converge. This warning is raised
## anytime a user tries to set a t value smaller than 10 or larger than 20.
Bounds on the number of input dataframes removed as well as the way in which the number removed is chosen can be controlled.
By default, generate_settings_matrix
will pick a random
value between 0 and 1 less than the total number of available dataframes
based on an exponential probability distribution. The exponential
distribution makes it so that it is very likely that a small number of
dataframes will be dropped and much less likely that a large number of
dataframes will be dropped.
You can control the distribution by changing the
dropout_dist
value to “uniform” (which will result in a
much higher number of dataframes being dropped on average) or “none”
(which will result in no dataframes being dropped).
# Exponential dropping
settings_matrix <- generate_settings_matrix(
data_list,
nrow = 20,
dropout_dist = "exponential" # the default behaviour
)
head(settings_matrix)
## row_id alpha k t snf_scheme clust_alg cont_dist disc_dist ord_dist cat_dist
## 1 1 0.5 39 20 2 2 1 1 1 1
## 2 2 0.7 13 20 2 1 1 1 1 1
## 3 3 0.5 75 20 3 2 1 1 1 1
## 4 4 0.5 20 20 3 2 1 1 1 1
## 5 5 0.5 25 20 2 2 1 1 1 1
## 6 6 0.6 49 20 3 1 1 1 1 1
## mix_dist inc_cortical_thickness inc_cortical_surface_area
## 1 1 1 1
## 2 1 1 1
## 3 1 1 0
## 4 1 1 1
## 5 1 1 1
## 6 1 1 1
## inc_subcortical_volume inc_household_income inc_pubertal_status
## 1 1 1 1
## 2 1 1 1
## 3 1 1 1
## 4 1 1 1
## 5 1 1 1
## 6 1 1 1
# Uniform dropping
settings_matrix <- generate_settings_matrix(
data_list,
nrow = 20,
dropout_dist = "uniform"
)
head(settings_matrix)
## row_id alpha k t snf_scheme clust_alg cont_dist disc_dist ord_dist cat_dist
## 1 1 0.3 79 20 3 2 1 1 1 1
## 2 2 0.8 83 20 2 1 1 1 1 1
## 3 3 0.5 45 20 2 2 1 1 1 1
## 4 4 0.4 13 20 2 1 1 1 1 1
## 5 5 0.6 89 20 1 2 1 1 1 1
## 6 6 0.4 35 20 2 2 1 1 1 1
## mix_dist inc_cortical_thickness inc_cortical_surface_area
## 1 1 1 1
## 2 1 0 0
## 3 1 1 1
## 4 1 1 1
## 5 1 1 1
## 6 1 1 1
## inc_subcortical_volume inc_household_income inc_pubertal_status
## 1 0 0 1
## 2 0 1 1
## 3 0 1 1
## 4 0 0 1
## 5 0 1 1
## 6 1 0 1
# No dropping
settings_matrix <- generate_settings_matrix(
data_list,
nrow = 20,
dropout_dist = "none"
)
head(settings_matrix)
## row_id alpha k t snf_scheme clust_alg cont_dist disc_dist ord_dist cat_dist
## 1 1 0.3 74 20 1 2 1 1 1 1
## 2 2 0.3 49 20 1 1 1 1 1 1
## 3 3 0.3 14 20 1 1 1 1 1 1
## 4 4 0.8 64 20 1 2 1 1 1 1
## 5 5 0.6 32 20 3 1 1 1 1 1
## 6 6 0.8 88 20 1 1 1 1 1 1
## mix_dist inc_cortical_thickness inc_cortical_surface_area
## 1 1 1 1
## 2 1 1 1
## 3 1 1 1
## 4 1 1 1
## 5 1 1 1
## 6 1 1 1
## inc_subcortical_volume inc_household_income inc_pubertal_status
## 1 1 1 1
## 2 1 1 1
## 3 1 1 1
## 4 1 1 1
## 5 1 1 1
## 6 1 1 1
The bounds on the number of dataframes that can be dropped can be
controlled using the min_removed_inputs
and
max_removed_inputs
:
settings_matrix <- generate_settings_matrix(
data_list,
nrow = 20,
min_removed_inputs = 3
)
# No row will exclude fewer than 3 dataframes during SNF
head(settings_matrix)
## row_id alpha k t snf_scheme clust_alg cont_dist disc_dist ord_dist cat_dist
## 1 1 0.3 31 20 3 1 1 1 1 1
## 2 2 0.4 33 20 2 1 1 1 1 1
## 3 3 0.8 79 20 2 2 1 1 1 1
## 4 4 0.6 40 20 3 2 1 1 1 1
## 5 5 0.6 57 20 2 1 1 1 1 1
## 6 6 0.7 75 20 3 1 1 1 1 1
## mix_dist inc_cortical_thickness inc_cortical_surface_area
## 1 1 0 1
## 2 1 1 0
## 3 1 1 0
## 4 1 1 0
## 5 1 0 1
## 6 1 0 0
## inc_subcortical_volume inc_household_income inc_pubertal_status
## 1 0 0 1
## 2 1 0 0
## 3 0 0 1
## 4 0 0 1
## 5 0 1 0
## 6 1 1 0
If you are interested in grid searching over perhaps just a specific set of alpha and k values, you may want to consider varying those parameters and keeping everything else fixed:
Rather than varying everything equally all at once, you may be interested in looking at “chunks” of solution spaces that are based on distinct settings matrices.
For example, you may want to look at 50 solutions generated with k =
50 and look at another 50 solutions generated with k = 80. You can build
two separate settings matrices and concatenate them, but you can also
build up a single matrix in parts using the
add_settings_matrix_rows
function:
set.seed(42)
settings_matrix <- generate_settings_matrix(
data_list,
nrow = 25,
k_values = 50
)
settings_matrix <- add_settings_matrix_rows(
settings_matrix,
nrow = 25,
k_values = 80
)
dim(settings_matrix)
## [1] 50 16
## [1] 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50
## [26] 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80
Don’t forget that the settings matrix is just a dataframe. You can always go in and modify things as you wish, but you do risk generating duplicate or invalid rows that the package functions would have prevented.
generate_settings_matrix
will never build duplicate
rows. A consequence of this is that if you request a very large number
of rows over a very small range of possible values to vary over, it will
be impossible for the matrix to be built. For example, there’s no way to
generate 10 unique rows when the only thing allowed to vary is which
clustering algorithm (1 or 2) is used - only 2 rows could ever be
created.
If you encounter the error “Matrix building failed”, try to generate fewer rows or to be a little less strict with what values are allowed.