This vignette is a short introduction to the CoNVaQ R package. For a more detailed description of the functions in the CoNVaQ package and their arguments, we refer to the reference manual included with the package.
CoNVaQ takes as input two sets of segmented CNV calls, one for each group of patients. Each set must be a data.frame
with the following five columns in in this order:
patient
: An identifier for the patient or sample the segment was identified in. All segments with equal patient identifiers will be considered from the same patient/sample.chr
: Chromosome the position is located in.start
: First genomic position of the segment in base pairs.end
: Last genomic position of the segment in base pairs.type
: The variant type of the segment. Must be one of “Gain”, “Loss” or “LOH”.You can load the included synthetic example data set like this:
data("example", package="convaq")
The example
object contains two data frames, disease
and healthy
:
head(example$disease)
#> patient chr start end type
#> 1 CASE1 11 12611660 12902008 LOH
#> 2 CASE1 Y 31623109 32225521 LOH
#> 3 CASE1 8 15199931 15238875 LOH
#> 4 CASE1 1 116507114 116697901 Loss
#> 5 CASE1 18 33288382 34071926 Loss
#> 6 CASE1 19 19483990 20284356 Loss
head(example$healthy)
#> patient chr start end type
#> 1 CONTROL1 5 75230547 75596850 LOH
#> 2 CONTROL1 5 33707135 34231788 LOH
#> 3 CONTROL1 19 2876001 3832189 Gain
#> 4 CONTROL1 1 13255535 13921869 Gain
#> 5 CONTROL1 16 76410256 76461754 LOH
#> 6 CONTROL1 2 224005750 224280031 LOH
The statistical models searches for CNV regions significantly associated to the patient stratification using Fisher’s exact test. To use the statical model, we set the model
-argument to “statistical”. We set the p.cutoff
-argument to 0.05 to only include regions with \(p < 0.05\).
library(convaq)
results <- convaq(
example$disease, example$healthy,
model = "statistical",
name1 = "Disease", name2="Healthy",
p.cutoff = 0.05
)
The convaq
-function returns a convaq
-object. We can inspect the object by printing it:
print(results)
#> CoNVaQ results object.
#>
#> Model: statistical
#> Group 1 name: Disease
#> Group 2 name: Healthy
#> No. regions found: 59
#> Compute q-values: FALSE
#> Q-value repetitions: 500
#> P-value cutoff: 0.05
We see that CoNVaQ reports it found 59 statistically significant regions. Use the regions
-method to obtain these regions.
head(regions(results))
chr | start | end | length | type | pvalue |
---|---|---|---|---|---|
17 | 41663728 | 41759865 | 96139 | Gain | 9.0e-07 |
4 | 3022867 | 3295724 | 272859 | Gain | 9.0e-07 |
17 | 41654723 | 41657734 | 3013 | Gain | 2.5e-06 |
17 | 41660383 | 41663727 | 3346 | Gain | 4.7e-06 |
4 | 3295725 | 3302695 | 6972 | Gain | 4.7e-06 |
4 | 3014887 | 3022866 | 7981 | Gain | 4.7e-06 |
If we want to also compute empirical q-values we set the qvalues
-argument to TRUE
. Q-values are computed by repeatedly permutating the patient distributing among the two groups and determining how often we observe a more significant result. To change the number of permutations used, we can set the qvalues.rep
-argument. A greater value will improve accuracy at the cost of higher computation time.
Let’s compute q-values using 2000 repetitions:
results2 <- convaq(
example$disease, example$healthy,
model = "statistical",
name1 = "Disease", name2="Healthy",
p.cutoff = 0.05,
qvalues = TRUE,
qvalues.rep = 2000
)
head(regions(results2))
chr | start | end | length | type | pvalue | qvalue |
---|---|---|---|---|---|---|
4 | 3022867 | 3295724 | 272859 | Gain | 9.3e-07 | 0 |
20 | 10064997 | 10272582 | 207587 | Gain | 0.0433 | 0.509 |
17 | 41663728 | 41759865 | 96139 | Gain | 9.3e-07 | 0.624 |
Y | 29216184 | 29279783 | 63601 | Gain | 0.0433 | 0.716 |
20 | 9971176 | 10064996 | 93822 | Gain | 0.0433 | 0.716 |
20 | 10284667 | 10346809 | 62144 | Gain | 0.0433 | 0.766 |
If q-values are computed, the output will be sorted on that instead of p-value.
The query model searches for CNV regions matching a user-defined query. A query is defined by two predicates, one for each group of samples.
A predicate takes the following form:
[COMP] [FREQ] [EQ] [TYPE]
where:
COMP
is a comparison operator. One of “<” (less than), “>” (greater than), “<=” (less than or equal to) or “>=” (greater than or equal to).FREQ
is a numerical value between 0 and 1.EQ
is either “==” (equal to) or “!=” (not equal to).TYPE
is a segment type. One of “Gain”, “Loss”, “LOH” or “Normal”.A predicate is given as a string, with each argument separated by a single space. For instance, if we want to find regions where at least 50% of patients in a group are reported as “Gain”, we use the predicate:
">= 0.5 == Gain"
Likewise if we are searching for regions where less than 25% of patients in a group have any kind of variation, we can use the predicate:
"< 0.25 != Normal"
To use the query model we need to set the model
-argument to “query” and specify the two predicates using the arguments pred1
and pred2
. Let us use the query model to search for regions where at least 50% of patients in the disease group have a gain in copy numbers, while at most 15% of patients in the healthy group do:
results3 <- convaq(
example$disease, example$healthy,
model = "query",
name1 = "Disease", name2 = "Healthy",
pred1 = ">= 0.5 == Gain",
pred2 = "<= 0.15 == Gain",
qvalues = TRUE, qvalues.rep = 2000
)
head(regions(results3))
chr | start | end | length | qvalue |
---|---|---|---|---|
4 | 3022867 | 3295724 | 272859 | 0.0000 |
17 | 41663728 | 41759865 | 96139 | 0.0215 |
17 | 41759866 | 41784968 | 25104 | 0.0335 |
17 | 41808670 | 41823875 | 15207 | 0.0335 |
4 | 3327671 | 3339107 | 11438 | 0.0465 |
4 | 3316544 | 3327670 | 11128 | 0.3575 |
We can access the frequencies of each type of variation in the reported regions using the frequencies
-method:
head(frequencies(results))
Disease: Gain | Disease: Loss | Disease: LOH | Healthy: Gain | Healthy: Loss | Healthy: LOH |
---|---|---|---|---|---|
85 % | 0 % | 0 % | 9.09 % | 0 % | 0 % |
85 % | 0 % | 0 % | 9.09 % | 0 % | 0 % |
75 % | 0 % | 0 % | 4.55 % | 0 % | 0 % |
80 % | 0 % | 0 % | 9.09 % | 0 % | 0 % |
80 % | 0 % | 0 % | 9.09 % | 0 % | 0 % |
80 % | 0 % | 0 % | 9.09 % | 0 % | 0 % |
Each row corresponds to the same row in the table returned by the regions
-method. To get a better overview, we can combine the two tables with cbind
:
cbind(
head(regions(results)[,c("chr","start","end","pvalue")]),
head(frequencies(results)[,c(1,4)])
)
chr | start | end | pvalue | Disease: Gain | Healthy: Gain |
---|---|---|---|---|---|
17 | 41663728 | 41759865 | 9.0e-07 | 85 % | 9.09 % |
4 | 3022867 | 3295724 | 9.0e-07 | 85 % | 9.09 % |
17 | 41654723 | 41657734 | 2.5e-06 | 75 % | 4.55 % |
17 | 41660383 | 41663727 | 4.7e-06 | 80 % | 9.09 % |
4 | 3295725 | 3302695 | 4.7e-06 | 80 % | 9.09 % |
4 | 3014887 | 3022866 | 4.7e-06 | 80 % | 9.09 % |
We can access the states of the individual samples in the reported regions using the states
-method.
head(states(results)[,c(1:3,21:23)])
CASE1 | CASE10 | CASE11 | CONTROL1 | CONTROL10 | CONTROL11 |
---|---|---|---|---|---|
Normal | Gain | Gain | Normal | Gain | Normal |
Normal | Gain | Gain | Normal | Normal | Normal |
Normal | Gain | Gain | Normal | Normal | Normal |
Normal | Gain | Gain | Normal | Gain | Normal |
Normal | Gain | Gain | Normal | Normal | Normal |
Normal | Gain | Gain | Normal | Normal | Normal |
Each row corresponds to the same row in the table returned by the regions
-method. To get a better overview, we can combine the two tables with cbind
: (only six samples shown here)
cbind(
head(regions(results)[,c("chr","start","end","pvalue")]),
head(states(results)[,c(1:3,21:23)])
)
chr | start | end | pvalue | CASE1 | CASE10 | CASE11 | CONTROL1 | CONTROL10 | CONTROL11 |
---|---|---|---|---|---|---|---|---|---|
17 | 41663728 | 41759865 | 9.0e-07 | Normal | Gain | Gain | Normal | Gain | Normal |
4 | 3022867 | 3295724 | 9.0e-07 | Normal | Gain | Gain | Normal | Normal | Normal |
17 | 41654723 | 41657734 | 2.5e-06 | Normal | Gain | Gain | Normal | Normal | Normal |
17 | 41660383 | 41663727 | 4.7e-06 | Normal | Gain | Gain | Normal | Gain | Normal |
4 | 3295725 | 3302695 | 4.7e-06 | Normal | Gain | Gain | Normal | Normal | Normal |
4 | 3014887 | 3022866 | 4.7e-06 | Normal | Gain | Gain | Normal | Normal | Normal |