CoNVaQ: A short introduction

Simon J. Larsen

2018-04-03

This vignette is a short introduction to the CoNVaQ R package. For a more detailed description of the functions in the CoNVaQ package and their arguments, we refer to the reference manual included with the package.

Input data format

CoNVaQ takes as input two sets of segmented CNV calls, one for each group of patients. Each set must be a data.frame with the following five columns in in this order:

  1. patient: An identifier for the patient or sample the segment was identified in. All segments with equal patient identifiers will be considered from the same patient/sample.
  2. chr: Chromosome the position is located in.
  3. start: First genomic position of the segment in base pairs.
  4. end: Last genomic position of the segment in base pairs.
  5. type: The variant type of the segment. Must be one of “Gain”, “Loss” or “LOH”.

You can load the included synthetic example data set like this:

data("example", package="convaq")

The example object contains two data frames, disease and healthy:

head(example$disease)
#>   patient chr     start       end type
#> 1   CASE1  11  12611660  12902008  LOH
#> 2   CASE1   Y  31623109  32225521  LOH
#> 3   CASE1   8  15199931  15238875  LOH
#> 4   CASE1   1 116507114 116697901 Loss
#> 5   CASE1  18  33288382  34071926 Loss
#> 6   CASE1  19  19483990  20284356 Loss
head(example$healthy)
#>    patient chr     start       end type
#> 1 CONTROL1   5  75230547  75596850  LOH
#> 2 CONTROL1   5  33707135  34231788  LOH
#> 3 CONTROL1  19   2876001   3832189 Gain
#> 4 CONTROL1   1  13255535  13921869 Gain
#> 5 CONTROL1  16  76410256  76461754  LOH
#> 6 CONTROL1   2 224005750 224280031  LOH

Statistical model

The statistical models searches for CNV regions significantly associated to the patient stratification using Fisher’s exact test. To use the statical model, we set the model-argument to “statistical”. We set the p.cutoff-argument to 0.05 to only include regions with \(p < 0.05\).

library(convaq)
results <- convaq(
  example$disease, example$healthy,
  model = "statistical",
  name1 = "Disease", name2="Healthy",
  p.cutoff = 0.05
)

The convaq-function returns a convaq-object. We can inspect the object by printing it:

print(results)
#> CoNVaQ results object.
#> 
#> Model:                statistical 
#> Group 1 name:         Disease 
#> Group 2 name:         Healthy 
#> No. regions found:    59 
#> Compute q-values:     FALSE 
#> Q-value repetitions:  500 
#> P-value cutoff:       0.05

We see that CoNVaQ reports it found 59 statistically significant regions. Use the regions-method to obtain these regions.

head(regions(results))
chr start end length type pvalue
17 41663728 41759865 96139 Gain 9.0e-07
4 3022867 3295724 272859 Gain 9.0e-07
17 41654723 41657734 3013 Gain 2.5e-06
17 41660383 41663727 3346 Gain 4.7e-06
4 3295725 3302695 6972 Gain 4.7e-06
4 3014887 3022866 7981 Gain 4.7e-06

If we want to also compute empirical q-values we set the qvalues-argument to TRUE. Q-values are computed by repeatedly permutating the patient distributing among the two groups and determining how often we observe a more significant result. To change the number of permutations used, we can set the qvalues.rep-argument. A greater value will improve accuracy at the cost of higher computation time.

Let’s compute q-values using 2000 repetitions:

results2 <- convaq(
  example$disease, example$healthy,
  model = "statistical",
  name1 = "Disease", name2="Healthy",
  p.cutoff = 0.05,
  qvalues = TRUE,
  qvalues.rep = 2000
)
head(regions(results2))
chr start end length type pvalue qvalue
4 3022867 3295724 272859 Gain 9.3e-07 0
20 10064997 10272582 207587 Gain 0.0433 0.509
17 41663728 41759865 96139 Gain 9.3e-07 0.624
Y 29216184 29279783 63601 Gain 0.0433 0.716
20 9971176 10064996 93822 Gain 0.0433 0.716
20 10284667 10346809 62144 Gain 0.0433 0.766

If q-values are computed, the output will be sorted on that instead of p-value.

Query model

The query model searches for CNV regions matching a user-defined query. A query is defined by two predicates, one for each group of samples.

A predicate takes the following form:

[COMP] [FREQ] [EQ] [TYPE]

where:

A predicate is given as a string, with each argument separated by a single space. For instance, if we want to find regions where at least 50% of patients in a group are reported as “Gain”, we use the predicate:

">= 0.5 == Gain"

Likewise if we are searching for regions where less than 25% of patients in a group have any kind of variation, we can use the predicate:

"< 0.25 != Normal"

To use the query model we need to set the model-argument to “query” and specify the two predicates using the arguments pred1 and pred2. Let us use the query model to search for regions where at least 50% of patients in the disease group have a gain in copy numbers, while at most 15% of patients in the healthy group do:

results3 <- convaq(
  example$disease, example$healthy,
  model = "query",
  name1 = "Disease", name2 = "Healthy",
  pred1 = ">= 0.5 == Gain",
  pred2 = "<= 0.15 == Gain",
  qvalues = TRUE, qvalues.rep = 2000
)
head(regions(results3))
chr start end length qvalue
4 3022867 3295724 272859 0.0000
17 41663728 41759865 96139 0.0215
17 41759866 41784968 25104 0.0335
17 41808670 41823875 15207 0.0335
4 3327671 3339107 11438 0.0465
4 3316544 3327670 11128 0.3575

Accessing variation frequencies

We can access the frequencies of each type of variation in the reported regions using the frequencies-method:

head(frequencies(results))
Disease: Gain Disease: Loss Disease: LOH Healthy: Gain Healthy: Loss Healthy: LOH
85 % 0 % 0 % 9.09 % 0 % 0 %
85 % 0 % 0 % 9.09 % 0 % 0 %
75 % 0 % 0 % 4.55 % 0 % 0 %
80 % 0 % 0 % 9.09 % 0 % 0 %
80 % 0 % 0 % 9.09 % 0 % 0 %
80 % 0 % 0 % 9.09 % 0 % 0 %

Each row corresponds to the same row in the table returned by the regions-method. To get a better overview, we can combine the two tables with cbind:

cbind(
  head(regions(results)[,c("chr","start","end","pvalue")]),
  head(frequencies(results)[,c(1,4)])
)
chr start end pvalue Disease: Gain Healthy: Gain
17 41663728 41759865 9.0e-07 85 % 9.09 %
4 3022867 3295724 9.0e-07 85 % 9.09 %
17 41654723 41657734 2.5e-06 75 % 4.55 %
17 41660383 41663727 4.7e-06 80 % 9.09 %
4 3295725 3302695 4.7e-06 80 % 9.09 %
4 3014887 3022866 4.7e-06 80 % 9.09 %

Accessing individual sample states

We can access the states of the individual samples in the reported regions using the states-method.

head(states(results)[,c(1:3,21:23)])
CASE1 CASE10 CASE11 CONTROL1 CONTROL10 CONTROL11
Normal Gain Gain Normal Gain Normal
Normal Gain Gain Normal Normal Normal
Normal Gain Gain Normal Normal Normal
Normal Gain Gain Normal Gain Normal
Normal Gain Gain Normal Normal Normal
Normal Gain Gain Normal Normal Normal

Each row corresponds to the same row in the table returned by the regions-method. To get a better overview, we can combine the two tables with cbind: (only six samples shown here)

cbind(
  head(regions(results)[,c("chr","start","end","pvalue")]),
  head(states(results)[,c(1:3,21:23)])
)
chr start end pvalue CASE1 CASE10 CASE11 CONTROL1 CONTROL10 CONTROL11
17 41663728 41759865 9.0e-07 Normal Gain Gain Normal Gain Normal
4 3022867 3295724 9.0e-07 Normal Gain Gain Normal Normal Normal
17 41654723 41657734 2.5e-06 Normal Gain Gain Normal Normal Normal
17 41660383 41663727 4.7e-06 Normal Gain Gain Normal Gain Normal
4 3295725 3302695 4.7e-06 Normal Gain Gain Normal Normal Normal
4 3014887 3022866 4.7e-06 Normal Gain Gain Normal Normal Normal