The GaussSuppression
package uses a common interface shared by other SDC packages developed at Statistics Norway (see also SmallCountRounding
and SSBcellKey
). In the background, these packages use a model matrix representation, which connects the input data to the intended output. This functionality is provided by the R package SSBtools
. In this vignette, we look at multiple ways of specifying output tables given different forms of input. Note that this vignette only scratches the surface of what is possible with the provided interface, and rather is intended to help users get going with the package.
We begin by importing the necessary dependencies as well as loading a test data set provided in the SSBtools package.
library(SSBtools)
#> Loading required package: Matrix
library(GaussSuppression)
<- SSBtools::SSBtoolsData("d2")
dataset <- SSBtools::MakeMicro(dataset, "freq")
microdata
head(dataset)
#> region county k_group main_income freq
#> 1 A 1 300 other 11
#> 2 B 4 300 other 7
#> 3 C 5 300 other 5
#> 4 D 5 300 other 13
#> 5 E 6 300 other 9
#> 6 F 6 300 other 12
nrow(dataset)
#> [1] 44
head(microdata)
#> region county k_group main_income freq
#> 1 A 1 300 other 1
#> 2 A 1 300 other 1
#> 3 A 1 300 other 1
#> 4 A 1 300 other 1
#> 5 A 1 300 other 1
#> 6 A 1 300 other 1
nrow(microdata)
#> [1] 706
The imported data set is a fictitious data set containing the variables: region, county, k_group, main_income, freq, where region, county, and k_group are different (non-nested) regional hierarchies. GaussSuppression
can take microdata as input as well, which we will demonstrate in the following sections.
Output tables are mainly specified using the following three parameters: dimVar
, hierarchies
, and formula
.
dimVar
The most basic way of defining output tables is by using the dimVar
parameter. This generates by default all combinations of the variables provided, including marginals. For example, the following function call creates a one dimensional frequency table over the variable region.
GaussSuppressionFromData(data = dataset,
dimVar = "region",
freqVar = "freq",
primary = FALSE,
protectZeros = FALSE)
#> region freq primary suppressed
#> 1 Total 706 FALSE FALSE
#> 2 A 113 FALSE FALSE
#> 3 B 55 FALSE FALSE
#> 4 C 73 FALSE FALSE
#> 5 D 45 FALSE FALSE
#> 6 E 138 FALSE FALSE
#> 7 F 67 FALSE FALSE
#> 8 G 40 FALSE FALSE
#> 9 H 65 FALSE FALSE
#> 10 I 14 FALSE FALSE
#> 11 J 61 FALSE FALSE
#> 12 K 35 FALSE FALSE
Note the use of the function GaussSuppressionFromData and the inclusion of two parameters primary
and protectZeros
. The functions in GaussSuppression
are designed to incorporate both table building and protection into a single function call. Thus, to illustrate the table building features, we have set that nothing must be protected.
In a similar fashion, we can include multiple variables in the dimVar
parameter:
GaussSuppressionFromData(data = dataset,
dimVar = c("region", "main_income"),
freqVar = "freq",
primary = FALSE,
protectZeros = FALSE)
#> region main_income freq primary suppressed
#> 1 Total Total 706 FALSE FALSE
#> 2 Total assistance 342 FALSE FALSE
#> 3 Total other 88 FALSE FALSE
#> 4 Total pensions 222 FALSE FALSE
#> 5 Total wages 54 FALSE FALSE
#> 6 A Total 113 FALSE FALSE
#> 7 A assistance 55 FALSE FALSE
#> 8 A other 11 FALSE FALSE
#> 9 A pensions 36 FALSE FALSE
#> 10 A wages 11 FALSE FALSE
#> 11 B Total 55 FALSE FALSE
#> 12 B assistance 29 FALSE FALSE
#> 13 B other 7 FALSE FALSE
#> 14 B pensions 18 FALSE FALSE
#> 15 B wages 1 FALSE FALSE
#> 16 C Total 73 FALSE FALSE
#> 17 C assistance 35 FALSE FALSE
#> 18 C other 5 FALSE FALSE
#> 19 C pensions 25 FALSE FALSE
#> 20 C wages 8 FALSE FALSE
#> 21 D Total 45 FALSE FALSE
#> 22 D assistance 17 FALSE FALSE
#> 23 D other 13 FALSE FALSE
#> 24 D pensions 13 FALSE FALSE
#> 25 D wages 2 FALSE FALSE
#> 26 E Total 138 FALSE FALSE
#> 27 E assistance 63 FALSE FALSE
#> 28 E other 9 FALSE FALSE
#> 29 E pensions 52 FALSE FALSE
#> 30 E wages 14 FALSE FALSE
#> 31 F Total 67 FALSE FALSE
#> 32 F assistance 24 FALSE FALSE
#> 33 F other 12 FALSE FALSE
#> 34 F pensions 22 FALSE FALSE
#> 35 F wages 9 FALSE FALSE
#> 36 G Total 40 FALSE FALSE
#> 37 G assistance 22 FALSE FALSE
#> 38 G other 6 FALSE FALSE
#> 39 G pensions 8 FALSE FALSE
#> 40 G wages 4 FALSE FALSE
#> 41 H Total 65 FALSE FALSE
#> 42 H assistance 38 FALSE FALSE
#> 43 H other 9 FALSE FALSE
#> 44 H pensions 15 FALSE FALSE
#> 45 H wages 3 FALSE FALSE
#> 46 I Total 14 FALSE FALSE
#> 47 I assistance 9 FALSE FALSE
#> 48 I other 3 FALSE FALSE
#> 49 I pensions 2 FALSE FALSE
#> 50 I wages 0 FALSE FALSE
#> 51 J Total 61 FALSE FALSE
#> 52 J assistance 32 FALSE FALSE
#> 53 J other 9 FALSE FALSE
#> 54 J pensions 20 FALSE FALSE
#> 55 J wages 0 FALSE FALSE
#> 56 K Total 35 FALSE FALSE
#> 57 K assistance 18 FALSE FALSE
#> 58 K other 4 FALSE FALSE
#> 59 K pensions 11 FALSE FALSE
#> 60 K wages 2 FALSE FALSE
Note in particular what happens when we provide two regional variables:
GaussSuppressionFromData(data = dataset,
dimVar = c("region", "county"),
freqVar = "freq",
primary = FALSE,
protectZeros = FALSE)
#> region freq primary suppressed
#> 1 Total 706 FALSE FALSE
#> 2 1 127 FALSE FALSE
#> 3 10 96 FALSE FALSE
#> 4 4 55 FALSE FALSE
#> 5 5 118 FALSE FALSE
#> 6 6 205 FALSE FALSE
#> 7 8 105 FALSE FALSE
#> 8 A 113 FALSE FALSE
#> 9 B 55 FALSE FALSE
#> 10 C 73 FALSE FALSE
#> 11 D 45 FALSE FALSE
#> 12 E 138 FALSE FALSE
#> 13 F 67 FALSE FALSE
#> 14 G 40 FALSE FALSE
#> 15 H 65 FALSE FALSE
#> 16 I 14 FALSE FALSE
#> 17 J 61 FALSE FALSE
#> 18 K 35 FALSE FALSE
The function detects hierarchies encoded in dimVar
columns, and collapses them into a single column (with the name of the most detailed variable). In this way, it is not necessary to specify hierarchies by hand and include them explicitly in the function call. This also works for non-nested hierarchies:
GaussSuppressionFromData(data = dataset,
dimVar = c("region", "county", "k_group"),
freqVar = "freq",
primary = FALSE,
protectZeros = FALSE)
#> region freq primary suppressed
#> 1 1 127 FALSE FALSE
#> 2 10 96 FALSE FALSE
#> 3 300 596 FALSE FALSE
#> 4 4 55 FALSE FALSE
#> 5 400 110 FALSE FALSE
#> 6 5 118 FALSE FALSE
#> 7 6 205 FALSE FALSE
#> 8 8 105 FALSE FALSE
#> 9 Total 706 FALSE FALSE
#> 10 A 113 FALSE FALSE
#> 11 B 55 FALSE FALSE
#> 12 C 73 FALSE FALSE
#> 13 D 45 FALSE FALSE
#> 14 E 138 FALSE FALSE
#> 15 F 67 FALSE FALSE
#> 16 G 40 FALSE FALSE
#> 17 H 65 FALSE FALSE
#> 18 I 14 FALSE FALSE
#> 19 J 61 FALSE FALSE
#> 20 K 35 FALSE FALSE
In the background, functions from SSBtools are used to find the hierarchies. There are multiple ways of inspecting which hierarchies can be found; users familiar with DimLists used in other SDC packages can for example use the following:
FindDimLists(dataset[c("region", "county")])
#> $region
#> levels codes
#> 1 @ Total
#> 2 @@ 1
#> 3 @@@ A
#> 4 @@@ I
#> 5 @@ 4
#> 6 @@@ B
#> 7 @@ 5
#> 8 @@@ C
#> 9 @@@ D
#> 10 @@ 6
#> 11 @@@ E
#> 12 @@@ F
#> 13 @@ 8
#> 14 @@@ G
#> 15 @@@ H
#> 16 @@ 10
#> 17 @@@ J
#> 18 @@@ K
FindDimLists(dataset[c("region", "county", "k_group")])
#> $region
#> levels codes
#> 1 @ Total
#> 2 @@ 1
#> 3 @@@ A
#> 4 @@@ I
#> 5 @@ 4
#> 6 @@@ B
#> 7 @@ 5
#> 8 @@@ C
#> 9 @@@ D
#> 10 @@ 6
#> 11 @@@ E
#> 12 @@@ F
#> 13 @@ 8
#> 14 @@@ G
#> 15 @@@ H
#> 16 @@ 10
#> 17 @@@ J
#> 18 @@@ K
#>
#> $region
#> levels codes
#> 1 @ Total
#> 2 @@ 300
#> 3 @@@ A
#> 4 @@@ B
#> 5 @@@ C
#> 6 @@@ D
#> 7 @@@ E
#> 8 @@@ F
#> 9 @@@ G
#> 10 @@@ H
#> 11 @@ 400
#> 12 @@@ I
#> 13 @@@ J
#> 14 @@@ K
Note the last example which contained non-nested hierarchies. Here, a unique DimList is created for each tree-shaped hierarchy in the data set. This avoids the need for specifying non-nested hierarchies as linked tables.
Finally, for illustration purposes, we see that the same function calls work with microdata as input:
GaussSuppressionFromData(data = microdata,
dimVar = c("region", "county", "k_group"),
freqVar = "freq",
primary = FALSE,
protectZeros = FALSE)
#> region freq primary suppressed
#> 1 1 127 FALSE FALSE
#> 2 10 96 FALSE FALSE
#> 3 300 596 FALSE FALSE
#> 4 4 55 FALSE FALSE
#> 5 400 110 FALSE FALSE
#> 6 5 118 FALSE FALSE
#> 7 6 205 FALSE FALSE
#> 8 8 105 FALSE FALSE
#> 9 Total 706 FALSE FALSE
#> 10 A 113 FALSE FALSE
#> 11 B 55 FALSE FALSE
#> 12 C 73 FALSE FALSE
#> 13 D 45 FALSE FALSE
#> 14 E 138 FALSE FALSE
#> 15 F 67 FALSE FALSE
#> 16 G 40 FALSE FALSE
#> 17 H 65 FALSE FALSE
#> 18 I 14 FALSE FALSE
#> 19 J 61 FALSE FALSE
#> 20 K 35 FALSE FALSE
hierarchies
The hierarchies
parameter allows the explicit specification of which hierarchies should be used when creating the output table. This allows for a more fine-grained approach as opposed to simply using dimVar
, as it allows for applying hierarchies not already present in the data set. Hierarchies can be provided in many ways. In this vignette, we will exemplify the following three forms: as a dimlist (as defined in sdcTable
), using the hrc format from TauArgus, and finally with a more general hierarchy specification (internally, not surprisingly, simply called hierarchy). Any of these can be provided to the hierarchies
parameter, as they are all translated to the internal hierarchy representation. For the purposes of this vignette, we will use dimlists, however in the following example we shall see how these can be translated to one another using functions from SSBtools
. Let us begin by defining two hierarchies by using dimlists:
<- data.frame(levels = c("@", "@@", rep("@@@", 3), rep("@@", 8)),
region_dim codes = c("Total", "ABC", LETTERS[1:11]))
region_dim#> levels codes
#> 1 @ Total
#> 2 @@ ABC
#> 3 @@@ A
#> 4 @@@ B
#> 5 @@@ C
#> 6 @@ D
#> 7 @@ E
#> 8 @@ F
#> 9 @@ G
#> 10 @@ H
#> 11 @@ I
#> 12 @@ J
#> 13 @@ K
<- data.frame(levels = c("@", "@@", "@@", "@@@", "@@@", "@@@"),
income_dim codes = c("Total", "wages", "not_wages", "other", "assistance", "pensions"))
income_dim#> levels codes
#> 1 @ Total
#> 2 @@ wages
#> 3 @@ not_wages
#> 4 @@@ other
#> 5 @@@ assistance
#> 6 @@@ pensions
::DimList2Hrc(income_dim)
SSBtools#> [1] "wages" "not_wages" "@other" "@assistance" "@pensions"
::DimList2Hierarchy(income_dim)
SSBtools#> mapsFrom mapsTo sign level
#> 1 wages Total 1 2
#> 2 not_wages Total 1 2
#> 3 other not_wages 1 1
#> 4 assistance not_wages 1 1
#> 5 pensions not_wages 1 1
We can use these hierarchies to specify our output table. We do this by supplying a named list to the hierarchies
parameter, where the list names correspond to variables in the data, and the list elements correspond to hierarchies we wish to include.
GaussSuppressionFromData(data = dataset,
hierarchies = list(region = region_dim, main_income = income_dim),
freqVar = "freq",
primary = FALSE,
protectZeros = FALSE)
#> region main_income freq primary suppressed
#> 1 Total Total 706 FALSE FALSE
#> 2 Total not_wages 652 FALSE FALSE
#> 3 Total assistance 342 FALSE FALSE
#> 4 Total other 88 FALSE FALSE
#> 5 Total pensions 222 FALSE FALSE
#> 6 Total wages 54 FALSE FALSE
#> 7 ABC Total 241 FALSE FALSE
#> 8 ABC not_wages 221 FALSE FALSE
#> 9 ABC assistance 119 FALSE FALSE
#> 10 ABC other 23 FALSE FALSE
#> 11 ABC pensions 79 FALSE FALSE
#> 12 ABC wages 20 FALSE FALSE
#> 13 A Total 113 FALSE FALSE
#> 14 A not_wages 102 FALSE FALSE
#> 15 A assistance 55 FALSE FALSE
#> 16 A other 11 FALSE FALSE
#> 17 A pensions 36 FALSE FALSE
#> 18 A wages 11 FALSE FALSE
#> 19 B Total 55 FALSE FALSE
#> 20 B not_wages 54 FALSE FALSE
#> 21 B assistance 29 FALSE FALSE
#> 22 B other 7 FALSE FALSE
#> 23 B pensions 18 FALSE FALSE
#> 24 B wages 1 FALSE FALSE
#> 25 C Total 73 FALSE FALSE
#> 26 C not_wages 65 FALSE FALSE
#> 27 C assistance 35 FALSE FALSE
#> 28 C other 5 FALSE FALSE
#> 29 C pensions 25 FALSE FALSE
#> 30 C wages 8 FALSE FALSE
#> 31 D Total 45 FALSE FALSE
#> 32 D not_wages 43 FALSE FALSE
#> 33 D assistance 17 FALSE FALSE
#> 34 D other 13 FALSE FALSE
#> 35 D pensions 13 FALSE FALSE
#> 36 D wages 2 FALSE FALSE
#> 37 E Total 138 FALSE FALSE
#> 38 E not_wages 124 FALSE FALSE
#> 39 E assistance 63 FALSE FALSE
#> 40 E other 9 FALSE FALSE
#> 41 E pensions 52 FALSE FALSE
#> 42 E wages 14 FALSE FALSE
#> 43 F Total 67 FALSE FALSE
#> 44 F not_wages 58 FALSE FALSE
#> 45 F assistance 24 FALSE FALSE
#> 46 F other 12 FALSE FALSE
#> 47 F pensions 22 FALSE FALSE
#> 48 F wages 9 FALSE FALSE
#> 49 G Total 40 FALSE FALSE
#> 50 G not_wages 36 FALSE FALSE
#> 51 G assistance 22 FALSE FALSE
#> 52 G other 6 FALSE FALSE
#> 53 G pensions 8 FALSE FALSE
#> 54 G wages 4 FALSE FALSE
#> 55 H Total 65 FALSE FALSE
#> 56 H not_wages 62 FALSE FALSE
#> 57 H assistance 38 FALSE FALSE
#> 58 H other 9 FALSE FALSE
#> 59 H pensions 15 FALSE FALSE
#> 60 H wages 3 FALSE FALSE
#> 61 I Total 14 FALSE FALSE
#> 62 I not_wages 14 FALSE FALSE
#> 63 I assistance 9 FALSE FALSE
#> 64 I other 3 FALSE FALSE
#> 65 I pensions 2 FALSE FALSE
#> 66 I wages 0 FALSE FALSE
#> 67 J Total 61 FALSE FALSE
#> 68 J not_wages 61 FALSE FALSE
#> 69 J assistance 32 FALSE FALSE
#> 70 J other 9 FALSE FALSE
#> 71 J pensions 20 FALSE FALSE
#> 72 J wages 0 FALSE FALSE
#> 73 K Total 35 FALSE FALSE
#> 74 K not_wages 33 FALSE FALSE
#> 75 K assistance 18 FALSE FALSE
#> 76 K other 4 FALSE FALSE
#> 77 K pensions 11 FALSE FALSE
#> 78 K wages 2 FALSE FALSE
As mentioned previously, the GaussSuppression
package supports non-nested hierarchies natively. We achieve this by having multiple elements with the same name in the hierarchies
list:
<- data.frame(levels = c("@", rep(c("@@", rep("@@@" ,3)),2), rep("@@", 5)),
region2_dim codes = c("Total", "ACE", "A", "C", "E",
"BDF", "B", "D", "F",
"G", "H", "I", "J", "K"))
region2_dim#> levels codes
#> 1 @ Total
#> 2 @@ ACE
#> 3 @@@ A
#> 4 @@@ C
#> 5 @@@ E
#> 6 @@ BDF
#> 7 @@@ B
#> 8 @@@ D
#> 9 @@@ F
#> 10 @@ G
#> 11 @@ H
#> 12 @@ I
#> 13 @@ J
#> 14 @@ K
GaussSuppressionFromData(data = dataset,
hierarchies = list(region = region_dim, region = region2_dim),
freqVar = "freq",
primary = FALSE,
protectZeros = FALSE)
#> region freq primary suppressed
#> 1 ABC 241 FALSE FALSE
#> 2 ACE 324 FALSE FALSE
#> 3 BDF 167 FALSE FALSE
#> 4 Total 706 FALSE FALSE
#> 5 A 113 FALSE FALSE
#> 6 B 55 FALSE FALSE
#> 7 C 73 FALSE FALSE
#> 8 D 45 FALSE FALSE
#> 9 E 138 FALSE FALSE
#> 10 F 67 FALSE FALSE
#> 11 G 40 FALSE FALSE
#> 12 H 65 FALSE FALSE
#> 13 I 14 FALSE FALSE
#> 14 J 61 FALSE FALSE
#> 15 K 35 FALSE FALSE
Finally, as before, all of this functionality works with microdata as input as well.
GaussSuppressionFromData(data = microdata,
hierarchies = list(region = region_dim, region = region2_dim),
freqVar = "freq",
primary = FALSE,
protectZeros = FALSE)
#> region freq primary suppressed
#> 1 ABC 241 FALSE FALSE
#> 2 ACE 324 FALSE FALSE
#> 3 BDF 167 FALSE FALSE
#> 4 Total 706 FALSE FALSE
#> 5 A 113 FALSE FALSE
#> 6 B 55 FALSE FALSE
#> 7 C 73 FALSE FALSE
#> 8 D 45 FALSE FALSE
#> 9 E 138 FALSE FALSE
#> 10 F 67 FALSE FALSE
#> 11 G 40 FALSE FALSE
#> 12 H 65 FALSE FALSE
#> 13 I 14 FALSE FALSE
#> 14 J 61 FALSE FALSE
#> 15 K 35 FALSE FALSE
formula
The most flexible method for specifying the output of GaussSuppression is by using the formula
interface. This makes use of model formulas in R, and provides a powerful way of specifying multiple different tables. Indeed, all of the above examples—and much more—can be replicated using the formula interface. The formula’s predictor variables must be variable names occuring in the data set (the dependent variable is ignored, and thus we leave it empty). In the following, we create a table based on the region and county variables. As before, the hierarchical relationship between these variables is detected automatically:
GaussSuppressionFromData(data = microdata,
formula = ~ region + county,
freqVar = "freq",
primary = FALSE,
protectZeros = FALSE)
#> region freq primary suppressed
#> 1 Total 706 FALSE FALSE
#> 2 A 113 FALSE FALSE
#> 3 B 55 FALSE FALSE
#> 4 C 73 FALSE FALSE
#> 5 D 45 FALSE FALSE
#> 6 E 138 FALSE FALSE
#> 7 F 67 FALSE FALSE
#> 8 G 40 FALSE FALSE
#> 9 H 65 FALSE FALSE
#> 10 I 14 FALSE FALSE
#> 11 J 61 FALSE FALSE
#> 12 K 35 FALSE FALSE
#> 13 1 127 FALSE FALSE
#> 14 4 55 FALSE FALSE
#> 15 5 118 FALSE FALSE
#> 16 6 205 FALSE FALSE
#> 17 8 105 FALSE FALSE
#> 18 10 96 FALSE FALSE
If there is no hierarchical relationship between variables, multiplication in the formula
and specification in dimVar
yield the same results.
GaussSuppressionFromData(data = microdata,
formula = ~ county * main_income,
freqVar = "freq",
primary = FALSE,
protectZeros = FALSE)
#> county main_income freq primary suppressed
#> 1 Total Total 706 FALSE FALSE
#> 2 1 Total 127 FALSE FALSE
#> 3 4 Total 55 FALSE FALSE
#> 4 5 Total 118 FALSE FALSE
#> 5 6 Total 205 FALSE FALSE
#> 6 8 Total 105 FALSE FALSE
#> 7 10 Total 96 FALSE FALSE
#> 8 Total assistance 342 FALSE FALSE
#> 9 Total other 88 FALSE FALSE
#> 10 Total pensions 222 FALSE FALSE
#> 11 Total wages 54 FALSE FALSE
#> 12 1 assistance 64 FALSE FALSE
#> 13 1 other 14 FALSE FALSE
#> 14 1 pensions 38 FALSE FALSE
#> 15 1 wages 11 FALSE FALSE
#> 16 4 assistance 29 FALSE FALSE
#> 17 4 other 7 FALSE FALSE
#> 18 4 pensions 18 FALSE FALSE
#> 19 4 wages 1 FALSE FALSE
#> 20 5 assistance 52 FALSE FALSE
#> 21 5 other 18 FALSE FALSE
#> 22 5 pensions 38 FALSE FALSE
#> 23 5 wages 10 FALSE FALSE
#> 24 6 assistance 87 FALSE FALSE
#> 25 6 other 21 FALSE FALSE
#> 26 6 pensions 74 FALSE FALSE
#> 27 6 wages 23 FALSE FALSE
#> 28 8 assistance 60 FALSE FALSE
#> 29 8 other 15 FALSE FALSE
#> 30 8 pensions 23 FALSE FALSE
#> 31 8 wages 7 FALSE FALSE
#> 32 10 assistance 50 FALSE FALSE
#> 33 10 other 13 FALSE FALSE
#> 34 10 pensions 31 FALSE FALSE
#> 35 10 wages 2 FALSE FALSE
GaussSuppressionFromData(data = microdata,
dimVar = c("county" , "main_income"),
freqVar = "freq",
primary = FALSE,
protectZeros = FALSE)
#> county main_income freq primary suppressed
#> 1 Total Total 706 FALSE FALSE
#> 2 Total assistance 342 FALSE FALSE
#> 3 Total other 88 FALSE FALSE
#> 4 Total pensions 222 FALSE FALSE
#> 5 Total wages 54 FALSE FALSE
#> 6 1 Total 127 FALSE FALSE
#> 7 1 assistance 64 FALSE FALSE
#> 8 1 other 14 FALSE FALSE
#> 9 1 pensions 38 FALSE FALSE
#> 10 1 wages 11 FALSE FALSE
#> 11 10 Total 96 FALSE FALSE
#> 12 10 assistance 50 FALSE FALSE
#> 13 10 other 13 FALSE FALSE
#> 14 10 pensions 31 FALSE FALSE
#> 15 10 wages 2 FALSE FALSE
#> 16 4 Total 55 FALSE FALSE
#> 17 4 assistance 29 FALSE FALSE
#> 18 4 other 7 FALSE FALSE
#> 19 4 pensions 18 FALSE FALSE
#> 20 4 wages 1 FALSE FALSE
#> 21 5 Total 118 FALSE FALSE
#> 22 5 assistance 52 FALSE FALSE
#> 23 5 other 18 FALSE FALSE
#> 24 5 pensions 38 FALSE FALSE
#> 25 5 wages 10 FALSE FALSE
#> 26 6 Total 205 FALSE FALSE
#> 27 6 assistance 87 FALSE FALSE
#> 28 6 other 21 FALSE FALSE
#> 29 6 pensions 74 FALSE FALSE
#> 30 6 wages 23 FALSE FALSE
#> 31 8 Total 105 FALSE FALSE
#> 32 8 assistance 60 FALSE FALSE
#> 33 8 other 15 FALSE FALSE
#> 34 8 pensions 23 FALSE FALSE
#> 35 8 wages 7 FALSE FALSE
However, formula
lets us specify different shapes for our tables. For example, if we are only interested in marginal values, we can supply this with the use of the addition operator:
GaussSuppressionFromData(data = microdata,
formula = ~ county + main_income,
freqVar = "freq",
primary = FALSE,
protectZeros = FALSE)
#> county main_income freq primary suppressed
#> 1 Total Total 706 FALSE FALSE
#> 2 1 Total 127 FALSE FALSE
#> 3 4 Total 55 FALSE FALSE
#> 4 5 Total 118 FALSE FALSE
#> 5 6 Total 205 FALSE FALSE
#> 6 8 Total 105 FALSE FALSE
#> 7 10 Total 96 FALSE FALSE
#> 8 Total assistance 342 FALSE FALSE
#> 9 Total other 88 FALSE FALSE
#> 10 Total pensions 222 FALSE FALSE
#> 11 Total wages 54 FALSE FALSE
This example demonstrates, in fact, the ability of specifying multiple linked tables: a one-dimensional table for county linked with a one-dimensional table for main_income. Similarly, we can use the colon (“:”) operator to omit row and column marginals:
GaussSuppressionFromData(data = microdata,
formula = ~ county : main_income,
freqVar = "freq",
primary = FALSE,
protectZeros = FALSE)
#> county main_income freq primary suppressed
#> 1 Total Total 706 FALSE FALSE
#> 2 1 assistance 64 FALSE FALSE
#> 3 1 other 14 FALSE FALSE
#> 4 1 pensions 38 FALSE FALSE
#> 5 1 wages 11 FALSE FALSE
#> 6 4 assistance 29 FALSE FALSE
#> 7 4 other 7 FALSE FALSE
#> 8 4 pensions 18 FALSE FALSE
#> 9 4 wages 1 FALSE FALSE
#> 10 5 assistance 52 FALSE FALSE
#> 11 5 other 18 FALSE FALSE
#> 12 5 pensions 38 FALSE FALSE
#> 13 5 wages 10 FALSE FALSE
#> 14 6 assistance 87 FALSE FALSE
#> 15 6 other 21 FALSE FALSE
#> 16 6 pensions 74 FALSE FALSE
#> 17 6 wages 23 FALSE FALSE
#> 18 8 assistance 60 FALSE FALSE
#> 19 8 other 15 FALSE FALSE
#> 20 8 pensions 23 FALSE FALSE
#> 21 8 wages 7 FALSE FALSE
#> 22 10 assistance 50 FALSE FALSE
#> 23 10 other 13 FALSE FALSE
#> 24 10 pensions 31 FALSE FALSE
#> 25 10 wages 2 FALSE FALSE
Using subtraction, we can omit marginals and other cells from the output. For example, the intercept (sum over all records) can be omitted by including - 1
in the formula, like this: formula = county : main_income - 1
.
Using these features, we can define more complicated linked tables. To illustrate this, let us assume we wish to publish the following:
To do this, we begin by adding a column encoding whether the main source of income was “wages” or “not_wages”.
$income2 <- ifelse(dataset$main_income == "wages", "wages", "not_wages")
dataset$income2 <- ifelse(microdata$main_income == "wages", "wages", "not_wages")
microdatahead(dataset)
#> region county k_group main_income freq income2
#> 1 A 1 300 other 11 not_wages
#> 2 B 4 300 other 7 not_wages
#> 3 C 5 300 other 5 not_wages
#> 4 D 5 300 other 13 not_wages
#> 5 E 6 300 other 9 not_wages
#> 6 F 6 300 other 12 not_wages
Then we can specify the desired output with the following formula:
GaussSuppressionFromData(data = dataset,
formula = ~ region * income2 + (county + k_group) * main_income,
freqVar = "freq",
primary = FALSE,
protectZeros = FALSE)
#> region main_income freq primary suppressed
#> 1 Total Total 706 FALSE FALSE
#> 2 A Total 113 FALSE FALSE
#> 3 B Total 55 FALSE FALSE
#> 4 C Total 73 FALSE FALSE
#> 5 D Total 45 FALSE FALSE
#> 6 E Total 138 FALSE FALSE
#> 7 F Total 67 FALSE FALSE
#> 8 G Total 40 FALSE FALSE
#> 9 H Total 65 FALSE FALSE
#> 10 I Total 14 FALSE FALSE
#> 11 J Total 61 FALSE FALSE
#> 12 K Total 35 FALSE FALSE
#> 13 Total not_wages 652 FALSE FALSE
#> 14 Total wages 54 FALSE FALSE
#> 15 1 Total 127 FALSE FALSE
#> 16 4 Total 55 FALSE FALSE
#> 17 5 Total 118 FALSE FALSE
#> 18 6 Total 205 FALSE FALSE
#> 19 8 Total 105 FALSE FALSE
#> 20 10 Total 96 FALSE FALSE
#> 21 300 Total 596 FALSE FALSE
#> 22 400 Total 110 FALSE FALSE
#> 23 Total assistance 342 FALSE FALSE
#> 24 Total other 88 FALSE FALSE
#> 25 Total pensions 222 FALSE FALSE
#> 26 Total wages 54 FALSE FALSE
#> 27 A not_wages 102 FALSE FALSE
#> 28 A wages 11 FALSE FALSE
#> 29 B not_wages 54 FALSE FALSE
#> 30 B wages 1 FALSE FALSE
#> 31 C not_wages 65 FALSE FALSE
#> 32 C wages 8 FALSE FALSE
#> 33 D not_wages 43 FALSE FALSE
#> 34 D wages 2 FALSE FALSE
#> 35 E not_wages 124 FALSE FALSE
#> 36 E wages 14 FALSE FALSE
#> 37 F not_wages 58 FALSE FALSE
#> 38 F wages 9 FALSE FALSE
#> 39 G not_wages 36 FALSE FALSE
#> 40 G wages 4 FALSE FALSE
#> 41 H not_wages 62 FALSE FALSE
#> 42 H wages 3 FALSE FALSE
#> 43 I not_wages 14 FALSE FALSE
#> 44 I wages 0 FALSE FALSE
#> 45 J not_wages 61 FALSE FALSE
#> 46 J wages 0 FALSE FALSE
#> 47 K not_wages 33 FALSE FALSE
#> 48 K wages 2 FALSE FALSE
#> 49 1 assistance 64 FALSE FALSE
#> 50 1 other 14 FALSE FALSE
#> 51 1 pensions 38 FALSE FALSE
#> 52 1 wages 11 FALSE FALSE
#> 53 4 assistance 29 FALSE FALSE
#> 54 4 other 7 FALSE FALSE
#> 55 4 pensions 18 FALSE FALSE
#> 56 4 wages 1 FALSE FALSE
#> 57 5 assistance 52 FALSE FALSE
#> 58 5 other 18 FALSE FALSE
#> 59 5 pensions 38 FALSE FALSE
#> 60 5 wages 10 FALSE FALSE
#> 61 6 assistance 87 FALSE FALSE
#> 62 6 other 21 FALSE FALSE
#> 63 6 pensions 74 FALSE FALSE
#> 64 6 wages 23 FALSE FALSE
#> 65 8 assistance 60 FALSE FALSE
#> 66 8 other 15 FALSE FALSE
#> 67 8 pensions 23 FALSE FALSE
#> 68 8 wages 7 FALSE FALSE
#> 69 10 assistance 50 FALSE FALSE
#> 70 10 other 13 FALSE FALSE
#> 71 10 pensions 31 FALSE FALSE
#> 72 10 wages 2 FALSE FALSE
#> 73 300 assistance 283 FALSE FALSE
#> 74 300 other 72 FALSE FALSE
#> 75 300 pensions 189 FALSE FALSE
#> 76 300 wages 52 FALSE FALSE
#> 77 400 assistance 59 FALSE FALSE
#> 78 400 other 16 FALSE FALSE
#> 79 400 pensions 33 FALSE FALSE
#> 80 400 wages 2 FALSE FALSE
In this manner, we can specify multiple linked tables, each of which can use different non-nested hierarchies. This allows the suppression algorithm to protect all of these tables simultaneously (indeed, they are treated as a single table internally), avoiding the need for a stratified protection paradigm. Furthermore, the fine-grained specification of which cells are to be published allows the secondary suppression algorithm to protect with respect to precisely those cells that will be published. If row and column marginals are not published, for example, the suppression algorithm does not need to secondary suppress with respect to these marginals.
In addition to defining the dimensions of the output tables, we need to decide whether they should be frequency tables (where we count contributing records) or magnititude tables (where we add contributing records’ numerical values for a given variable). All of the above examples have been frequency tables. However, the process is exactly the same if one wishes to construct magnititude tables; the only difference is that one must specify the numerical variable with the help of the parameter numVar
.
Since most magnitude table suppression methods are based on comparing units’ contributions, input the data will most likely be supplied as microdata. Therefore, let us add a fake numerical variable to our microdata:
set.seed(12345)
$num <- sample(0:1000, nrow(microdata), replace = TRUE) microdata
Then in order to construct a volume table where records’ contributions to num
are aggregated, we supply this as a parameter to GaussSuppressionFromData
:
GaussSuppressionFromData(data = microdata,
formula = ~ region * income2 + (county + k_group) * main_income,
numVar = "num",
primary = FALSE,
protectZeros = FALSE)
#> [preAggregate 706*7->42*7]
#> region main_income freq.1 num primary suppressed
#> 1 Total Total 706 358773 FALSE FALSE
#> 2 A Total 113 56793 FALSE FALSE
#> 3 B Total 55 31867 FALSE FALSE
#> 4 C Total 73 33500 FALSE FALSE
#> 5 D Total 45 22829 FALSE FALSE
#> 6 E Total 138 66412 FALSE FALSE
#> 7 F Total 67 38823 FALSE FALSE
#> 8 G Total 40 18817 FALSE FALSE
#> 9 H Total 65 33314 FALSE FALSE
#> 10 I Total 14 6870 FALSE FALSE
#> 11 J Total 61 31353 FALSE FALSE
#> 12 K Total 35 18195 FALSE FALSE
#> 13 Total not_wages 652 330316 FALSE FALSE
#> 14 Total wages 54 28457 FALSE FALSE
#> 15 1 Total 127 63663 FALSE FALSE
#> 16 4 Total 55 31867 FALSE FALSE
#> 17 5 Total 118 56329 FALSE FALSE
#> 18 6 Total 205 105235 FALSE FALSE
#> 19 8 Total 105 52131 FALSE FALSE
#> 20 10 Total 96 49548 FALSE FALSE
#> 21 300 Total 596 302355 FALSE FALSE
#> 22 400 Total 110 56418 FALSE FALSE
#> 23 Total assistance 342 171392 FALSE FALSE
#> 24 Total other 88 45958 FALSE FALSE
#> 25 Total pensions 222 112966 FALSE FALSE
#> 26 Total wages 54 28457 FALSE FALSE
#> 27 A not_wages 102 51447 FALSE FALSE
#> 28 A wages 11 5346 FALSE FALSE
#> 29 B not_wages 54 31001 FALSE FALSE
#> 30 B wages 1 866 FALSE FALSE
#> 31 C not_wages 65 29678 FALSE FALSE
#> 32 C wages 8 3822 FALSE FALSE
#> 33 D not_wages 43 22041 FALSE FALSE
#> 34 D wages 2 788 FALSE FALSE
#> 35 E not_wages 124 57540 FALSE FALSE
#> 36 E wages 14 8872 FALSE FALSE
#> 37 F not_wages 58 34933 FALSE FALSE
#> 38 F wages 9 3890 FALSE FALSE
#> 39 G not_wages 36 16348 FALSE FALSE
#> 40 G wages 4 2469 FALSE FALSE
#> 41 H not_wages 62 31651 FALSE FALSE
#> 42 H wages 3 1663 FALSE FALSE
#> 43 I not_wages 14 6870 FALSE FALSE
#> 44 J not_wages 61 31353 FALSE FALSE
#> 45 K not_wages 33 17454 FALSE FALSE
#> 46 K wages 2 741 FALSE FALSE
#> 47 1 assistance 64 29577 FALSE FALSE
#> 48 1 other 14 6583 FALSE FALSE
#> 49 1 pensions 38 22157 FALSE FALSE
#> 50 1 wages 11 5346 FALSE FALSE
#> 51 4 assistance 29 16798 FALSE FALSE
#> 52 4 other 7 3217 FALSE FALSE
#> 53 4 pensions 18 10986 FALSE FALSE
#> 54 4 wages 1 866 FALSE FALSE
#> 55 5 assistance 52 24467 FALSE FALSE
#> 56 5 other 18 9436 FALSE FALSE
#> 57 5 pensions 38 17816 FALSE FALSE
#> 58 5 wages 10 4610 FALSE FALSE
#> 59 6 assistance 87 44849 FALSE FALSE
#> 60 6 other 21 12582 FALSE FALSE
#> 61 6 pensions 74 35042 FALSE FALSE
#> 62 6 wages 23 12762 FALSE FALSE
#> 63 8 assistance 60 31136 FALSE FALSE
#> 64 8 other 15 6462 FALSE FALSE
#> 65 8 pensions 23 10401 FALSE FALSE
#> 66 8 wages 7 4132 FALSE FALSE
#> 67 10 assistance 50 24565 FALSE FALSE
#> 68 10 other 13 7678 FALSE FALSE
#> 69 10 pensions 31 16564 FALSE FALSE
#> 70 10 wages 2 741 FALSE FALSE
#> 71 300 assistance 283 142670 FALSE FALSE
#> 72 300 other 72 36799 FALSE FALSE
#> 73 300 pensions 189 95170 FALSE FALSE
#> 74 300 wages 52 27716 FALSE FALSE
#> 75 400 assistance 59 28722 FALSE FALSE
#> 76 400 other 16 9159 FALSE FALSE
#> 77 400 pensions 33 17796 FALSE FALSE
#> 78 400 wages 2 741 FALSE FALSE
Note that a new frequency variable is generated with the above call. If a frequency variable is already present in the input data, we can provide it in addition to numVar
and the method will use that information instead:
GaussSuppressionFromData(data = microdata,
formula = ~ region * income2 + (county + k_group) * main_income,
freqVar = "freq",
numVar = "num",
primary = FALSE,
protectZeros = FALSE)
#> region main_income freq num primary suppressed
#> 1 Total Total 706 358773 FALSE FALSE
#> 2 A Total 113 56793 FALSE FALSE
#> 3 B Total 55 31867 FALSE FALSE
#> 4 C Total 73 33500 FALSE FALSE
#> 5 D Total 45 22829 FALSE FALSE
#> 6 E Total 138 66412 FALSE FALSE
#> 7 F Total 67 38823 FALSE FALSE
#> 8 G Total 40 18817 FALSE FALSE
#> 9 H Total 65 33314 FALSE FALSE
#> 10 I Total 14 6870 FALSE FALSE
#> 11 J Total 61 31353 FALSE FALSE
#> 12 K Total 35 18195 FALSE FALSE
#> 13 Total not_wages 652 330316 FALSE FALSE
#> 14 Total wages 54 28457 FALSE FALSE
#> 15 1 Total 127 63663 FALSE FALSE
#> 16 4 Total 55 31867 FALSE FALSE
#> 17 5 Total 118 56329 FALSE FALSE
#> 18 6 Total 205 105235 FALSE FALSE
#> 19 8 Total 105 52131 FALSE FALSE
#> 20 10 Total 96 49548 FALSE FALSE
#> 21 300 Total 596 302355 FALSE FALSE
#> 22 400 Total 110 56418 FALSE FALSE
#> 23 Total assistance 342 171392 FALSE FALSE
#> 24 Total other 88 45958 FALSE FALSE
#> 25 Total pensions 222 112966 FALSE FALSE
#> 26 Total wages 54 28457 FALSE FALSE
#> 27 A not_wages 102 51447 FALSE FALSE
#> 28 A wages 11 5346 FALSE FALSE
#> 29 B not_wages 54 31001 FALSE FALSE
#> 30 B wages 1 866 FALSE FALSE
#> 31 C not_wages 65 29678 FALSE FALSE
#> 32 C wages 8 3822 FALSE FALSE
#> 33 D not_wages 43 22041 FALSE FALSE
#> 34 D wages 2 788 FALSE FALSE
#> 35 E not_wages 124 57540 FALSE FALSE
#> 36 E wages 14 8872 FALSE FALSE
#> 37 F not_wages 58 34933 FALSE FALSE
#> 38 F wages 9 3890 FALSE FALSE
#> 39 G not_wages 36 16348 FALSE FALSE
#> 40 G wages 4 2469 FALSE FALSE
#> 41 H not_wages 62 31651 FALSE FALSE
#> 42 H wages 3 1663 FALSE FALSE
#> 43 I not_wages 14 6870 FALSE FALSE
#> 44 J not_wages 61 31353 FALSE FALSE
#> 45 K not_wages 33 17454 FALSE FALSE
#> 46 K wages 2 741 FALSE FALSE
#> 47 1 assistance 64 29577 FALSE FALSE
#> 48 1 other 14 6583 FALSE FALSE
#> 49 1 pensions 38 22157 FALSE FALSE
#> 50 1 wages 11 5346 FALSE FALSE
#> 51 4 assistance 29 16798 FALSE FALSE
#> 52 4 other 7 3217 FALSE FALSE
#> 53 4 pensions 18 10986 FALSE FALSE
#> 54 4 wages 1 866 FALSE FALSE
#> 55 5 assistance 52 24467 FALSE FALSE
#> 56 5 other 18 9436 FALSE FALSE
#> 57 5 pensions 38 17816 FALSE FALSE
#> 58 5 wages 10 4610 FALSE FALSE
#> 59 6 assistance 87 44849 FALSE FALSE
#> 60 6 other 21 12582 FALSE FALSE
#> 61 6 pensions 74 35042 FALSE FALSE
#> 62 6 wages 23 12762 FALSE FALSE
#> 63 8 assistance 60 31136 FALSE FALSE
#> 64 8 other 15 6462 FALSE FALSE
#> 65 8 pensions 23 10401 FALSE FALSE
#> 66 8 wages 7 4132 FALSE FALSE
#> 67 10 assistance 50 24565 FALSE FALSE
#> 68 10 other 13 7678 FALSE FALSE
#> 69 10 pensions 31 16564 FALSE FALSE
#> 70 10 wages 2 741 FALSE FALSE
#> 71 300 assistance 283 142670 FALSE FALSE
#> 72 300 other 72 36799 FALSE FALSE
#> 73 300 pensions 189 95170 FALSE FALSE
#> 74 300 wages 52 27716 FALSE FALSE
#> 75 400 assistance 59 28722 FALSE FALSE
#> 76 400 other 16 9159 FALSE FALSE
#> 77 400 pensions 33 17796 FALSE FALSE
#> 78 400 wages 2 741 FALSE FALSE