Defining Tables for GaussSuppression

Daniel P. Lupp

Introduction and setup

The GaussSuppression package uses a common interface shared by other SDC packages developed at Statistics Norway (see also SmallCountRounding and SSBcellKey). In the background, these packages use a model matrix representation, which connects the input data to the intended output. This functionality is provided by the R package SSBtools. In this vignette, we look at multiple ways of specifying output tables given different forms of input. Note that this vignette only scratches the surface of what is possible with the provided interface, and rather is intended to help users get going with the package.

We begin by importing the necessary dependencies as well as loading a test data set provided in the SSBtools package.

library(SSBtools)
#> Loading required package: Matrix
library(GaussSuppression)

dataset <- SSBtools::SSBtoolsData("d2")
microdata <- SSBtools::MakeMicro(dataset, "freq")

head(dataset)
#>   region county k_group main_income freq
#> 1      A      1     300       other   11
#> 2      B      4     300       other    7
#> 3      C      5     300       other    5
#> 4      D      5     300       other   13
#> 5      E      6     300       other    9
#> 6      F      6     300       other   12
nrow(dataset)
#> [1] 44
head(microdata)
#>   region county k_group main_income freq
#> 1      A      1     300       other    1
#> 2      A      1     300       other    1
#> 3      A      1     300       other    1
#> 4      A      1     300       other    1
#> 5      A      1     300       other    1
#> 6      A      1     300       other    1
nrow(microdata)
#> [1] 706

The imported data set is a fictitious data set containing the variables: region, county, k_group, main_income, freq, where region, county, and k_group are different (non-nested) regional hierarchies. GaussSuppression can take microdata as input as well, which we will demonstrate in the following sections.

Defining Table Dimensions

Output tables are mainly specified using the following three parameters: dimVar, hierarchies, and formula.

Creating tables using dimVar

The most basic way of defining output tables is by using the dimVar parameter. This generates by default all combinations of the variables provided, including marginals. For example, the following function call creates a one dimensional frequency table over the variable region.

GaussSuppressionFromData(data = dataset,
                         dimVar = "region",
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    region freq primary suppressed
#> 1   Total  706   FALSE      FALSE
#> 2       A  113   FALSE      FALSE
#> 3       B   55   FALSE      FALSE
#> 4       C   73   FALSE      FALSE
#> 5       D   45   FALSE      FALSE
#> 6       E  138   FALSE      FALSE
#> 7       F   67   FALSE      FALSE
#> 8       G   40   FALSE      FALSE
#> 9       H   65   FALSE      FALSE
#> 10      I   14   FALSE      FALSE
#> 11      J   61   FALSE      FALSE
#> 12      K   35   FALSE      FALSE

Note the use of the function GaussSuppressionFromData and the inclusion of two parameters primary and protectZeros. The functions in GaussSuppression are designed to incorporate both table building and protection into a single function call. Thus, to illustrate the table building features, we have set that nothing must be protected.

In a similar fashion, we can include multiple variables in the dimVar parameter:

GaussSuppressionFromData(data = dataset,
                         dimVar = c("region", "main_income"),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    region main_income freq primary suppressed
#> 1   Total       Total  706   FALSE      FALSE
#> 2   Total  assistance  342   FALSE      FALSE
#> 3   Total       other   88   FALSE      FALSE
#> 4   Total    pensions  222   FALSE      FALSE
#> 5   Total       wages   54   FALSE      FALSE
#> 6       A       Total  113   FALSE      FALSE
#> 7       A  assistance   55   FALSE      FALSE
#> 8       A       other   11   FALSE      FALSE
#> 9       A    pensions   36   FALSE      FALSE
#> 10      A       wages   11   FALSE      FALSE
#> 11      B       Total   55   FALSE      FALSE
#> 12      B  assistance   29   FALSE      FALSE
#> 13      B       other    7   FALSE      FALSE
#> 14      B    pensions   18   FALSE      FALSE
#> 15      B       wages    1   FALSE      FALSE
#> 16      C       Total   73   FALSE      FALSE
#> 17      C  assistance   35   FALSE      FALSE
#> 18      C       other    5   FALSE      FALSE
#> 19      C    pensions   25   FALSE      FALSE
#> 20      C       wages    8   FALSE      FALSE
#> 21      D       Total   45   FALSE      FALSE
#> 22      D  assistance   17   FALSE      FALSE
#> 23      D       other   13   FALSE      FALSE
#> 24      D    pensions   13   FALSE      FALSE
#> 25      D       wages    2   FALSE      FALSE
#> 26      E       Total  138   FALSE      FALSE
#> 27      E  assistance   63   FALSE      FALSE
#> 28      E       other    9   FALSE      FALSE
#> 29      E    pensions   52   FALSE      FALSE
#> 30      E       wages   14   FALSE      FALSE
#> 31      F       Total   67   FALSE      FALSE
#> 32      F  assistance   24   FALSE      FALSE
#> 33      F       other   12   FALSE      FALSE
#> 34      F    pensions   22   FALSE      FALSE
#> 35      F       wages    9   FALSE      FALSE
#> 36      G       Total   40   FALSE      FALSE
#> 37      G  assistance   22   FALSE      FALSE
#> 38      G       other    6   FALSE      FALSE
#> 39      G    pensions    8   FALSE      FALSE
#> 40      G       wages    4   FALSE      FALSE
#> 41      H       Total   65   FALSE      FALSE
#> 42      H  assistance   38   FALSE      FALSE
#> 43      H       other    9   FALSE      FALSE
#> 44      H    pensions   15   FALSE      FALSE
#> 45      H       wages    3   FALSE      FALSE
#> 46      I       Total   14   FALSE      FALSE
#> 47      I  assistance    9   FALSE      FALSE
#> 48      I       other    3   FALSE      FALSE
#> 49      I    pensions    2   FALSE      FALSE
#> 50      I       wages    0   FALSE      FALSE
#> 51      J       Total   61   FALSE      FALSE
#> 52      J  assistance   32   FALSE      FALSE
#> 53      J       other    9   FALSE      FALSE
#> 54      J    pensions   20   FALSE      FALSE
#> 55      J       wages    0   FALSE      FALSE
#> 56      K       Total   35   FALSE      FALSE
#> 57      K  assistance   18   FALSE      FALSE
#> 58      K       other    4   FALSE      FALSE
#> 59      K    pensions   11   FALSE      FALSE
#> 60      K       wages    2   FALSE      FALSE

Note in particular what happens when we provide two regional variables:

GaussSuppressionFromData(data = dataset,
                         dimVar = c("region", "county"),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    region freq primary suppressed
#> 1   Total  706   FALSE      FALSE
#> 2       1  127   FALSE      FALSE
#> 3      10   96   FALSE      FALSE
#> 4       4   55   FALSE      FALSE
#> 5       5  118   FALSE      FALSE
#> 6       6  205   FALSE      FALSE
#> 7       8  105   FALSE      FALSE
#> 8       A  113   FALSE      FALSE
#> 9       B   55   FALSE      FALSE
#> 10      C   73   FALSE      FALSE
#> 11      D   45   FALSE      FALSE
#> 12      E  138   FALSE      FALSE
#> 13      F   67   FALSE      FALSE
#> 14      G   40   FALSE      FALSE
#> 15      H   65   FALSE      FALSE
#> 16      I   14   FALSE      FALSE
#> 17      J   61   FALSE      FALSE
#> 18      K   35   FALSE      FALSE

The function detects hierarchies encoded in dimVar columns, and collapses them into a single column (with the name of the most detailed variable). In this way, it is not necessary to specify hierarchies by hand and include them explicitly in the function call. This also works for non-nested hierarchies:

GaussSuppressionFromData(data = dataset,
                         dimVar = c("region", "county", "k_group"),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    region freq primary suppressed
#> 1       1  127   FALSE      FALSE
#> 2      10   96   FALSE      FALSE
#> 3     300  596   FALSE      FALSE
#> 4       4   55   FALSE      FALSE
#> 5     400  110   FALSE      FALSE
#> 6       5  118   FALSE      FALSE
#> 7       6  205   FALSE      FALSE
#> 8       8  105   FALSE      FALSE
#> 9   Total  706   FALSE      FALSE
#> 10      A  113   FALSE      FALSE
#> 11      B   55   FALSE      FALSE
#> 12      C   73   FALSE      FALSE
#> 13      D   45   FALSE      FALSE
#> 14      E  138   FALSE      FALSE
#> 15      F   67   FALSE      FALSE
#> 16      G   40   FALSE      FALSE
#> 17      H   65   FALSE      FALSE
#> 18      I   14   FALSE      FALSE
#> 19      J   61   FALSE      FALSE
#> 20      K   35   FALSE      FALSE

In the background, functions from SSBtools are used to find the hierarchies. There are multiple ways of inspecting which hierarchies can be found; users familiar with DimLists used in other SDC packages can for example use the following:

FindDimLists(dataset[c("region", "county")])
#> $region
#>    levels codes
#> 1       @ Total
#> 2      @@     1
#> 3     @@@     A
#> 4     @@@     I
#> 5      @@     4
#> 6     @@@     B
#> 7      @@     5
#> 8     @@@     C
#> 9     @@@     D
#> 10     @@     6
#> 11    @@@     E
#> 12    @@@     F
#> 13     @@     8
#> 14    @@@     G
#> 15    @@@     H
#> 16     @@    10
#> 17    @@@     J
#> 18    @@@     K
FindDimLists(dataset[c("region", "county", "k_group")])
#> $region
#>    levels codes
#> 1       @ Total
#> 2      @@     1
#> 3     @@@     A
#> 4     @@@     I
#> 5      @@     4
#> 6     @@@     B
#> 7      @@     5
#> 8     @@@     C
#> 9     @@@     D
#> 10     @@     6
#> 11    @@@     E
#> 12    @@@     F
#> 13     @@     8
#> 14    @@@     G
#> 15    @@@     H
#> 16     @@    10
#> 17    @@@     J
#> 18    @@@     K
#> 
#> $region
#>    levels codes
#> 1       @ Total
#> 2      @@   300
#> 3     @@@     A
#> 4     @@@     B
#> 5     @@@     C
#> 6     @@@     D
#> 7     @@@     E
#> 8     @@@     F
#> 9     @@@     G
#> 10    @@@     H
#> 11     @@   400
#> 12    @@@     I
#> 13    @@@     J
#> 14    @@@     K

Note the last example which contained non-nested hierarchies. Here, a unique DimList is created for each tree-shaped hierarchy in the data set. This avoids the need for specifying non-nested hierarchies as linked tables.

Finally, for illustration purposes, we see that the same function calls work with microdata as input:

GaussSuppressionFromData(data = microdata,
                         dimVar = c("region", "county", "k_group"),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    region freq primary suppressed
#> 1       1  127   FALSE      FALSE
#> 2      10   96   FALSE      FALSE
#> 3     300  596   FALSE      FALSE
#> 4       4   55   FALSE      FALSE
#> 5     400  110   FALSE      FALSE
#> 6       5  118   FALSE      FALSE
#> 7       6  205   FALSE      FALSE
#> 8       8  105   FALSE      FALSE
#> 9   Total  706   FALSE      FALSE
#> 10      A  113   FALSE      FALSE
#> 11      B   55   FALSE      FALSE
#> 12      C   73   FALSE      FALSE
#> 13      D   45   FALSE      FALSE
#> 14      E  138   FALSE      FALSE
#> 15      F   67   FALSE      FALSE
#> 16      G   40   FALSE      FALSE
#> 17      H   65   FALSE      FALSE
#> 18      I   14   FALSE      FALSE
#> 19      J   61   FALSE      FALSE
#> 20      K   35   FALSE      FALSE

Creating tables using hierarchies

The hierarchies parameter allows the explicit specification of which hierarchies should be used when creating the output table. This allows for a more fine-grained approach as opposed to simply using dimVar, as it allows for applying hierarchies not already present in the data set. Hierarchies can be provided in many ways. In this vignette, we will exemplify the following three forms: as a dimlist (as defined in sdcTable), using the hrc format from TauArgus, and finally with a more general hierarchy specification (internally, not surprisingly, simply called hierarchy). Any of these can be provided to the hierarchies parameter, as they are all translated to the internal hierarchy representation. For the purposes of this vignette, we will use dimlists, however in the following example we shall see how these can be translated to one another using functions from SSBtools. Let us begin by defining two hierarchies by using dimlists:

region_dim <- data.frame(levels = c("@", "@@", rep("@@@", 3), rep("@@", 8)),
                         codes = c("Total", "ABC", LETTERS[1:11]))
region_dim
#>    levels codes
#> 1       @ Total
#> 2      @@   ABC
#> 3     @@@     A
#> 4     @@@     B
#> 5     @@@     C
#> 6      @@     D
#> 7      @@     E
#> 8      @@     F
#> 9      @@     G
#> 10     @@     H
#> 11     @@     I
#> 12     @@     J
#> 13     @@     K

income_dim <- data.frame(levels = c("@", "@@", "@@", "@@@", "@@@", "@@@"),
                         codes = c("Total", "wages", "not_wages", "other", "assistance", "pensions"))
income_dim
#>   levels      codes
#> 1      @      Total
#> 2     @@      wages
#> 3     @@  not_wages
#> 4    @@@      other
#> 5    @@@ assistance
#> 6    @@@   pensions
SSBtools::DimList2Hrc(income_dim)
#> [1] "wages"       "not_wages"   "@other"      "@assistance" "@pensions"
SSBtools::DimList2Hierarchy(income_dim)
#>     mapsFrom    mapsTo sign level
#> 1      wages     Total    1     2
#> 2  not_wages     Total    1     2
#> 3      other not_wages    1     1
#> 4 assistance not_wages    1     1
#> 5   pensions not_wages    1     1

We can use these hierarchies to specify our output table. We do this by supplying a named list to the hierarchies parameter, where the list names correspond to variables in the data, and the list elements correspond to hierarchies we wish to include.

GaussSuppressionFromData(data = dataset,
                         hierarchies = list(region = region_dim, main_income = income_dim),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    region main_income freq primary suppressed
#> 1   Total       Total  706   FALSE      FALSE
#> 2   Total   not_wages  652   FALSE      FALSE
#> 3   Total  assistance  342   FALSE      FALSE
#> 4   Total       other   88   FALSE      FALSE
#> 5   Total    pensions  222   FALSE      FALSE
#> 6   Total       wages   54   FALSE      FALSE
#> 7     ABC       Total  241   FALSE      FALSE
#> 8     ABC   not_wages  221   FALSE      FALSE
#> 9     ABC  assistance  119   FALSE      FALSE
#> 10    ABC       other   23   FALSE      FALSE
#> 11    ABC    pensions   79   FALSE      FALSE
#> 12    ABC       wages   20   FALSE      FALSE
#> 13      A       Total  113   FALSE      FALSE
#> 14      A   not_wages  102   FALSE      FALSE
#> 15      A  assistance   55   FALSE      FALSE
#> 16      A       other   11   FALSE      FALSE
#> 17      A    pensions   36   FALSE      FALSE
#> 18      A       wages   11   FALSE      FALSE
#> 19      B       Total   55   FALSE      FALSE
#> 20      B   not_wages   54   FALSE      FALSE
#> 21      B  assistance   29   FALSE      FALSE
#> 22      B       other    7   FALSE      FALSE
#> 23      B    pensions   18   FALSE      FALSE
#> 24      B       wages    1   FALSE      FALSE
#> 25      C       Total   73   FALSE      FALSE
#> 26      C   not_wages   65   FALSE      FALSE
#> 27      C  assistance   35   FALSE      FALSE
#> 28      C       other    5   FALSE      FALSE
#> 29      C    pensions   25   FALSE      FALSE
#> 30      C       wages    8   FALSE      FALSE
#> 31      D       Total   45   FALSE      FALSE
#> 32      D   not_wages   43   FALSE      FALSE
#> 33      D  assistance   17   FALSE      FALSE
#> 34      D       other   13   FALSE      FALSE
#> 35      D    pensions   13   FALSE      FALSE
#> 36      D       wages    2   FALSE      FALSE
#> 37      E       Total  138   FALSE      FALSE
#> 38      E   not_wages  124   FALSE      FALSE
#> 39      E  assistance   63   FALSE      FALSE
#> 40      E       other    9   FALSE      FALSE
#> 41      E    pensions   52   FALSE      FALSE
#> 42      E       wages   14   FALSE      FALSE
#> 43      F       Total   67   FALSE      FALSE
#> 44      F   not_wages   58   FALSE      FALSE
#> 45      F  assistance   24   FALSE      FALSE
#> 46      F       other   12   FALSE      FALSE
#> 47      F    pensions   22   FALSE      FALSE
#> 48      F       wages    9   FALSE      FALSE
#> 49      G       Total   40   FALSE      FALSE
#> 50      G   not_wages   36   FALSE      FALSE
#> 51      G  assistance   22   FALSE      FALSE
#> 52      G       other    6   FALSE      FALSE
#> 53      G    pensions    8   FALSE      FALSE
#> 54      G       wages    4   FALSE      FALSE
#> 55      H       Total   65   FALSE      FALSE
#> 56      H   not_wages   62   FALSE      FALSE
#> 57      H  assistance   38   FALSE      FALSE
#> 58      H       other    9   FALSE      FALSE
#> 59      H    pensions   15   FALSE      FALSE
#> 60      H       wages    3   FALSE      FALSE
#> 61      I       Total   14   FALSE      FALSE
#> 62      I   not_wages   14   FALSE      FALSE
#> 63      I  assistance    9   FALSE      FALSE
#> 64      I       other    3   FALSE      FALSE
#> 65      I    pensions    2   FALSE      FALSE
#> 66      I       wages    0   FALSE      FALSE
#> 67      J       Total   61   FALSE      FALSE
#> 68      J   not_wages   61   FALSE      FALSE
#> 69      J  assistance   32   FALSE      FALSE
#> 70      J       other    9   FALSE      FALSE
#> 71      J    pensions   20   FALSE      FALSE
#> 72      J       wages    0   FALSE      FALSE
#> 73      K       Total   35   FALSE      FALSE
#> 74      K   not_wages   33   FALSE      FALSE
#> 75      K  assistance   18   FALSE      FALSE
#> 76      K       other    4   FALSE      FALSE
#> 77      K    pensions   11   FALSE      FALSE
#> 78      K       wages    2   FALSE      FALSE

As mentioned previously, the GaussSuppression package supports non-nested hierarchies natively. We achieve this by having multiple elements with the same name in the hierarchies list:

region2_dim <- data.frame(levels = c("@", rep(c("@@", rep("@@@" ,3)),2), rep("@@", 5)),
                          codes = c("Total", "ACE", "A", "C", "E", 
                                    "BDF", "B", "D", "F", 
                                    "G", "H", "I", "J", "K"))
region2_dim
#>    levels codes
#> 1       @ Total
#> 2      @@   ACE
#> 3     @@@     A
#> 4     @@@     C
#> 5     @@@     E
#> 6      @@   BDF
#> 7     @@@     B
#> 8     @@@     D
#> 9     @@@     F
#> 10     @@     G
#> 11     @@     H
#> 12     @@     I
#> 13     @@     J
#> 14     @@     K

GaussSuppressionFromData(data = dataset,
                         hierarchies = list(region = region_dim, region = region2_dim),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    region freq primary suppressed
#> 1     ABC  241   FALSE      FALSE
#> 2     ACE  324   FALSE      FALSE
#> 3     BDF  167   FALSE      FALSE
#> 4   Total  706   FALSE      FALSE
#> 5       A  113   FALSE      FALSE
#> 6       B   55   FALSE      FALSE
#> 7       C   73   FALSE      FALSE
#> 8       D   45   FALSE      FALSE
#> 9       E  138   FALSE      FALSE
#> 10      F   67   FALSE      FALSE
#> 11      G   40   FALSE      FALSE
#> 12      H   65   FALSE      FALSE
#> 13      I   14   FALSE      FALSE
#> 14      J   61   FALSE      FALSE
#> 15      K   35   FALSE      FALSE

Finally, as before, all of this functionality works with microdata as input as well.

GaussSuppressionFromData(data = microdata,
                         hierarchies = list(region = region_dim, region = region2_dim),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    region freq primary suppressed
#> 1     ABC  241   FALSE      FALSE
#> 2     ACE  324   FALSE      FALSE
#> 3     BDF  167   FALSE      FALSE
#> 4   Total  706   FALSE      FALSE
#> 5       A  113   FALSE      FALSE
#> 6       B   55   FALSE      FALSE
#> 7       C   73   FALSE      FALSE
#> 8       D   45   FALSE      FALSE
#> 9       E  138   FALSE      FALSE
#> 10      F   67   FALSE      FALSE
#> 11      G   40   FALSE      FALSE
#> 12      H   65   FALSE      FALSE
#> 13      I   14   FALSE      FALSE
#> 14      J   61   FALSE      FALSE
#> 15      K   35   FALSE      FALSE

Creating tables using formula

The most flexible method for specifying the output of GaussSuppression is by using the formula interface. This makes use of model formulas in R, and provides a powerful way of specifying multiple different tables. Indeed, all of the above examples—and much more—can be replicated using the formula interface. The formula’s predictor variables must be variable names occuring in the data set (the dependent variable is ignored, and thus we leave it empty). In the following, we create a table based on the region and county variables. As before, the hierarchical relationship between these variables is detected automatically:

GaussSuppressionFromData(data = microdata,
                         formula = ~ region + county,
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    region freq primary suppressed
#> 1   Total  706   FALSE      FALSE
#> 2       A  113   FALSE      FALSE
#> 3       B   55   FALSE      FALSE
#> 4       C   73   FALSE      FALSE
#> 5       D   45   FALSE      FALSE
#> 6       E  138   FALSE      FALSE
#> 7       F   67   FALSE      FALSE
#> 8       G   40   FALSE      FALSE
#> 9       H   65   FALSE      FALSE
#> 10      I   14   FALSE      FALSE
#> 11      J   61   FALSE      FALSE
#> 12      K   35   FALSE      FALSE
#> 13      1  127   FALSE      FALSE
#> 14      4   55   FALSE      FALSE
#> 15      5  118   FALSE      FALSE
#> 16      6  205   FALSE      FALSE
#> 17      8  105   FALSE      FALSE
#> 18     10   96   FALSE      FALSE

If there is no hierarchical relationship between variables, multiplication in the formula and specification in dimVar yield the same results.


GaussSuppressionFromData(data = microdata,
                         formula = ~ county * main_income,
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    county main_income freq primary suppressed
#> 1   Total       Total  706   FALSE      FALSE
#> 2       1       Total  127   FALSE      FALSE
#> 3       4       Total   55   FALSE      FALSE
#> 4       5       Total  118   FALSE      FALSE
#> 5       6       Total  205   FALSE      FALSE
#> 6       8       Total  105   FALSE      FALSE
#> 7      10       Total   96   FALSE      FALSE
#> 8   Total  assistance  342   FALSE      FALSE
#> 9   Total       other   88   FALSE      FALSE
#> 10  Total    pensions  222   FALSE      FALSE
#> 11  Total       wages   54   FALSE      FALSE
#> 12      1  assistance   64   FALSE      FALSE
#> 13      1       other   14   FALSE      FALSE
#> 14      1    pensions   38   FALSE      FALSE
#> 15      1       wages   11   FALSE      FALSE
#> 16      4  assistance   29   FALSE      FALSE
#> 17      4       other    7   FALSE      FALSE
#> 18      4    pensions   18   FALSE      FALSE
#> 19      4       wages    1   FALSE      FALSE
#> 20      5  assistance   52   FALSE      FALSE
#> 21      5       other   18   FALSE      FALSE
#> 22      5    pensions   38   FALSE      FALSE
#> 23      5       wages   10   FALSE      FALSE
#> 24      6  assistance   87   FALSE      FALSE
#> 25      6       other   21   FALSE      FALSE
#> 26      6    pensions   74   FALSE      FALSE
#> 27      6       wages   23   FALSE      FALSE
#> 28      8  assistance   60   FALSE      FALSE
#> 29      8       other   15   FALSE      FALSE
#> 30      8    pensions   23   FALSE      FALSE
#> 31      8       wages    7   FALSE      FALSE
#> 32     10  assistance   50   FALSE      FALSE
#> 33     10       other   13   FALSE      FALSE
#> 34     10    pensions   31   FALSE      FALSE
#> 35     10       wages    2   FALSE      FALSE
  GaussSuppressionFromData(data = microdata,
                         dimVar = c("county" , "main_income"),
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    county main_income freq primary suppressed
#> 1   Total       Total  706   FALSE      FALSE
#> 2   Total  assistance  342   FALSE      FALSE
#> 3   Total       other   88   FALSE      FALSE
#> 4   Total    pensions  222   FALSE      FALSE
#> 5   Total       wages   54   FALSE      FALSE
#> 6       1       Total  127   FALSE      FALSE
#> 7       1  assistance   64   FALSE      FALSE
#> 8       1       other   14   FALSE      FALSE
#> 9       1    pensions   38   FALSE      FALSE
#> 10      1       wages   11   FALSE      FALSE
#> 11     10       Total   96   FALSE      FALSE
#> 12     10  assistance   50   FALSE      FALSE
#> 13     10       other   13   FALSE      FALSE
#> 14     10    pensions   31   FALSE      FALSE
#> 15     10       wages    2   FALSE      FALSE
#> 16      4       Total   55   FALSE      FALSE
#> 17      4  assistance   29   FALSE      FALSE
#> 18      4       other    7   FALSE      FALSE
#> 19      4    pensions   18   FALSE      FALSE
#> 20      4       wages    1   FALSE      FALSE
#> 21      5       Total  118   FALSE      FALSE
#> 22      5  assistance   52   FALSE      FALSE
#> 23      5       other   18   FALSE      FALSE
#> 24      5    pensions   38   FALSE      FALSE
#> 25      5       wages   10   FALSE      FALSE
#> 26      6       Total  205   FALSE      FALSE
#> 27      6  assistance   87   FALSE      FALSE
#> 28      6       other   21   FALSE      FALSE
#> 29      6    pensions   74   FALSE      FALSE
#> 30      6       wages   23   FALSE      FALSE
#> 31      8       Total  105   FALSE      FALSE
#> 32      8  assistance   60   FALSE      FALSE
#> 33      8       other   15   FALSE      FALSE
#> 34      8    pensions   23   FALSE      FALSE
#> 35      8       wages    7   FALSE      FALSE

However, formula lets us specify different shapes for our tables. For example, if we are only interested in marginal values, we can supply this with the use of the addition operator:


GaussSuppressionFromData(data = microdata,
                         formula = ~ county + main_income,
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    county main_income freq primary suppressed
#> 1   Total       Total  706   FALSE      FALSE
#> 2       1       Total  127   FALSE      FALSE
#> 3       4       Total   55   FALSE      FALSE
#> 4       5       Total  118   FALSE      FALSE
#> 5       6       Total  205   FALSE      FALSE
#> 6       8       Total  105   FALSE      FALSE
#> 7      10       Total   96   FALSE      FALSE
#> 8   Total  assistance  342   FALSE      FALSE
#> 9   Total       other   88   FALSE      FALSE
#> 10  Total    pensions  222   FALSE      FALSE
#> 11  Total       wages   54   FALSE      FALSE

This example demonstrates, in fact, the ability of specifying multiple linked tables: a one-dimensional table for county linked with a one-dimensional table for main_income. Similarly, we can use the colon (“:”) operator to omit row and column marginals:

GaussSuppressionFromData(data = microdata,
                         formula = ~ county : main_income,
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    county main_income freq primary suppressed
#> 1   Total       Total  706   FALSE      FALSE
#> 2       1  assistance   64   FALSE      FALSE
#> 3       1       other   14   FALSE      FALSE
#> 4       1    pensions   38   FALSE      FALSE
#> 5       1       wages   11   FALSE      FALSE
#> 6       4  assistance   29   FALSE      FALSE
#> 7       4       other    7   FALSE      FALSE
#> 8       4    pensions   18   FALSE      FALSE
#> 9       4       wages    1   FALSE      FALSE
#> 10      5  assistance   52   FALSE      FALSE
#> 11      5       other   18   FALSE      FALSE
#> 12      5    pensions   38   FALSE      FALSE
#> 13      5       wages   10   FALSE      FALSE
#> 14      6  assistance   87   FALSE      FALSE
#> 15      6       other   21   FALSE      FALSE
#> 16      6    pensions   74   FALSE      FALSE
#> 17      6       wages   23   FALSE      FALSE
#> 18      8  assistance   60   FALSE      FALSE
#> 19      8       other   15   FALSE      FALSE
#> 20      8    pensions   23   FALSE      FALSE
#> 21      8       wages    7   FALSE      FALSE
#> 22     10  assistance   50   FALSE      FALSE
#> 23     10       other   13   FALSE      FALSE
#> 24     10    pensions   31   FALSE      FALSE
#> 25     10       wages    2   FALSE      FALSE

Using subtraction, we can omit marginals and other cells from the output. For example, the intercept (sum over all records) can be omitted by including - 1 in the formula, like this: formula = county : main_income - 1.

Using these features, we can define more complicated linked tables. To illustrate this, let us assume we wish to publish the following:

To do this, we begin by adding a column encoding whether the main source of income was “wages” or “not_wages”.

dataset$income2 <- ifelse(dataset$main_income == "wages", "wages", "not_wages")
microdata$income2 <- ifelse(microdata$main_income == "wages", "wages", "not_wages")
head(dataset)
#>   region county k_group main_income freq   income2
#> 1      A      1     300       other   11 not_wages
#> 2      B      4     300       other    7 not_wages
#> 3      C      5     300       other    5 not_wages
#> 4      D      5     300       other   13 not_wages
#> 5      E      6     300       other    9 not_wages
#> 6      F      6     300       other   12 not_wages

Then we can specify the desired output with the following formula:

GaussSuppressionFromData(data = dataset,
                         formula = ~ region * income2 + (county + k_group) * main_income,
                         freqVar = "freq",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    region main_income freq primary suppressed
#> 1   Total       Total  706   FALSE      FALSE
#> 2       A       Total  113   FALSE      FALSE
#> 3       B       Total   55   FALSE      FALSE
#> 4       C       Total   73   FALSE      FALSE
#> 5       D       Total   45   FALSE      FALSE
#> 6       E       Total  138   FALSE      FALSE
#> 7       F       Total   67   FALSE      FALSE
#> 8       G       Total   40   FALSE      FALSE
#> 9       H       Total   65   FALSE      FALSE
#> 10      I       Total   14   FALSE      FALSE
#> 11      J       Total   61   FALSE      FALSE
#> 12      K       Total   35   FALSE      FALSE
#> 13  Total   not_wages  652   FALSE      FALSE
#> 14  Total       wages   54   FALSE      FALSE
#> 15      1       Total  127   FALSE      FALSE
#> 16      4       Total   55   FALSE      FALSE
#> 17      5       Total  118   FALSE      FALSE
#> 18      6       Total  205   FALSE      FALSE
#> 19      8       Total  105   FALSE      FALSE
#> 20     10       Total   96   FALSE      FALSE
#> 21    300       Total  596   FALSE      FALSE
#> 22    400       Total  110   FALSE      FALSE
#> 23  Total  assistance  342   FALSE      FALSE
#> 24  Total       other   88   FALSE      FALSE
#> 25  Total    pensions  222   FALSE      FALSE
#> 26  Total       wages   54   FALSE      FALSE
#> 27      A   not_wages  102   FALSE      FALSE
#> 28      A       wages   11   FALSE      FALSE
#> 29      B   not_wages   54   FALSE      FALSE
#> 30      B       wages    1   FALSE      FALSE
#> 31      C   not_wages   65   FALSE      FALSE
#> 32      C       wages    8   FALSE      FALSE
#> 33      D   not_wages   43   FALSE      FALSE
#> 34      D       wages    2   FALSE      FALSE
#> 35      E   not_wages  124   FALSE      FALSE
#> 36      E       wages   14   FALSE      FALSE
#> 37      F   not_wages   58   FALSE      FALSE
#> 38      F       wages    9   FALSE      FALSE
#> 39      G   not_wages   36   FALSE      FALSE
#> 40      G       wages    4   FALSE      FALSE
#> 41      H   not_wages   62   FALSE      FALSE
#> 42      H       wages    3   FALSE      FALSE
#> 43      I   not_wages   14   FALSE      FALSE
#> 44      I       wages    0   FALSE      FALSE
#> 45      J   not_wages   61   FALSE      FALSE
#> 46      J       wages    0   FALSE      FALSE
#> 47      K   not_wages   33   FALSE      FALSE
#> 48      K       wages    2   FALSE      FALSE
#> 49      1  assistance   64   FALSE      FALSE
#> 50      1       other   14   FALSE      FALSE
#> 51      1    pensions   38   FALSE      FALSE
#> 52      1       wages   11   FALSE      FALSE
#> 53      4  assistance   29   FALSE      FALSE
#> 54      4       other    7   FALSE      FALSE
#> 55      4    pensions   18   FALSE      FALSE
#> 56      4       wages    1   FALSE      FALSE
#> 57      5  assistance   52   FALSE      FALSE
#> 58      5       other   18   FALSE      FALSE
#> 59      5    pensions   38   FALSE      FALSE
#> 60      5       wages   10   FALSE      FALSE
#> 61      6  assistance   87   FALSE      FALSE
#> 62      6       other   21   FALSE      FALSE
#> 63      6    pensions   74   FALSE      FALSE
#> 64      6       wages   23   FALSE      FALSE
#> 65      8  assistance   60   FALSE      FALSE
#> 66      8       other   15   FALSE      FALSE
#> 67      8    pensions   23   FALSE      FALSE
#> 68      8       wages    7   FALSE      FALSE
#> 69     10  assistance   50   FALSE      FALSE
#> 70     10       other   13   FALSE      FALSE
#> 71     10    pensions   31   FALSE      FALSE
#> 72     10       wages    2   FALSE      FALSE
#> 73    300  assistance  283   FALSE      FALSE
#> 74    300       other   72   FALSE      FALSE
#> 75    300    pensions  189   FALSE      FALSE
#> 76    300       wages   52   FALSE      FALSE
#> 77    400  assistance   59   FALSE      FALSE
#> 78    400       other   16   FALSE      FALSE
#> 79    400    pensions   33   FALSE      FALSE
#> 80    400       wages    2   FALSE      FALSE

In this manner, we can specify multiple linked tables, each of which can use different non-nested hierarchies. This allows the suppression algorithm to protect all of these tables simultaneously (indeed, they are treated as a single table internally), avoiding the need for a stratified protection paradigm. Furthermore, the fine-grained specification of which cells are to be published allows the secondary suppression algorithm to protect with respect to precisely those cells that will be published. If row and column marginals are not published, for example, the suppression algorithm does not need to secondary suppress with respect to these marginals.

Tabulating continuous variables

In addition to defining the dimensions of the output tables, we need to decide whether they should be frequency tables (where we count contributing records) or magnititude tables (where we add contributing records’ numerical values for a given variable). All of the above examples have been frequency tables. However, the process is exactly the same if one wishes to construct magnititude tables; the only difference is that one must specify the numerical variable with the help of the parameter numVar.

Since most magnitude table suppression methods are based on comparing units’ contributions, input the data will most likely be supplied as microdata. Therefore, let us add a fake numerical variable to our microdata:

set.seed(12345)
microdata$num <- sample(0:1000, nrow(microdata), replace = TRUE) 

Then in order to construct a volume table where records’ contributions to num are aggregated, we supply this as a parameter to GaussSuppressionFromData:

GaussSuppressionFromData(data = microdata,
                         formula = ~ region * income2 + (county + k_group) * main_income,
                         numVar = "num",
                         primary = FALSE,
                         protectZeros = FALSE)
#> [preAggregate 706*7->42*7]
#>    region main_income freq.1    num primary suppressed
#> 1   Total       Total    706 358773   FALSE      FALSE
#> 2       A       Total    113  56793   FALSE      FALSE
#> 3       B       Total     55  31867   FALSE      FALSE
#> 4       C       Total     73  33500   FALSE      FALSE
#> 5       D       Total     45  22829   FALSE      FALSE
#> 6       E       Total    138  66412   FALSE      FALSE
#> 7       F       Total     67  38823   FALSE      FALSE
#> 8       G       Total     40  18817   FALSE      FALSE
#> 9       H       Total     65  33314   FALSE      FALSE
#> 10      I       Total     14   6870   FALSE      FALSE
#> 11      J       Total     61  31353   FALSE      FALSE
#> 12      K       Total     35  18195   FALSE      FALSE
#> 13  Total   not_wages    652 330316   FALSE      FALSE
#> 14  Total       wages     54  28457   FALSE      FALSE
#> 15      1       Total    127  63663   FALSE      FALSE
#> 16      4       Total     55  31867   FALSE      FALSE
#> 17      5       Total    118  56329   FALSE      FALSE
#> 18      6       Total    205 105235   FALSE      FALSE
#> 19      8       Total    105  52131   FALSE      FALSE
#> 20     10       Total     96  49548   FALSE      FALSE
#> 21    300       Total    596 302355   FALSE      FALSE
#> 22    400       Total    110  56418   FALSE      FALSE
#> 23  Total  assistance    342 171392   FALSE      FALSE
#> 24  Total       other     88  45958   FALSE      FALSE
#> 25  Total    pensions    222 112966   FALSE      FALSE
#> 26  Total       wages     54  28457   FALSE      FALSE
#> 27      A   not_wages    102  51447   FALSE      FALSE
#> 28      A       wages     11   5346   FALSE      FALSE
#> 29      B   not_wages     54  31001   FALSE      FALSE
#> 30      B       wages      1    866   FALSE      FALSE
#> 31      C   not_wages     65  29678   FALSE      FALSE
#> 32      C       wages      8   3822   FALSE      FALSE
#> 33      D   not_wages     43  22041   FALSE      FALSE
#> 34      D       wages      2    788   FALSE      FALSE
#> 35      E   not_wages    124  57540   FALSE      FALSE
#> 36      E       wages     14   8872   FALSE      FALSE
#> 37      F   not_wages     58  34933   FALSE      FALSE
#> 38      F       wages      9   3890   FALSE      FALSE
#> 39      G   not_wages     36  16348   FALSE      FALSE
#> 40      G       wages      4   2469   FALSE      FALSE
#> 41      H   not_wages     62  31651   FALSE      FALSE
#> 42      H       wages      3   1663   FALSE      FALSE
#> 43      I   not_wages     14   6870   FALSE      FALSE
#> 44      J   not_wages     61  31353   FALSE      FALSE
#> 45      K   not_wages     33  17454   FALSE      FALSE
#> 46      K       wages      2    741   FALSE      FALSE
#> 47      1  assistance     64  29577   FALSE      FALSE
#> 48      1       other     14   6583   FALSE      FALSE
#> 49      1    pensions     38  22157   FALSE      FALSE
#> 50      1       wages     11   5346   FALSE      FALSE
#> 51      4  assistance     29  16798   FALSE      FALSE
#> 52      4       other      7   3217   FALSE      FALSE
#> 53      4    pensions     18  10986   FALSE      FALSE
#> 54      4       wages      1    866   FALSE      FALSE
#> 55      5  assistance     52  24467   FALSE      FALSE
#> 56      5       other     18   9436   FALSE      FALSE
#> 57      5    pensions     38  17816   FALSE      FALSE
#> 58      5       wages     10   4610   FALSE      FALSE
#> 59      6  assistance     87  44849   FALSE      FALSE
#> 60      6       other     21  12582   FALSE      FALSE
#> 61      6    pensions     74  35042   FALSE      FALSE
#> 62      6       wages     23  12762   FALSE      FALSE
#> 63      8  assistance     60  31136   FALSE      FALSE
#> 64      8       other     15   6462   FALSE      FALSE
#> 65      8    pensions     23  10401   FALSE      FALSE
#> 66      8       wages      7   4132   FALSE      FALSE
#> 67     10  assistance     50  24565   FALSE      FALSE
#> 68     10       other     13   7678   FALSE      FALSE
#> 69     10    pensions     31  16564   FALSE      FALSE
#> 70     10       wages      2    741   FALSE      FALSE
#> 71    300  assistance    283 142670   FALSE      FALSE
#> 72    300       other     72  36799   FALSE      FALSE
#> 73    300    pensions    189  95170   FALSE      FALSE
#> 74    300       wages     52  27716   FALSE      FALSE
#> 75    400  assistance     59  28722   FALSE      FALSE
#> 76    400       other     16   9159   FALSE      FALSE
#> 77    400    pensions     33  17796   FALSE      FALSE
#> 78    400       wages      2    741   FALSE      FALSE

Note that a new frequency variable is generated with the above call. If a frequency variable is already present in the input data, we can provide it in addition to numVar and the method will use that information instead:

GaussSuppressionFromData(data = microdata,
                         formula = ~ region * income2 + (county + k_group) * main_income,
                         freqVar = "freq",
                         numVar = "num",
                         primary = FALSE,
                         protectZeros = FALSE)
#>    region main_income freq    num primary suppressed
#> 1   Total       Total  706 358773   FALSE      FALSE
#> 2       A       Total  113  56793   FALSE      FALSE
#> 3       B       Total   55  31867   FALSE      FALSE
#> 4       C       Total   73  33500   FALSE      FALSE
#> 5       D       Total   45  22829   FALSE      FALSE
#> 6       E       Total  138  66412   FALSE      FALSE
#> 7       F       Total   67  38823   FALSE      FALSE
#> 8       G       Total   40  18817   FALSE      FALSE
#> 9       H       Total   65  33314   FALSE      FALSE
#> 10      I       Total   14   6870   FALSE      FALSE
#> 11      J       Total   61  31353   FALSE      FALSE
#> 12      K       Total   35  18195   FALSE      FALSE
#> 13  Total   not_wages  652 330316   FALSE      FALSE
#> 14  Total       wages   54  28457   FALSE      FALSE
#> 15      1       Total  127  63663   FALSE      FALSE
#> 16      4       Total   55  31867   FALSE      FALSE
#> 17      5       Total  118  56329   FALSE      FALSE
#> 18      6       Total  205 105235   FALSE      FALSE
#> 19      8       Total  105  52131   FALSE      FALSE
#> 20     10       Total   96  49548   FALSE      FALSE
#> 21    300       Total  596 302355   FALSE      FALSE
#> 22    400       Total  110  56418   FALSE      FALSE
#> 23  Total  assistance  342 171392   FALSE      FALSE
#> 24  Total       other   88  45958   FALSE      FALSE
#> 25  Total    pensions  222 112966   FALSE      FALSE
#> 26  Total       wages   54  28457   FALSE      FALSE
#> 27      A   not_wages  102  51447   FALSE      FALSE
#> 28      A       wages   11   5346   FALSE      FALSE
#> 29      B   not_wages   54  31001   FALSE      FALSE
#> 30      B       wages    1    866   FALSE      FALSE
#> 31      C   not_wages   65  29678   FALSE      FALSE
#> 32      C       wages    8   3822   FALSE      FALSE
#> 33      D   not_wages   43  22041   FALSE      FALSE
#> 34      D       wages    2    788   FALSE      FALSE
#> 35      E   not_wages  124  57540   FALSE      FALSE
#> 36      E       wages   14   8872   FALSE      FALSE
#> 37      F   not_wages   58  34933   FALSE      FALSE
#> 38      F       wages    9   3890   FALSE      FALSE
#> 39      G   not_wages   36  16348   FALSE      FALSE
#> 40      G       wages    4   2469   FALSE      FALSE
#> 41      H   not_wages   62  31651   FALSE      FALSE
#> 42      H       wages    3   1663   FALSE      FALSE
#> 43      I   not_wages   14   6870   FALSE      FALSE
#> 44      J   not_wages   61  31353   FALSE      FALSE
#> 45      K   not_wages   33  17454   FALSE      FALSE
#> 46      K       wages    2    741   FALSE      FALSE
#> 47      1  assistance   64  29577   FALSE      FALSE
#> 48      1       other   14   6583   FALSE      FALSE
#> 49      1    pensions   38  22157   FALSE      FALSE
#> 50      1       wages   11   5346   FALSE      FALSE
#> 51      4  assistance   29  16798   FALSE      FALSE
#> 52      4       other    7   3217   FALSE      FALSE
#> 53      4    pensions   18  10986   FALSE      FALSE
#> 54      4       wages    1    866   FALSE      FALSE
#> 55      5  assistance   52  24467   FALSE      FALSE
#> 56      5       other   18   9436   FALSE      FALSE
#> 57      5    pensions   38  17816   FALSE      FALSE
#> 58      5       wages   10   4610   FALSE      FALSE
#> 59      6  assistance   87  44849   FALSE      FALSE
#> 60      6       other   21  12582   FALSE      FALSE
#> 61      6    pensions   74  35042   FALSE      FALSE
#> 62      6       wages   23  12762   FALSE      FALSE
#> 63      8  assistance   60  31136   FALSE      FALSE
#> 64      8       other   15   6462   FALSE      FALSE
#> 65      8    pensions   23  10401   FALSE      FALSE
#> 66      8       wages    7   4132   FALSE      FALSE
#> 67     10  assistance   50  24565   FALSE      FALSE
#> 68     10       other   13   7678   FALSE      FALSE
#> 69     10    pensions   31  16564   FALSE      FALSE
#> 70     10       wages    2    741   FALSE      FALSE
#> 71    300  assistance  283 142670   FALSE      FALSE
#> 72    300       other   72  36799   FALSE      FALSE
#> 73    300    pensions  189  95170   FALSE      FALSE
#> 74    300       wages   52  27716   FALSE      FALSE
#> 75    400  assistance   59  28722   FALSE      FALSE
#> 76    400       other   16   9159   FALSE      FALSE
#> 77    400    pensions   33  17796   FALSE      FALSE
#> 78    400       wages    2    741   FALSE      FALSE