Introduction to vtree

Exploring subsets of data using variable trees

Nick Barrowman  -  12-Dec-2022 -  vtree version 5.6.5

Welcome

vtree is a flexible tool for calculating and displaying variable trees — diagrams that show information about nested subsets of a data frame. vtree can be used to:

  1. explore a data set interactively

  2. produce customized figures for reports and publications.

Note, however, that vtree is not designed to build or display decision trees.

 

Notes

  • This presentation was created using the reveal.js framework in Quarto.
  • The menu can be accessed via the icon in the bottom left corner.

Mini tutorial

This section introduces some basic features of the vtree function.

vtree for a single variable

To display a variable tree for a single variable, say Severity, use the following command:

vtree(FakeData,"Severity")

By default, next to each layer of the tree, a variable name is shown. In the example above, “Severity” is shown below the corresponding nodes. (For a vertical tree, “Severity” would be shown to the left of the nodes.) If you specify showvarnames=FALSE, no variable names will be shown.

dplyr etc.

vtree can also be used with dplyr. For example, to rename the Severity variable as HowBad, we can pipe the data frame into the rename function in dplyr, and then pipe the result into vtree:

Note that vtree also has a built-in way of renaming variables, which is an alternative to using dplyr.

Large variable trees can be difficult to display in a readable way. One approach that helps is to display the count and percentage on the same line in each node. For example, in the tree above, the label for the Moderate node is on two lines, like this:

                                                                 Moderate
                                                                 16 (40%)

Specifying sameline=TRUE results in single-line labels, like this:

                                                                 Moderate, 16 (40%)

Percentages

By default, vtree shows “valid percentages”, i.e. percentages calculated using the total number of non-missing values as denominator. In the case of Severity, there are 6 missing values, so the denominator is 46 - 6, or 40. There are 19 Mild cases, and 19/40 = 0.475 so the percentage shown is 48%. No percentage is shown in the NA node since missing values are not included in the denominator.

If you prefer the denominator to represent the complete set of observations (including any missing values), specify vp=FALSE. A percentage will be shown in each of the nodes, including any NA nodes.

If you don’t wish to see percentages, specify showpct=FALSE, and if you don’t wish to see counts, specify showcount=FALSE.

Displaying a legend and hiding node labels

To display a legend, specify showlegend=TRUE.

vtree(FakeData,
  "Severity Sex",
  showlegend=TRUE)

Displaying a legend and hiding node labels (continued)

When the legend is shown, labels in the nodes of the variable tree are redundant, since the colors of the nodes identify the values of the variables (although the labels may aid readability). If you prefer, you can hide the node labels, by specifying shownodelabels=FALSE:

vtree(FakeData,
  "Severity Sex",
  showlegend=TRUE,
  shownodelabels=FALSE)

Displaying a legend and hiding node labels (continued)

Since Severity is the first variable in the tree, it is not nested within another variable. Therefore the marginal counts and percentages for Severity shown in the legend nodes are identical to those displayed in the nodes of the variable tree. In contrast, for Sex, the marginal counts and percentages are different from what is shown in the nodes of the variable tree for Sex since they are nested within levels of Severity.

Text wrapping

By default, vtree wraps text onto the next line whenever a space occurs after at least 20 characters. This can be adjusted, for example, to 15 characters, by specifying splitwidth=15. To disable line splitting, specify splitwidth=Inf (Inf means infinity, i.e. “do not split”.)

The vsplitwidth parameter is similarly used to control text wrapping in variable names. This is helpful with long variable names, which may be truncated unless wrapping is used. In this case text wrapping occurs not only at spaces, but also at any of the following characters:

. - + _ = / (

For example if vsplitwidth=5, a variable name like First_Emergency_Visit would be split into

                                                                 First_
                                                                 Emergency_
                                                                 Visit

 

This concludes the mini-tutorial. vtree has many more features, described in the following sections.

Features

This section presents detailed descriptions of vtree features.

Pruning

This section shows how to remove branches from a variable tree.

When a variable tree gets too big, or you are only interested in certain parts of the tree, it may be useful to remove some nodes along with their descendants. This is known as pruning. For convenience, there are several different ways to prune a tree, described below.

The prune parameter

Here’s a variable tree we’ve already seen in various forms:

vtree(FakeData,
  "Severity Sex")

The prune parameter (continued)

Suppose you don’t want the tree to show branches for individuals whose disease is Mild or Moderate. Specifying prune=list(Severity=c("Mild","Moderate")) removes those nodes, and all of their descendants:

vtree(FakeData,
  "Severity Sex",
  prune=list(Severity=c("Mild","Moderate")))

The prune parameter (continued)

In general, the argument of the prune parameter is a list with an element named for each variable you wish to prune. In the example above, the list has a single element, named Severity. In turn, that element is a vector c("Mild","Moderate") indicating the values of Severity to prune.

Caution: Once a variable tree has been pruned, it is no longer complete. This can sometimes be confusing since not all observations are represented at certain layers of the tree. For example in the tree above, only 11 observations are shown in the Severity nodes and their children.

The keep parameter

Sometimes it is more convenient to specify which nodes should be retained rather than which ones should be discarded. The keep parameter is used for this purpose, and can thus be considered the complement of the prune parameter. For example, to retain the Moderate Severity node:

vtree(FakeData,
  "Severity Sex",
  keep=list(Severity="Moderate"))

Note: In addition to the Moderate node, the missing value node has also been retained. In general, whenever valid percentages are used (which is the default), missing value nodes are retained when keep is used. This is because valid percentages are difficult to interpret without knowing the denominator, which requires knowing the number of missing values.

The keep parameter (continued)

On the other hand, here’s what happens when vp=FALSE:

vtree(FakeData,
  "Severity Sex",
  keep=list(Severity="Moderate"),
  vp=FALSE)

The prunebelow parameter

As seen above, a disadvantage of pruning is that in the resulting tree, the counts shown in child nodes may not add up to the counts shown in their parent node.

An alternative is to prune below the specified nodes (i.e. to prune their descendants), so that the counts always add up. In the present example, this means that the Mild and Moderate nodes will be shown, but not their descendants. The prunebelow parameter is used to do this:

vtree(FakeData,
  "Severity Sex",
  prunebelow=list(Severity=c("Mild","Moderate")))

The follow parameter

The complement of prunebelow is follow. Instead of specifying which nodes should be pruned below, this allows you to specify which nodes should be “followed” (that is, not pruned below).

Targeted pruning

This section describes a more flexible way to prune variable trees. To explain this, first note that the prune, keep, prunebelow, and follow parameters specify pruning across all branches of the tree. For example, if you were pruning Severity nested within levels of Sex, the pruning would take place in both the M and F branches.

Sometimes, however, it is preferable to perform pruning only in specified branches of the tree. This is called targeted pruning, and the parameters tprune, tkeep, tprunebelow, and tfollow provide this functionality. However, their arguments have a more complex form than those of the corresponding prune, keep, prunebelow, and follow parameters because they specify the full path from the root of the tree all the way to the nodes to be pruned.

Targeted pruning (continued)

For example, to remove every Severity node except Moderate, but only for males, use the following command:

vtree(FakeData,
  "Sex Severity",
  tkeep=list(list(Sex="M",Severity="Moderate")))

Note that the argument of tkeep is a list of lists, one for each path through the tree. To keep both Moderate and Severe, specify tkeep=list(list(Sex="M",Severity=c("Moderate","Severe"))).

Targeted pruning (continued)

Now suppose that, in addition to this, within females, you want to keep just Mild. Use the following specification to do this:

tkeep=list(list(Sex="M",Severity=c("Moderate","Severe")),list(Sex="F",Severity="Mild"))

The prunesmaller parameter

As a variable tree grows, it can become difficult to see the forest for the tree. For example, the following tree is hard to read, even when sameline=TRUE has been specified:

vtree(FakeData,
  "Severity Sex Age Category",
  sameline=TRUE)

The prunesmaller parameter (continued)

One solution is to prune nodes that contain small numbers of observations. For example if you want to only see nodes with at least 3 observations, you can specify prunesmaller=3, as in this example:

vtree(FakeData,
  "Severity Sex Age Category",
  sameline=TRUE,
  prunesmaller=3)

As with the keep parameter, when valid percentages are used (vp=TRUE, which is the default), nodes represent missing values will not be pruned. (As noted previously, this is because percentages are confusing when missing values are not shown.) On the other hand, when vp=FALSE, missing nodes will be pruned (if they are small enough).

Labels for variables and nodes

This section shows how to relabel variables and nodes.

By default, vtree labels variables and nodes exactly as they appear in the data frame. But it is often useful to change these labels.

Changing variable labels with the labelvar parameter

Suppose Severity in fact represents initial severity. To label it that way in the variable tree, specify labelvar=c(Severity="Initial severity"):

vtree(FakeData,
  "Severity Sex",
  horiz=FALSE,
  labelvar=c(Severity="Initial severity"))

Changing node labels with the labelnode parameter

By default, vtree labels nodes (except for the root node) using the values of the variable in question. Sometimes it is convenient to instead specify custom labels for nodes. The labelnode argument can be used to relabel the values. For example, you might want to use “Male” and “Female” instead of “M” and “F”.

vtree(FakeData,
  "Group Sex",
  horiz=FALSE,
  labelnode=list(Sex=c(Male="M",Female="F")))

The argument of the labelnode parameter is specified as a list whose element names are variable names. To substitute a new label for an old label, the syntax is: "New label"="Old label". Thus the full specification, as used above, is: labelnode=list(Sex=c(Male="M",Female="F")).

Targeted node labels using the tlabelnode parameter

Suppose in the example above that Group A represents children and Group B represents adults. In Group A, we would like to use the labels “girl” and “boy”, while in Group B we would like to use “woman” and “man”. The labelnode parameter cannot handle this situation because the values of Sex need to be labeled differently in different branches of the tree. The tlabelnode parameter allows “targeted” node labels.

vtree(FakeData,
  "Group Sex",
  horiz=FALSE,
  labelnode=list(Group=c(Child="A",Adult="B")),
  tlabelnode=list(
    c(Group="A",Sex="F",label="girl"),
    c(Group="A",Sex="M",label="boy"),
    c(Group="B",Sex="F",label="woman"),
    c(Group="B",Sex="M",label="man")))

Text and text formatting

This section shows how to add bold, italics, and other text formatting.

Graphviz, the open source graph visualization software that vtree is built on, supports a variety of text formatting (including bold, colors, etc.). This is used in vtree to control formatting of text such as node labels.

Markdown-style codes for text formatting

By default, the vtree package uses markdown-style codes for text formatting. In the tables below, ... represents arbitrary text.

\n insert a line break
\n*l make the preceding line left-justified and insert a line break
*...* display text in italics
**...** display text in bold
^...^ display text in superscript (using 10 point font)
~...~ display text in subscript (using 10 point font)
%%red ...%% display text in red (or whichever color is specified)

HTML-like codes for text formatting

As an alternative, if you specify HTMLtext=TRUE you can use “HTML-like labels” (implemented in Graphviz), including:

<BR/> insert a line break
<BR ALIGN='LEFT'/> make the preceding line left-justified and insert a line break
<I> ... </I> display text in italics
<B> ... </B> display text in bold
<SUP> ... </SUP> display text in superscript (using 10 point font)
<SUB> ... </SUB> display text in subscript (using 10 point font)
<FONT POINT-SIZE='10'> ... </FONT> set font to 10 point
<FONT FACE='Times-Roman'> ... </FONT> set font to Times-Roman
<FONT COLOR='red'> ... </FONT> set font to red

See https://www.graphviz.org/doc/info/shapes.html#html for more details.

Adding text to nodes using the text parameter

Suppose you wish to add the italicized text “Excluding new diagnoses” to any Mild nodes in the tree. The parameter text is used to add text to nodes. It is specified as a list with an element named for each variable. In the example below the list has one element, named Severity. That element in turn is a vector c(Mild="\n*Excluding\nnew diagnoses*") indicating that the Mild node should include additional text using Markdown-style formatting (i.e. \n indicates a linebreak and the asterisks around the text indicate that it should be displayed in italics):

vtree(FakeData,
  "Group Severity",
  horiz=FALSE,
  showvarnames=FALSE,
  text=list(
    Severity=c(
      Mild="\n*Excluding\nnew diagnoses*")))

Targeted text using the ttext parameter

In the example above, suppose that new diagnoses are only excluded from Mild cases in Group B. But the text parameter adds text to all Mild nodes. Thus, in situations like this, the text parameter is not sufficient. Instead, you can use the ttext parameter to target exactly which nodes should have the specified text.

The ttext parameter requires that you specify the full path from the root of the tree to the node in question, along with the text in question. The ttext parameter is specified as a list so that multiple targeted text strings can be specified at once. For example:

vtree(FakeData,
  "Group Severity",
  horiz=FALSE,
  showvarnames=FALSE,
  ttext=list(
    c(Group="B",
      Severity="Mild",
      text="\n*Excluding\nnew diagnoses*"),
    c(Group="A",
      text="\nSweden"),
    c(Group="B",
      text="\nNorway")))

Specification of variables

This section shows how to control how variables appear in a variable tree.

Sometimes it is desirable to modify a variable for use in a variable tree. For example, suppose you wish to determine how many values of Score are missing. This is easy to do with dplyr:

library(dplyr)
FakeData %>% 
  mutate(missingScore=is.na(Score)) %>% 
  vtree("missingScore")

But vtree also offers built-in tools for variable specification. Although limited, they can be very convenient.

prefix is.na:

If an individual variable name is preceded by is.na:, that variable will be replaced by a missing value indicator in the variable tree. (This differs from the check.is.na parameter, which is used to replace all of the specified variables with missing value indicators.) For example:

vtree(FakeData,"is.na:Score")

wildcard #

Specifying Ind# matches all variable names that start with Ind and end with one or more numeric digits, namely Ind1, Ind2, Ind3, and Ind4. This wildcard can also be used within a variable name. For example, visit#duration would match visit1duration, visit2duration, etc.

wildcard *

Specifying Ind* matches all variable names that start with Ind and end with any other characters (or no other characters). In FakeData this matches Ind1, Ind2, Ind3, and Ind4 (just like Ind# does). But if FakeData contained variables named Ind and Index, they would also be matched by Ind*. As with the # wildcard, the * wildcard can be used within a variable name.

prefix i:

“Intersections” between multiple variables can be generated using the prefix i:. For example, i:Ind* generates a variable representing the observed combinations of values of Ind1, Ind2, Ind3, and Ind4. (If at least one of the variables is missing, the combination will be missing.)

prefix r: (for REDCap)

Vtree includes special support for REDCap data sets. The prefix r: is used to indicate REDCap checkbox variables, and can be combined with other prefixes. This is described in the section on REDCap checkboxes later in this vignette.

prefix any:

Sometimes a group of variables contain responses to a list of checkbox options (often with instructions to “check all that apply”). For example, suppose you have a data frame of shops, including whether they are open on Saturday (openSaturday) or Sunday (openSunday). Suppose no other variables start with open. Then open* will match both openSaturday and openSunday.

In general for a group of checkbox variables, it is often useful to know if any of the options were selected (i.e. checked). In the case above, we might want to know which shops are open at all on the weekend (either Saturday or Sunday).

A specification like any:open* is used to generate a variable that is

  • TRUE if any of the matching variables has a “checked” value

  • FALSE if none of the matching variables have “checked” values.

checked and unchecked

The parameters checked and unchecked specify which values are considered checked or unchecked respectively, and have the following defaults:

parameter default value
checked c("1","TRUE","Yes","yes")
unchecked c("0","FALSE","No","no")

Values not listed in checked or unchecked are treated as missing values.

An alternative prefix, anyx:, is used to specify that missing values will be removed when performing the calculation. This matches the behavior of the R function any when na.rm=TRUE is specified.

prefix none:

The logical complement (negation) of the any: prefix. An alternative prefix, nonex:, is used to specify that missing values will be removed when performing the calculation.

prefix all:

A specification like all:open* generates a variable which is TRUE if all of the matching variables have a “checked” value.

An alternative prefix, allx:, is used to specify that missing values will be removed when performing the calculation. This matches the behavior of the R function all when na.rm=TRUE is specified.

prefix notall:

The logical complement (negation) of the all: prefix. An alternative prefix, notallx:, is used to specify that missing values will be removed when performing the calculation.

prefix tri:

The tri: prefix is useful for identifying values of a numeric variable that are extreme compared to the other values in a node. Note: Unlike other variable specifications, which take effect at the level of the entire data frame, the tri: prefix takes effect within each node.

The effect of this variable specification is to trichotomize the values of a numeric variable, i.e. to divide them into three groups:

  • “mid”: values within plus or minus 1.5×IQR of the median,

  • “high”: values more than 1.5×IQR above the median,

  • “low”: values more than 1.5×IQR below the median.

specification variable=value

When a variable takes on a large number of different values, the resulting variable tree will very large. One solution is to prune the tree, for example by keeping just the node corresponding to one value of a particular variable. An alternative is to specify the value of the variable that is of primary interest and vtree will dichotomize the variable at that value. For example if Severity=Mild is specified, the Severity variable will be dichotomized between Mild and Not Mild.

specifications variable<value, variable>value

These two specifications are used to dichotomize a numeric variable, splitting above and below a specified value. This can be useful for identifying subsets with extreme values.

Displaying summary statistics in nodes

This section shows how to display information about other variables in the nodes.

It is often useful to display information about other variables (apart from those that define the tree) in the nodes of a variable tree. This is particularly useful for numeric variables, which usually would not be used to build the tree since they have too many distinct values. The summary parameter allows you to show information (for example, a mean) about a specified variable within a subset of the data frame.

Default summaries

Suppose you are interested in summary information for the Score variable for all of the observations in the data frame (i.e. in the root node). In that case you don’t need to specify any variables for the tree itself:

vtree(FakeData,
  summary="Score")

When the name of a numeric variable (in this case "Score") is specified as the argument of the summary parameter, a default set of summary statistics (as shown above) appears: the variable name, the number of missing values, the mean and standard deviation, the median and interquartile range (IQR), and the range.

(Note, however, that if there are three or fewer observations, instead of showing the above summary statistics, the observations are simply listed.)

Default summaries (continued)

Suppose we’re building a variable tree based on Severity. We can display these summaries for Score in each node:

vtree(FakeData,
  "Severity",
  summary="Score",
  horiz=FALSE)

Extracting summary information

Sometimes it is helpful to extract summary information as text. For example, we might wish to access the summary information contained in the Mild node. This is explained later on, but here’s a brief example:

vSeverity <- vtree(FakeData,
  "Severity",
  summary="Score",
  horiz=FALSE)

info <- attributes(vSeverity)$info
cat(info$Severity$Mild$.text)

Score
missing 1
mean 12.1 SD 14.6
med 5.5 IQR 3.2, 9.8
range 1.0, 45.0

Default summaries for factor variables and indicator variables

There are also default summaries for factor variables and for indicator variables. For example, Category is a factor variable:

vtree(FakeData,
  summary="Category")

Indicator variables

Indicator variables have two levels such as 0 / 1, or TRUE / FALSE. For example, Event is an indicator variable

vtree(FakeData,
  summary="Event")

Specification of variables in the summary argument

Variables in the summary argument can also be specified in a way that is similar to the specification of variables for structuring a variable tree. For example, if we wish to know the proportion of patients in each node whose Category is single, we specify Category=single in the summary argument:

vtree(FakeData,
  "Severity",
  summary="Category=single",
  horiz=FALSE)

Summaries for a collection of variables

Summaries can be obtained for a collection of variables using pattern-matching, for example:

vtree(FakeData,
  "Severity",
  summary="Ind*",
  sameline=TRUE,
  horiz=FALSE,
  just="l")

Incidentally, note that just="l" specifies that all text should be left-justified, which conveniently lines up all of the rows of the summary.

The summary argument can also use the prefixes i:, any:, none:, all:, notall: (as well as anyx:, nonex:, allx:, and notallx:) and wildcards # and * (similar to variable specifications). Additionally, specifications for REDCap checkboxes can be used.

Control codes

By default, summary information is shown in all nodes. However, it may also be convenient to only show it in specific nodes. To control this, special codes that begin and end with % can be specified. The following control codes are available:

code summary information restricted to:
%noroot% all nodes except the root
%leafonly% leaf nodes
%var=v% nodes of variable v
%node=n% nodes named n

More on control codes

The control codes can be specified by adding them to the end of the summary string, separated with a space. For example, to only show summary information for nodes of the Category variable with the value single:

vtree(FakeData,
  "Severity Category",
  summary="Score<10 %var=Category%%node=single%",
  sameline=TRUE,
  showlegend=TRUE,
  showlegendsum=TRUE)

Here showlegend=TRUE was specified, and additionally showlegendsum=TRUE, which indicates that summaries should also be shown in legend nodes.

Customized summaries

The summary parameter also allows for customized summaries. For example, we might wish to display only the mean Score in each node of the tree. The %mean% code is used to represent the mean of the specified variable (preceded here by a line break, \n).

vtree(FakeData,
  "Severity",
  summary="Score \nmean score\n%mean%",
  sameline=TRUE,
  horiz=FALSE)

Other summary codes

In addition to the %mean% code, numerous other summary codes are supported, as listed in the table below. When such a code is present, the default summary is not shown. Instead, any text that is provided——in this case \nmean score\n——is shown, together with the requested summary information. If there are any missing values in a node, the number of missing values is shown using the abbreviation mv. To see summaries without any decimals, specify cdigits=0.

summary code result
%mean% mean (variant: %meanx% does not report missing values*)
%SD% standard deviation (variant: %SDx% does not report missing values*)
%sum% sum (variant: %sumx% does not report missing values*)
%min% minimum (variant: %minx% does not report missing values*)
%max% maximum (variant: %maxx% does not report missing values*)
%range% range (variant: %rangex% does not report missing values*)
%median% median, i.e. p50 (variant: %medianx% does not report missing values*)
%IQR% IQR, i.e. p25, p75 (variant: %IQRx% does not report missing values*)

(continued next slide)

Other summary codes (continued)

summary code result
%freqpct% frequency and % (variant: %freqpct_% shows each value on a separate line)
%freq% frequency (variant: %freq_% shows each value on a separate line)
%pY% Yth percentile (e.g. p50 means the 50th percentile)
%npct% frequency and % of a logical variable. By default “valid percentages” are used. Any missing values are also reported.
%pct% same as %npct% but percentage only (with no parentheses).
%list% list of individual values, separated by commas (variant: %list_% shows each value on a separate line)
%mv% the number of missing values
%nonmv% the number of non-missing values
%v% the name of the variable

*Caution is recommended when suppressing missing values.

The summary argument can include any number of these codes, mixed with text and formatting codes.

The %trunc% code

It is sometimes convenient to see individual values of a variable in each node. A good example is ID numbers. To do this, use the %list% code. When a value occurs more than once in the subset, it will be followed by a count of the number of repetitions in parentheses.

When there are many individual values, it is often convenient to truncate the output. If you specify %trunc=N%, summary information will be truncated after N characters, and followed by “…”.

R expressions in the summary argument

Rather than starting the summary argument with a variable name, an R expression involving variables in the data frame can be given, as long as it does not contain any spaces.

vtree(FakeData,
  "Severity Category",
  summary="(Post-Pre)/Pre \nmean = %mean%",
  sameline=TRUE,
  horiz=FALSE,
  cdigits=1)

Expressions involving functions can also be used; for example sqrt(abs(Post/Pre)).

More than one variable

Sometimes it is useful to display summary information for more than one variable. To do this, specify summary as a vector of character strings. For example:

vtree(FakeData,
  "Severity",
  horiz=FALSE,
  showvarnames=FALSE,
  splitwidth=Inf,
  sameline=TRUE,
  summary=c(
    "Score \nScore: mean (SD) %meanx% (%SD%)",
    "Pre \nPre: range %range%"))

Targeted summaries

Sometimes you only want to show a summary in a particular node. Targeted summaries are specified with the tsummary parameter as a list of character-string vectors. The initial elements of each character string vector point to a specific node. The final element of each character string vector is a summary string, with the same structure as for summary.

vtree(FakeData,
  "Age Sex",
  tsummary=list(
    list(Age="5",Sex="M","id \n%list%")),
  horiz=FALSE)

Pattern trees and pattern tables

This section shows how to display all the combinations of values in a set of variables.

Each node in a variable tree provides the frequency of a particular combination of values of the variables. The leaf nodes represent the observed combinations of values of all of the variables. For example, in a variable tree for Severity and Sex, the leaf nodes correspond to Mild F, Mild M, Moderate F, Moderate M, etc. These combinations, or “patterns”, can be treated as an additional variable. And if this new pattern variable is used as the first variable in a tree, then the branches of the tree will be simplified: each branch will represent a unique pattern, with no sub-branches.

Pattern trees

A “pattern tree” can be easily produced by specifying pattern=TRUE.

vtree(FakeData,
  "Severity Sex",
  pattern=TRUE)

Pattern trees (continued)

Pattern trees are simpler to read than ordinary variable trees, but they involve a considerable loss of information, since they only represent the nth-degree subsets (where n is the number of variables).

Note that by default, when pattern=TRUE is specified, the root node is not shown (in order to simplify the display). A disadvantage of this is that the total sample size is not shown. You can override this behavior by specifying showroot=TRUE.

A pattern tree has two other special characteristics. First, note that after the first layer (representing pattern), counts and percentages are not shown, since they are not informative: by definition, all nodes within a branch have the same count. Second, note that in place of arrows, undirected line segments are shown. This is because, unlike in a regular variable tree, the order of variables is irrelevant in a pattern tree. Sometimes, however, the variables do have a natural ordering, as in the case of longitudinal variables. To show arrows, specify seq=TRUE instead of pattern=TRUE, and a “sequence” (i.e. an ordered pattern) will be shown.

Summaries can be shown in pattern trees (using the summary parameter), but they only appear in the pattern node (or the sequence node if seq=TRUE).

Pattern tables

A pattern tree has the same structure as a table. Indeed, it may be more convenient to produce a table rather than a pattern tree. A data frame containing the information from the pattern tree can be exported by specifying ptable=TRUE:

vtree(FakeData,"Severity Sex",ptable=TRUE)
   n pct Severity Sex
1  2   4   Severe   F
2  3   7     <NA>   F
3  3   7     <NA>   M
4  3   7   Severe   M
5  5  11 Moderate   M
6  8  17     Mild   M
7 11  24     Mild   F
8 11  24 Moderate   F

The pattern table includes a column for the counts from the pattern nodes, and a column for percentages. Compared to a variable tree, this table is much more compact, and may be more suitable for use in a manuscript.

Indicator variables

Pattern trees can be very useful for indicator variables, i.e. variables that take values like 0/1, no/yes, FALSE/TRUE, etc. For convenience here, we’ll refer to 0 (or no, FALSE, etc.) as a negative and 1 (or yes, TRUE, etc.) as an affirmative.

The variables Ind1 through Ind4 in FakeData are 0/1 indicator variables. If these variables are interpreted as representing set membership (0 = non-member, 1 = member), then a pattern tree is an alternative representation of a Venn diagram. If you specify Venn=TRUE, the nodes (except for the pattern nodes) will be blank, with only their shade indicating their value (dark = 1, light = 0, white = missing).

vtree(FakeData,
  "Ind1 Ind2 Ind3 Ind4",
  Venn=TRUE,
  pattern=TRUE)

prunesmaller

Big pattern trees can be overwhelming, so it may be useful to prune patterns that occur fewer than, say, 3 times, by specifying prunesmaller=3.

A pattern tree for indicator variables provides all the information that a Venn diagram represents, but unlike a Venn diagram, missing values are also represented. This can also be shown as a pattern table. For example:

vtree(FakeData,
  "Ind1 Ind2",
  ptable=TRUE)
   n pct Ind1 Ind2
1  1   2 <NA>    0
2 10  22    1    0
3 11  24    0    1
4 12  26    0    0
5 12  26    1    1

The VennTable function

For indicator variables, there is an extra function, VennTable, which converts the pattern table to a matrix of character strings and adds some additional totals.

VennTable(
  vtree(FakeData,
    "Ind1 Ind2",ptable=TRUE))
      n    pct   Ind2 Ind1
      "1"  "2"   "0"  NA  
      "10" "22"  "0"  "10"
      "11" "24"  "11" "0" 
      "12" "26"  "0"  "0" 
      "12" "26"  "12" "12"
Total "46" "100" ""   ""  
N     ""   ""    "23" "22"
pct   ""   ""    "50" "48"

The VennTable function (continued)

By default in R, when a matrix of character strings is printed, quotation marks are displayed around each element. Unfortunately the result is unattractive. Instead it’s helpful to call the print function and specify quote=FALSE:

print(
  VennTable(
    vtree(FakeData,
      "Ind1 Ind2",
      ptable=TRUE)),
  quote=FALSE)
      n  pct Ind2 Ind1
      1  2   0    <NA>
      10 22  0    10  
      11 24  11   0   
      12 26  0    0   
      12 26  12   12  
Total 46 100          
N            23   22  
pct          50   48  

Without all those quotation marks, it’s easier to see what VennTable adds:

  • the total sample size (46) and percentage (100), and

  • the total number (N) of affirmatives for each variable, together with a percentage.

The VennTable function (continued)

The VennTable function can also be used in an R Markdown document. Specifying markdown=TRUE generates a pandoc markdown pipetable, with several formatting tweaks:

  • the rows and columns of the table are transposed

  • affirmatives are represented by checkmarks

  • negatives are represented by spaces

  • missing values are represented by dashes (which can be changed with the NAcode parameter).

The VennTable function (continued)

To display the table in R Markdown, use this inline call:

VennTable(
  vtree(FakeData,
    "Ind1 Ind2",ptable=TRUE),
  markdown=TRUE)
[1] "&nbsp;|&nbsp;|&nbsp;|&nbsp;|&nbsp;|&nbsp;|Total|N|%\n-|-|-|-|-|-|-|-|-\nn|1|10|11|12|12|46|&nbsp;|&nbsp;\n%|2|22|24|26|26|100|&nbsp;|&nbsp;\nInd2|&nbsp;|&nbsp;|&#10004;|&nbsp;|&#10004;|&nbsp;|23|50\nInd1|-|&#10004;|&nbsp;|&nbsp;|&#10004;|&nbsp;|22|48"

VennTable has some additional parameters. The checked parameter is used to specify values that should be interpreted as affirmative. By default, it is set to c("1","TRUE","Yes","yes","N/A"). Similarly, the unchecked parameter is used to specify values that should be interpreted as negative, with default c("0","FALSE","No","no","not N/A").

Using the summary parameter in pattern tables

The summary parameter can also be used in pattern tables. If a single summary is requested, it appears in the summary_1 variable in the data frame. Additional summaries appear as summary_2, summary_3, etc.

vtree(FakeData,
  "Severity Sex",
  summary=c("Score %mean%","Pre %mean%"),
  ptable=TRUE)
   n pct Severity Sex summary_1 summary_2
1  2   4   Severe   F      28.0      -0.4
2  3   7     <NA>   F       6.3      -0.1
3  3   7     <NA>   M      23.7      -0.9
4  3   7   Severe   M      44.0      -0.3
5  5  11 Moderate   M       8.2 -0.7 mv=1
6  8  17     Mild   M  6.3 mv=1       0.2
7 11  24     Mild   F      15.7 -0.4 mv=2
8 11  24 Moderate   F 21.5 mv=1       0.0

Checking for missing values with the check.is.na parameter

If check.is.na=TRUE is specified, each variable is replaced by an indicator of whether or not it is missing, and pattern=TRUE is automatically set. As when Venn=TRUE is specified, all nodes except for the pattern node are blank, and only their shade indicates missing (dark) or not (light). Whereas the variables used to build a variable tree are normally categorical, in this situation non-categorical variables can be used, because their missingness is represented instead of their actual values.

vtree(FakeData,
  "Severity Age Pre Post",
  check.is.na=TRUE)

The ptable parameter

Specifying ptable=TRUE produces this information in a data frame, and calling VennTable shows additional information. To display the table in R Markdown, use this inline call:

VennTable(
  vtree(FakeData,
    "Severity Age Pre Post",
    check.is.na=TRUE,
    ptable=TRUE),
  markdown=TRUE)
[1] "&nbsp;|&nbsp;|&nbsp;|&nbsp;|&nbsp;|&nbsp;|&nbsp;|&nbsp;|&nbsp;|Total|N|%\n-|-|-|-|-|-|-|-|-|-|-|-\nn|1|1|1|1|2|4|4|32|46|&nbsp;|&nbsp;\n%|2|2|2|2|4|9|9|70|100|&nbsp;|&nbsp;\nMISSING_Age|&#10004;|&nbsp;|&nbsp;|&nbsp;|&#10004;|&nbsp;|&#10004;|&nbsp;|&nbsp;|7|15\nMISSING_Severity|&nbsp;|&nbsp;|&nbsp;|&nbsp;|&#10004;|&#10004;|&nbsp;|&nbsp;|&nbsp;|6|13\nMISSING_Pre|&#10004;|&#10004;|&#10004;|&nbsp;|&nbsp;|&nbsp;|&nbsp;|&nbsp;|&nbsp;|3|7\nMISSING_Post|&nbsp;|&#10004;|&nbsp;|&#10004;|&nbsp;|&nbsp;|&nbsp;|&nbsp;|&nbsp;|2|4"

Colors

This section explains how colors and color palettes can be used.

By default, vtree assigns colors to nodes of each successive variable using color palettes from RColorBrewer.

The sequence of palettes (identified by short names) is as follows:

The sequence of palettes (identified by short names) is as follows:

1 Reds   6 YlGn   11 BuPu   16 RdPu
2 Blues   7 PuBu   12 YlOrRd   17 BuGn
3 Greens   8 PuRd   13 RdYlGn   18 OrRd
4 Oranges   9 YlOrBr   14 GnBu  
5 Purples   10 PuBuGn   15 YlGnBu  

The palette parameter

If you prefer to change the color assignments, you can use the palette parameter. For example, by default a variable tree for Sex and Severity will assign shades of red to nodes of Sex and shades of blue to notes of Severity. To switch to shades of, say, green and orange instead, use:

vtree(FakeData,
  "Sex Severity",
  palette=c(3,4))

The revgradient parameter

Sometimes it may be useful to reverse the order of a gradient. To reverse the order of all gradients, specify revgradient=TRUE. The gradient for selected variables can be reversed as in the example below:

vtree(FakeData,
  "Sex Group Severity",
  revgradient=c(Sex=TRUE,Severity=TRUE))

Other color-related parameters include:

sortfill Specifying sortfill=TRUE fills nodes with gradient colors in sorted order according to the node count.
NAfillcolor By default, missing value nodes are colored white. For a different color (say gray), specify NAfillcolor="gray". To instead use a color from the current palette, specify NAfillcolor=NULL.
rootfillcolor The color of the root node can be changed (say to yellow) by specifying rootfillcolor="yellow".
fillcolor To set all nodes of the tree (except for missing value nodes and the root node) to be the same color (say palegreen), specify fillcolor="palegreen".
plain A simple color scheme is produced by specifying plain=TRUE. (Additionally, this increases the spaces between nodes.)

REDCap checkboxes

This section details support for checkbox variables from REDCap.

In datasets exported from REDCap, checkboxes (i.e. select-all-that-apply boxes) are represented in a special way. For each item in a checklist, a separate variable is created. Suppose survey respondents were asked to select which flavors of ice cream (Chocolate, Vanilla, Strawberry) they like. Within REDCap, the variable name for this list of checkboxes is IceCream, but when the dataset is exported, individual variables IceCream___1 (representing Chocolate), IceCream___2 (Vanilla), and IceCream___3 (Strawberry) are created. When the dataset is read into R, the names of the flavors are embedded in the attributes of these variables.

REDCap checkboxes (continued)

For illustrative purposes, let’s build a dataframe like this using the build.data.frame function (for an explanation of this function see the section of this vignette on generating a data frame by specifying subset sizes

dessert <- build.data.frame(
  c(   "group","IceCream___1","IceCream___2","IceCream___3"),
  list("A",     1,             0,             0,              7),
  list("A",     1,             0,             1,              2),
  list("A",     0,             0,             0,              1),
  list("A",     1,             1,             1,              1),
  list("B",     1,             0,             1,              1),
  list("B",     1,             0,             0,              2), 
  list("B",     0,             1,             1,              1),
  list("B",     0,             0,             0,              1))
attr(dessert$IceCream___1,"label") <- "Ice cream (choice=Chocolate)"
attr(dessert$IceCream___2,"label") <- "Ice cream (choice=Vanilla)"
attr(dessert$IceCream___3,"label") <- "Ice cream (choice=Strawberry)"

prefix r:

The prefix r: identifies a REDCap checklist variable, and extracts a label from the variable attribute. For example, the following call automatically displays “Chocolate”:

vtree(dessert,"r:IceCream___1")

suffix @

The suffix @ matches REDCap checklist variables based on the naming scheme used by REDCap for checklist variables. For example, the following call automatically displays Chocolate, Vanilla, and Strawberry:

vtree(dessert,"r:IceCream@")

variable prefixes rany:, rnone:, rall:, and rnotall:

The variable prefixes any:, none:, all:, and notall: can be combined with the r: prefix to form rany:, rnone:, rall:, and rnotall:. For example, to determine whether anyone did not like any of the flavors (Chocolate, Vanilla, or Strawberry):

vtree(dessert,"rnone:IceCream@")

variable prefix ri:

“Intersections” of REDCap variables may be obtained by combining the r: prefix with the i: prefix:

vtree(dessert,"ri:IceCream@")

Deprecated: variable prefixes stem: and rc:

To examine the pattern of ice-cream flavor choices, the following can be used:

vtree(dessert,
  "IceCream___1 IceCream___2 IceCream___3",
  pattern=TRUE)

One problem is that this doesn’t assign the appropriate labels to IceCream___1 (Chocolate), IceCream___2 (Vanilla), and IceCream___3 (Strawberry).

Pattern tree stem

Instead, try the following more compact call, which also assigns labels automatically.

vtree(dessert,
  "stem:IceCream",
  pattern=TRUE)

Summary stem

The summary parameter also supports a stem: prefix:

vtree(dessert,
  summary="stem:IceCream",
  splitwidth=Inf,
  just="l")

If you wish to only examine specific REDCap checkbox items, the rc: prefix can be used. For example to examine results for just Chocolate and Strawberry:

vtree(dessert,
  "rc:IceCream___1 rc:IceCream___3",
  pattern=TRUE)

The DOT script generated by vtree

This section shows how to obtain the DOT script that displays a variable tree.

Specifying getscript=TRUE lets you capture the DOT script representing a variable tree. (vtree uses DiagrammeR, which uses Graphviz, which uses the DOT graph description language.) Here is an example:

dotscript <- vtree(FakeData,
  "Severity",
  getscript=TRUE)
cat(dotscript)
digraph vtree {
graph [nodesep=0.1, ranksep=0.5, tooltip=" "]
node [fontname = "Arial", fontcolor = black,shape = rectangle, color = black, tooltip=" ",margin=0.1]
rankdir=LR;
Node_L0_0 [style=invisible]

Node_L1_0[label=<<FONT POINT-SIZE="24"><FONT COLOR="#DE2D26">Severity</FONT></FONT>> shape=none margin=0]
Node_L0_0 -> Node_L1_0 [style=invisible arrowhead=none]

edge[style=solid]
Node_1->Node_2 Node_1->Node_3 Node_1->Node_4 Node_1->Node_5

Node_1[label=<46>  fontcolor=<#000000> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_2[label=<Mild<BR/>19 (48%)>  fontcolor=<#000000> color=black style="rounded,filled" fillcolor=<#FEE0D2>  ]
Node_1[label=<46>  fontcolor=<#000000> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_3[label=<Moderate<BR/>16 (40%)>  fontcolor=<#ffffff> color=black style="rounded,filled" fillcolor=<#FC9272>  ]
Node_1[label=<46>  fontcolor=<#000000> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_4[label=<Severe<BR/>5 (12%)>  fontcolor=<#ffffff> color=black style="rounded,filled" fillcolor=<#DE2D26>  ]
Node_1[label=<46>  fontcolor=<#000000> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_5[label=<NA<BR/>6>  fontcolor=<#000000> color=black style="rounded,filled" fillcolor=<white>  ]

}

If you wish to directly edit this code, it can can be pasted into an online Graphviz editor, for example:

https://dreampuf.github.io/GraphvizOnline/ and http://magjac.com/graphviz-visual-editor/

Ways to call vtree

vtree behaves differently depending on the context in which it is called.

Calling vtree interactively

  • If vtree is called interactively in RStudio, it displays the variable tree in the Viewer window.

  • If vtree is called interactively from the RGui console (i.e. from R outside of RStudio), it displays the variable tree in a browser window.

Calling vtree from knitr and R Markdown

When vtree is called from knitr, it generates

  • A PNG file if the output format is Markdown

  • A PDF file if the output format is LaTeX.

Here’s how it does that. vtree uses the DiagrammeR package, which automatically generates an htmlwidget object for display in HTML, using the htmlwidgets framework. Then vtree converts the htmlwidget object into an SVG image, and finally into a PNG or PDF file.

Generating PNG files

PNG files are useful because they allow you to display variable trees in Microsoft Word documents, and also because HTML files that use htmlwidgets can get large, and if they contain several widgets they can be slow to load.

If vtree is called while an R Markdown or Quarto file is being knitted, it generates a PNG file and automatically embeds it into the knitted document. The resolution of the PNG file in pixels is determined by parameters pxwidth and pxheight. If neither is specified, pxwidth is automatically set to 2000, which provides good resolution for a printed page. The height of the image in the output document can be specified using the imageheight parameter, for example imageheight="4in" for a 4-inch image. There is also an imagewidth parameter. If neither is specified, imageheight is automatically set to 3 inches.

Note: You may notice a warning in the R Markdown rendering (in RStudio, the R Markdown pane) like this:

<unknown>:1919791: Invalid asm.js: Function definition doesn't match use

Although distracting, this message is irrelevant.

Generating an image file but not displaying it

Specifying imageFileOnly=TRUE instructs vtree to generate an image file but not display it.

Generating an htmlwidget in an HTML document

When knitting to an HTML document, htmlwidgets can be used rather than embedding a PNG file. To use htmlwidgets instead of a PNG file simply specify pngknit=FALSE.

Using vtree in Shiny

Thanks to Shiny and the svg-pan-zoom JavaScript library, interactive panning and zooming of a variable tree is possible with the svtree function. The syntax of svtree is the same as that of vtree, but instead of generating a static variable tree, it launches a Shiny app. The mousewheel allows you to zoom in or out. The variable tree can also be dragged to a different position.

Thanks to the panning and zooming functionality in svtree, it is possible to examine larger variable trees than with vtree. In large variable trees it is often useful to show the variable name in each node, since the variable labels (which are shown at the bottom or left-hand margin) may not be visible after zooming. To show the variable name in each node, specify showvarinnode=TRUE.

Generating a data frame by specifying subset sizes

vtree is designed to generate a variable tree based on a data frame. However, sometimes the sizes of subsets are known but no data frame is available.

The build.data.frame function allows you to build a data frame by specifying the size of subsets. Here’s an example involving pets:

build.data.frame(
  c("pet","breed","size"),
  list("dog","golden retriever","large",5),
  list("cat","tabby","small",2))
  pet            breed  size
1 dog golden retriever large
2 dog golden retriever large
3 dog golden retriever large
4 dog golden retriever large
5 dog golden retriever large
6 cat            tabby small
7 cat            tabby small

In this case there are five large golden retrievers and 2 small tabby cats. Although a data frame like this could easily be created without using build.data.frame, consider this example:

build.data.frame(
  c("pet","breed","size"),
  list("dog","golden retriever","large",5),
  list("cat","tabby","small",2),
  list("dog","Dalmation","various",101),
  list("cat","Abyssinian","small",5),
  list("cat","Abyssinian","large",22),
  list("cat","tabby","large",86))

Examples

A collection of vtree examples follows.

Rudimentary CONSORT diagrams

Consider the following fictitious data about a randomized controlled trial (RCT):

     id   eligible     randomized group        followup analyzed
1   001   Eligible     Randomized     B     Followed up Analyzed
2   002   Eligible Not randomized  <NA>            <NA>     <NA>
3   003   Eligible     Randomized     A Not followed up     <NA>
4   004   Eligible     Randomized     B     Followed up Analyzed
5   005   Eligible     Randomized     A     Followed up Analyzed
6   006 Ineligible           <NA>  <NA>            <NA>     <NA>
7   007   Eligible     Randomized     A     Followed up Analyzed
8   008 Ineligible           <NA>  <NA>            <NA>     <NA>
9   009   Eligible     Randomized     A     Followed up Analyzed
10 0010 Ineligible           <NA>  <NA>            <NA>     <NA>
11 0011   Eligible     Randomized     B     Followed up Analyzed
12 0012 Ineligible           <NA>  <NA>            <NA>     <NA>

The CONSORT diagram (http://www.consort-statement.org/) shows the flow of patients through the study, starting with those who meet eligibility criteria, then those who are randomized, etc.

Rudimentary CONSORT diagrams (continued)

It is easy to produce a rudimentary version of a CONSORT diagram in vtree. The key step is to prune branches for those who are not eligible, not randomized, etc. This can be done using the keep parameter:

vtree(FakeRCT,
  "eligible randomized group followup analyzed",
  plain=TRUE,
  keep=list(
    eligible="Eligible",
    randomized="Randomized",
    followup="Followed up"),
  horiz=FALSE,
  showvarnames=FALSE,
  title="Assessed for eligibility")

Rudimentary CONSORT diagrams (continued)

Note that this does not include all of the additional information for a full CONSORT diagram (exclusion reasons and counts, as well as numbers of patients who received their allocated interventions, who discontinued intervention, and who were excluded from analysis). It does, however, provide the main flow information.

Rudimentary CONSORT diagrams (continued)

Additional information can be obtained by viewing the nodes for patients in the pruned branches (but not their descendants). The follow parameter makes that easy:

vtree(FakeRCT,
  "eligible randomized group followup analyzed",
  plain=TRUE,
  follow=list(
    eligible="Eligible",
    randomized="Randomized",
    followup="Followed up"),
  horiz=FALSE,
  showvarnames=FALSE,
  title="Assessed for eligibility")

Rudimentary CONSORT diagrams (continued)

Finally, it may be useful to see the ID numbers in each node. This can be done using the summary parameter with the %list% code. Since IDs are less useful in the root note, the %noroot% code is also specified here:

vtree(FakeRCT,
  "eligible randomized group followup analyzed",
  plain=TRUE,
  follow=list(
    eligible="Eligible",
    randomized="Randomized",
    followup="Followed up"),
  horiz=FALSE,
  showvarnames=FALSE,
  title="Assessed for eligibility",
  summary="id \nid: %list% %noroot%")

Examples using R datasets

The datasets package is loaded in R by default. In the following section, vtree is applied to several of these data sets for illustrative purposes. Note that the variable trees generated by the commands below are not shown. The reader can try these commands to see what the variable trees look like, and experiment with many other possibilities.

Hair and eye color

The HairEyeColor data set is an array representing a contingency table (also called a crosstab or crosstabulation). Before vtree can be applied to this data set, it is necessary to convert the table of crosstabulated frequencies to a data frame of cases. For convenience, the vtree package includes a helper function to do this, called crosstabToCases. It is adapted from a function listed on the Cookbook for R website

hec <- crosstabToCases(HairEyeColor)

# There are lots of combinations 
# but let's say we are especially interested
# in green eyes (as compared to non-green eyes).
# We can use the variable specification Eye=Green to do this:

vtree(hec,"Hair Eye=Green Sex",sameline=TRUE)

Titanic

The Titanic dataset is a 4-dimensional array of counts. First, let’s convert it to a dataframe of individuals:

titanic <- crosstabToCases(Titanic)

# We'll specify `sameline=TRUE` so that the 
# variable tree is a bit more compact:
 
vtree(titanic,"Class Sex Age",summary="Survived=Yes \n%pct% survived",sameline=TRUE)

mtcars

The mtcars data set was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

# The rownames of the data set contain the names of the cars.
# Let's move that information into a column.
# To do that, we'll make a slightly altered version of the data frame

mt <- mtcars
mt$name <- rownames(mt)
rownames(mt) <- NULL

# Now let's look at the mean and standard deviation of horsepower (HP)
# by number of carburetors, nested within number of gears,
# and in turn nested within number of cylinders:

vtree(mt,"cyl gear carb",summary="hp \nmean (SD) HP %mean% (%SD%)")

The above shows the mean and SD of horsepower by (1) number of cylinders; (2) number of gears (within number of cylinders); and (3) number of carburetors (within number of gears nested within number of cylinders). That’s a lot of information.

mtcars (continued)

Suppose instead that we are only interested in number 3 above, i.e. all combinations of number of cylinders, number of gears, and number of carburetors.

In that case, we can specify ptable=TRUE, To make the table a little easier to read, set the number of digits for the mean and SD to be zero, and relabel the variables.

vtree(mt,
  "cyl gear carb",
  summary="hp mean (SD) HP %mean% (%SD%)",
  cdigits=0,
  labelvar=c(cyl="# cylinders",gear="# gears",carb="# carburetors"),
  ptable=TRUE)

mtcars (continued)

We might also like to list the names of cars by number of carburetors nested within number of gears:

vtree(mt,
  "gear carb",
  summary="name \n%list%%noroot%",
  splitwidth=50,
  sameline=TRUE,
  labelvar=c(gear="# gears",carb="# carburetors"))

UCBAdmissions

The UCBAdmissions data consists of aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex.

According to the data set Details, “This data set is frequently used for illustrating Simpson’s paradox, see Bickel et al. (1975). At issue is whether the data show evidence of sex bias in admission practices. There were 2691 male applicants, of whom 1198 (44.5%) were admitted, compared with 1835 female applicants of whom 557 (30.4%) were admitted.” Furthermore, “the apparent association between admission and sex stems from differences in the tendency of males and females to apply to the individual departments (females used to apply more to departments with higher rejection rates).”

# convert the crosstab data to a data frame of cases
ucb <- crosstabToCases(UCBAdmissions) 

# look at admission rates by Gender, nested within department
vtree(ucb,"Dept Gender",summary="Admit=Admitted \n%pct% admitted",sameline=TRUE)

ChickWeight

The ChickWeight data set is from an experiment on the effect of diet on early growth of chicks. Let’s look at the mean weight of chicks at birth (0 days of age) and 4 days of age, nested within type of diet. A simple variable tree can be produced like this:

vtree(ChickWeight,"Diet Time",
  keep=list(Time=c("0","4")),
  summary="weight \nmean weight %mean%g")

To make the display a little easier to read, relabel the nodes and the Time variable:

vtree(ChickWeight,"Diet Time",
  keep=list(Time=c("0","4")),
  labelnode=list(
    Diet=c("Diet 1"="1","Diet 2"="2","Diet 3"="3","Diet 4"="4"),
    Time=c("0 days"="0","4 days"="4")),
  labelvar=c(Time="Days since birth"),
  summary="weight \nmean weight %mean%g")

InsectSprays

The InsectSprays data set contains counts of insects in agricultural experimental units treated with different insecticides. Let’s look at those counts by insecticide.

vtree(InsectSprays,
  "spray",
  splitwidth=80,
  sameline=TRUE,
  summary="count \ncounts: %list%%noroot%",
  cdigits=0)

ToothGrowth

The ToothGrowth data set contains the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).

Let’s examine the percentage with length > 20 by dose nested within delivery method:

vtree(ToothGrowth,
  "supp dose",
  summary="len>20 \n%pct% length > 20")

To make the display a little easier to read, relabel the nodes and the Time variable:

vtree(ToothGrowth,
  "supp dose",
  summary="len>20 \n%pct% length > 20",
  labelvar=c(
    supp="Supplement type",
    dose="Dose (mg/day)"),
  labelnode=list(supp=c("Vitamin C"="VC","Orange Juice"="OJ")))

For more information

https://nbarrowman.github.io/vtree

https://cran.r-project.org/web/packages/vtree/index.html

https://github.com/nbarrowman/vtree