April, 2018

Frequency of univariate varable

Package

  • plotrix
  • vioplot
  • vcd

Introduction

* Example of categorical variables: gender, supporting candidate, blood type etc.
    
    + gender is an example of categorical variable. It has two levels, 
    male and female.
    + blood type has four levels, A, B, AB and O.

* Summarization of categorical data: frequency, relative frequency 
(cf. What tools can we use to summarize continuous data?)

* Visualization of categorical data:

    + Barplot
    + Pie chart
    + Other useful tool?

barplot

  • How many states in U.S?
    • Print state.region.
counts = table(state.region)
counts
barplot(counts, main = "simple bar chart", 
xlab = "region", ylab = "freq")
  • table() function: compute frequencies of variables in state.region.
  • barplot() function: draw barplot with use of the counts variable.
  • xlab, ylab options add the names of axises.

barplot

## state.region
##     Northeast         South North Central          West 
##             9            16            12            13

barplot

  • mtcar has continuous and categorical variables.

  • Check the names of cars.
  • Draw barplot with cyl variable in mtcars dataset.

barplot

freq.cyl =table(mtcars$cyl)
barplot(freq.cyl, main = "simple bar chart", col ="orange")

barplot

  • names.arg option can takes the names of levels and add the names on the plot.
  • Check the structure of freq.cyl
  • Since the length of freq.cyl is 3, we use the corresponding name vector with length 3.
cyl.name =  c("4 cyl", "6 cyl", "8 cyl")
barplot(freq.cyl, main = "simple bar chart", col ="orange",
names.arg = cyl.name)

Pie chart

  • pie() function draws pie chart with categorical data.
  • The function takes the frequency vector usually produced by table() function.
  • Since pie() function does not display the names of levels in categorical data, label are sometimes used to draw the figure like that of Excel.
cyl.name2 = paste0( cyl.name, "(", freq.cyl, "%)")
pie(freq.cyl, labels = cyl.name2, 
    col = rainbow(length(freq.cyl)), main = "pie chart")
  • paste0() function returns character vector whose elements defined in inputs are concatenated. Check the output of paste0 function and look for another function paste().

pie chart

3D-pie chart

  • plotrix() draws 3 dimensional pie chart.
library(plotrix)
pie3D(freq.cyl, labels = cyl.name2, explode = 0.1, main = "3d pie plot")

  • explode option controls the gaps between parts of pie.

fan plot

fan.plot() draws the fan plot used as an alternative of pie chart. Fan plot is relatively easy to compare the relative frequencies of levels, while pie chart is difficult.

fan.plot(freq.cyl, labels = cyl.name2, main = "Fan plot")

Frequency of mutivariate variables

Introduction

  • Analysis of data with two more categorical variables.
    • Example: (Blood type, gender), (Treatment, Improvement)
  • Visualization of frequency table
    • Frequency table: xtabs()
    • Visualization: barplot(), spine()

Frequency table

library(vcd)
head(Arthritis, n = 3)
##   ID Treatment  Sex Age Improved
## 1 57   Treated Male  27     Some
## 2 46   Treated Male  29     None
## 3 77   Treated Male  30     None
  • Data:
    • ID , Treatment, Gender, Age, Improved

Frequency table

my.table <- xtabs( ~ Treatment + Improved, data = Arthritis)
my.table
##          Improved
## Treatment None Some Marked
##   Placebo   29    7      7
##   Treated   13    7     21
  • frequency analysis: cross table with treatment and improvement
    • formula: ~ Treatment + Improved, `treatment is row-variables and improved is column variable.
  • xtab() produces cross table with categorical variables.

barplot

barplot( my.table,
         xlab = "Improved", ylab = "Frequency", legend.text = TRUE,
         col = c("green", "red"))

* Better display? Change the row and columns of the cross table.

barplot

barplot( t(my.table),
         xlab = "Improved", ylab = "Frequency", legend.text = TRUE,
         col = c("green", "red", "orange"))
  • Better display? Change the row and columns of the cross table.
    • t(mytable)
    • Three colors are required in col=
t(my.table)
##         Treatment
## Improved Placebo Treated
##   None        29      13
##   Some         7       7
##   Marked       7      21

barplot

barplot

  • safety buckle example
tmp = c("buckled", "unbuckled")
belt <- matrix( c(58, 2, 8, 16), ncol = 2, 
                dimnames = list(parent = tmp, child = tmp))
belt
##            child
## parent      buckled unbuckled
##   buckled        58         8
##   unbuckled       2        16

barplot

  • spine() : this function shows mosaic plot where the length of edge denotes the marginal probability.
  • Thus, the area of rectangle indicates the joint probability. Moreover, we can easily check the independence of the two random variables.

barplot

library(vcd)
spine(belt, main="spine plot for child seat-belt usage",
      gp = gpar(fill = c("green", "red")))

continuous variables and visualization

introduction

  • EDA for continuous variables
    • examples: heigh, weight, stock index…
    • summarization: mean (average), median, quantiles…
    • Visualization tools: boxplot, histgram. violin plot etc.

boxplot

x = rnorm(100)
boxplot(x, main = "boxplot", col ='lightblue')

boxplot

  • middle black line: median
    • location measure
  • box: uppder quantile(25%: Q1), lower quantile(75%: Q3)
    • the size of box is a dispersion measure of data (IQR: inter-quantile-range)
  • whisker: upper and lower whisker denotes

\[ \max \{ x_i: x_i \leq Q1 + 1.5\times(Q1-Q2) \}\] and

\[ \min \{ x_i: x_i \leq 2.5 \times Q3 - 1.5\times (Q2-Q3) \};\], respectively.

boxplot

  1. If data follow the standard normal distribution, then what are the values of upper and lower whisker in the boxplot?

histogram

x = faithful$waiting
hist(faithful$waiting, nclass = 8)

histogram

  • hist() produces histogram.
    • The histogram depends on how to choose bins in the figure.
    • nclass option is usually used to determind the number of bins.
    • We can freely set bins by break option.
    • We can set the y-axix as the relative frequency by probability=T option.
  • Note that histogram is visualization tool for the probability distribution of continuous data.

histogram

x = faithful$waiting
hist(faithful$waiting, breaks = seq(min(x), max(x), length = 10),
     probability = T)

histogram

  • density() function gives the results of density estimation by a kernel method.
x = faithful$waiting
hist(faithful$waiting, nclass = 10, probability = T)
lines(density(x), col = "red", lwd = 2)

violin plot

  • violin plot utilizes the advantages of boxplot (quantiles) and histgram (distribution)
library(vioplot)
x = rpois(1000, lambda = 3)
vioplot(x, col = "lightblue")

violin plot

visualization for mutivariate variables

multiple boxplot

  • By boxplot() we can compare summarized information of continuous variable according to the levels
  • mpg~cyl means that we will use mpg on y-axis and cyl on x-axis.
attach(mtcars)
boxplot(mpg~cyl, data = mtcars, names = c('4 cyl','6 cyl', '8 cyl'),
        main = "MPG dist by cylinder")

multiple histogram

  • Poor visualization
hist(mpg[cyl==4], xlab="MPG", main = "MPG dist by cylinder",
     xlim = c(5, 40),  ylim = c(0,10), col = 'lightblue',
     nclass = trunc(sqrt(length(mpg[cyl==4]))))
hist(mpg[cyl==6], xlab="MPG", main = "MPG dist by cylinder",
     xlim = c(5, 40),  ylim = c(0,10), col = 'orange',
     nclass = trunc(sqrt(length(mpg[cyl==6]))), add= TRUE)
hist(mpg[cyl==8], xlab="MPG", main = "MPG dist by cylinder",
     xlim = c(5, 40),  ylim = c(0,10), col = 'red',
     nclass = trunc(sqrt(length(mpg[cyl==8]))), add= TRUE)

multiple histogram

multiple histogram

Exampple of poor visualization

multiple histogram

  • Use vertical layout by mfrow option.

  • Set xlim equally for fair comparison of locations.

par(mfrow = c(3,1))
hist(mpg[cyl==4], xlab="MPG", main = "MPG dist by cylinder",
     xlim = c(5, 40),  ylim = c(0,10), col = 'lightblue',
     nclass = trunc(sqrt(length(mpg[cyl==4]))))
hist(mpg[cyl==6], xlab="MPG", main = "MPG dist by cylinder",
     xlim = c(5, 40),  ylim = c(0,10), col = 'orange',
     nclass = trunc(sqrt(length(mpg[cyl==6]))))
hist(mpg[cyl==8], xlab="MPG", main = "MPG dist by cylinder",
     xlim = c(5, 40),  ylim = c(0,10), col = 'red',
     nclass = trunc(sqrt(length(mpg[cyl==8]))))

multiple histogram

multiple densities

  • Display multiple density function simultaneously.
plot(density(mpg[cyl==4]), xlab="MPG", main = "MPG dist by cylinder",
     xlim = c(5, 40), ylim = c(0.,0.26))
lines(density(mpg[cyl==6]), col = "red", lty = 2)
lines(density(mpg[cyl==8]), col = "blue", lty = 3)      
legend("topright", paste(c(4,6,8), "Cylinder"),
       col = c("black","red", "blue"),
       lty = c(1,2,3), lwd = 3, bty ="n")

multiple densities

Exercise

  • Safety buckle example
tmp = c("buckled", "unbuckled")
belt <- matrix( c(58, 2, 8, 16), ncol = 2, 
                dimnames = list(parent = tmp, child = tmp))
belt
##            child
## parent      buckled unbuckled
##   buckled        58         8
##   unbuckled       2        16

Exercise

  • Support the argument that parents' behavior for safety buckle can affect that of their child by visualization.

Exercise

barplot( t(belt), main = "Stacked Bar chart for child seat-belt usage",
         xlab = "parent", ylab = "Frequency", legend.text = TRUE,
         col = c("green", "red"))