---
title: "A quick tour of mclustAddons"
author: "Luca Scrucca"
date: "`r format(Sys.time(), '%d %b %Y')`"
output:
rmarkdown::html_vignette:
toc: true
number_sections: false
css: "vignette.css"
vignette: >
%\VignetteIndexEntry{A quick tour of mclustAddons}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(fig.align = "center",
out.width = "80%",
fig.width = 7,
fig.height = 6,
dev.args = list(pointsize=12),
par = TRUE, # needed for setting hook
collapse = TRUE, # collapse input & output code in chunks
warning = FALSE)
knit_hooks$set(par = function(before, options, envir)
{ if(before && options$fig.show != "none")
par(family = "sans", mar=c(4.1,4.1,1.1,1.1), mgp=c(3,1,0), tcl=-0.5)
})
set.seed(1) # for exact reproducibility
```
# Introduction
**mclustAddons** is a contributed R package that extends the functionality available in the **mclust** package (Scrucca et al. 2016, Scrucca et al. 2023).
In particular, the following methods are included:
- density estimation for data with bounded support (Scrucca, 2019);
- modal clustering using modal EM algorithm for Gaussian mixtures (Scrucca, 2021);
- entropy estimation via Gaussian mixture modeling (Robin & Scrucca, 2023).
This document gives a quick tour of **mclustAddons** (version `r packageVersion("mclustAddons")`). It was written in [R Markdown](https://rmarkdown.rstudio.com), using the [knitr](https://cran.r-project.org/package=knitr) package for production.
References on the methodologies implemented are provided at the end of this document.
```{r}
library(mclustAddons)
```
# Density estimation for data with bounded support
## Univariate case with lower bound
```{r}
x <- rchisq(200, 3)
xgrid <- seq(-2, max(x), length=1000)
f <- dchisq(xgrid, 3) # true density
dens <- densityMclustBounded(x, lbound = 0)
summary(dens, parameters = TRUE)
plot(dens, what = "density")
lines(xgrid, f, lty = 2)
plot(dens, what = "density", data = x, breaks = 15)
```
## Univariate case with lower & upper bounds
```{r}
x <- rbeta(200, 5, 1.5)
xgrid <- seq(-0.1, 1.1, length=1000)
f <- dbeta(xgrid, 5, 1.5) # true density
dens <- densityMclustBounded(x, lbound = 0, ubound = 1)
summary(dens, parameters = TRUE)
plot(dens, what = "density")
plot(dens, what = "density", data = x, breaks = 11)
```
## Bivariate case with lower bounds
```{r}
x1 <- rchisq(200, 3)
x2 <- 0.5*x1 + sqrt(1-0.5^2)*rchisq(200, 5)
x <- cbind(x1, x2)
dens <- densityMclustBounded(x, lbound = c(0,0))
summary(dens, parameters = TRUE)
plot(dens, what = "BIC")
plot(dens, what = "density")
points(x, cex = 0.3)
abline(h = 0, v = 0, lty = 3)
plot(dens, what = "density", type = "hdr")
abline(h = 0, v = 0, lty = 3)
plot(dens, what = "density", type = "persp")
```
## Suicide data
The data consist in the lengths of 86 spells of psychiatric treatment undergone by control patients in a suicide study (Silverman, 1986).
```{r}
data("suicide")
dens <- densityMclustBounded(suicide, lbound = 0)
summary(dens, parameters = TRUE)
plot(dens, what = "density",
lwd = 2, col = "dodgerblue2",
data = suicide, breaks = 15,
xlab = "Length of psychiatric treatment")
rug(suicide)
```
## Racial data
This dataset provides the proportion of white student enrollment in 56 school districts in Nassau County (Long Island, New York), for the 1992-1993 school year (Simonoff 1996, Sec. 3.2).
```{r}
data("racial")
x <- racial$PropWhite
dens <- densityMclustBounded(x, lbound = 0, ubound = 1)
summary(dens, parameters = TRUE)
plot(dens, what = "density",
lwd = 2, col = "dodgerblue2",
data = x, breaks = 15,
xlab = "Proportion of white student enrolled in schools")
rug(x)
```
# Modal clustering using MEM algorithm for Gaussian mixtures
## Simulated datasets
```{r}
data(Baudry_etal_2010_JCGS_examples, package = "mclust")
GMM <- Mclust(ex4.1)
plot(GMM, what = "classification")
MEM <- MclustMEM(GMM)
summary(MEM)
plot(MEM)
plot(MEM, addPoints = FALSE)
```
```{r}
GMM <- Mclust(ex4.4.2)
plot(GMM, what = "classification")
MEM <- MclustMEM(GMM)
summary(MEM)
plot(MEM)
plot(MEM, addDensity = FALSE)
```
# Entropy estimation
## Simulated data
**Univariate Gaussian**
```{r, out.width="50%"}
EntropyGauss(1) # population entropy
x = rnorm(1000) # generate sample
EntropyGauss(var(x)) # sample entropy assuming Gaussian distribution
mod = densityMclust(x, plot = FALSE)
EntropyGMM(mod) # GMM-based entropy estimate
plot(mod, what = "density", data = x, breaks = 31); rug(x)
```
**Univariate Mixed-Gaussian**\
Consider the mixed-Gaussian distribution $f(x) = 0.5 \times N(-2,1) + 0.5 \times N(2,1)$, whose entropy is 2.051939 in the population.
```{r, out.width="50%"}
cl = rbinom(1000, size = 1, prob = 0.5)
x = ifelse(cl == 1, rnorm(1000, 2, 1), rnorm(1000, -2, 1)) # generate sample
mod = densityMclust(x, plot = FALSE)
EntropyGMM(mod) # GMM-based entropy estimate
plot(mod, what = "density", data = x, breaks = 31); rug(x)
```
**Multivariate Chi-squared**\
Consider a 10-dimensional independent $\chi^2$ distribution, whose entropy is 24.23095 in the population.
```{r}
x = matrix(rchisq(1000*10, df = 5), nrow = 1000, ncol = 10)
mod1 = densityMclust(x, plot = FALSE)
EntropyGMM(mod1) # GMM-based entropy estimate, not too bad but...
mod2 = densityMclustBounded(x, lbound = rep(0,10))
EntropyGMM(mod2) # much more accurate
```
## Faithful data
```{r}
data(faithful)
mod = densityMclust(faithful, plot = FALSE)
EntropyGMM(mod) # GMM-based entropy estimate
# or provide the data and fit GMM implicitly
EntropyGMM(faithful)
```
## Iris data
```{r}
data(iris)
mod = densityMclust(iris[,1:4], plot = FALSE)
EntropyGMM(mod) # GMM-based entropy estimate
```
# Volatility analysis of financial log-returns
## Gold price 2023 data
```{r}
data(gold)
head(gold)
# GMM modeling
mod = GMMlogreturn(gold$log.returns)
summary(mod)
plot(mod, what = "BIC")
plot(mod, what = "density", data = gold$log.returns)
plot(mod, what = "diagnostic")
# compare to single Gaussian model
mod1 = GMMlogreturn(gold$log.returns, G = 1)
y0 = extendrange(mod$data, f = 0.1)
y0 = seq(min(y0), max(y0), length = 1000)
plot(mod, what = "density", data = gold$log.returns, col = "steelblue",
xlab = "Gold price log-returns", ylab = "Density")
lines(y0, predict(mod1, what = "dens", newdata = y0), col = "red3")
legend("topright", legend = c("Gaussian", "GMM"), lty = c(1,1),
col = c("red3", "steelblue"), inset = 0.02)
```
# References
Scrucca L., Fraley C., Murphy T. B. and Raftery A. E. (2023) *Model-Based Clustering, Classification, and Density Estimation Using mclust in R*. Chapman & Hall/CRC, ISBN: 978-1032234953, https://mclust-org.github.io/book/
Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. *The R Journal* 8/1, pp. 289-317.
https://doi.org/10.32614/RJ-2016-021
Scrucca L. (2019) A transformation-based approach to Gaussian mixture density estimation for bounded data, *Biometrical Journal*, 61:4, 873–888. https://doi.org/10.1002/bimj.201800174
Scrucca L. (2021) A fast and efficient Modal EM algorithm for Gaussian mixtures. *Statistical Analysis and Data Mining*, 14:4, 305–314. https://doi.org/10.1002/sam.11527
Robin S. and Scrucca L. (2023) Mixture-based estimation of entropy. *Computational Statistics & Data Analysis*, 177, 107582. https://doi.org/10.1016/j.csda.2022.107582
----
```{r}
sessionInfo()
```