---
title: "The phytoclass package"
output:
   rmarkdown::html_vignette:
     tabset: true
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{The phytoclass package}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(phytoclass)
```


# Introduction

The *phytoclass* package uses non-negative matrix factorization and simulated annealing to determine the biomass of different phytoplankton groups from pigment concentrations. The methodology is discussed in [@Hayward2023] and is similar to the CHEMTAX method of @Mackey1996.

For ease of use, most functions have default options. However the user also has the option to set their own parameters within the program, and instructions on how to do this are listed for each function. 

It is important to highlight that naming conventions for phytoplankton groups and their pigments should be adhered to when using the default samples. For example, the user should ensure that pigment names in their sample matrix (S) match the same pigment names in the pigment – Chl a ratio matrix (F).

At present the use of DV Chl a with Prochlorococcus is not supported, however it will be in a future release.

The main function of the package is **simulated_annealing()** with an associated helper function **Cluster()**  

Other helper functions covered in this document are:

1.  Matrix_checks
2.  Bounded_weights
3.  Steepest_Desc
4.  NNLS_MF


## Main function: simulated_annealing() {.tabset}

This is the main function for the *phytoclass* package. 

It takes in the inputs (listed below) and returns the updated pigment to Chl a ratios, the Chl a biomass of each phytoplankton group, error associated with each group, and a graph displaying the Chl a concentration for each group.  

It is important that samples are clustered appropriately before using the function (see the Cluster function).

Arguments:

**S** = Sample matrix – a matrix of pigment samples. Ensure that Chl a is the final column

**F** = Pigment to Chl a matrix. If left blank default values will be used. Ensure that pigment columns are in the same order as **S** and column naming conventions match. 

**user_defined_min_max** = If blank default values are used. To create different min_max values, follow the same structure as the phytoclass::min_max file. See the example below. 

**do_matrix_checks** = this should only be set to true when using the default values. This will remove pigment columns that have column sums of 0. Set to FALSE if using customised names for pigments and phytoplankton groups.

**niter** = number of iterations. Default value is 500.

**step** = step ratio used. Default value is 0.009. 

**weight.upper.bound** = the upper limit of the weights applied. Default value is 30. 


When using the default values, the only argument required is the sample matrix. However, make sure that the pigment names match those in the built-in pigment to Chl matrix Fm.

For the examples that follow the argument *niter* equals one for processing speed, but should be set much higher to obtain convergence. 


### Cluster function

Prior to analysis using simulated annealing, pigment samples require clustering.

The *Cluster* function divides all pigment concentrations by the total Chl a concentration. Following this the data undergoes BoxCox transformation, and the data is hierarchically clustered using the Ward method based on the Manhattan distances between pigment samples. The DynamicTreeCut method of [@Langfelder2008] is then used to prune the dendogram into reasonable clusters of specified size(s). 

The function returns a list of the clusters and the cluster dendrogram. 

An example, using the built-in sample data set Sm:

```{r}
Cluster.result <- Cluster(Sm, 14)
```

```{r fig.width=7}
# list of clusters
Cluster.result$cluster.list
# plot of clusters
plot(Cluster.result$cluster.plot)
```


### Example without clustering

The example here uses the built-in sample matrix Sm. 

```{r message=FALSE}
set.seed("7683")
Results <- simulated_annealing(Sm, niter = 1)
```

```{r results}
Results$`condition number`
Results$RMSE
Results$MAE
Results$Error

Results$`F matrix`
Results$`Class abundances`
```

```{r figure-results, fig.width=7}
Results$Figure
```

### Example with clustering

```{r message=FALSE}
Clust1 <- Cluster(Sm, min_cluster_size = 14)$cluster.list[[1]]
# Remove the cluster column/label
Clust1$Clust <- NULL

set.seed("7683")
Results <- simulated_annealing(Clust1, niter = 1)
```


```{r results-clustering}
Results$`condition number`
Results$RMSE
Results$MAE
Results$Error

Results$`F matrix`
Results$`Class abundances`
```

```{r figure-results-clustering, fig.width=7}
Results$Figure
```


### Example using non-default values

```{r}
#Create Fm (F matrix). Alternatively, a .csv file can be uploaded.
#Create Fm (F matrix). Alternatively, a .csv file can be uploaded.
Fu <- data.frame(
  Per = c(0, 0, 0, 0, 1, 0, 0, 0),
  X19but = c(0, 0, 0, 0, 0, 1, 1, 0),
  Fuco = c(0, 0, 0, 1, 0, 1, 1, 0),
  Pra = c(1, 0, 0, 0, 0, 0, 0, 0),
  X19hex = c(0, 0, 0, 0, 0, 1, 0, 0),
  Allo = c(0, 0, 1, 0, 0, 0, 0, 0),
  Zea = c(1, 1, 0, 0, 0, 0, 0, 1),
  Chl_b = c(1, 1, 0, 0, 0, 0, 0, 0),
  Tchla = c(1, 1, 1, 1, 1, 1, 1, 1)
)

rownames(Fu) <- c(
  "Prasinophytes", "Chlorophytes", "Cryptophytes"
  , "Diatoms-2", "Dinoflagellates-1",
  "Haptophytes", "Pelagophytes", "Syn"
)

Min_max <- data.frame(
  Class = c(
    "Syn", "Chlorophytes", "Chlorophytes", "Prasinophytes", "Prasinophytes",
    "Prasinophytes", "Cryptophytes", "Diatoms-2", "Diatoms-2", "Pelagophytes",
    "Pelagophytes", "Pelagophytes", "Dinoflagellates-1", "Haptophytes",
    "Haptophytes", "Haptophytes", "Haptophytes", "Diatoms-2", "Cryptophytes",
    "Prasinophytes", "Chlorophytes", "Syn", "Dinoflagellates-1", "Pelagophytes"
  ),
  Pig_Abbrev = c(
    "Zea", "Zea", "Chl_b", "Pra", "Zea", "Chl_b", "Allo", "Chl_c3",
    "Fuco", "Chl_c3", "X19but", "Fuco", "Per", "X19but", "X19hex",
    "Fuco", "Tchla", "Tchla", "Tchla", "Tchla", "Tchla", "Tchla", "Tchla",
    "Tchla"
  ),
  min = as.numeric(c(
    0.0800, 0.0063, 0.1666, 0.0642, 0.0151, 0.4993, 0.2118, 0.0189,
    0.3315, 0.1471, 0.2457, 0.3092, 0.3421, 0.0819, 0.2107, 0.0090,
    1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000
  )),
  max = as.numeric(c(
    1.2123, 0.0722, 0.9254, 0.4369, 0.1396, 0.9072, 0.5479, 0.1840,
    0.9332, 0.2967, 1.0339, 1.2366, 0.8650, 0.2872, 1.3766, 0.4689,
    1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000
  ))
)
```

```{r message=FALSE}
set.seed("7683")
Results <- simulated_annealing(
  S = Sm, 
  F = Fu,
  user_defined_min_max = Min_max,
  do_matrix_checks = TRUE,
  niter = 1,
  step = 0.01,
  weight.upper.bound = 30
)
```

```{r message=FALSE}
set.seed("7683")
Results <- simulated_annealing(
  S = Sm, 
  F = Fu,
  user_defined_min_max = Min_max,
  do_matrix_checks = TRUE,
  niter = 1,
  step = 0.01,
  weight.upper.bound = 30
)
```
 
 
```{r results-not-default}
Results$`condition number`
Results$RMSE
Results$MAE
Results$Error

Results$`F matrix`
Results$`Class abundances`
```

```{r figure-results-not-default, fig.width=7}
Results$Figure
```


# Helper functions {.tabset}

## Matrix_checks 

This function will perform checks on the **F** and **S** matrices. If the columns sum is 0 or very small it will be eliminated from both the S and F matrix. To use this function, naming for both pigments and phytoplankton  classes must follow the same conventions as the default values. The output of this function will be new **S** and **F** matrices.

Arguments:

S = S matrix

F = F matrix

```{r}
MC <- Matrix_checks(Sm, Fm)  
Snew <- MC$Snew
Fnew <- MC$Fnew  
```


## Steep_Desc 

This algorithm will perform the steepest descent algorithm without being bound by any limits. The results may show unrealistic pig:Chl *a* ratios for certain groups, although the error can often be very low. 

Arguments:

F = F matrix

S = S matrix

num.loops = the number of loops/iterations to perform

```{r message=FALSE}
MC <- Matrix_checks(Sm, Fm)  
Snew <- MC$Snew
Fnew <- MC$Fnew
SDRes <- Steepest_Desc(Fnew, Snew, num.loops = 10)
```


## Bounded_weights 

This function determines the weights to use for the S matrix, for which the default upper bound is 30 in the simulated_annealing() function call. 

Arguments:

S = sample matrix

weight.upper.bound = maximum weights upper bound

```{r}
Bounded_weights(Sm, weight.upper.bound = 30)
```


## NNLS_MF 

This performs non-negative least least squares matrix factorization between the **S** and **F** matrices. Ensure that Chl *a* is the final column and that the columns in **S** and **F** are in the same order. If the weights are left blank then no weights will be used.

Arguments:

Fn = F matrix

S = sample matrix

cm = weights for each column

```{r}
MC <- Matrix_checks(Sm, Fm)  
Snew <- MC$Snew
Fnew <- MC$Fnew
cm <- Bounded_weights(Snew, weight.upper.bound = 30)

Results <- NNLS_MF(Fnew, Snew, cm)
Results$`F matrix`
Results$RMSE
Results$`C matrix`
```

# References