---
title: "ntdr"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{ntdr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The ``ntdr`` package is an easy way to access [National Transit Database](https://www.transit.dot.gov/ntd/ntd-data) from R. The package is available on CRAN, or you can install the latest development version from Github with `remotes::install_github("vgXhc/ntdr")`

```{r eval=FALSE}
install.packages("ntdr")
```

In addition to loading the `ntdr` package we also load `dplyr` and `ggplot2`.

```{r setup}
library(ntdr)
library(dplyr)
library(ggplot2)
```

# `get_ntd()`
`get_ntd()` is the main function of the package. It doesn't have any required parameters:

```{r bare-get-ntd-call}
ntd_data <- get_ntd()
ntd_data
colnames(ntd_data)
```

By default, the package downloads what the NTD calls "Complete Monthly Ridership (with adjustments and estimates)." Alternatively you can request `raw` data ("Raw Monthly Ridership (No Adjustments or Estimates)"). The complete data includes the most recent data available. The downside is that depending on the agency, the data may be based on a growth estimate, and there is currently no way to distinguish whether a value is based on an agency's report or on the growth estimate. Thus, very recent values and values for smaller agencies should be interpreted with caution and may be revised later. The `raw` data is the most reliable data, but the dataset only is released once a year. 

The package downloads a fairly large `xlsx` file from the web and returns a tibble with `r nrow(ntd_data)` rows and `r ncol(ntd_data)` columns. The first two columns are identifiers for the transit agency; followed by a human-readable agency name. Note that the agency name may not be what you expect. For example, the name of our local agency in Madison (Wisconsin) is ["Metro Transit"](https://www.cityofmadison.com/metro). But in the NTD data it is listed as "City of Madison". So if you cannot find your agency, use the `uza_name` variable described below.

NTD data go back as far as 2002, and some agencies no longer actively report data, report them under a different ID, or don't even exist anymore. This is reflected in the `active` column. Until April 2024, the `active` data applied to the whole agency  itself. More recently, the data distinguishes between different types of service (see below) within an agency, limiting the usefulness of the variable. I.e. an agency may still actively report data on their directly operated bus service, but they no longer run (and therefore actively report) on light rail service.

The `reporter_type` variable most commonly takes on the `Full reporter` value, but especially smaller or rural systems may have a different value for this variable. For agencies that aren't full reporters, the NTD data may include projections rather than actually reported data. Even for full reporters, most recent data may be based on estimates and may be corrected in future data releases.

`uza` is an identifier for [urbanized areas](https://www.fhwa.dot.gov/planning/census_issues/urbanized_areas_and_mpo_tma/faq/page01.cfm#q2) and `uza_name` has the name of that area (this will usually be how you will find your local agency). 

Since July 2023, an additional identifier, `uace` is included. In future versions of the dataset, this identifier will likely replace the `uza` variable.

`modes` denotes the type of transit reported on. 

```{r}
ntd_data |>
  count(modes)
```

There are a lot of different modes, including rather obscure ones like "Inclined Plane" (`IP`) or "Alaska Railroad" (`AR`). You can find documentation of the different modes [here](https://ftis.org/iNTD-Urban/modes.pdf). Since July 2023, a simplified variable for mode is included, `modes_simplified`, which only distinguishes between bus, rail, ferry, and other.

The `tos` variable represents the "type of service":

```{r}
ntd_data |>
  count(tos, sort = TRUE)
```

The most common values are `DO`, which is directly operated service, i.e. a transit agency running their own service; and `PT` for "purchased transportation", i.e. a transit agency contracting out services. Often agencies will have an entry for both of these, with `DO` being the regular, fixed route service and `PT` being paratransit or other more specialized forms of transit. 

Finally, the `month` and `value` variables provide the actual transit data for a given month. What variable is represented by `value` is in the `ntd_variable`. If you call `get_ntd()` without any additional parameters, it will return the "unlinked passenger trips" (UPT) metric for all agencies, modes, and types of service.


# Plot the data
The data are returned in a long format, which makes it easy to create plots:

```{r ridership-chart}
get_ntd(agency = c("City of Madison", "Capital Area Transportation Authority"), modes = "MB") |>
  dplyr::filter(tos == "DO") |>
  ggplot(aes(month, value, color = agency)) +
  geom_line() +
  labs(title = "Monthly unlinked passenger trips in Madison and Lansing") +
  theme_minimal()
```