# Data Visualization

Each of the following examples plots one million data points, but can scale to infinitely many observations, since only a summary (`OnlineStat`

) of the data is plotted.

## Partitions

The `Partition`

type summarizes sections of a data stream using any `OnlineStat`

, and is therefore extremely useful in visualizing huge datasets, as summaries are plotted rather than every single observation.

#### Continuous Data

```
y = cumsum(randn(10^6)) + 100randn(10^6)
o = Partition(KHist(10))
fit!(o, y)
plot(o)
```

```
o = Partition(Series(Mean(), Extrema()))
fit!(o, y)
plot(o)
```

#### Categorical Data

```
y = rand(["a", "a", "b", "c"], 10^6)
o = Partition(CountMap(String), 75)
fit!(o, y)
plot(o)
```

## Indexed Partitions

The `Partition`

type can only track the number of observations in the x-axis. If you wish to plot one variable against another, you can use an `IndexedPartition`

.

```
x = randn(10^6)
y = x + randn(10^6)
o = fit!(IndexedPartition(Float64, KHist(40), 40), zip(x, y))
plot(o)
```

```
x = rand(10^6)
y = rand(1:5, 10^6)
o = fit!(IndexedPartition(Float64, CountMap(Int)), zip(x,y))
plot(o, xlab = "X", ylab = "Y")
```

Due to a limitation with Plots.jl, `Date`

and `DateTime`

will sometimes be converted to their `Dates.value`

when plotted. To get human-readable tick labels, you can use the `xformatter`

keyword argument to `plot`

.

```
using Dates
x = rand(Date(2019):Day(1):Date(2020), 10^6)
y = Dates.value.(x) .+ 30randn(10^6)
o = fit!(IndexedPartition(Date, KHist(20)), zip(x,y))
plot(o, xformatter = x -> string(Date(Dates.UTInstant(Day(x)))))
```

## K-Indexed Partitions

A `KIndexedPartition`

is simlar to an `IndexedPartition`

, but uses a different method of binning the x variable (centroids vs. intervals), similar to that of `KHist`

.

For the sake of performance, you must provide a **function** that creates the OnlineStat you wish to calculate for the y variable.

```
x = randn(10^6)
y = x + randn(10^6)
o = fit!(KIndexedPartition(Float64, () -> KHist(20)), zip(x, y))
plot(o)
```

## Histograms

```
s = fit!(Series(KHist(25), Hist(-5:.2:5), ExpandingHist(100)), randn(10^6))
plot(s, link = :x, label = ["KHist" "Hist" "ExpandingHist"])
```

## Average Shifted Histograms (ASH)

- ASH is a semi-parametric density estimation method that is similar to Kernel Density Estimation, but uses a fine partition histogram instead of individual observations to perform the smoothing.

```
o = fit!(Ash(ExpandingHist(1000), 5), randn(10^6))
plot(o)
```

## Approximate CDF

```
o = fit!(OrderStats(1000), randn(10^6))
plot(o)
```

## Mosaic Plots

The `Mosaic`

type allows you to plot the relationship between two categorical variables. It is typically more useful than a bar plot, as class probabilities are given by the horizontal widths.

```
using RDatasets
t = dataset("ggplot2", "diamonds")
o = Mosaic(eltype(t.Cut), eltype(t.Color))
fit!(o, zip(t.Cut, t.Color))
plot(o, legendtitle="Color", xlabel="Cut")
```

## HeatMap

```
o = HeatMap(-5:.1:5, -0:.1:10)
x, y = randn(10^6), 5 .+ randn(10^6)
fit!(o, zip(x, y))
plot(o)
```

`plot(o, marginals=false, legend=true)`

## Traces

A `Trace`

will take snapshots of an OnlineStat as it is fitted, allowing you view how the value changed as observations were added. This can be useful for identifying concept drift or finding optimal hyperparameters for stochastic approximation methods like `StatLearn`

.

```
y = range(1, 20, length=10^6) .* randn(10^6)
o = Trace(Extrema())
fit!(o, y)
plot(o)
```

## Naive Bayes Classifier

The `NBClassifier`

type stores conditional histograms of the predictor variables, allowing you to plot approximate "group by" distributions:

```
# make data
x = randn(10^6, 5)
y = x * [1,3,5,7,9] .> 0
o = NBClassifier(5, Bool) # 5 predictors with Boolean categories
fit!(o, zip(eachrow(x), y))
plot(o)
```