Data Visualization

Note

Each of the following examples plots one million data points, but can scale to infinitely many observations, since only a summary (OnlineStat) of the data is plotted.

Partitions

The Partition type summarizes sections of a data stream using any OnlineStat, and is therefore extremely useful in visualizing huge datasets, as summaries are plotted rather than every single observation.

Continuous Data

y = cumsum(randn(10^6)) + 100randn(10^6)

o = Partition(KHist(10))

fit!(o, y)

plot(o)
Example block output
o = Partition(Series(Mean(), Extrema()))

fit!(o, y)

plot(o)
Example block output

Categorical Data

y = rand(["a", "a", "b", "c"], 10^6)

o = Partition(CountMap(String), 75)

fit!(o, y)

plot(o)
Example block output

Indexed Partitions

The Partition type can only track the number of observations in the x-axis. If you wish to plot one variable against another, you can use an IndexedPartition.

x = randn(10^6)
y = x + randn(10^6)

o = fit!(IndexedPartition(Float64, KHist(40), 40), zip(x, y))

plot(o)
Example block output
x = rand(10^6)
y = rand(1:5, 10^6)

o = fit!(IndexedPartition(Float64, CountMap(Int)), zip(x,y))

plot(o, xlab = "X", ylab = "Y")
Example block output

Due to a limitation with Plots.jl, Date and DateTime will sometimes be converted to their Dates.value when plotted. To get human-readable tick labels, you can use the xformatter keyword argument to plot.

using Dates

x = rand(Date(2019):Day(1):Date(2020), 10^6)
y = Dates.value.(x) .+ 30randn(10^6)

o = fit!(IndexedPartition(Date, KHist(20)), zip(x,y))

plot(o, xformatter = x -> string(Date(Dates.UTInstant(Day(x)))))
Example block output

K-Indexed Partitions

A KIndexedPartition is simlar to an IndexedPartition, but uses a different method of binning the x variable (centroids vs. intervals), similar to that of KHist.

For the sake of performance, you must provide a function that creates the OnlineStat you wish to calculate for the y variable.

x = randn(10^6)
y = x + randn(10^6)

o = fit!(KIndexedPartition(Float64, () -> KHist(20)), zip(x, y))

plot(o)
Example block output

Histograms

s = fit!(Series(KHist(25), Hist(-5:.2:5), ExpandingHist(100)), randn(10^6))
plot(s, link = :x, label = ["KHist" "Hist" "ExpandingHist"])
Example block output

Average Shifted Histograms (ASH)

  • ASH is a semi-parametric density estimation method that is similar to Kernel Density Estimation, but uses a fine partition histogram instead of individual observations to perform the smoothing.
o = fit!(Ash(ExpandingHist(1000), 5), randn(10^6))
plot(o)
Example block output

Approximate CDF

o = fit!(OrderStats(1000), randn(10^6))

plot(o)
Example block output

Mosaic Plots

The Mosaic type allows you to plot the relationship between two categorical variables. It is typically more useful than a bar plot, as class probabilities are given by the horizontal widths.

using RDatasets
t = dataset("ggplot2", "diamonds")

o = Mosaic(eltype(t.Cut), eltype(t.Color))

fit!(o, zip(t.Cut, t.Color))

plot(o, legendtitle="Color", xlabel="Cut")
Example block output

HeatMap

o = HeatMap(-5:.1:5, -0:.1:10)

x, y = randn(10^6), 5 .+ randn(10^6)

fit!(o, zip(x, y))

plot(o)
Example block output
plot(o, marginals=false, legend=true)
Example block output

Traces

A Trace will take snapshots of an OnlineStat as it is fitted, allowing you view how the value changed as observations were added. This can be useful for identifying concept drift or finding optimal hyperparameters for stochastic approximation methods like StatLearn.

y = range(1, 20, length=10^6) .* randn(10^6)

o = Trace(Extrema())

fit!(o, y)

plot(o)
Example block output

Naive Bayes Classifier

The NBClassifier type stores conditional histograms of the predictor variables, allowing you to plot approximate "group by" distributions:

# make data
x = randn(10^6, 5)
y = x * [1,3,5,7,9] .> 0

o = NBClassifier(5, Bool)  # 5 predictors with Boolean categories
fit!(o, zip(eachrow(x), y))
plot(o)
Example block output