Data Visualization

Note

Each of the following examples plots one million data points, but can scale to infinitely many observations, since only a summary (OnlineStat) of the data is plotted.

Partitions

The Partition type summarizes sections of a data stream using any OnlineStat, and is therefore extremely useful in visualizing huge datasets, as summaries are plotted rather than every single observation.

Continuous Data

y = cumsum(randn(10^6)) + 100randn(10^6)

o = Partition(KHist(10))

fit!(o, y)

plot(o)
o = Partition(Series(Mean(), Extrema()))

fit!(o, y)

plot(o)

Categorical Data

y = rand(["a", "a", "b", "c"], 10^6)

o = Partition(CountMap(String), 75)

fit!(o, y)

plot(o)

Indexed Partitions

The Partition type can only track the number of observations in the x-axis. If you wish to plot one variable against another, you can use an IndexedPartition.

x = randn(10^6)
y = x + randn(10^6)

o = fit!(IndexedPartition(Float64, KHist(40), 40), zip(x, y))

plot(o)
x = rand(10^6)
y = rand(1:5, 10^6)

o = fit!(IndexedPartition(Float64, CountMap(Int)), zip(x,y))

plot(o, xlab = "X", ylab = "Y")

Due to a limitation with Plots.jl, Date and DateTime will sometimes be converted to their Dates.value when plotted. To get human-readable tick labels, you can use the xformatter keyword argument to plot.

using Dates

x = rand(Date(2019):Day(1):Date(2020), 10^6)
y = Dates.value.(x) .+ 30randn(10^6)

o = fit!(IndexedPartition(Date, KHist(20)), zip(x,y))

plot(o, xformatter = x -> string(Date(Dates.UTInstant(Day(x)))))

K-Indexed Partitions

A KIndexedPartition is simlar to an IndexedPartition, but uses a different method of binning the x variable (centroids vs. intervals), similar to that of KHist.

For the sake of performance, you must provide a function that creates the OnlineStat you wish to calculate for the y variable.

x = randn(10^6)
y = x + randn(10^6)

o = fit!(KIndexedPartition(Float64, () -> KHist(20)), zip(x, y))

plot(o)

Histograms

s = fit!(Series(KHist(25), Hist(-5:.2:5), ExpandingHist(100)), randn(10^6))
plot(s, link = :x, label = ["KHist" "Hist" "ExpandingHist"])

Average Shifted Histograms (ASH)

  • ASH is a semi-parametric density estimation method that is similar to Kernel Density Estimation, but uses a fine partition histogram instead of individual observations to perform the smoothing.
o = fit!(Ash(ExpandingHist(1000), 5), randn(10^6))
plot(o)

Approximate CDF

o = fit!(OrderStats(1000), randn(10^6))

plot(o)

Mosaic Plots

The Mosaic type allows you to plot the relationship between two categorical variables. It is typically more useful than a bar plot, as class probabilities are given by the horizontal widths.

using RDatasets
t = dataset("ggplot2", "diamonds")

o = Mosaic(eltype(t.Cut), eltype(t.Color))

fit!(o, zip(t.Cut, t.Color))

plot(o, legendtitle="Color", xlabel="Cut")

HeatMap

o = HeatMap(-5:.1:5, -0:.1:10)

x, y = randn(10^6), 5 .+ randn(10^6)

fit!(o, zip(x, y))

plot(o)
plot(o, marginals=false, legend=true)

Traces

A Trace will take snapshots of an OnlineStat as it is fitted, allowing you view how the value changed as observations were added. This can be useful for identifying concept drift or finding optimal hyperparameters for stochastic approximation methods like StatLearn.

y = range(1, 20, length=10^6) .* randn(10^6)

o = Trace(Extrema())

fit!(o, y)

plot(o)

Naive Bayes Classifier

The NBClassifier type stores conditional histograms of the predictor variables, allowing you to plot approximate "group by" distributions:

# make data
x = randn(10^6, 5)
y = x * [1,3,5,7,9] .> 0

o = NBClassifier(5, Bool)  # 5 predictors with Boolean categories
fit!(o, zip(eachrow(x), y))
plot(o)