API

API

ADADELTA(ρ = .95)

An extension of ADAGRAD.

source
ADAGRAD()

A variation of SGD with element-wise weights generated by the average of the squared gradients.

source
ADAM(β1 = .99, β2 = .999)

A variant of SGD with element-wise learning rates generated by exponentially weighted first and second moments of the gradient.

source
ADAMAX(η, β1 = .9, β2 = .999)

ADAMAX with momentum parameters β1, β2. ADAMAX is an extension of ADAM.

source
AutoCov(b, T = Float64; weight=EqualWeight())

Calculate the auto-covariance/correlation for lags 0 to b for a data stream of type T.

Example

y = cumsum(randn(100))
o = AutoCov(5)
fit!(o, y)
autocov(o)
autocor(o)
source
BiasVec(x)

Lightweight wrapper of a vector which adds a "bias" term at the end.

Example

BiasVec(rand(5))
source
Bootstrap(o::OnlineStat, nreps = 100, d = [0, 2])

Calculate an online statistical bootstrap of nreps replicates of o. For each call to fit!, any given replicate will be updated rand(d) times (default is double or nothing).

Example

o = Bootstrap(Variance())
fit!(o, randn(1000))
confint(o, .95)
source
CStat(stat)

Track a univariate OnlineStat for complex numbers. A copy of stat is made to separately track the real and imaginary parts.

Example

y = randn(100) + randn(100)im
fit!(CStat(Mean()), y)
source
CallFun(o::OnlineStat, f::Function)

Call f(o) every time the OnlineStat o gets updated.

Example

o = CallFun(Mean(), println)
fit!(o, [0,0,1,1])
source
CountMap(T::Type)
CountMap(dict::AbstractDict{T, Int})

Track a dictionary that maps unique values to its number of occurrences. Similar to StatsBase.countmap.

Example

o = fit!(CountMap(Int), rand(1:10, 1000))
value(o)
probs(o)
OnlineStats.pdf(o, 1)
collect(keys(o))
source
CovMatrix(p=0; weight=EqualWeight())
CovMatrix(::Type{T}, p=0; weight=EqualWeight())

Calculate a covariance/correlation matrix of p variables. If the number of variables is unknown, leave the default p=0.

Example

o = fit!(CovMatrix(), randn(100, 4))
cor(o)
cov(o)
mean(o)
var(o)
source
Diff(T::Type = Float64)

Track the difference and the last value.

Example

o = Diff()
fit!(o, [1.0, 2.0])
last(o)
diff(o)
source
Extrema(T::Type = Float64)

Maximum and minimum.

Example

o = fit!(Extrema(), rand(10^5))
extrema(o)
maximum(o)
minimum(o)
source
FTSeries(stats...; filter=x->true, transform=identity)

Track multiple stats for one data stream that is filtered and transformed before being fitted.

FTSeries(T, stats...; filter, transform)

Create an FTSeries and specify the type T of the transformed values.

Example

o = FTSeries(Mean(), Variance(); transform=abs)
fit!(o, -rand(1000))

# Remove missing values represented as DataValues
using DataValues
y = DataValueArray(randn(100), rand(Bool, 100))
o = FTSeries(DataValue, Mean(); transform=get, filter=!isna)
fit!(o, y)
source
FastForest(p, nkeys=2; stat=FitNormal(), kw...)

Calculate a random forest where each variable is summarized by stat.

Keyword Arguments

  • nt=100): Number of trees in the forest
  • b=floor(Int, sqrt(p)): Number of random features for each tree to receive
  • maxsize=1000: Maximum size for any tree in the forest
  • splitsize=5000: Number of observations in any given node before splitting
  • λ = .05: Probability that each tree is updated on a new observation

Example

x, y = randn(10^5, 10), rand(1:2, 10^5)

o = fit!(FastForest(10), (x,y))

classify(o, x[1,:])
source
FastTree(p::Int, nclasses=2; stat=FitNormal(), maxsize=5000, splitsize=1000)

Calculate a decision tree of p predictors variables and classes 1, 2, …, nclasses. Nodes split when they reach splitsize observations until maxsize nodes are in the tree. Each variable is summarized by stat, which can be FitNormal() or Hist(nbins).

Example

x = randn(10^5, 10)
y = rand([1,2], 10^5)

o = fit!(FastTree(10), (x,y))

xi = randn(10)
classify(o, xi)
source
FitBeta(; weight)

Online parameter estimate of a Beta distribution (Method of Moments).

Example

o = fit!(FitBeta(), rand(1000))
source
FitCauchy(; alg, rate)

Approximate parameter estimation of a Cauchy distribution. Estimates are based on quantiles, so that alg will be passed to Quantile.

Example

o = fit!(FitCauchy(), randn(1000))
source
FitGamma(; weight)

Online parameter estimate of a Gamma distribution (Method of Moments).

Example

using Random
o = fit!(FitGamma(), randexp(10^5))
source
FitLogNormal()

Online parameter estimate of a LogNormal distribution (MLE).

Example

o = fit!(FitLogNormal(), exp.(randn(10^5)))
source
FitMultinomial(p)

Online parameter estimate of a Multinomial distribution. The sum of counts does not need to be consistent across observations. Therefore, the n parameter of the Multinomial distribution is returned as 1.

Example

x = [1 2 3; 4 8 12]
fit!(FitMultinomial(3), x)
source
FitMvNormal(d)

Online parameter estimate of a d-dimensional MvNormal distribution (MLE).

Example

y = randn(100, 2)
o = fit!(FitMvNormal(2), y)
source
FitNormal()

Calculate the parameters of a normal distribution via maximum likelihood.

Example

o = fit!(FitNormal(), randn(1000))
source
Group(stats::OnlineStat...)
Group(; stats...)
Group(collection)

Create a vector-input stat from several scalar-input stats. For a new observation y, y[i] is sent to stats[i].

Examples

x = randn(100, 2)

fit!(Group(Mean(), Mean()), x)
fit!(Group(Mean(), Variance()), x)

o = fit!(Group(m1 = Mean(), m2 = Mean()), x)
o.stats.m1
o.stats.m2
source
GroupBy{T}(stat)

Update stat for each group (of type T).

Example

x = rand(1:10, 10^5)
y = x .+ randn(10^5)
fit!(GroupBy{Int}(Extrema()), zip(x,y))
source
Heatmap(xedges, yedges; left = true, closed = true)

Create a two dimensional histogram with the bin partition created by xedges and yedges. When fitting a new observation, the first value will be associated with X, the second with Y.

  • If left, the bins will be left-closed.
  • If closed, the bins on the ends will be closed. See Hist.

Example

o = fit!(HeatMap(-5:.1:5, -5:.1:5), eachrow(randn(10^5, 2)))

using Plots
plot(o)
source
Hist(edges; left = true, closed = true)

Create a histogram with bin partition defined by edges.

  • If left, the bins will be left-closed.
  • If closed, the bin on the end will be closed.
    • E.g. for a two bin histogram $[a, b), [b, c)$ vs. $[a, b), [b, c]$

Example

o = fit!(Hist(-5:.1:5), randn(10^6))

# approximate statistics 
using Statistics

mean(o)
var(o)
std(o)
quantile(o)
median(o)
extrema(o)
source
HyperLogLog(b, T::Type = Number)  # 4 ≤ b ≤ 16

Approximate count of distinct elements.

Example

fit!(HyperLogLog(12), rand(1:10,10^5))
source
IndexedPartition(T, stat, b=100)

Summarize data with stat over a partition of size b where the data is indexed by a variable of type T.

Example

o = IndexedPartition(Float64, Hist(10))
fit!(o, randn(10^4, 2))

using Plots 
plot(o)
source
KHist(k::Int)

Estimate the probability density of a univariate distribution at k approximately equally-spaced points.

Ref: http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf

Example

o = fit!(KHist(25), randn(10^6))

# Approximate statistics
using Statistics
mean(o)
var(o)
std(o)
quantile(o)
median(o)

using Plots
plot(o)
source
KMeans(p, k; rate=LearningRate(.6))

Approximate K-Means clustering of k clusters and p variables.

Example

clusters = rand(Bool, 10^5)

x = [clusters[i] > .5 ? randn() : 5 + randn() for i in 1:10^5, j in 1:2]

o = fit!(KMeans(2, 2), x)
source
KahanMean(; T=Float64, weight=EqualWeight())

Track a univariate mean. Uses a compensation term for the update.

#Note

This should be more accurate as Mean in most cases but the guarantees of KahanSum do not apply. merge! can have some accuracy issues.

Update

$μ = (1 - γ) * μ + γ * x$

Example

@time fit!(KahanMean(), randn(10^6))
source
KahanSum(T::Type = Float64)

Track the overall sum. Includes a compensation term that effectively doubles precision, see Wikipedia for details.

Example

fit!(KahanSum(Float64), fill(1, 100))
source
KahanVariance(; T=Float64, weight=EqualWeight())

Track the univariate variance. Uses compensation terms for a higher accuracy.

#Note

This should be more accurate as Variance in most cases but the guarantees of KahanSum do not apply. merge! can have accuracy issues.

Example

o = fit!(KahanVariance(), randn(10^6))
mean(o)
var(o)
std(o)
source
OnlineStats.LagType.
Lag{T}(b::Integer)

Store the last b values for a data stream of type T. Values are stored as

$v(t), v(t-1), v(t-2), …, v(t-b+1)$

Example

o = fit!(Lag{Int}(10), 1:12)
o[1]
o[end]
source
LinReg()

Linear regression, optionally with element-wise ridge regularization.

Example

x = randn(100, 5)
y = x * (1:5) + randn(100)
o = fit!(LinReg(), (x,y))
coef(o)
coef(o, .1)
coef(o, [0,0,0,0,Inf])
source
LinRegBuilder(p)

Create an object from which any variable can be regressed on any other set of variables, optionally with element-wise ridge regularization. The main function to use with LinRegBuilder is coef:

coef(o::LinRegBuilder, λ = 0; y=1, x=[2,3,...], bias=true, verbose=false)

Return the coefficients of a regressing column y on columns x with ridge (L2Penalty) parameter λ. An intercept (bias) term is added by default.

Examples

x = randn(1000, 10)
o = fit!(LinRegBuilder(), x)

coef(o; y=3, verbose=true)

coef(o; y=7, x=[2,5,4])
source
MSPI()

Majorized Stochastic Proximal Iteration.

source
Mean(T = Float64; weight=EqualWeight())

Track a univariate mean, stored as type T.

Example

@time fit!(Mean(), randn(10^6))
source
Moments(; weight=EqualWeight())

First four non-central moments.

Example

o = fit!(Moments(), randn(1000))
mean(o)
var(o)
std(o)
skewness(o)
kurtosis(o)
source
Mosaic(T::Type, S::Type)

Data structure for generating a mosaic plot, a comparison between two categorical variables.

Example

using OnlineStats, Plots 
x = [rand() > .8 for i in 1:10^5]
y = rand([1,2,2,3,3,3], 10^5)
o = fit!(Mosaic(Bool, Int), zip(x, y))
plot(o)
source
MovingTimeWindow{T<:TimeType, S}(window::DatePeriod)
MovingTimeWindow(window::DatePeriod; valtype=Float64, timetype=Date)

Fit a moving window of data based on time stamps. Each observation must be a Tuple, NamedTuple, or Pair where the first item is <: Dates.TimeType. Only observations with time stamps in the range

$most_recent_datetime - window <= time_stamp <= most_recent_datetime$

are kept track of.

Example

using Dates
dts = Date(2010):Day(1):Date(2011)
y = rand(length(dts))

o = MovingTimeWindow(Day(4); timetype=Date, valtype=Float64)
fit!(o, zip(dts, y))
source
MovingWindow(b, T)
MovingWindow(T, b)

Track a moving window of b items of type T.

Example

o = MovingWindow(10, Int)
fit!(o, 1:14)
source
NBClassifier(p::Int, T::Type; stat = Hist(15))

Calculate a naive bayes classifier for classes of type T and p predictors. For each class K, predictor variables are summarized by the stat.

Example

x, y = randn(10^4, 10), rand(Bool, 10^4)

o = fit!(NBClassifier(10, Bool), (x,y))
collect(keys(o))
probs(o)

xi = randn(10)
predict(o, xi)
classify(o, xi)
source
OMAP()

Online MM via Averaged Parameter.

source
OMAS()

Online MM via Averaged Surrogate.

source
OrderStats(b::Int, T::Type = Float64; weight=EqualWeight())

Average order statistics with batches of size b.

Example

o = fit!(OrderStats(100), randn(10^5))
quantile(o, [.25, .5, .75])
source
P2Quantile(τ = 0.5)

Calculate the approximate quantile via the P^2 algorithm. It is more computationally expensive than the algorithms used by Quantile, but also more exact.

Ref: https://www.cse.wustl.edu/~jain/papers/ftp/psqr.pdf

Example

fit!(P2Quantile(.5), rand(10^5))
source
Partition(stat, nparts=100)

Split a data stream into nparts where each part is summarized by stat.

Example

o = Partition(Extrema())
fit!(o, cumsum(randn(10^5)))

using Plots
plot(o)
source
PlotNN(b=300)

Approximate scatterplot of b centers. This implementation is too slow to be useful.

Example

x = randn(10^4)
y = x + randn(10^4)
plot(fit!(PlotNN(), zip(x, y)))
source
ProbMap(T::Type; weight=EqualWeight())
ProbMap(A::AbstractDict{T, Float64}; weight=EqualWeight())

Track a dictionary that maps unique values to its probability. Similar to CountMap, but uses a weighting mechanism.

Example

o = ProbMap(Int)
fit!(o, rand(1:10, 1000))
probs(o)
source
Quantile(q = [.25, .5, .75]; alg=OMAS(), rate=LearningRate(.6))

Calculate quantiles via a stochastic approximation algorithm OMAS, SGD, ADAGRAD, or MSPI. For better (although slower) approximations, see P2Quantile and Hist.

Example

fit!(Quantile(), randn(10^5))
source
RMSPROP(α = .9)

A Variation of ADAGRAD that uses element-wise weights generated by an exponentially weighted mean of the squared gradients.

source
ReservoirSample(k::Int, T::Type = Float64)

Create a sample without replacement of size k. After running through n observations, the probability of an observation being in the sample is 1 / n.

Example

fit!(ReservoirSample(100, Int), 1:1000)
source
OnlineStats.SGDType.
SGD()

Stochastic Gradient Descent.

source
Series(stats)
Series(stats...)
Series(; stats...)

Track a collection stats for one data stream.

Example

s = Series(Mean(), Variance())
fit!(s, randn(1000))
source
StatHistory(stat, b)

Track a moving window (previous b copies) of stat.

Example

fit!(StatHistory(Mean(), 10), 1:20)
source
StatLearn(p, args...; rate=LearningRate())

Fit a model that is linear in the parameters.

The (offline) objective function that StatLearn approximately minimizes is

$(1/n) ∑ᵢ f(yᵢ, xᵢ'β) + ∑ⱼ λⱼ g(βⱼ),$

where $fᵢ$ are loss functions of a single response and linear predictor, $λⱼ$s are nonnegative regularization parameters, and $g$ is a penalty function.

Arguments

  • loss = .5 * L2DistLoss()
  • penalty = NoPenalty()
  • algorithm = SGD()
  • rate = LearningRate(.6) (keyword arg)

Example

x = randn(1000, 5)
y = x * range(-1, stop=1, length=5) + randn(1000)

o = fit!(StatLearn(5, MSPI()), (x, y))
coef(o)
source
OnlineStats.SumType.
Sum(T::Type = Float64)

Track the overall sum.

Example

fit!(Sum(Int), fill(1, 100))
source
Variance(T = Float64; weight=EqualWeight())

Univariate variance, tracked as type T.

Example

o = fit!(Variance(), randn(10^6))
mean(o)
var(o)
std(o)
source
StatsBase.confintFunction.
confint(b::Bootstrap, coverageprob = .95)

Return a confidence interval for a Bootstrap b.

source
Part(stat, a, b)

stat summarizes a Y variable over an X variable's range a to b.

source