Online algorithms for statistics

Online algorithms for statistics

OnlineStats is a Julia package which provides online algorithms for statistical models. Online algorithms are well suited for streaming data or when data is too large to hold in memory. Observations are processed one at a time and all algorithms use O(1) memory.

Basics

Every OnlineStat is a type

m = Mean()
v = Variance()

OnlineStats are grouped by Series

s = Series(m, v)

Updating a Series updates the OnlineStats

y = randn(100)

for yi in y
    fit!(s, yi)
end

# or more simply:
fit!(s, y)

Weighting

Series are parameterized by a Weight type that controls the influence the next observation has on the OnlineStats contained in the Series.

s = Series(EqualWeight(), Mean())

Consider how weights affect the influence the next observation has on an online mean. Many OnlineStats have an update which takes this form:

\[\theta^{(t)} = (1-\gamma_t)\theta^{(t-1)} + \gamma_t x_t\]
ConstructorWeight at Update t
EqualWeight()γ(t) = 1 / t
ExponentialWeight(λ)γ(t) = λ
BoundedEqualWeight(λ)γ(t) = max(1 / t, λ)
LearningRate(r, λ)γ(t) = max(1 / t ^ r, λ)

Series

The Series type is the workhorse of OnlineStats. A Series tracks

  1. The Weight

  2. An OnlineStat or tuple of OnlineStats.

Creating a Series

Series(Mean())
Series(Mean(), Variance())

Series(ExponentialWeight(), Mean())
Series(ExponentialWeight(), Mean(), Variance())

y = randn(100)

Series(y, Mean())
Series(y, Mean(), Variance())

Series(y, ExponentialWeight(.01), Mean())
Series(y, ExponentialWeight(.01), Mean(), Variance())

Updating a Series

There are multiple ways to update the OnlineStats in a Series

s = Series(Mean())
fit!(s, randn())

s = Series(CovMatrix(4))
fit!(s, randn(4))
fit!(s, randn(4))
s = Series(Mean())
fit!(s, randn(), rand())
s = Series(Mean())
fit!(s, randn(100))

s = Series(CovMatrix(4))
fit!(s, randn(100, 4))                 # Observations in rows
fit!(s, randn(4, 100), ObsDim.Last())  # Observations in columns
s = Series(Mean())
fit!(s, randn(100), .01)
s = Series(Mean())
fit!(s, randn(100), rand(100))

go to top

Merging Series

Two Series can be merged if they track the same OnlineStats and those OnlineStats are mergeable. The syntax for in-place merging is

merge!(series1, series2, arg)

Where series1/series2 are Series that contain the same OnlineStats and arg is used to determine how series2 should be merged into series1.

using OnlineStats

y1 = randn(100)
y2 = randn(100)

s1 = Series(y1, Mean(), Variance())
s2 = Series(y2, Mean(), Variance())

# Treat s2 as a new batch of data.  Essentially:
# s1 = Series(Mean(), Variance()); fit!(s1, y1); fit!(s1, y2)
merge!(s1, s2, :append)

# Use weighted average based on nobs of each Series
merge!(s1, s2, :mean)

# Treat s2 as a single observation.
merge!(s1, s2, :singleton)

# Provide the ratio of influence s2 should have.
merge!(s1, s2, .5)

Callbacks

While an OnlineStat is being updated, you may wish to perform an action like print intermediate results to a log file or update a plot. For this purpose, OnlineStats exports a maprows function.

maprows(f::Function, b::Integer, data...)

maprows works similar to Base.mapslices, but maps b rows at a time. It is best used with Julia's do block syntax.

Example 1

y = randn(100)
s = Series(Mean())
maprows(20, y) do yi
    fit!(s, yi)
    info("value of mean is $(value(s))")
end
INFO: value of mean is 0.06340121912925167
INFO: value of mean is -0.06576995293439102
INFO: value of mean is 0.05374292238752276
INFO: value of mean is 0.008857939006120167
INFO: value of mean is 0.016199508928045905

go to top

Low Level Details

OnlineStat{I, O}

fit! and value