Statistical Learning

The StatLearn (short for statistical learning) OnlineStat uses stochastic approximation methods to estimate models that take the form:

\[\hat\beta = \argmin_\beta \frac{1}{n} \sum_i f(y_i, x_i'\beta) + \sum_j \lambda_j g(\beta_j),\]

where

$f$ is a loss function of a response variable and linear preditor.
$\lambda_j$'s are nonnegative regularization parameters.
$g$ is a penalty function.

For example, LASSO Regression fits this form with:

$f(y_i, x_i'\beta) = \frac{1}{2}(y_i - x_i'\beta) ^ 2$
$g(\beta_j) = |\beta_j|$

OnlineStats implements interchangeable loss and penalty functions to use for both regression and classification problems. See the StatLearn docstring for details.

Stochastic Approximation

An important note is that StatLearn is unable to estimate coefficients exactly (For precision in regression problems, see LinReg). The upside is that it makes estimating certain models possible in an online fashion.

For example, it is not possible to get the same coefficients for logistic regression from an O(1)-memory online algorithm as you would get from its offline counterpart. This is because the logistic regression likelihood's sufficient statistics scale with the number of observations.

All this to say: StatLearn lets you do things that would otherwise not be possible at the cost of returning noisy estimates.

Algorithms

Besides the loss and penalty functions, you can also plug in a variety of fitting algorithms to StatLearn. Some of these methods are based on the stochastic gradient (gradient of loss evaluated on a single observation). Other methods are based on the majorization-minimization (MM) principle^[1], which offers some guarantees on numerical stability (sometimes at the cost of slower learning).

It is a good idea to test out different algorithms on a sample of your dataset. Plotting the coefficients over time can give you an idea of the stability of the estimates. Use Trace, a wrapper around an OnlineStat, to automatically take equally-spaced snapshots of an OnlineStat's state. Keep in mind the early observations will cause bigger jumps in the cofficients than later observations (based on the learning rate; see Weights. To add further complexity, learning rates (supplied by the rate keyword argument) do not affect each algorithm's learning uniformly. You may need to test different combinations of algorithm/learning rate to find an "optimal" pairing.

using OnlineStats, Plots

# fake data
x = rand(Bool, 1000, 10)
y = x * (1:10) + 10randn(1000)

rate = LearningRate(.8)

o = Trace(StatLearn(SGD(), OnlineStats.l2regloss; rate))
o2 = Trace(StatLearn(MSPI(), OnlineStats.l2regloss; rate))

itr = zip(eachrow(x), y)

fit!(o, itr)
fit!(o2, itr)

plot(
    plot(o, xlab="Nobs", title="SGD Coefficients", lab=nothing),
    plot(o2, xlab="Nobs", title="MSPI Coefficients", lab=nothing),
    link=:y
)

1At the moment, the only place to read about the stochastic MM algorithms in detail is Josh Day's dissertation. Josh is working on an easier-to-digest introduction to these methods and is also happy to discuss them through GitHub issue/email.