21 Jan 2017, 19:30

Logistic Classifier in C#

In this post we’ll walk through implementing a logistic classifier in C#. A couple of features of the language allow for an implementation that is particularly nice - LINQ and expression bodied function members. Users of other languages take note!

For those who are just looking for an off-the-shelf library, the the Apache licensed source code is available on github and there is a package (’MathML’) on nuget. I went to the effort of implementing this because the main ML libraries for the .NET framework have yet to be ported over to dotnet core. It’s turned out to be worth the effort however - the exercise has proved to be a great way to obtain a deeper understanding of the behavior of the classifier.

The core idea is relatively simple - take a linear combination of the feature values $x$ (where the weights are specified by model parameters $\theta$) and apply a sigmoid function $g$ to the result, bounding them to the range $(0, 1)$:

$$ h(x, \theta) = g(\theta^Tx) = \frac{1}{1 + e^{(-\theta^Tx)}} $$

$h(x, \theta)$ - the ‘hypothesis function’ - is interpreted as the probability that the features $x$ correspond to an input of class ‘1’ (and $1-h(x, \theta)$ is interpreted as the probability the features correspond to an input of class ‘0’).

Some notes on features:

  • If you have a categorical feature with k levels, you would typically split this into into (k-1) binary 0/1 features.
  • Typically the first ‘feature’ will simply be a constant value of 1 to give a constant term $\theta_0$ in the hypothesis function.
  • In some scenarios, you may want to introduce features that are simply some function (e.g. $x^2$) of other features to allow the classifier to differentiate more complex regions in the feature space.

Here’s the C# code for the hypothesis function:

    public static double Sigmoid(double v)
        => 1.0 / (1 + Math.Exp(-v));

    public static double DotProduct(double[] xs, double[] ys)
        => xs.Zip(ys, (x, y) => x * y).Sum();

    public static double Hypothesis(double[] x, double[] theta)
        => Sigmoid(DotProduct(x, theta));

    public static double Probability(double[] x, double[] theta)
        => Hypothesis(x, theta);


Of course the most difficult part is finding the vector $\theta$ such that the model is a good fit to our training set and hopefully generalizes well to new data.

How do we measure how good the model is? An appropriate cost function for a given input $x^{(i)}$ with known output (class, 0 or 1) $y^{(i)}$ is:

$$ J^{(i)}(\theta) = -y^{(i)} log(h_\theta(x^{(i)})) - (1 - y^{(i)})log(1-h_\theta(x^{(i)})) $$

An intuitive justification of this is given here (note: unfortunately, I didn’t find a good formal justification of this online… but there are plenty of good books available). For a collection of feature vectors / classes, the total cost $J(\theta)$ is simply the sum of the individual cost values.

Here’s the C# code:

    public static double Cost(double[] x, double y, double[] theta)
        => -y * Math.Log(Hypothesis(x, theta)) - (1 - y) * Math.Log(1 - Hypothesis(x, theta));

    public static double Cost(double[][] xs, double[] ys, double[] theta)
        => xs.Zip(ys, (x, y) => Cost(x, y, theta)).Sum() / xs.Length;

Our goal is to find the model parameters $\theta$ such that the total cost associated with our training set is minimum. Gradient decent can be used to achieve this.

The gradient of the total cost function with respect to $\theta$ is:

$$ \frac{\partial J(\theta)}{\partial\theta_j} = \frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} $$

(again, refer to a good book..). To find the optimal $\theta$, we can start with a naive guess and iteratively jump by a small amount proportional to the magnitude of the gradient in the direction of the gradient.

I’ve used a slightly atypical strategy for choosing the proportionality constant $\alpha$: start out with a relatively high value and when an iteration step results in a worse cost value (implying the algorithm has overstepped the local minimum), halve the current $\alpha$ and keep iterating. Repeat up to a maximum number of iterations.

    public static double[] ScalarProduct(double[] xs, double y)
        => xs.Select(x => x * y).ToArray();

    public static double[] Gradient(double[] x, double y, double[] theta)
        => ScalarProduct(x, (Hypothesis(x, theta) - y));

    public static double[] VectorAdd(double[][] xs1)
        => xs1.Aggregate(new double[xs1[0].Length], (accum, xs) => accum.Zip(xs, (a, x) => a + x).ToArray());

    public static double[] Gradient(double[][] xs, double[] ys, double[] theta)
        => VectorAdd(xs.Zip(ys, (x, y) => Gradient(x, y, theta)).ToArray());

    public static double[] VectorAdd(double[] xs, double[] ys)
        => xs.Zip(ys, (x, y) => x + y).ToArray();

    public static double[] Train(double[][] xs, double[] ys, double[] theta, double minAlpha, double maxAlpha, ref int iterations)
        var alpha = maxAlpha;
        double prevCost = Classifier.Cost(xs, ys, theta);
        while (iterations-- > 0)
            theta = VectorAdd(theta, ScalarProduct(Gradient(xs, ys, theta), -alpha));
            var cost = Classifier.Cost(xs, ys, theta);
            if (cost > prevCost)
                alpha /= 2;
                if (alpha < minAlpha)
            prevCost = cost;
        return theta;

In practice, the above algorithm can be susceptible to overfitting. For this reason, it is common to include a regularization term which acts to reduce the variance of the $\theta$ (and hence the reliance of the model on any small subset of the features). The regularized cost function is:

$$ J(\theta) = \frac{1}{m}\sum_{i=1}^{m}\big[-y^{(i)} log(h_\theta(x^{(i)})) - (1 - y^{(i)})log(1-h_\theta(x^{(i)})) \big] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2 $$

where $\lambda$ controls the degree of regularization. Note that the final sum excludes the constant model parameter $\theta_0$.

The regularized gradient function is:

$$ \frac{\partial J(\theta)}{\partial\theta_j} = \frac{1}{m}\Big(\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} + \lambda\theta_j\Big) $$

for all components of $\theta$ except the constant term $\theta_0$, which should be calculated using the non regularized gradient function.

The implementation is a straightforward extension of the non-regualarized version.

Ok, that’s it! I’m off to actually use this thing for something real.