05 Feb 2017, 17:00

Improved Logistic Classifier in C#

Last weekend I implemented a logistic classifier for use in a side-project I’ve been working on for the last few months (stay tuned on that). For anyone reading along (probably no one), you may recall I was quite enamoured with the beauty with which C# allowed the mathematics to be expressed in code.

Unfortunately, the implementation is annoyingly slow for the quantity of data I want to use it on, so this weekend I’ve built an optimised version. It’s approximately 20 times faster. Here’s what makes it faster:

  1. Unnecessary heap allocations and copying of data are avoided. I would guess this is the primary optimisation and accounts for about 10x improvement in performance. LINQ is the culprit here and the functional niceties it provides have been replaced by iterative loops.

  2. The cost function is never calculated (only it’s gradient w.r.t. $\theta$). I would guess this accounts for an additional 2x improvement. I was using the cost function to adaptively decrease the alpha step size and also to exit out of the training loop early when a good enough solution was found. I still wanted to decrease alpha as the current solution nears a local minima as I’ve found this results in faster training times. However, this is now done by a fixed proportion each cycle (specified using the halfLives parameter). Also, the training now always executes a fixed number of iterations.

  3. Array indexing is avoided in favour our unsafe code blocks and pointer arithmetic. I would guess this probably only accounts for a little improvement as I suspect standard loop constructs are heavily optimised by the C# compiler.

Finally, I’ve added a new feature. Some of the classification tasks I have are pretty unbalanced (relative class frequency is 10:1 in the worst case), so I wanted to play with allowing some of the training data to have more impact on the optimisation than others. There is a new parameter weightClass0 which specifies the weight of class 0 training data relative to class 1.

Parameters that work well for my data set are:

  • iterations ~ 100k
  • regularisation constant (lambda) ~ 0 (i’ve not found overfitting to be a problem)
  • theta - zero vector
  • alpha ~ 0.5
  • halfLives ~ 3

Code is available on github and I’ve packaged up the library and put it on nuget. Most of the calculation is done in the Gradient method, the new implementation of which is shown below:

    unsafe public static void FastGradient(double[,] xs, double[] ys, double[] theta, double lambda, double weightClass0, double[] thetaGradientOut)
    {
        int thetaLen = theta.Length;

        fixed (double* theta_fixed = theta)
        fixed (double* ys_fixed = ys)
        fixed (double* xs_fixed = xs)
        fixed (double* thetaGradient_fixed = thetaGradientOut)
        {
            double* ysPtr = ys_fixed;
            double* xsPtr = xs_fixed;
            double* resultPtr = thetaGradient_fixed;
            double* thetaPtr = theta_fixed;

            for (int j=0; j<thetaLen; ++j)
            {
                *(resultPtr++) = 0.0;
            }
            
            double weightTotal = 0.0;
            for (int i=0; i<ys.Length; ++i)
            {
                double weight = ys[i] > 0.5 ? 1.0 : weightClass0;
                weightTotal += weight;
                
                double* xsPtr_save = xsPtr;
                double h = 0;
                thetaPtr = theta_fixed;
                for (int j=0; j<thetaLen; ++j)
                {
                    h += *(thetaPtr++) * *(xsPtr++);
                }
                double delta = Sigmoid(h) - *(ysPtr++);

                resultPtr = thetaGradient_fixed;
                for (int j=0; j<thetaLen; ++j)
                {
                    *(resultPtr++) += weight * delta * *(xsPtr_save++);
                }
            }

            double trainingSetSize = ys.Length;

            resultPtr = thetaGradient_fixed;
            *(resultPtr++) /= weightTotal;
            thetaPtr = theta_fixed + 1;
            for (int j=1; j<thetaLen; ++j)
            {
                *resultPtr += lambda * *(thetaPtr++);
                *(resultPtr++) /= weightTotal;
            }
        }
    }

That’s it!

Again, just wanted to put out a quick writeup of this - time to get back to using it in my super-exciting side project.

21 Jan 2017, 19:30

Logistic Classifier in C#

In this post we’ll walk through implementing a logistic classifier in C#. A couple of features of the language allow for an implementation that is particularly nice - LINQ and expression bodied function members. Users of other languages take note!

For those who are just looking for an off-the-shelf library, the the Apache licensed source code is available on github and there is a package (’MathML’) on nuget. I went to the effort of implementing this because the main ML libraries for the .NET framework have yet to be ported over to dotnet core. It’s turned out to be worth the effort however - the exercise has proved to be a great way to obtain a deeper understanding of the behavior of the classifier.

The core idea is relatively simple - take a linear combination of the feature values $x$ (where the weights are specified by model parameters $\theta$) and apply a sigmoid function $g$ to the result, bounding them to the range $(0, 1)$:

$$ h(x, \theta) = g(\theta^Tx) = \frac{1}{1 + e^{(-\theta^Tx)}} $$

$h(x, \theta)$ - the ‘hypothesis function’ - is interpreted as the probability that the features $x$ correspond to an input of class ‘1’ (and $1-h(x, \theta)$ is interpreted as the probability the features correspond to an input of class ‘0’).

Some notes on features:

  • If you have a categorical feature with k levels, you would typically split this into into (k-1) binary 0/1 features.
  • Typically the first ‘feature’ will simply be a constant value of 1 to give a constant term $\theta_0$ in the hypothesis function.
  • In some scenarios, you may want to introduce features that are simply some function (e.g. $x^2$) of other features to allow the classifier to differentiate more complex regions in the feature space.

Here’s the C# code for the hypothesis function:

    public static double Sigmoid(double v)
        => 1.0 / (1 + Math.Exp(-v));

    public static double DotProduct(double[] xs, double[] ys)
        => xs.Zip(ys, (x, y) => x * y).Sum();

    public static double Hypothesis(double[] x, double[] theta)
        => Sigmoid(DotProduct(x, theta));

    public static double Probability(double[] x, double[] theta)
        => Hypothesis(x, theta);

(nice!)

Of course the most difficult part is finding the vector $\theta$ such that the model is a good fit to our training set and hopefully generalizes well to new data.

How do we measure how good the model is? An appropriate cost function for a given input $x^{(i)}$ with known output (class, 0 or 1) $y^{(i)}$ is:

$$ J^{(i)}(\theta) = -y^{(i)} log(h_\theta(x^{(i)})) - (1 - y^{(i)})log(1-h_\theta(x^{(i)})) $$

An intuitive justification of this is given here (note: unfortunately, I didn’t find a good formal justification of this online… but there are plenty of good books available). For a collection of feature vectors / classes, the total cost $J(\theta)$ is simply the sum of the individual cost values.

Here’s the C# code:

    public static double Cost(double[] x, double y, double[] theta)
        => -y * Math.Log(Hypothesis(x, theta)) - (1 - y) * Math.Log(1 - Hypothesis(x, theta));

    public static double Cost(double[][] xs, double[] ys, double[] theta)
        => xs.Zip(ys, (x, y) => Cost(x, y, theta)).Sum() / xs.Length;

Our goal is to find the model parameters $\theta$ such that the total cost associated with our training set is minimum. Gradient decent can be used to achieve this.

The gradient of the total cost function with respect to $\theta$ is:

$$ \frac{\partial J(\theta)}{\partial\theta_j} = \frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} $$

(again, refer to a good book..). To find the optimal $\theta$, we can start with a naive guess and iteratively jump by a small amount proportional to the magnitude of the gradient in the direction of the gradient.

I’ve used a slightly atypical strategy for choosing the proportionality constant $\alpha$: start out with a relatively high value and when an iteration step results in a worse cost value (implying the algorithm has overstepped the local minimum), halve the current $\alpha$ and keep iterating. Repeat up to a maximum number of iterations.

    public static double[] ScalarProduct(double[] xs, double y)
        => xs.Select(x => x * y).ToArray();

    public static double[] Gradient(double[] x, double y, double[] theta)
        => ScalarProduct(x, (Hypothesis(x, theta) - y));

    public static double[] VectorAdd(double[][] xs1)
        => xs1.Aggregate(new double[xs1[0].Length], (accum, xs) => accum.Zip(xs, (a, x) => a + x).ToArray());

    public static double[] Gradient(double[][] xs, double[] ys, double[] theta)
        => VectorAdd(xs.Zip(ys, (x, y) => Gradient(x, y, theta)).ToArray());

    public static double[] VectorAdd(double[] xs, double[] ys)
        => xs.Zip(ys, (x, y) => x + y).ToArray();

    public static double[] Train(double[][] xs, double[] ys, double[] theta, double minAlpha, double maxAlpha, ref int iterations)
    {
        var alpha = maxAlpha;
        double prevCost = Classifier.Cost(xs, ys, theta);
        while (iterations-- > 0)
        {
            theta = VectorAdd(theta, ScalarProduct(Gradient(xs, ys, theta), -alpha));
            var cost = Classifier.Cost(xs, ys, theta);
            if (cost > prevCost)
            {
                alpha /= 2;
                if (alpha < minAlpha)
                {
                    break;
                }
            }
            prevCost = cost;
        }
        return theta;
    }

In practice, the above algorithm can be susceptible to overfitting. For this reason, it is common to include a regularization term which acts to reduce the variance of the $\theta$ (and hence the reliance of the model on any small subset of the features). The regularized cost function is:

$$ J(\theta) = \frac{1}{m}\sum_{i=1}^{m}\big[-y^{(i)} log(h_\theta(x^{(i)})) - (1 - y^{(i)})log(1-h_\theta(x^{(i)})) \big] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2 $$

where $\lambda$ controls the degree of regularization. Note that the final sum excludes the constant model parameter $\theta_0$.

The regularized gradient function is:

$$ \frac{\partial J(\theta)}{\partial\theta_j} = \frac{1}{m}\Big(\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} + \lambda\theta_j\Big) $$

for all components of $\theta$ except the constant term $\theta_0$, which should be calculated using the non regularized gradient function.

The implementation is a straightforward extension of the non-regualarized version.

Ok, that’s it! I’m off to actually use this thing for something real.