matt's dotnet blog
https://dotnet.blog/index.xml
Recent content on matt's dotnet blogHugo -- gohugo.ioen-usSun, 05 Feb 2017 17:00:00 +0000Improved Logistic Classifier in C#
https://dotnet.blog/posts/improved-logistic-classifier-csharp/
Sun, 05 Feb 2017 17:00:00 +0000https://dotnet.blog/posts/improved-logistic-classifier-csharp/<p><a href="https://dotnet.blog/posts/logistic-classifier-csharp/">Last weekend</a> I implemented a logistic classifier for use in
a side-project I’ve been working on for the last few months
(stay tuned on that). For anyone reading along (probably no one), you may recall I was quite enamoured with the beauty
with which C# allowed the mathematics to be expressed in code.</p>
<p>Unfortunately, the implementation is annoyingly slow for the quantity of data I want to use it on, so this weekend
I’ve built an optimised version. It’s approximately <strong>20 times faster</strong>. Here’s what makes it faster:</p>
<ol>
<li><p>Unnecessary heap allocations and copying of data are avoided. I would guess this is the primary optimisation and
accounts for about 10x improvement in performance. LINQ is the culprit here and the functional niceties it
provides have been replaced by iterative loops.</p></li>
<li><p>The cost function is never calculated (only it’s gradient w.r.t. <code>$\theta$</code>). I would guess this accounts for an
additional 2x improvement. I was using the cost function to adaptively decrease the alpha step size and also to
exit out of the training loop early when a good enough solution was found. I still wanted to decrease alpha as the
current solution nears a local minima as I’ve found this results in faster training times. However, this is now done
by a fixed proportion each cycle (specified using the <code>halfLives</code> parameter). Also, the training now always
executes a fixed number of iterations.</p></li>
<li><p>Array indexing is avoided in favour our unsafe code blocks and pointer arithmetic. I would guess this probably only
accounts for a little improvement as I suspect standard loop constructs are heavily optimised by the C# compiler.</p></li>
</ol>
<p>Finally, I’ve added a new feature. Some of the classification tasks I have are pretty unbalanced (relative class
frequency is 10:1 in the worst case), so I wanted to play with allowing some of the training data to have more
impact on the optimisation than others. There is a new parameter <code>weightClass0</code> which specifies the weight of
class 0 training data relative to class 1.</p>
<p>Parameters that work well for my data set are:</p>
<ul>
<li>iterations ~ 100k</li>
<li>regularisation constant (lambda) ~ 0 (i’ve not found overfitting to be a problem)</li>
<li>theta - zero vector</li>
<li>alpha ~ 0.5</li>
<li>halfLives ~ 3</li>
</ul>
<p>Code is available on <a href="https://github.com/mhowlett/MathML">github</a> and I’ve packaged up the library and put it on
<a href="https://www.nuget.org/packages/MathML/">nuget</a>.
Most of the calculation is done in the Gradient method, the new implementation of which is shown below:</p>
<pre><code> unsafe public static void FastGradient(double[,] xs, double[] ys, double[] theta, double lambda, double weightClass0, double[] thetaGradientOut)
{
int thetaLen = theta.Length;
fixed (double* theta_fixed = theta)
fixed (double* ys_fixed = ys)
fixed (double* xs_fixed = xs)
fixed (double* thetaGradient_fixed = thetaGradientOut)
{
double* ysPtr = ys_fixed;
double* xsPtr = xs_fixed;
double* resultPtr = thetaGradient_fixed;
double* thetaPtr = theta_fixed;
for (int j=0; j<thetaLen; ++j)
{
*(resultPtr++) = 0.0;
}
double weightTotal = 0.0;
for (int i=0; i<ys.Length; ++i)
{
double weight = ys[i] > 0.5 ? 1.0 : weightClass0;
weightTotal += weight;
double* xsPtr_save = xsPtr;
double h = 0;
thetaPtr = theta_fixed;
for (int j=0; j<thetaLen; ++j)
{
h += *(thetaPtr++) * *(xsPtr++);
}
double delta = Sigmoid(h) - *(ysPtr++);
resultPtr = thetaGradient_fixed;
for (int j=0; j<thetaLen; ++j)
{
*(resultPtr++) += weight * delta * *(xsPtr_save++);
}
}
double trainingSetSize = ys.Length;
resultPtr = thetaGradient_fixed;
*(resultPtr++) /= weightTotal;
thetaPtr = theta_fixed + 1;
for (int j=1; j<thetaLen; ++j)
{
*resultPtr += lambda * *(thetaPtr++);
*(resultPtr++) /= weightTotal;
}
}
}
</code></pre>
<p>That’s it!</p>
<p>Again, just wanted to put out a quick writeup of this - time to get back to using it in my super-exciting
side project.</p>
Logistic Classifier in C#
https://dotnet.blog/posts/logistic-classifier-csharp/
Sat, 21 Jan 2017 19:30:00 +0000https://dotnet.blog/posts/logistic-classifier-csharp/<p>In this post we’ll walk through implementing a logistic classifier in C#. A couple of features of the
language allow for an implementation that is particularly nice -
<a href="https://msdn.microsoft.com/en-us/library/bb397933.aspx">LINQ</a> and
<a href="https://visualstudiomagazine.com/articles/2015/06/03/c-sharp-6-expression-bodied-properties-dictionary-initializer.aspx">expression bodied function members</a>. Users of other languages take note!</p>
<p>For those who are just looking for an off-the-shelf library, the the Apache licensed
<a href="https://github.com/mhowlett/MathML">source code</a> is available on github and there is a
package (’<a href="https://www.nuget.org/packages/MathML/">MathML</a>’) on nuget. I went to the effort of
implementing this because the main ML libraries for the .NET framework have yet to be ported over to
<a href="https://www.microsoft.com/net/core">dotnet core</a>. It’s turned out to be worth the effort however -
the exercise has proved to be a great way to obtain a deeper understanding of the
behavior of the classifier.</p>
<p>The core idea is relatively simple - take a linear combination of the feature values
<code>$x$</code> (where the weights are specified by
model parameters <code>$\theta$</code>) and apply a sigmoid function <code>$g$</code> to the result, bounding them to the range
<code>$(0, 1)$</code>:</p>
<div>$$
h(x, \theta) = g(\theta^Tx) = \frac{1}{1 + e^{(-\theta^Tx)}}
$$</div>
<p><code>$h(x, \theta)$</code> - the ‘hypothesis function’ - is interpreted as the probability that the features
<code>$x$</code> correspond to an input of class ‘1’ (and <code>$1-h(x, \theta)$</code> is interpreted as the
probability the features correspond to an input of class ‘0’).</p>
<p>Some notes on features:</p>
<ul>
<li>If you have a categorical feature with k levels, you would typically split this into into (k-1)
binary 0/1 features.</li>
<li>Typically the first ‘feature’ will simply be a constant value of 1 to give a constant term
<code>$\theta_0$</code> in the hypothesis function.</li>
<li>In some scenarios, you may want to introduce features that are simply some function (e.g. <code>$x^2$</code>)
of other features to allow the classifier to differentiate more complex regions in the
feature space.</li>
</ul>
<p>Here’s the C# code for the hypothesis function:</p>
<pre><code> public static double Sigmoid(double v)
=> 1.0 / (1 + Math.Exp(-v));
public static double DotProduct(double[] xs, double[] ys)
=> xs.Zip(ys, (x, y) => x * y).Sum();
public static double Hypothesis(double[] x, double[] theta)
=> Sigmoid(DotProduct(x, theta));
public static double Probability(double[] x, double[] theta)
=> Hypothesis(x, theta);
</code></pre>
<p>(nice!)</p>
<p>Of course the most difficult part is finding the vector <code>$\theta$</code> such that the model is a good
fit to our training set and hopefully generalizes well to new data.</p>
<p>How do we measure how good the model is? An appropriate cost function for a given input
<code>$x^{(i)}$</code> with known output (class, 0 or 1) <code>$y^{(i)}$</code> is:</p>
<div>$$
J^{(i)}(\theta) = -y^{(i)} log(h_\theta(x^{(i)})) - (1 - y^{(i)})log(1-h_\theta(x^{(i)}))
$$</div>
<p>An intuitive justification of this is given
<a href="https://www.scribd.com/document/330990131/TMI05-2-logistic-regression-pdf">here</a> (note: unfortunately,
I didn’t find a good formal justification of this online… but there are plenty of good books available). For a collection of feature vectors / classes, the
total cost <code>$J(\theta)$</code> is simply the sum of the individual cost values.</p>
<p>Here’s the C# code:</p>
<pre><code> public static double Cost(double[] x, double y, double[] theta)
=> -y * Math.Log(Hypothesis(x, theta)) - (1 - y) * Math.Log(1 - Hypothesis(x, theta));
public static double Cost(double[][] xs, double[] ys, double[] theta)
=> xs.Zip(ys, (x, y) => Cost(x, y, theta)).Sum() / xs.Length;
</code></pre>
<p>Our goal is to find the model parameters <code>$\theta$</code> such that the total cost associated with
our training set is minimum. Gradient decent can be used to achieve this.</p>
<p>The gradient of the total cost function with respect to <code>$\theta$</code> is:</p>
<div>$$
\frac{\partial J(\theta)}{\partial\theta_j} = \frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}
$$</div>
<p>(again, refer to a good book..). To find the optimal <code>$\theta$</code>, we
can start with a naive guess and iteratively jump by a small amount proportional to the magnitude of
the gradient in the direction of the gradient.</p>
<p>I’ve used a slightly atypical strategy for choosing the proportionality constant <code>$\alpha$</code>: start
out with a relatively high value and when an iteration step results in a worse cost value (implying the algorithm
has overstepped the local minimum), halve the current <code>$\alpha$</code> and keep iterating. Repeat up to a maximum number
of iterations.</p>
<pre><code> public static double[] ScalarProduct(double[] xs, double y)
=> xs.Select(x => x * y).ToArray();
public static double[] Gradient(double[] x, double y, double[] theta)
=> ScalarProduct(x, (Hypothesis(x, theta) - y));
public static double[] VectorAdd(double[][] xs1)
=> xs1.Aggregate(new double[xs1[0].Length], (accum, xs) => accum.Zip(xs, (a, x) => a + x).ToArray());
public static double[] Gradient(double[][] xs, double[] ys, double[] theta)
=> VectorAdd(xs.Zip(ys, (x, y) => Gradient(x, y, theta)).ToArray());
public static double[] VectorAdd(double[] xs, double[] ys)
=> xs.Zip(ys, (x, y) => x + y).ToArray();
public static double[] Train(double[][] xs, double[] ys, double[] theta, double minAlpha, double maxAlpha, ref int iterations)
{
var alpha = maxAlpha;
double prevCost = Classifier.Cost(xs, ys, theta);
while (iterations-- > 0)
{
theta = VectorAdd(theta, ScalarProduct(Gradient(xs, ys, theta), -alpha));
var cost = Classifier.Cost(xs, ys, theta);
if (cost > prevCost)
{
alpha /= 2;
if (alpha < minAlpha)
{
break;
}
}
prevCost = cost;
}
return theta;
}
</code></pre>
<p>In practice, the above algorithm can be susceptible to overfitting. For this reason, it is common to include a
regularization term which acts to reduce the variance of the <code>$\theta$</code> (and hence the reliance of the
model on any small subset of the features). The regularized cost function is:</p>
<div>$$
J(\theta) = \frac{1}{m}\sum_{i=1}^{m}\big[-y^{(i)} log(h_\theta(x^{(i)})) - (1 - y^{(i)})log(1-h_\theta(x^{(i)})) \big] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2
$$</div>
<p>where <code>$\lambda$</code> controls the degree of regularization. Note that the final sum excludes the constant
model parameter <code>$\theta_0$</code>.</p>
<p>The regularized gradient function is:</p>
<div>$$
\frac{\partial J(\theta)}{\partial\theta_j} = \frac{1}{m}\Big(\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} + \lambda\theta_j\Big)
$$</div>
<p>for all components of <code>$\theta$</code> except the constant term <code>$\theta_0$</code>, which should be calculated using
the non regularized gradient function.</p>
<p>The implementation is a straightforward extension of the non-regualarized version.</p>
<p>Ok, that’s it! I’m off to actually use this thing for something real.</p>