Training a Neural Network - A Numerical Example by ralampay

ai · @ralampay · Dec 16 '17 (edited)

$0.50

Training a Neural Network - A Numerical Example

<html>
<h1><br></h1>
<h1>Abstract</h1>
<p>Neural networks are models used to approximate a discriminative function for classification in a supervised learning fashion. You have a bunch of input in the form of <code>n</code>-dimensional numerical vectors that represent features of certain entities you'd like to teach the network to classify (example 1000 emails classified as spam and 1000 emails classified non-spam). This paper takes a look at a quantitative approach in training a neural network given actual numerical examples. This will allow the student/practitioner to understand the fundamentals of the training process namely feed forward and backpropagation. The paper is intended to be light in concept with specific examples for people getting into machine learning with neural networks.</p>
<h1>Introduction</h1>
<p>Neural networks are biologically inspired computational models that attempts to solve classification problems. It is composed on <strong>neurons</strong> which holds and processes values in the network. You can think of these values as signal strengths that aim to mimic how chemical reactions occur in the brain. The higher the value, the stronger the signal. In biology, neurons transmit and receive signals to and from other neurons by means of dendrites. These propagations are modelled in the neural network by means of weight values. Since a neuron may receive values from more than one neuron, it accounts all the weights connected to it before attempting to fire a signal thus simulating how we "react" to certain stimuli. Training the neural network roughly means looking for the optimal values for these weights based on what we already know in order for the model to properly "react" to a certain input.&nbsp;</p>
<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Neuron.svg/1200px-Neuron.svg.png" width="1200" height="645"/><br>
(Image taken from <a href="https://en.wikipedia.org/wiki/Neuron">Wikipedia</a>)</p>
<p>There are typically 3 layers in a classical neural network - <strong>input layer</strong>, <strong>hidden layer/s</strong> and <strong>output layer</strong>. These layers are connected to one another via neurons' weights in an adjacent manner. This means that a layer can only be connected to at most two adjacent layers. The <strong>input layer</strong> will always be the leftmost layer, the <strong>output layer</strong> will always be the rightmost layer while everything in between will be <strong>hidden layers</strong>. The following example is an illustration of the classical neural network:</p>
<p><img src="https://i.imgsafe.org/50/50383b5d31.png" width="541" height="381"/></p>
<p><br></p>
<p>The image above will be the basis for the numeric computations in the following sections. But first, let's talk about the different components of the neural network.</p>
<h2>Input Layer</h2>
<p>The input layer contains neurons that represent the input values or what the network initially receives from the real world. These values are features that represent a classification/label that we'd like to recognize. In the mathematical model <code>f(x) = y</code>, this would be the <code>x</code> as an <code>n</code>-dimensional vector. For example, if we'd like to tell the neural network that we're looking at an image of a face represented by a matrix of pixel values, and suppose the size of the image is 32 x 32 pixels, the input layer would have a total of <code>1024</code> neurons each one corresponding to a pixel value. We refer to this vector as a <strong>feature vector</strong>.</p>
<h2>Hidden Layer/s</h2>
<p>From the input layer, information is passed to a hidden layer which contains neurons that processes signals it receives from the its adjacent layer/s (either from the left or from the right). A neural network can have more than one hidden layer. Neurons in the hidden layer are often denoted as <code>z_i</code> which we refer to as <strong>latent variables</strong>.</p>
<h2>Output Layer</h2>
<p>The output layer contains neurons that represent the output of the network or the result/reaction of the network after receiving and processing the input (from input layer to hidden layer then finally to the output layer). The easiest way to model neurons in the output layer is to treat each one (neuron) as a classification/label that the network is trying to recognize with a value ranging from <code>0</code> to <code>1</code>. The closer the value to <code>1</code> for an output neuron, the closer it is to thinking that it is that classification or label for a given input <code>x</code>. For example, let's say we're trying to learn how to diffirentiate cats from dogs from any other animal. The set of possible outputs (cat, dog or others) can be represented by:</p>
<p><img src="https://i.imgsafe.org/50/50658f2836.png" width="229" height="68"/></p>
<p>where <code>y_1</code> represents the label cat, <code>y_2</code> dog and <code>y_3</code> others. A cat then for a neural network would look like $f(x) = \begin{bmatrix} 1 &amp; &nbsp;&nbsp;0 &amp; 0 \end{bmatrix}$, a dog would be <code>f(x) = [0, 1, 0]</code> and finally any other animal <code>f(x) = [0, 0, 1 ]</code>.</p>
<h1>Training</h1>
<p>To train a neural network means to optimize the set of weights (connections between neurons) in such a way that when we give it a feature vector representing a cat, the output should be close to <code>f(x) = [1, 0, 0]</code>. In order to do this, at the beginning of training, the weights are randomly initialized. We then feed it a feature vector relating to cat and see what the <code>Y</code> value is --- how close did the network get to the actual answer. This closeness value can be measured by a <strong>loss function</strong>. A higher value for this function means that the network yielded a higher error while smaller value means that the network is getting pretty close to recognizing what the input is. We will use this error/closeness value to adjust the weights accordingly. The first step in this entire training process is called <strong>feed forward</strong>.</p>
<p>For the rest of the discussion, we will be referring to following figure for the initial weights:</p>
<p><img src="https://i.imgsafe.org/50/50a45a90c8.png" width="308" height="214"/></p>
<h2>Feed Forward</h2>
<p>The feed forward process can be thought of the movement of information from the input layer in the form of the feature vector's values to the first hidden layer and finally making the guess (classification) in the output layer. Given two consecutive layers, information is passed from the left layer to the right layer by performing a matrix multiplication operation between the left layer's neurons and the weight matrix between the left and right layer. The resulting matrix should have the same size as the neurons in the right layer (1 row with <code>n</code> columns where <code>n</code> is the size of the neurons in the right layer). Let's take the following illustration:</p>
<p>We're trying to solve for the values of <code>Z</code> by matrix multiplying <code>X</code> with <code>W_1</code>. The result will then be passed to an activation function such as the <strong>sigmoid function</strong> given by:</p>
<p><img src="https://i.imgsafe.org/50/50b06635b4.png" width="215" height="79"/></p>
<p>Mathematically, we can then represent information flow from <code>X</code> to <code>Z</code>:</p>
<p><img src="https://i.imgsafe.org/50/50b3500ecf.png" width="191" height="66"/></p>
<p>Numerically we have the following:</p>
<p><img src="https://i.imgsafe.org/50/50b5ee85ea.png" width="411" height="263"/></p>
<p>Using the same approach, we feed forward $Z$'s values towards $Y$ by matrix multiplying it with $W_{2}$.</p>
<p>Mathematically:</p>
<p><img src="https://i.imgsafe.org/50/50eeceb3a6.png" width="196" height="58"/></p>
<p>Numerically:</p>
<p><img src="https://i.imgsafe.org/50/50f057598f.png" width="388" height="286"/></p>
<h2>Loss Function</h2>
<p>The loss function determines how far (or close) the guess of the network (<code>Y</code>) to the actual classification value (<code>Y^</code>). Remember that we want to teach the network how to recognize a certain classification by adjusting the its weights. The amount of adjustment we do will largely depend on the value given by the <strong>loss function</strong>. For this paper, we will be using a very simple loss function:</p>
<p><img src="https://i.imgsafe.org/50/50f49de25c.png" width="264" height="92"/></p>
<p>In the case of our example, <code>Y^=[1, 0, 0]</code>, the value of our loss function will be computed as:</p>
<p><img src="https://i.imgsafe.org/51/5118c99ccb.png" width="539" height="425"/></p>
<h2>Back Propagation</h2>
<p>Once we get the error value from the loss function, we can use this value to determine how much change we would apply to the weights to minimize this error value. This is done by a process called back propagation which takes the individual error values and cascades it from the output layer to the input layer seeing how much damage the weight values contributed to the overall error. Integral to this process is to solve for <strong>gradients</strong>. These gradients are approximation of derivatives. As with derivatives, gradient values dictate the direction of the error function of the network. If we know these values, we can roughly determine how to adjust our weights to minimize the error. To simplify things, since weight values drive the error value, we'd like to determine the gradient values for each neuron (since weights are attached to neurons) starting from the output layer (thus backward propagation) to give us the <strong>delta weights</strong> or how much magnitude should we adjust the original weights to lower the error. Numerically, this means that the size of our gradient vector will be the same size of neurons in a given layer.</p>
<p>For a more specific example, we'll break down this process into two major operations. The first part will perform back propagation starting from the output to the last hidden layer. And the second part will be from the last hidden layer down to the input layer. Similar to feed forward, we will be computing values with 3 inputs -- 2 layers and the weight matrix in between them. For each pair of layers, the gradient values to be computed for the right layer.</p>
<h3>BP from Output to Last Hidden Layer</h3>
<p>To start off, we compute the gradients for the right layer in this part of BP which in this case is the output layer (<code>Y</code>). Gradients are computed by taking the product of the first derivatives of equations used in the model. Take the <strong>loss function</strong> for each <code>y_i</code> for instance:</p>
<p><img src="https://i.imgsafe.org/51/512a3579fc.png" width="182" height="72"/></p>
<p>The first derivative <code>e'_i</code> for the <strong>loss function</strong> would be:</p>
<p><img src="https://i.imgsafe.org/51/512c6ab6d3.png" width="156" height="77"/></p>
<p>We also take the derivative for the neurons in the right layer (output layer in this case) which we will refer to as <code>Y'</code>. The activation function we used was:</p>
<p><img src="https://i.imgsafe.org/51/5138c406bc.png" width="195" height="77"/></p>
<p>The derivative a given output &lt;code&gt;y'_i&lt;/code&gt; would approximately be:</p>
<p><img src="https://i.imgsafe.org/51/513b3532b9.png" width="202" height="124"/></p>
<p>Given these derivatives, we can then compute for our gradients <code>G_h</code> (with a one to one correspondence to the right layer / output layer in this case). We can do this by using the following equation:</p>
<p><img src="https://i.imgsafe.org/51/513d7998b6.png" width="356" height="57"/></p>
<p>Plugging in the necessary values we have:</p>
<p><img src="https://i.imgsafe.org/51/513ff902a6.png" width="614" height="114"/></p>
<p>Once we have the gradient values, we can use these to compute for the the change in weight <code>dW_2</code> which will subtract from the original weights to get the updated weights <code>W^_2</code> (we're at index <code>2</code> because again we're moving from the last layer down to the input layer). At this part of the process, delta weights can be computed by multiplying the transpose of our gradients <code>G</code> with the output of the left layer -- in this case <code>Z</code>. We then transpose the result to have the same shape as <code>W_2</code>.</p>
<p><img src="https://i.imgsafe.org/51/51424704de.png" width="182" height="58"/></p>
<p>Plugging in the values we have:</p>
<p><img src="https://i.imgsafe.org/51/5144d7af14.png" width="401" height="266"/></p>
<p>Finally we can update the weights from <code>Z</code> to <code>Y</code>:</p>
<p><img src="https://i.imgsafe.org/51/5146ce7294.png" width="629" height="230"/></p>
<h3>BP from Last Hidden Layer</h3>
<p>Computing the gradients and updated weights from the last hidden layer $Z$ down to the input layer $X$ is a bit different but generally follows the same process. Probably the major difference in this step of back propagation is to apply the gradients computed in the last operation (gradients corresponding to $Y$) as part of the computation. We shall refer to this as $G_{p}$ where:</p>
<p><img src="https://i.imgsafe.org/51/51ad04dddc.png" width="117" height="46"/></p>
<p>We then have to solve for a new $G$ which corresponds to the gradients of the right layer in this operation ($Z$). Aside from taking the previous operation's gradients, we have to also account for the old weights in the previous operation ($W_{2}$). We then matrix multiply $G_{p}$ with with the transpose of $W_{2}$ to give us the same shape as the right layer $Z$. This result will then be multiplied element-wise with the derivatives of $Z$ -- $Z'$. The operation would then be as follows:</p>
<p><img src="https://i.imgsafe.org/51/51b18d7b70.png" width="201" height="60"/></p>
<p>Let's solve for the derivatives $Z'$ first using the same derivative equation as $Y'$:</p>
<p><img src="https://i.imgsafe.org/51/51b773ded5.png" width="665" height="50"/></p>
<p>Next we solve for $G_{p}W_{p}^T$:</p>
<p><img src="https://i.imgsafe.org/51/51bae6c547.png" width="646" height="299"/></p>
<p>Finally putting them all together to solve for $G = (G_{p}W_{p}^T) \times Z'$:</p>
<p><img src="https://i.imgsafe.org/51/51be7f0d35.png" width="721" height="51"/></p>
<p>We can then extract the delta weights $\delta{W_{1}}$ using the following:</p>
<p><img src="https://i.imgsafe.org/51/51c2f406c6.png" width="162" height="62"/></p>
<p>Plugging in the values we have:</p>
<p><img src="https://i.imgsafe.org/51/51c780035a.png" width="836" height="147"/></p>
<p>If you have a deeper neural network, then we can apply the equations above to the next consecutive layers.</p>
<p>Now we can update $W_{1}$ to $\widehat{W_{1}}$ accordingly by subtracting $\delta{W_{1}}$ from it similar to what we did with $\widehat{W_{2}}$:</p>
<p><img src="https://i.imgsafe.org/51/51cb6c4bc0.png" width="202" height="69"/></p>
<p><img src="https://i.imgsafe.org/51/51cd8ee143.png" width="991" height="131"/></p>
<p>We now have the following updated weights for our network:</p>
<p><img src="https://i.imgsafe.org/51/51d06d7b13.png" width="495" height="193"/></p>
<p>To test if we have indeed trained the network, we'll use the same input and perform a feed forward using the updated weights. This can be mathematically expressed:</p>
<p><img src="https://i.imgsafe.org/51/51d5f5c7d0.png" width="165" height="96"/></p>
<p>Plugging in the values we have:</p>
<p><img src="https://i.imgsafe.org/51/51dafc1888.png" width="661" height="203"/></p>
<p><img src="https://i.imgsafe.org/51/51dd360c11.png" width="903" height="84"/></p>
<p><img src="https://i.imgsafe.org/51/51df70c20c.png" width="691" height="90"/></p>
<p>Finally we use the \textbf{loss function} to see if the network has improved (lower total error will incidcate improvement):</p>
<p><img src="https://i.imgsafe.org/51/51e27cac8c.png" width="660" height="348"/></p>
<p>We now have $E(Y, \widehat{Y})=0.558976$ which is less than the initial error of $E(Y, \widehat{Y})=0.56945$.</p>
<h1>Summary</h1>
<p>This paper shows a numeric example on how we can train a classical fully connected neural network. "Training" a neural network refers to optimizing its weights to reduce the value of the \textbf{loss function} given a target $\widehat{Y}$. In this case, we used an optimization process called back propagation which takes the derivatives of the error values to propagates it to compute for gradients that indicates what direction the weights should move in order to minimize the error.</p>
<p><br></p>
</html>

👍 nokodemion, vasilek, thd64, oraborajiv, tomscott, matheushtip

`author`	ralampay
`permlink`	training-a-neural-network-a-numerical-example-part-1
`category`	ai
`json_metadata`	{"tags":["ai","machinelearning","neuralnetworks","patternrecognition"],"app":"steemit/0.1","format":"html","image":["https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Neuron.svg/1200px-Neuron.svg.png","https://i.imgsafe.org/50/50383b5d31.png","https://i.imgsafe.org/50/50658f2836.png","https://i.imgsafe.org/50/50a45a90c8.png","https://i.imgsafe.org/50/50b06635b4.png","https://i.imgsafe.org/50/50b3500ecf.png","https://i.imgsafe.org/50/50b5ee85ea.png","https://i.imgsafe.org/50/50eeceb3a6.png","https://i.imgsafe.org/50/50f057598f.png","https://i.imgsafe.org/50/50f49de25c.png","https://i.imgsafe.org/51/5118c99ccb.png","https://i.imgsafe.org/51/512a3579fc.png","https://i.imgsafe.org/51/512c6ab6d3.png","https://i.imgsafe.org/51/5138c406bc.png","https://i.imgsafe.org/51/513b3532b9.png","https://i.imgsafe.org/51/513d7998b6.png","https://i.imgsafe.org/51/513ff902a6.png","https://i.imgsafe.org/51/51424704de.png","https://i.imgsafe.org/51/5144d7af14.png","https://i.imgsafe.org/51/5146ce7294.png","https://i.imgsafe.org/51/51ad04dddc.png","https://i.imgsafe.org/51/51b18d7b70.png","https://i.imgsafe.org/51/51b773ded5.png","https://i.imgsafe.org/51/51bae6c547.png","https://i.imgsafe.org/51/51be7f0d35.png","https://i.imgsafe.org/51/51c2f406c6.png","https://i.imgsafe.org/51/51c780035a.png","https://i.imgsafe.org/51/51cb6c4bc0.png","https://i.imgsafe.org/51/51cd8ee143.png","https://i.imgsafe.org/51/51d06d7b13.png","https://i.imgsafe.org/51/51d5f5c7d0.png","https://i.imgsafe.org/51/51dafc1888.png","https://i.imgsafe.org/51/51dd360c11.png","https://i.imgsafe.org/51/51df70c20c.png","https://i.imgsafe.org/51/51e27cac8c.png"],"links":["https://en.wikipedia.org/wiki/Neuron"]}
`created`	2017-12-16 11:08:18
`last_update`	2017-12-16 13:25:09
`depth`	0
`children`	2
`last_payout`	2017-12-23 11:08:18
`cashout_time`	1969-12-31 23:59:59
`total_payout_value`	0.386 HBD
`curator_payout_value`	0.116 HBD
`pending_payout_value`	0.000 HBD
`promoted`	0.000 HBD
`body_length`	16,654
`author_reputation`	1,302,126,061
`root_title`	"Training a Neural Network - A Numerical Example"
`beneficiaries`	`[]`
`max_accepted_payout`	1,000,000.000 HBD
`percent_hbd`	10,000
`post_id`	23,737,680
`net_rshares`	82,149,748,985
`author_curate_reward`	""

properties (23)vote details (6)

voter	rshares	pct
oraborajiv	615,320,000	100%
vasilek	2,884,123,734	100%
thd64	615,320,000	100%
nokodemion	77,984,882,738	100%
tomscott	50,102,513	50%
matheushtip	0	100%

`author`	neurallearner
`permlink`	re-ralampay-training-a-neural-network-a-numerical-example-part-1-20171216t212050414z
`category`	ai
`json_metadata`	{"tags":["ai"],"links":["https://steemit.com/ai/@neurallearner/ai-introduction-part-2-images-and-and-convolutional-neural-network"],"app":"steemit/0.1"}
`created`	2017-12-16 21:20:51
`last_update`	2017-12-16 21:20:51
`depth`	1
`children`	0
`last_payout`	2017-12-23 21:20:51
`cashout_time`	1969-12-31 23:59:59
`total_payout_value`	0.000 HBD
`curator_payout_value`	0.000 HBD
`pending_payout_value`	0.000 HBD
`promoted`	0.000 HBD
`body_length`	149
`author_reputation`	52,688,262,389
`root_title`	"Training a Neural Network - A Numerical Example"
`beneficiaries`	`[]`
`max_accepted_payout`	1,000,000.000 HBD
`percent_hbd`	10,000
`post_id`	23,798,046
`net_rshares`	0

`author`	steemitboard
`permlink`	steemitboard-notify-ralampay-20171216t161023000z
`category`	ai
`json_metadata`	{"image":["https://steemitboard.com/img/notifications.png"]}
`created`	2017-12-16 16:10:24
`last_update`	2017-12-16 16:10:24
`depth`	1
`children`	0
`last_payout`	2017-12-23 16:10:24
`cashout_time`	1969-12-31 23:59:59
`total_payout_value`	0.000 HBD
`curator_payout_value`	0.000 HBD
`pending_payout_value`	0.000 HBD
`promoted`	0.000 HBD
`body_length`	836
`author_reputation`	38,975,615,169,260
`root_title`	"Training a Neural Network - A Numerical Example"
`beneficiaries`	`[]`
`max_accepted_payout`	1,000,000.000 HBD
`percent_hbd`	10,000
`post_id`	23,767,790
`net_rshares`	0