Страница 2 из 5

Machine learning in practice – from PyTorch model to Kubeflow in the cloud for BigData

Shtoltc Eugeny

If the image is not only in the derived place, but there are other images, then to determine and it will take several layers of the neural network to perform the determination, the result of which will also be a map of the location of the digit, but making a decision about its location needs to be identified. Thus, the first layer will have the number of neurons displaying maps, which horizontally and vertically will correspond to the width and height of the minute leaf, corresponding to the width and height of the analyzing screen, divided by the step of shifting the analyzing window. The dimension of the second layer in neurons is equal to the dimension of the analyzed window in order to be able to identify the digit. If we make co

At the output, we will receive the activation of that output neuron that corresponds to a certain number. It does this on the basis of data received from neurons from the previous layer, which are responsible for the digit sectors, namely from which neurons the signals came and which ones did not. Let's take that the incoming signals from neurons through the co

When a neuron is trained with a teacher, we send training signals to it and get results at the output. For each input and output signal, we receive a result about the degree of error in prediction. When we ran all the training signals, we got a set (vector) of errors that can be represented as a function of errors. This error function depends on the input parameters (weights) and we need to find the weights at which this error function becomes minimal. To determine these weights, the Gradient Descent algorithm is used, the essence of which is to gradually move to the local target, and the direction of movement is determined by the derivative of this function and the activation function. The activation function is usually sigmoid for regular networks or truncated ReLU for deep networks. Sigmoid outputs a range from zero to one at all times. The truncated ReLU still allows for very large numbers (information is very important) at the input to transfer more than one to the output, there they themselves affect the layers that follow immediately after. For example, the dot above the dash separates the letter L from the letter i, and the information of one pixel affects the decision making at the output, so it is important not to lose this feature and transfer it to the last level. There are not so many varieties of activation functions – they are limited by the requirements for ease of training when it is required to take a derivative. So the sigmoid f after arbitrarily turns into f (1-f), which is effective. With Leaky ReLu (truncated ReLu with leakage) it is even simpler, since it takes the value 0 at x <0, then its wired in this section is also 0, and at x> = 0 it takes 0.01 * x, which with the derivative will be will be 0.01, and for x> 1 it takes 1 + 0.01 * x, which gives 0.01 for the derivative. Calculation is not required here at all, so learning is much faster.

Since we send the sum of the products of signals by their weights to the input of the activation function, then conceived, we need a different threshold level than from 0.5. We can shift it by a constant, adding it to the sum at the input to the activation function using the bias neuron to remember it. It has no inputs and always outputs one, and the offset itself is set by the weight of the co

When training a neuron, we know the error of the network itself, that is, on the input neurons. Based on them, you can calculate the error in the previous layer and so on up to the input – which is called the Backpropagation method.

The learning process itself can be divided into stages: initialization, learning itself, and prediction.

If our figure can be of different sizes, then pooling layers are applied, which scale the image down. Which algorithm will calculate what will be written when merging depends on the algorithm, usually this is the “max” function for the “max pooling” or “avg” algorithm (mean-square value of neighboring matrix cells) – average pooling.