I always believed that to understand something it’s best to start from the basics.
From my experience, the more down to the basics I go, the easier is to me to understand any higher-level problems in the subject.
Therefore, to understand a general concept of neural networks and artificial intelligence I started from one point. One data point. Usually, we have a lot of data, but all are subjected to the same operations.

A data point can represent any information, image, sound, price. This is our input. Something we can measure and process further.
In computer science, input data point is described as a vector. This is just a series of numbers. Mathematically speaking, each number represents a dimension. Practically, each number describes a feature, some significant characteristic of the data point. For example, features of an image could be the colours of the pixels. An image therefore can be described as a long vector, a series of numbers, each number assigned to a colour on a palette. Further we can decrease the number of features (decrease dimensionality). In case of an image, we can decrease dimensionality by describing not individual pixels, but groups of pixels by one number.
One of the big challenges in machine learning is how to describe data accurately but with minimal necessary numbers (features). If there are too many features, we have long (high dimensional) vectors and in consequence we need more data points and more computational power to work with them. If we have too little features, we may not describe our data point accurately enough to be able to use it for predictions.
Based on the input, we want to be able to predict the output. Output would be another data point, for example image label (cat, face, landscape), the meaning of the sound, price increase/decrease.
How can we find relationship between the input and the output?

We need to find a function which transforms the input vector into the output vector. It’s like y = ax + b, which transforms x into y; x and y in the equation can be single numbers but can also be vectors, built out of many numbers.
This is not that difficult if you have a set of inputs and set of outputs (for example images and descriptions). The real difficulty arises when we want to find a function which will work also for new data. Therefore, it must be not too specific for our current data set (not overfit the data), neither too general (underfit the data).
The real challenge of machine learning is that you should not go in any extreme but just find a middle way which will be optimal. For example, for the number of features and fitting of the function to the data.
But where are the neural networks in all this?
The data are not transformed with just one function. To find a complex relationship between input and output we need many functions, which form a chain. This works in a way that the output of the preceding function is the input for the next one.

Each function is like a step on the way from the question to the solution. And here we have not only many steps but also many parallel ways. This forms a network which in mathematics is called a graph.
Graph is built from nodes and edges which are connecting them. The other way we can describe the same thing is by saying that the nodes are neurons, and the edges are neuronal connections. This is to show the analogy to the way how information is processed by real, living neurons. The artificial neural networks are however a huge simplification in comparison to neurons forming a living brain, which form much more complex connections of different length and directions.
Therefore, we can say that artificial neural networks are a simplified in silico version of living neurons. But very much simplified. And they work in a different way, because they represent chains of mathematical operations, while living neurons transmit information through electrical and chemical signals.
Every function has parameters, which decide on the importance of each element in the function (in this case, each feature) to the output of the function. The higher the parameter, the more significant is the value of the feature to the output of the function.
Parameters are all combined in a matrix. For each layer of the neural network, a matrix with parameters is multiplied by a vector of features for each example data point.

This is just calculating many functions in parallel. As we said before, there are not only many steps in neural network computations but also many parallel ways, as reflected by the topology (the way how the nodes and edges are organised).
Nonlinear part of the neural network
Another important aspect of neural nets is that in the chain of functions, not all functions are linear. In two dimensions, linear function is a function which describes a straight line. Nonlinear function can have any shape, for example exponential or sinusoid.
In many dimensions it is a little bit difficult to imagine, but we can say that in linear function the change of input vs. output is stable while in non-linear function the change of input vs. output depends on the initial value of the input.
Non-linear functions are much more difficult to compute but also, they allow to describe much more complex transformation from input to output.
In artificial neural networks, nonlinear function work a little bit as a threshold. Here is again an analogy to a living brain: if the output of preceding node (or neuron, or just a function) is big enough, it will pass the threshold and go as an input to the next node on its way in the network. If not, it will become zero and will not be further computed.

A similar thing happens in the brain, where neurons are stimulated by preceding neurons only if the electrical signal passes the threshold. If it is not strong enough, it dies on the way, the electrical signal is not propagated further in the network.
At the end of the network, we have the final output, our result! But it is never easy at the first time. The first result of the transformation by neural network is usually very far from true.

Therefore, to train our network we need to compare the output of our transformation with some real data. How much is it different? This is the error.
The relationship between:
parameters we used in our network and
error - the difference between real data and what we calculated
is called a cost function.
We want the cost function to be at the minimum, as close to zero as possible.
How can we minimise the cost function? For example, by going backwards through all the network with all the computations.
We can use gradient descent or some other algorithm to do it.
Cost function is usually very complicated, and we cannot compute it all and just simply see where the minimum is.
Instead, we just make little steps by changing the parameter values and see if we went up or down with the cost function output. If we are going down, this is a good direction.
Here again, we do it step wise, by calculating the derivative of the cost function with respect to all parameters in the network. In this way we can check if all parameters in the network contribute to the cost function going down. Derivative, in simple terms, is a mathematical tool to check if a function is going up or down along the steps we have taken. So here we come back again to our chain of functions, indeed, the derivatives are also computed according to the chain rule all the way through the network.

We repeat this process many times (epochs) with many data points. The learning speed of our network is defined by the magnitude of changes of parameters at each step - which should not be too little (slow learning = many operations with little results) or too big (too big changes may lead to the missing of the optimal values for the minimum).
This leads to optimisation of the parameters in all the functions in the neural network. In this way the neural network can predict the correct output based on the input received.
I hope this explanation may be helpful for some of you, to understand or maybe just have a look at a familiar topic from a different perspective.

Comments