Milania's Bloghttps://www.milania.de/jansellner.net | milania.de<![CDATA[Blog: Introduction to neural network optimizers [part 3] – Adam optimizer]]>https://www.milania.de/blog/Introduction_to_neural_network_optimizers_%5Bpart_3%5D_%E2%80%93_Adam_optimizerhttps://www.milania.de/blog/Introduction_to_neural_network_optimizers_%5Bpart_3%5D_%E2%80%93_Adam_optimizerTue, 26 Mar 2019 00:00:00 +0100Jan SellnerThis is the third part of a series consisting of three articles with the goal to introduce some general concepts and concrete algorithms in the field of neural network optimizers. As a reminder, here is the table of contents:

]]><![CDATA[Blog: Introduction to neural network optimizers [part 2] – adaptive learning rates (RMSProp, AdaGrad)]]>https://www.milania.de/blog/Introduction_to_neural_network_optimizers_%5Bpart_2%5D_%E2%80%93_adaptive_learning_rates_%28RMSProp%2C_AdaGrad%29https://www.milania.de/blog/Introduction_to_neural_network_optimizers_%5Bpart_2%5D_%E2%80%93_adaptive_learning_rates_%28RMSProp%2C_AdaGrad%29Sat, 23 Mar 2019 00:00:00 +0100Jan SellnerThis is the second part of a series consisting of three articles with the goal to introduce some general concepts and concrete algorithms in the field of neural network optimizers. As a reminder, here is the table of contents:

]]><![CDATA[Blog: Introduction to neural network optimizers [part 1] – momentum optimization]]>https://www.milania.de/blog/Introduction_to_neural_network_optimizers_%5Bpart_1%5D_%E2%80%93_momentum_optimizationhttps://www.milania.de/blog/Introduction_to_neural_network_optimizers_%5Bpart_1%5D_%E2%80%93_momentum_optimizationTue, 19 Mar 2019 00:00:00 +0100Jan SellnerA neural network is a model with a lot of parameters which are used to derive an output based on an input. In the learning process, we show the network a series of example inputs with an associated output so that it can adapt its parameters (weights) according to a defined error function. Since these error functions are too complex, we cannot simply determine an explicit formula for the optimal parameters. The usual approach to tackle this problem is to start with a random initialization of the weights and use gradient descent to iteratively find a local minimum in the error landscape. That is, we adapt the weights over multiple iterations according to the gradients until the value of the error function is sufficiently low.

]]><![CDATA[Showcase: Hyperparameters of an SVM with an RBF kernel]]>https://www.milania.de/showcase/Hyperparameters_of_an_SVM_with_an_RBF_kernelhttps://www.milania.de/showcase/Hyperparameters_of_an_SVM_with_an_RBF_kernelSun, 25 Nov 2018 00:00:00 +0100Jan SellnerA Support vector machine (SVM) is a popular choice for a classifier and radial basis functions (RBFs) are commonly used kernels to apply SVMs also to non-linearly separable problems. There are two hyperparameters in this case. First, the margin is maximized by minimizing the function

]]><![CDATA[Showcase: Distribution of activations and gradients for different activation functions in a neural network]]>https://www.milania.de/showcase/Distribution_of_activations_and_gradients_for_different_activation_functions_in_a_neural_networkhttps://www.milania.de/showcase/Distribution_of_activations_and_gradients_for_different_activation_functions_in_a_neural_networkSat, 16 Jun 2018 00:00:00 +0200Jan SellnerThis showcase presents some simulation results for a deep neural network consisting of 21 layers. Based on randomly generated data, the distribution of network activations and gradients is analysed for different activation functions. This reveals how the flow of activations from the first to the last and the flow of the gradients from the last to the first layer behaves for different activation functions.

]]><![CDATA[Showcase: t-Distributed Stochastic Neighbor Embedding]]>https://www.milania.de/showcase/t-Distributed_Stochastic_Neighbor_Embeddinghttps://www.milania.de/showcase/t-Distributed_Stochastic_Neighbor_EmbeddingMon, 11 Jun 2018 00:00:00 +0200Jan SellnerVisualizing high-dimensional data is a demanding task since we are restricted to our three-dimensional world. A common approach to tackle this problem is to apply some dimensionality reduction algorithm first. This maps \(n\) data points \(\fvec{x}_i \in \mathbb{R}^d\) in the feature space to \(n\) projection points \(\fvec{y}_i \in \mathbb{R}^r\) in the projection space. If we choose \(r \in \{1,2,3\}\), we reach a point where we can successfully visualize the data. However, this mapping does not come at no cost since it is just not possible to visualize high-dimensional data in a low-dimensional space without the loss of at least some information. Hence, different algorithms focus on different aspects. \(t\)-Distributed Stochastic Neighbor Embedding(\(t\)-SNE) [video introduction] is such an algorithm which tries to preserve local neighbour relationships at the cost of distance or density information.

]]><![CDATA[Showcase: Crisp vs. fuzzy k-means clustering]]>https://www.milania.de/showcase/Crisp_vs._fuzzy_k-means_clusteringhttps://www.milania.de/showcase/Crisp_vs._fuzzy_k-means_clusteringTue, 29 May 2018 00:00:00 +0200Jan SellnerIn \(k\)-means clustering, the number of desired clusters \(k\) is set in advance and algorithms then try to find \(k\) groups in the data. In the crisp version, each data point is assigned to its nearest cluster centre (hard membership). On the other hand, in fuzzy clustering (the corresponding algorithm is sometimes also called c-means clustering), the memberships are soft. Every data point belongs to some degree to every cluster centre. The membership is usually related to the distance between the data point and the cluster centre. Here, both methods, crisp and fuzzy clustering, are analysed on an artificially generated example data set.

]]><![CDATA[Showcase: The softmax function in the output layer of neural networks]]>https://www.milania.de/showcase/The_softmax_function_in_the_output_layer_of_neural_networkshttps://www.milania.de/showcase/The_softmax_function_in_the_output_layer_of_neural_networksThu, 24 May 2018 00:00:00 +0200Jan SellnerSuppose you use a neural network for a classification problem and the neurons in the output layer should return a valid discrete probability distribution. If you set the number of output neurons \(n\) equal to the number of classes of your classification problem, you have the nice interpretation that the result for each neuron \(y_i\) gives you the probability that the corresponding input belongs to the class \(\omega_i\). If the network is confident in its classification, you will see a strong peak in the probability distribution. On the other hand, for a noisy input where the network has not really a clue what it means (or it hasn't learned yet), the resulting distribution will be more broadened.

]]><![CDATA[Showcase: Hartigan's method for k-means clustering (exchange clustering algorithm)]]>https://www.milania.de/showcase/Hartigan%27s_method_for_k-means_clustering_%28exchange_clustering_algorithm%29https://www.milania.de/showcase/Hartigan%27s_method_for_k-means_clustering_%28exchange_clustering_algorithm%29Wed, 09 May 2018 00:00:00 +0200Jan SellnerWhen we are confronted with a new dataset, it is often of interest to analyse the structure of the data and search for patterns. Are the data points organized into groups? How close are the groups together? Are there any other interesting structures? These questions are addressed in the cluster analysis field. It contains a collection of unsupervised algorithms (meaning that they don't rely on class labels) which try to find these patterns. As a result, each data point is assigned to a cluster.

]]><![CDATA[Showcase: Impact of the learning rate in a simple neural network]]>https://www.milania.de/showcase/Impact_of_the_learning_rate_in_a_simple_neural_networkhttps://www.milania.de/showcase/Impact_of_the_learning_rate_in_a_simple_neural_networkWed, 25 Apr 2018 00:00:00 +0200Jan SellnerThe learning rate \(\eta\) is one of the hyperparameters we need to optimize when training neural networks. It controls how fast we reach the minimum in our error function using gradient descent. If \(\eta\) is too small, the learning process takes too long which is especially a problem in deep networks which already have the burden of high learning times. But using too high learning rates can result in problems as well. We might overstep the minimum and oscillate around in the error landscape. Here, we want to analyse the effect of the learning rate on a simple example. For this, we use the following network which consists only of one input and one sigmoid neuron.