Milania's Bloghttps://www.milania.de/My little place on the web...<![CDATA[Showcase: Hyperparameters of an SVM with an RBF kernel]]>https://www.milania.de/showcase/Hyperparameters_of_an_SVM_with_an_RBF_kernelhttps://www.milania.de/showcase/Hyperparameters_of_an_SVM_with_an_RBF_kernelSun, 25 Nov 2018 00:00:00 +0100Jan SellnerA Support vector machine (SVM) is a popular choice for a classifier and radial basis functions (RBFs) are commonly used kernels to apply SVMs also to non-linearly separable problems. There are two hyperparameters in this case. First, the margin is maximized by minimizing the function

]]><![CDATA[Showcase: Distribution of activations and gradients for different activation functions in a neural network]]>https://www.milania.de/showcase/Distribution_of_activations_and_gradients_for_different_activation_functions_in_a_neural_networkhttps://www.milania.de/showcase/Distribution_of_activations_and_gradients_for_different_activation_functions_in_a_neural_networkSat, 16 Jun 2018 00:00:00 +0200Jan SellnerThis showcase presents some simulation results for a deep neural network consisting of 21 layers. Based on randomly generated data, the distribution of network activations and gradients is analysed for different activation functions. This reveals how the flow of activations from the first to the last and the flow of the gradients from the last to the first layer behaves for different activation functions.

]]><![CDATA[Showcase: t-Distributed Stochastic Neighbor Embedding]]>https://www.milania.de/showcase/t-Distributed_Stochastic_Neighbor_Embeddinghttps://www.milania.de/showcase/t-Distributed_Stochastic_Neighbor_EmbeddingMon, 11 Jun 2018 00:00:00 +0200Jan SellnerVisualizing high-dimensional data is a demanding task since we are restricted to our three-dimensional world. A common approach to tackle this problem is to apply some dimensionality reduction algorithm first. This maps \(n\) data points \(\fvec{x}_i \in \mathbb{R}^d\) in the feature space to \(n\) projection points \(\fvec{y}_i \in \mathbb{R}^r\) in the projection space. If we choose \(r \in \{1,2,3\}\), we reach a point where we can successfully visualize the data. However, this mapping does not come at no cost since it is just not possible to visualize high-dimensional data in a low-dimensional space without the loss of at least some information. Hence, different algorithms focus on different aspects. \(t\)-Distributed Stochastic Neighbor Embedding (\(t\)-SNE) [video introduction] is such an algorithm which tries to preserve local neighbour relationships at the cost of distance or density information.

]]><![CDATA[Showcase: Crisp vs. fuzzy k-means clustering]]>https://www.milania.de/showcase/Crisp_vs._fuzzy_k-means_clusteringhttps://www.milania.de/showcase/Crisp_vs._fuzzy_k-means_clusteringTue, 29 May 2018 00:00:00 +0200Jan SellnerIn \(k\)-means clustering, the number of desired clusters \(k\) is set in advance and algorithms then try to find \(k\) groups in the data. In the crisp version, each data point is assigned to its nearest cluster centre (hard membership). On the other hand, in fuzzy clustering (the corresponding algorithm is sometimes also called c-means clustering), the memberships are soft. Every data point belongs to some degree to every cluster centre. The membership is usually related to the distance between the data point and the cluster centre. Here, both methods, crisp and fuzzy clustering, are analysed on an artificially generated example data set.

]]><![CDATA[Showcase: The softmax function in the output layer of neural networks]]>https://www.milania.de/showcase/The_softmax_function_in_the_output_layer_of_neural_networkshttps://www.milania.de/showcase/The_softmax_function_in_the_output_layer_of_neural_networksThu, 24 May 2018 00:00:00 +0200Jan SellnerSuppose you use a neural network for a classification problem and the neurons in the output layer should return a valid discrete probability distribution. If you set the number of output neurons \(n\) equal to the number of classes of your classification problem, you have the nice interpretation that the result for each neuron \(y_i\) gives you the probability that the corresponding input belongs to the class \(\omega_i\). If the network is confident in its classification, you will see a strong peak in the probability distribution. On the other hand, for a noisy input where the network has not really a clue what it means (or it hasn't learned yet), the resulting distribution will be more broadened.

]]><![CDATA[Showcase: Hartigan's method for k-means clustering (exchange clustering algorithm)]]>https://www.milania.de/showcase/Hartigan%27s_method_for_k-means_clustering_%28exchange_clustering_algorithm%29https://www.milania.de/showcase/Hartigan%27s_method_for_k-means_clustering_%28exchange_clustering_algorithm%29Wed, 09 May 2018 00:00:00 +0200Jan SellnerWhen we are confronted with a new dataset, it is often of interest to analyse the structure of the data and search for patterns. Are the data points organized into groups? How close are the groups together? Are there any other interesting structures? These questions are addressed in the cluster analysis field. It contains a collection of unsupervised algorithms (meaning that they don't rely on class labels) which try to find these patterns. As a result, each data point is assigned to a cluster.

]]><![CDATA[Showcase: Impact of the learning rate in a simple neural network]]>https://www.milania.de/showcase/Impact_of_the_learning_rate_in_a_simple_neural_networkhttps://www.milania.de/showcase/Impact_of_the_learning_rate_in_a_simple_neural_networkWed, 25 Apr 2018 00:00:00 +0200Jan SellnerThe learning rate \(\eta\) is one of the hyperparameters we need to optimize when training neural networks. It controls how fast we reach the minimum in our error function using gradient descent. If \(\eta\) is too small, the learning process takes too long which is especially a problem in deep networks which already have the burden of high learning times. But using too high learning rates can result in problems as well. We might overstep the minimum and oscillate around in the error landscape. Here, we want to analyse the effect of the learning rate on a simple example. For this, we use the following network which consists only of one input and one sigmoid neuron.

]]><![CDATA[Showcase: The correlation coefficient subject to noise]]>https://www.milania.de/showcase/The_correlation_coefficient_subject_to_noisehttps://www.milania.de/showcase/The_correlation_coefficient_subject_to_noiseThu, 19 Apr 2018 00:00:00 +0200Jan SellnerThe correlation coefficient is an important metric to measure the linear dependency between two variables \(X\) and \(Y\). It is defined as

]]><![CDATA[Blog: CSS lightbox without JavaScript realized with a hidden input element]]>https://www.milania.de/blog/CSS_lightbox_without_JavaScript_realized_with_a_hidden_input_elementhttps://www.milania.de/blog/CSS_lightbox_without_JavaScript_realized_with_a_hidden_input_elementThu, 25 Jan 2018 00:00:00 +0100Jan SellnerIf you place images to a layout with a fixed width (like this webpage here), you may encounter the problem that you have images which are too large to display. Hence, the image is only shown in a lower resolution. But when we want the user to still be able to view the image in its full glory, we need an additional way of interaction. One could be to provide a link to the image in its full size but then the user has to leave the current page which breaks the attentional flow. A lightbox is a very nice way to overcome this issue which allows viewing images in higher resolutions without leaving the current site. The image is shown enlarged on the same page as before and the rest of the site is hidden in the background (but still visible) as seen in the following example.

]]><![CDATA[Showcase: Nearest neighbour density estimation]]>https://www.milania.de/showcase/Nearest_neighbour_density_estimationhttps://www.milania.de/showcase/Nearest_neighbour_density_estimationTue, 05 Dec 2017 00:00:00 +0100Jan SellnerDensity estimation based on the nearest neighbours is another technique to estimate the unknown PDF \(\hat{p}(x)\) from observed data. It implements kind of the opposite idea of the Parzen window estimator where we place kernels at each data point with a certain side length \(h\) which determines the local influence of the kernel. Using a large \(h\) results in wide kernels which collect more points on the way. In the nearest neighbour density estimation, we approach from a different perspective. Instead of fixing the side length \(h\) and collecting a varying number of \(k\) neighbours for each kernel, we now fix \(k\) and adjust the influence area accordingly.