Probabilistic Cluster Embedding - A New Way to Visualize Large Datasets

A remarkable feature of the human brain is its ability to find differences even in enormous amounts of visual information. This feature is very useful when studying large amounts of data. This is because the content of the data must be compressed into a format that human intelligence can understand. For visual analytics, dimensionality reduction remains a major issue.

Scientists from Aalto University and the University of Helsinki, part of the Finnish Center for Artificial Intelligence (FCAI), tested the capabilities of the most well-known visual analytics methods and found that none of them worked when the amount of data increased significantly. . For example, t-SNE, LargeViz, and UMAP methods can no longer distinguish extremely strong signal groups of observations in the data when the number of observations reaches hundreds of thousands. t-SNE, LargeViz and UMAP methods no longer work properly.

Researchers developed a new nonlinear dimensionality reduction method called Stochastic Cluster Embedding (SCE) for better cluster visualization. It aims to visualize data sets as clearly as possible and is designed to visualize data clusters and other macroscopic features in a way that is as clear as possible, easy to observe, and understandable to humans. SCE uses graphics acceleration similar to state-of-the-art artificial intelligence methods for neural network computing.

The discovery of the Higgs boson became the basis for the invention of this algorithm. The dataset for this related experiment contains over 11 million feature vectors. And this data needed convenient, clear visualization. This inspired scientists to develop new methods.

Researchers generalized SNE using an I-divergence series parameterized by a scale factor s between the unnormalized similarity of the input and output spaces. SNE is a special case of the family where s is chosen as a normalization factor for the output similarity. However, during testing, we found that the best value for s for visualizing clusters was often different from the value chosen by SNE. Therefore, to overcome the shortcomings of t-SNE, the new SCE method uses a different approach that mixes input similarities when calculating s. When optimizing a new learning objective, the coefficients are adaptively adjusted so that data points are better clustered. The researchers also developed an efficient optimization algorithm using asynchronous stochastic descent on block coordinates. The new algorithm can use parallel computing devices and is suitable for large-scale tasks with large amounts of data.

While developing the project, scientists tested the method on a variety of real-world data sets and compared it to other state-of-the-art NLDR methods. Users in the test chose a range of s values for viewing clusters and the visualization that best suited them. The researchers then compared the resulting s values from SCE and t-SNE to see which one was closer to human choices. The four smallest datasets were used for testing: IJCNN, TOMORADAR, SHUTTLE, and MNIST. For each dataset, test participants were presented with a set of visualizations that used sliders to display s values and tested the corresponding precomputed visualizations. Users chose their preferred s value for cluster visualization.

The test results clearly show that the s selected by SNE are to the right of the human median (solid green line) for all datasets. This means that in humans, GSNE with small s is often better than t-SNE for cluster visualization. In contrast, the SCE selection (red dashed line) is closer to the human median in all four data sets.

By applying the Stochastic Cluster Embedding method to the data of the Higgs boson, the most important physical properties were clearly identified. Stochastic Clustering Embedding, a new nonlinear dimensionality reduction method for better cluster visualization, works several times faster than previous methods and is much more stable even in complex applications. We modify t-SNE using an adaptive and efficient trade-off between attraction and repulsion. Experimental results showed that this method can consistently identify internal clusters. Additionally, the scientists provided a simple and fast optimization algorithm that can be easily implemented on modern parallel computing platforms. Efficient software using asynchronous stochastic block gradient descent has been developed to optimize a new family of objective functions. Experimental results show that this method consistently and significantly improves the visualization of data clusters compared to state-of-the-art probabilistic neighborhood embedding approaches.

The code for that method is publicly available on github.

What is a network engineer?

MathPrompt: A new AI method for evading AI safety mechanisms through mathematical encoding

Study: AI Could Lead to Inconsistent Results in Home Surveillance | MIT News

GoM on GST rate rationalization will meet on September 25 to discuss slabs, rate adjustments.

DJI OSMO Action 5 Pro camera features new 40-megapixel sensor and longer battery life

What is a network engineer?

Space Marine 2, Opens the Xbox 360 Era, Brothers Enthusiasts in Steam Reviews

Texas Court Dismisses Consensys Lawsuit Against SEC Regarding Ethereum Investigation

Wheels of Change: Self-Balancing Technologies for Urban Mobility

Discord CEO sheds light on future of gamer communication as users cross 200M

Most Popular

More FF7 Rebirth and Dragon’s Dogma 2 tips, welcome

Zapier Central debuts as a no-code tool for building enterprise AI bots

Double Jump.Tokyo Raises Over $10 Million from SBI and Sony Group

Our Picks

GoM on GST rate rationalization will meet on September 25 to discuss slabs, rate adjustments.

DJI OSMO Action 5 Pro camera features new 40-megapixel sensor and longer battery life

What is a network engineer?

Probabilistic Cluster Embedding – A New Way to Visualize Large Datasets

Related Posts