PUMA: Parallel subspace clustering of categorical data using multi-attribute weights
PUMA: خوشه بندی موازی زیر فضای داده های دسته ای با استفاده از وزنهای چند صفته-2019
There are two main reasons why traditional clustering schemes are incompetent for high-dimensional categorical data. First, traditional methods usually represent each cluster by all dimensions without dif- ference; and second, traditional clustering methods only rely on an individual dimension of projection as an attribute’s weight ignoring relevance among attributes. We solve these two problems by a MapReduce- based subspace clustering algorithm (called PUMA ) using multi-attribute weights. The attribute subspaces are constructed in our PUMA by calculating an attribute-value weight based on the co-occurrence prob- ability of attribute values among different dimensions. PUMA obtains sub-clusters corresponding to re- spective attribute subspaces from each computing node in parallel. Lastly, PUMA measures various scale clusters by applying the hierarchical clustering method to iteratively merge sub-clusters. We implement PUMA on a 24-node Hadoop cluster. Experimental results reveal that using multi-attribute weights with subspace clustering can achieve better clustering accuracy on both synthetic and real-world high dimen- sional datasets. Experimental results also show that PUMA achieves high performance in terms of exten- sibility, scalability and the nearly linear speedup with respect to number of nodes. Additionally, exper- imental results demonstrate that PUMA is reasonable, effective, and practical to expert systems such as knowledge acquisition, word sense disambiguation, automatic abstracting and recommender systems.
Keywords: Parallel subspace clustering | Multi-attribute weights | High dimension | Categorical data | MapReduce