Stratified linear systematic sampling based clustering approach for detection of financial risk group by mining of big data

Risk analysis is one of the most essential business activities because it discovers unknown risks such as financial risk, recovery risk, investment risk, operational risk, credit risk, debit risk, and so on. Clustering is a data mining technique that uses data behavior and nature to discover unexpected risks in business data. In a big data setup, clustering algorithms encounter execution time and cluster quality-related challenges due to the primary attribute of big data. This study suggests a Stratified Systematic Sampling Extension (SSE) approach for risk analysis in big data mining using a single machine execution by clustering methodology. Sampling is a data reduction technique that saves computation time and improves cluster quality, scalability, and speed of the clustering algorithm. The proposed sampling plan first formulates the stratum by selecting the minimum variance dimension and then selects samples from each stratum using random linear systematic sampling. The clustering algorithm produces robust clusters in terms of risk and non-risk group with the help of sample data and extends the sample-based clustering results to final clustering results utilizing Euclidean distance. The performance of the SSE-based clustering algorithm has been compared to existing K-means and K-means ++ algorithms using Davies Bouldin score, Silhouette coefficient, Scattering Density between clusters Validity, Scattering Distance Validity and CPU time validation metrics on financial risk datasets. The experimental results demonstrate that the SSE-based clustering algorithm achieved better clustering objectives in terms of cluster compaction, separation, density, and variance while minimizing iterations, distance computation, data comparison, and computational time. The statistical analysis reveals that the proposed sampling plan attained statistical significance by employing the Friedman test.

Stratification to Improve Systematic Sampling for Big Data Mining Using Approximate Clustering

Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining

Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining

Explore related subjects

Avoid common mistakes on your manuscript.

1 Introduction

The recent rise of internet-based technology such as cloud computing, internet of things, social networks, web applications and communication technologies are lead to the generation of huge amounts of data every day (Iyapparaja and Deva Arul 2020). Big data is defined by three primary characteristics: volume (a large amount of data), variety (a wide range of data categories), and velocity (speed of data, stay motion) (Deva Arul and Iyapparaja 2020). The discovery of knowledge, unknown relationships, actionable information, and hidden patterns in massive data sources is known as big data mining, and these mining results are useful for organizational decision-making (Hariri et al. 2019; Lozada et al. 2019). The goal of data mining is to forecast unknown insights and generate a simple description of the predicate values. The data relation approach is another way of big data mining that identifies the relationship between attributes of a dataset (Maheshwari et al. 2021).

Transparency is required in big data mining research because the enormous volume of data contains significant knowledge, relationships, and hidden patterns. The variety of data types and sources is lead to a wide range of mining results, and the velocity of data defines real-time mining (Kacfah Emani et al. 2015). Big data mining makes use of stability, low computational cost, and better risk management capability (Moharm 2019; Xian et al. 2021). The combination of statistics and data mining techniques is known as intelligent big data mining, and it addresses the process and management challenges in the mining framework (Hariri et al. 2019).

Clustering is an unsupervised predictive data mining technique that uses homogeneity, similarity, and other properties to identify risk. Each risk cluster has a high degree of resemblance as well as a significant separation degree. The high similarity within-cluster is defined as the distance between data points having a minimum distance within-cluster. A high separation leads to the distance between clusters being at their maximum distance. The clustering objective is to achieve high compaction and separation (Khondoker 2018). The primary goal of data clustering is the underlying structure, classification, and compression (Jain 2010).

Clustering is used in the disciplines of pattern recognition, image segmentation, artificial intelligence, wireless sensor networks, text analysis, bioinformatics, financial analysis, document clustering, text clustering, vector quantization, and so on (Pandove et al. 2018; Abualigah 2019; Xie et al. 2019). The clustering is used for risk analysis in the application of probabilistic risk assessment (Mandelli et al. 2013), project interdependent risk (Marle et al. 2013), financial risk analysis (Kou et al. 2014), supplier risk assessment (Kara 2018), insurance risk analysis, dynamic rockfall risk analysis (Wang et al. 2020), fall risk assessment (Caicedo et al. 2020), disease risk (Gopalakrishnan and Iyapparaja 2021) etc.

Credit and debit risk concentrations are managed by banks and financial departments. Financial risks are associated with any form of financing and intertwined with business, investment, liquidity, credit, legal and operational costs. Liquidity risk is caused by insufficient fund liquidity. Credit risk stems from information asymmetry, whereas operational and legal risks are caused by imperfect laws and regulations. Risk analysis improves organizational safety and performance by identifying factors that contribute to risk (Mandelli et al. 2013). Mandelli et al. (2013) applied the clustering algorithm for probabilistic risk assessment and identified similar behavioral risk events through principal component analysis and mean-shift methodology. It determined the accident transient and critical risk for aircraft crash recovery using the ADAPT-generated dataset. Marle et al. (Marle et al. 2013) used an interaction-based clustering algorithm to group the risks. The interaction-based clustering algorithm uses clustering objectives for prioritizing and resource allocation during risk categorization.

Kou et al. (Kou et al. 2014) evaluated the clustering algorithm based on multiple criteria decision making (MCDM) problems for financial risk analysis using real-life credit and bankruptcy risk datasets. Financial data analysis is referred to as business intelligence, and it detects financial risks ahead of time, taking appropriate actions to reduce defaults and supporting better decision-making. The clustering approach is used by (Kara 2018) to categorize risk into quantitative and qualitative categories. Quantitative risk is concerned with costs, quality, and logistics, whereas qualitative risk is associated with economic variables and rating risk. Kara (2018) used the K-means clustering algorithm to assess the 17 qualitative and quantitative supplier risks. The data points in the cluster reflect the specific risk and their interpretation aids in risk management and supplier risk mitigation. It used the supplier risk-related dataset and identified the reliable supplier through minimization of risk.

The authors of (Jain 2010; Wang and He 2016; Pandove et al. 2018; Zhao et al. 2019) described big data risk clustering techniques such as incremental, divide and conquer, data summarization, sampling, efficient nearest neighbor, dimension reduction, parallel computing, condensation, granular computing and so on. The arithmetic optimization algorithm (Abualigah et al. 2021a) and Aquila optimizer (Abualigah et al. 2021b) are another risk analysis approach that searches for optimum solutions to real-world population-based problems on large datasets. The population-based optimization methods search for multiple solutions at each iteration. For this reason, a variety of risk clusters can be achieved (Abualigah and Diabat 2021).

Sampling is a data reduction strategy that tackles a variety of problems in data mining (Umarani and Punithavalli 2011) and is useful for efficiency and performance improvement (Chen et al. 2002; Furht and Villanustre 2016). The sampling-based data mining techniques reduce the amount of data for mining, where reducted data is referred to as sample data. The sample data is representative of the entire dataset (Hu et al. 2014). The sampling approach minimizes data size, computation time, and memory consumption. It has established a balance between the computational cost of high volume data and approximation results (Ramasubramanian and Singh 2016; Mahmud et al. 2020). The size reduction of the dataset saves memory capacity, management efficiency, resource allocation, and mining time (Tchagna Kouanou et al. 2018; Mani and Iyapparaja 2020). Sampling can also be referred to as an approximation approach provider tool (Zhao et al. 2019). It achieves approximate results within a specific time with query optimization for the decision support system. It is used anywhere where data volume is very high, such as risk analysis, database sampling, stream-sampling, online aggregation, discovery of correlation, data relationships, sampling enhanced query engines, data aggregation, and so on (Haas 2016; Kim and Wang 2019; Jabłoński and Jabłoński 2020).

Conventional data mining approaches are not scalable for big data risk mining and achieve inadequate computation efficiency and effectiveness due to the large scale of the datasets (Hariri et al. 2019; Kumar and Mohbey 2019). The objective of this study is to improve the scalability, computational cost, efficiency, effectiveness, and speed-up of big data clustering by using a stratified systematic random linear sampling approach in the application of financial risk analysis on single machine execution. This study is divided into five sections. The second section defines the related work of the sampling approach in clustering methods and lists out the sampling-based clustering algorithms. The third section presents a stratified systematic sampling approach and provides a sampling plan for big data clustering. Section four contains the implementation of the proposed work using the K-means and K-means + + algorithms and validates them using internal measures on risk related datasets. The last section concludes the work and explores further possibilities.

2 Related works

This section reviews the sampling-based clustering algorithm based on the existing research perspective and thereafter examines the stratified systematic sampling-based work. The sampling-based clustering algorithms are categorized into uniform random sampling, systematic sampling, progressive sampling, reservoir sampling and stratified sampling. Uniform random sampling uses a random number generator and unbiased selection to select data from large data sets (Brus 2019). The initial data point of the sample is picked at random, whereas the remaining data points are chosen in a predetermined order on the dataset in systematic sampling (Aune-Lundberg and Strand 2014). Progressive sampling begins with fixed sample size and gradually increases the sample size until a satisfactory performance measure, such as accuracy, speedup ratio, efficiency, and so on, is obtained (Umarani and Punithavalli 2011). Reservoir sampling is used for data stream mining of homogeneous and heterogeneous data sources (Satyanarayana 2014). Stratified sampling separates a diverse dataset into homogeneous subsamples of data known as strata or intervals, and then collects samples from the strata using random sampling (Zhao et al. 2019).

David et al. (Ben-david 2007) designed a sample-based clustering framework for K-median, K-means clustering and other K-center-based clustering algorithms. The sample-based clustering framework achieved optimal approximation clustering results based on the sample sizes and other accuracy parameters. The key advantage of this framework is that it is unaffected by data dimensionality, data size and data generation distribution. Wang et al. (Wang et al. 2008) discussed a selective sampling extension of non-Euclidean relational fuzzy c-means (eNERF) approximate clustering for very large datasets. The eNERF eliminates progressive sampling to over-sampling to obtain an n × n sample matrix from an N × N matrix.

Silva et al. (da Silva et al. 2012) brought to notice the CLUSMASTER method on data streams in sensor networks by combining partitional risk clustering and random sampling approach. The CLUSMASTER method has achieved an efficient sampling rate while minimizing the sensor's sum of square error in a network. Rajasekaran et al. (Rajasekaran and Saha 2013) proposed a Deterministic Sampling-based Clustering (DSC) approach for the improvement of scalability and speed-up in large dimensional datasets using the generic technique. The DSC approach is implemented by using hierarchical, K-means, K-means + + , K-medians clustering.

Parker et al. (Rajasekaran and Saha 2013) contributed two sampling-based fuzzy clustering algorithms. The first is a geometric progressive fuzzy c-means (GOFCM) clustering algorithm based on progressive sampling. The second is a minimum sample estimate random fuzzy c-means (MSERFCM) clustering algorithm based on random sampling and partition simplification. Both algorithms increase clustering speed by determining the optimal sample size and stopping criterion. Jaiswal et al. (Jaiswal et al. 2014) proposed a D2-sampling method for initial centroid selection that effectively minimizes the Euclidean distance between data points and the center of the cluster as compared to random sampling. Zhang et al. (Zhang et al. 2015) proposed a URLSamp approach for URL clustering and web accessibility utilizing a page sampling method. The URLSamp approach reduces the huge I/O and computation costs of web page clustering. Li et al. (Li et al. 2016) recommended distributed stratified sampling (DSS) for big data processing. The DSS minimizes the computational cost and obtains higher efficiency and lower data-transmission costs. In this work, stratified sampling was utilized to extract stratum and then to change the reservoir method for sample extraction.

Ros et al. (Ros and Guillaume 2017) introduced the DIstance and DEnsity-based Sampling (DIDES) approach for distance-based clustering algorithms. The DIDES approach avoiding large distance calculation that cause large time complexity. The granularity parameter of DIDES approach impacts the sample size. The DIDES approach is insensitive to initialization, noise and data size properties of clustering. Jia et al. (Jia et al. 2017) introduced the Nyström spectral clustering approach through incremental sampling. The Nyström spectral method uses a probability-based incremental sampling strategy to reduce the computing cost of big data spectrum clustering. The combination of random and incremental sampling decreases sampling errors and enhances sampling accuracy. Zhan et al. (Zhan 2017) used the Nyström method for Eigen function solution in the spectral clustering algorithm. The Nyström sampling method extracts a small number of the samples to infer overall clustering results on a large-scale dataset. Nyström sampling reduces time and space complexity while achieving a feasible, high-quality, and effective risk clustering solution.

Aloise et al. (Aloise and Contardo 2018) discussed an iterative sampling-based exact algorithm for the minimax diameter clustering problem (MDCP). The MDCP is known as the NP-hard problem and is related to the partitional risk clustering algorithm. The sampling-based exact algorithm minimizes the maximum intra-cluster dissimilarity on large-sized datasets by using a small sample size and heuristic procedure. In addition, the heuristic procedure ensures that the optimal solution will be found. Ben et al. (Ben et al. 2019) came through a contribution to STiMR K-means for big data through the reservoir random sampling, triangle inequality, and MapReduce framework. Reservoir random sampling decreases the number of data points required for cluster prototype formulation. The triangle inequality reduces the amount of data and distance comparisons wherever MapReduce is employed for the parallel execution of STiMR k-means. The STiMR k-means algorithm enhances scalability, cluster quality and running time.

Luchi et al. (Luchi et al. 2019) brought insight into Rough-DBSCAN and I-DBSCAN as improvements to the DBSCAN (Density-based spatial clustering of applications with noise) clustering algorithm. Both sampling-based clustering techniques shorten the execution time by reducing data patterns. Kim et al. (Chen et al. 2019) suggested a robust inverse sampling method for big data integration by propensity score (PS) weight. The PS weight approach reduces the selection of bias and increases the sampling performance with robust estimation. Inverse sampling is used by any data mining algorithm for approximate results. Zhang et al. (Zhang and Wang 2021) developed an optimal probability subsampling approach and optimal sample allocation procedures for big data analysis using airline datasets. The proposed sampling strategy minimizes asymptotic mean squared error and ensures an optimal sampling criterion.

Some other sampling-based clustering algorithms applicable to risk evaluation are SDBSCAN (Ji-hong et al. 2001), DBRS (Wang and Hamilton 2003), EM clustering (Wang et al. 2008), K-means Sampler (Bejarano et al. 2011), CURE (Fahad et al. 2014), RSEFCM (Fahad et al. 2014), CLARANS (Fahad et al. 2014), PS-SC (Jia et al. 2017), TEP (Härtel et al. 2017), INS (Zhan 2017), BIRCH (Pandove et al. 2018), Single Pass FCM (Zhao et al. 2019), SSEFCM (Zhao et al. 2019) etc.

Ngu at el. (Databases et al. 2004) derived the systematic sampling-based SYSSMP approach for estimating query size results. The SYSSMP approach attains a more representative sample than random sampling. Judez et al. (Judez et al. 2006) evaluated the efficiency of random sampling and stratified random sampling in terms of accurate sample size and better sampling results. The Neyman approach was employed by the author (Judez et al. 2006) for sample extraction in stratified random sampling that achieved precise variance within the sample pool. Keskintürk et al. (Keskintürk 2007) adopted the genetic algorithm for stratified sampling. The genetic algorithm determines the minimum sample variance and sample data identification from each stratum.

Étoré et al. (Étoré and Jourdain 2010) introduced an adaptive stratified sampling strategy that minimized the sample variance while selecting the optimal sample data from each stratum. Kao et al. (Kao et al. 2011) enhanced the sampling plan of systematic sampling through the Markov process. Markov's systematic sampling enhanced the effectiveness of sample size and sample means. The advantages of systematic sampling over random sampling were established by Lundberg et al. (Aune-Lundberg and Strand 2014). Systematic sampling obtains the lowest sample variance and decent autocorrelation. Systematic sampling produces a more accurate sample during stratification rather than random sampling. Larson et al. (Larson et al. 2019) evaluated systematic and random sample designs and found that systematic sampling beats random sampling in terms of variance estimator, sample size, and data points.

3 Proposed work

The clustering process detects risk factors without any extraneous information. Conventional clustering algorithms are not scalable for processing to big data due to huge computation costs and memory requirements. The literature survey discussed various sample-based clustering approaches. The discussed literature has faced challenges in terms of randomization, homogeneous samples, and algorithm modification. The sample data from random sampling does not ensure that the sample is homogeneous and covers all data points in the dataset. Another issue with existing algorithms is algorithm modification because most algorithms in the literature are internally modified based on applications. Therefore, the computation costs increase according to changes in application behavior.

The proposed sampling plan combines stratification with systematic sampling. The stratification process extracts homogeneous risk data from large data sets based on minimum variance dimension. Therefore, the fraction of the risk data is minimized and similar risk groups are achieved before the clustering process. Systematic sampling extracts the samples for clustering because it covers the entire data points of the dataset in a consistent manner. Therefore, the extracted sample strongly represents the entire dataset. Thus, stratification reduces computation costs, and systematic sampling minimizes memory consumption for financial risk cluster identification.

Existing research (Pandey and Shukla 2019, 2020; Zhao et al. 2019) and literature from the second section determine the practical approach of the sample plan for clustering across several domains. This section describes the clustering objective, sampling contains and presents the stratified systematic sampling extension strategy (SSE) based clustering model for financial risk analysis on big data mining using single machine execution. The proposed method enhances computational efficiency and reduces computation cost and computing speed without affecting cluster quality and cluster objective.

3.1 Objective function for risk based dataset

Let the X risk based dataset $N= \left\_,_,\dots \dots .._\right\>$ be clustered C into $K= \left\_,_,\dots \dots .._\right\>$ on the basis of predefined similarity function in d dimension space of risk attribute set. A successful clustering has always follow the $\left\_ \cup _\cup \dots \dots .< \cup c>_\right\>=C$ and $\left\_ \cap _\cap \dots \dots .< \cap c>_\right\>= \theta$ conditions [17]. The considered clustering approach minimizes the Sum of Squared Error (SSE) function. The objective criterion optimized the within-cluster SSE, which is described in Eq. 1 (Jain 2010).

$$WSSE \left( \right) = \mathop \sum \limits_^ \sum\nolimits_ \in C_ >> <\left\| - \mu_ > \right\|>^$$

where $_$ is the data point and $<\mu >_$ is the centroid of $_$ cluster. The content of $_$ is described in Eq. 2 and 3 for minimization of SSE problem (Xiao and Yu 2012).