Stateoftheart on clustering data streams
 Mohammed Ghesmoune^{1}Email author,
 Mustapha Lebbah^{1} and
 Hanene Azzag^{1}
Received: 20 April 2016
Accepted: 14 September 2016
Published: 1 December 2016
Abstract
Clustering is a key data mining task. This is the problem of partitioning a set of observations into clusters such that the intracluster observations are similar and the intercluster observations are dissimilar. The traditional setup where a static dataset is available in its entirety for random access is not applicable as we do not have the entire dataset at the launch of the learning, the data continue to arrive at a rapid rate, we can not access the data randomly, and we can make only one or at most a small number of passes on the data in order to generate the clustering results. These types of data are referred to as data streams. The data stream clustering problem requires a process capable of partitioning observations continuously while taking into account restrictions of memory and time. In the literature of data stream clustering methods, a large number of algorithms use a twophase scheme which consists of an online component that processes data stream points and produces summary statistics, and an offline component that uses the summary data to generate the clusters. An alternative class is capable of generating the final clusters without the need of an offline phase. This paper presents a comprehensive survey of the data stream clustering methods and an overview of the most wellknown streaming platforms which implement clustering.
Keywords
Data stream clustering Streaming platforms StateoftheartBackground
In today’s applications, evolving data streams are ubiquitous. Indeed, examples of applications relevant to streaming data are becoming more numerous and more important, including network intrusion detection, transaction streams, phone records, web clickstreams, social streams, weather monitoring, etc. There is active research on how to store, query, analyze, extract and predict relevant information from data streams. Clustering is a key data mining task. This is the problem of partitioning a set of observations into clusters such that the intracluster observations are similar (or close) and the intercluster observations are dissimilar (or distant). The other objective of clustering is to reduce the complexity of the data by replacing a group of observations (cluster) with a representative observation (prototype).
In this paper, we consider the problem of clustering data in the form of a stream, i.e. a sequence of potentially infinite, nonstationary data (the probability distribution of the unknown data generation process may change over time) arriving continuously (which requires a single pass through the data) where random access to the data is not feasible and storing all the arriving data is impractical. When applying data mining techniques, and specifically clustering algorithms, to data streams, restrictions in execution time and memory have to be considered carefully. To deal with time and memory restrictions, many of the existing data stream clustering algorithms modify traditional nonstreaming methods to use the twophase framework proposed in [1] to deal with streaming data, e.g., DenStream [2] is an extension of DBSCAN algorithm, StreamKM++ [3] of kmeans++, StrAP [4] of AP, etc.
Realtime processing means that the ongoing data processing requires a very low response delay. The velocity, which refers to that Big Data are generated at high speed (speed of data in and out), is an important concept in the Big Data domain [5]. With the increasing importance of data stream mining applications, many streaming platforms have been proposed. These can be classified in two categories: traditional or nondistributed such as MOA [6] and distributed streaming platforms such as Spark Streaming [7] and Flink [8]. The latter two, may be considered as the most widely used streaming platforms. These distributed streaming systems are based on two processing models, recordatatime and microbatching. On a recordatatime processing model, longrunning stateful operators process records as they arrive, update the internal state, and send out new records. On the other hand, the microbatching processing model runs each streaming computation as a series of deterministic batch computations on small time intervals, which is implemented in Spark Streaming [7].
General surveys have been recently published in the literature for mining data streams [9–13]. The authors of [14] introduced a taxonomy to classify data stream clustering algorithms. The work presented in [15] is a thorough survey of stateoftheart densitybased clustering algorithms over data streams. This paper presents a thorough survey of the stateoftheart for a wide range of data stream clustering algorithms and an overview of the most wellknown streaming platforms. The remainder of this paper is organized as follows. Section “Data stream clustering methods” presents in a comprehensive manner the most relevant works on data stream clustering algorithms. These algorithms are categorized according to the nature of their underlying clustering approach. Section “Streaming platforms” overviews the most wellknown streaming platforms with a focus on the streaming clustering task. Section “Conclusion” concludes this paper.
Data stream clustering methods
GNG based algorithms
Growing Neural Gas
Growing Neural Gas (GNG) [17] is an incremental selforganizing approach which belongs to the family of topological maps such as SelfOrganizing Maps (SOM) [18] or Neural Gas (NG) [19]. It is an unsupervised clustering algorithm capable of representing a high dimensional input space in a low dimensional feature map. Typically, it is used for finding topological structures that closely reflect the structure of the input distribution. Therefore, it is used for visualization tasks in a number of domains [19, 20] as neurons (nodes), which represent prototypes, are easy to understand and interpret.
The GNG algorithm constructs a graph of nodes in which each node has its associated prototype. Prototypes can be regarded as positions in the input space of their corresponding nodes. Pairs of nodes are connected by edges (links), which are not weighted. The purpose of these links is to define the topological structure. These links are temporal in the sense that they are subject to aging during the iteration steps of the algorithm and are removed when they become “too old” [20].
Starting with two nodes, and as a new data point is available, the nearest and the secondnearest nodes are identified, linked by an edge, and the nearest node and its topological neighbors are moved toward the data point. Each node has an accumulated error variable. Periodically, a node is inserted into the graph between the nodes with the largest error values. Nodes can also be removed if they are identified as being superfluous. This is an advantage compared to SOM and NG, as there is no need to fix the graph size in advance. Algorithm 1 outlines an online version of the GNG approach. In this version, unlike the standard approach of GNG, the data is seen only once.
A number of authors have proposed variations on the Growing Neural Gas (GNG) approach. The GNG algorithm creates a new node every λ iterations (λ is fixed by the user as an input parameter). Hence, it is not adapted for data streams, or nonstationary datasets, or to novelty detection. In order to deal with nonstationary datasets, the author of [21] has investigated modifying the network by proposing an online criterion for identifying “useless” nodes. The algorithm proposed is known as the Growing Neural Gas with Utility (GNGU). Slow changes of the distribution are handled by adaptation of existing nodes, whereas rapid changes are handled by removal of “useless” neurons and subsequent insertions of new nodes in other places.
The authors of [22] modified GNG to detect incrementally emerging cluster structures. The proposed GNGC algorithm is able to match the temporal distribution of the original dataset by creating a new node whenever the received new data point is too far from its nearest node. It is noted that the algorithm is computationally demanding.
The clustering method proposed in [23] consists of two steps. In the first step, the data are prepared by generating the Voronoi partition using a modified GNG algorithm (which does not exceed linear complexity). The result is that the number of intermediate clusters is much smaller than the number of original objects. In the second step the intermediate clusters are clustered using conventional algorithms that have a much higher computational complexity (for this reason they should not be used for clustering the full volume of initial data). The approach examines hierarchical clustering of GNG units using single linkage and Ward’s method as linkage criteria. Although the clustering results look promising, the approach has the drawback that they have to manually identify the “right” level in the cluster hierarchy to obtain an adequate clustering of the input space.
GWR
The ‘Grow When Required’ (GWR) network [24] may add a new node at any time, whose position is dependent on the input and the current winning node. The GWR deals with the problem of novelty detection by adding new nodes into the network structure whenever the activity of the current bestmatching node is below some threshold, which implies that the bestmatching node is not trained to deal with that particular input. This means that the network grows very quickly when new data is presented, but stops growing once the network has matched the data to a given accuracy. This has benefits in that there is no need to decide in advance how large the network should be, as nodes will be added until the network is saturated. This means that for small datasets the complexity of the network is significantly reduced. In addition, if the dataset changes at some time in the future, further nodes can be added to represent the new data without disturbing the network that has already been created [24, 25]. Considering one iteration of the GWR algorithm, GWR has approximatively the same time complexity as one iteration of GNG. Hence, the complexity of GWR is O(k n m) where k is the number of iterations, n is the number of data points of the data stream m is the number of nodes in the graph. For more details on the complexity of GNG the reader is referred to [26].
IGNG
Still in the same idea of relaxing the constraint of periodical evolution of the network, the IGNG [27] algorithm has been proposed. In this algorithm a new neuron is created each time the distance of the current input data to the existing neuron is greater than a predefined fixed threshold σ, which is dependent on the global datasets. However, one disadvantage of this algorithm is the global character of the parameter σ and also that it must be computed prior to the learning. In order to resolve this weakness, I2GNG [28] associates a threshold variable σ to each neuron. However, its major drawback is the initialization of the σ values at the creation of each node. The authors of [29] address the problem of choosing the final winner neuron among the many input equidistant neurons. They proposed some adaptations of the IGNG and I2GNG algorithms. Notably, the use of a labeling maximization approach as a clustering similarity measure (IGNGF) to replace the distance in the winner selection process.
The ability of selforganizing neural network models to manage realtime applications, using a modified learning algorithm for a growing neural gas network is addressed in [30]. The proposed modification aims to satisfy realtime temporal constraints in the adaptation of the network. The proposed learning algorithm can add a dynamic number of neurons per iteration. Indeed, a detailed study has been conducted to estimate the optimal parameters that keep a good quality of representation in the available time. The authors concluded that the use of a large number of neurons made it difficult to obtain a representation of the distribution of training data with good accuracy in realtime [30, 31].
AING [32] is an incremental GNG that learns automatically the distance thresholds of nodes based on its neighbors and data points assigned to the node of interest. It merges nodes when their number reaches a given upperbound.
GStream
More recently, GStream [33, 34] was proposed as a data stream clustering approach based on the Growing Neural Gas algorithm. GStream uses a stochastic approach to update the prototypes, and it was implemented on a “centralized” platform, which can be summarized as follows: starting with two nodes, and as a new data point is reached, the nearest and the secondnearest nodes are identified, linked by an edge, and the nearest node with its topological neighbors are moved toward the data point. Each node has an accumulated error variable and a weight, which varies over time using a fading function. Using an edge management procedure, one, two or three nodes are inserted into the graph between the nodes with the largest error values. Nodes can also be removed if they are identified as being superfluous.
 (i)
the topological structure is represented by a graph wherein each node represents a cluster, which is a set of “close” data points, and neighboring nodes (clusters) are connected by edges. The graph size is not fixed but may evolve;
 (ii)to reduce the impact of old data whose relevance diminishes over time, GStream uses an exponential fading functionwhere λ _{1}>0, defines the rate of decay of the weight over time, t denotes the current time and t _{0} is the timestamp of the data point. The weight of a node is based on data points associated with it:$$f(t) = 2^{\lambda_{1} (t  t_{0})} $$where m is the number of points assigned to the node c at the current time t. If the weight of a node is less than a threshold value then this node is considered as outdated and then deleted (with its links). For the same reason, links between nodes are also weighted by an exponential function;$$weight(c) = \sum_{i=1}^{m} 2^{\lambda_{1} (t  t_{i_{0}})} $$
 (iii)
unlike many other data stream algorithms that start by taking a significant number of data points for initializing the model (these data points can be seen several times), GStream starts with only two nodes. Several nodes (clusters) are created in each iteration, unlike the traditional Growing Neural Gas (GNG) [17] algorithm;
 (iv)
all aspects of GStream (including creation, deletion and fading of nodes, edges management, and reservoir management) are performed online;
 (v)
a reservoir is used to hold, temporarily, the very distant data points, compared to the current prototypes.
where \(\mathbf {w}_{c}^{(t)}\) is the previous center for the cluster, \(n_{c}^{(t)}\) is the number of points assigned to the cluster thus far, \(\mathbf {z}_{c}^{(t)}\) is the new cluster center from the current batch, and \(m_{c}^{(t)}\) is the number of points added to the cluster c in the current batch.
Hierarchical stream methods
A hierarchical clustering method groups the given data into a tree of clusters which is useful for data summarization and visualization. This is a binarytree based data structure called the dendrogram. Once the dendrogram is constructed, one can automatically choose the right number of clusters by splitting the tree at different levels to obtain different clustering solutions for the same dataset without rerunning the clustering algorithm again. Hierarchical clustering can be achieved in two different ways, namely, bottomup and topdown clustering. Though both of these approaches utilize the concept of dendrogram while clustering the data, they might yield entirely different sets of results depending on the criterion used during the clustering process [36]. In hierarchical clustering once a step (merge or split) is done, it can never be undone. Methods for improving the quality of hierarchical clustering have been proposed such as integrating hierarchical clustering with other clustering techniques, resulting in multiplephase clustering such as BIRCH [37].
BIRCH
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) incrementally and dynamically clusters multidimensional data points to try to produce the best quality clustering with the available resources (i. e., memory and time constraints) by making a single scan of the data, and to improve the quality further with a few additional scans. It should be noted that the BIRCH method is not designed for clustering data streams and cannot address the concept drift problem. The key characteristic of the BIRCH is to introduce a new data structure called a clustering feature (CF) as well as a CFtree. The CF can be regarded as a concise summary of each cluster. This is motivated by the fact that not every data point is equally important for clustering and we cannot afford to keep every data point in the main memory given that the overall memory is limited. On the other hand, for the purpose of clustering, it is often enough to keep up to the second order of data moment. In other words, CF is not only efficient, but also sufficient to cluster the entire data set [36, 37].

Incrementality If a point x is added to the cluster, the sufficient statistics are updated as follows:$$ \begin{aligned} &N_{i} \leftarrow N_{i} + 1;\\[3pt] &LS_{i} \leftarrow LS_{i} + x;\\[3pt] &SS_{i} \leftarrow SS_{i} + x^{2}; \end{aligned} $$

Additivity If C F _{1}=(N _{1},L S _{1},S S _{1}) and C F _{2}=(N _{2},L S _{2},S S _{2}) are the CF vectors of two disjoint clusters, merging them is equal to the sum of their parts. The additive property allows us to merge subclusters incrementally without accessing the original data set.$$ CF_{1} + CF_{2}= (N_{1}+N_{2}, LS_{1}+LS_{2}, SS_{1}+SS_{2}). $$
BIRCH requires two user defined parameters: B the branch factor or the maximum number of entries in each nonleaf node; and T the maximum diameter (or radius) of any CF in a leaf node. The maximum diameter T defines the examples that can be absorbed by a CF. Increasing T, more examples can be absorbed by a CF node and smaller CFTrees are generated. Each node in the CFtree represents a cluster which is in turn made up of at most B subclusters. All the leaf nodes are chained together for the purpose of efficient scanning. When a data point is available, it traverses down the current tree from the root, until it finds the appropriate leaf following the closestCF path, with respect to the L _{1} or L _{2} norms. The insertion on the CFtree can be performed in a similar way as the insertion in the classic Btree. If the closestCF in the leaf cannot absorb the data point, a new CF entry is created. If there is no room for new leaf, the parent node is split. A leaf node might be expanded due to the constraints imposed by B and T. The process consists of taking the two farthest CFs and creates two new leaf nodes. BIRCH operates in two main steps: the first step builds an initial CFtree in memory using the given amount of memory and recycling space on disk; the second step tries to cluster all the subclusters in the leaf nodes, called also the “global clustering”. There are two optional steps: the“tree condensing” step which aims to refine the initial CFtree by reinserting its leaf entries; and the “clustering refinement” step which reassigns all the data points based on the cluster centroid produced by the global clustering step.
EStream
EStream starts empty, and every new point either is mapped onto one of the existing clusters (based on a radius threshold) or a new cluster is created around it. Any cluster not meeting a predefined density level is considered inactive and remains isolated until achieving a desired weight. The weight of a cluster is the number of data elements assigned to this cluster. The algorithm employs an exponential decay function to weigh down the influence of older data, thus focuses on keeping an uptodate view of the data distribution. Clusters which are not active for a certain time period may be deleted from the data space.
HUEStream
HUEStream [39] which extends EStream in order to support uncertainty in heterogeneous data, i.e., including numerical and categorical attributes simultaneously. Uncertain data streams pose a special challenge because of the dual complexity of high volume and data uncertainty. This uncertainty is due to errors in the reading of sensors or other hardware collection technology. In many of these cases, the data errors can be approximated either in terms of statistical parameters, such as the standard deviation, or the probability density functions [9]. The Uncertain MICROclustering (UMicro) algorithm is proposed as a method for clustering uncertain data streams, which enhances the microclusters with additional information about the uncertainty of the data points in the clusters [40]. This information is used to improve the quality of the distance functions for the cluster assignments. HCluStream [41] extends the definition of the cluster feature vector to include categorical features, replaces the modified kmeans clustering with the corresponding kprototypes clustering which is able to handle heterogeneous attributes. The centroid of continuous attributes and the histogram of the discrete attributes are used to represent the microclusters, and the kprototype clustering algorithm is used to create the microclusters and macroclusters.
The distance function, cluster representation and histogram management are introduced for the different types of clustering structure evolution. A distance function between the probability distributions of two objects is introduced to support uncertainty in categorical attributes. To detect changes in the clustering structure, the proposed distance function is used to merge clusters and find the closest cluster of a given incoming data and the proposed histogram management to split clusters for categorical data. To decrease the weight of old data over time, a fading function is used. Experimental results show that HUEStream gives better cluster quality, in terms of purity and the Fmeasure, compared to UMicro for the KDDCUP’99 dataset [39].
ClusTree

an inner node contains between m and M entries. Leaf nodes contain between l and L entries. The root has at least one entry (m, M, l and L are input parameters).

an entry in an inner node stores: (i) a cluster feature of the objects that it represents. (ii) a cluster feature of the objects in the buffer. (iii) a pointer to its child node.

an entry in a leaf stores a cluster feature of the data point(s) it represents.

a path from the root to any leaf node has always the same length (balanced).
So, it uses also the microcluster structure as a compact representation of the data distribution. Anytime algorithms denote approaches that are capable of delivering a result at any given point in time, and of using more time if it is available to refine the result. The basic idea is to maintain measures for incremental computation of the mean and variance of microclusters so that the infeasible access to all past stream objects is no longer necessary. We recall that a microcluster is a cluster feature tuple (or a variant of it) C F=(n,L S,S S) of the number n of represented data points, their linear sum LS, and their squared sum SS. In the proposed method, CFs are created and updated by extending index structures from the Rtree family [42]. Such hierarchical indexing structures provide the means for efficiently locating the right place to insert any object from the stream into a microcluster. The idea is to build a hierarchy of microclusters at different levels of granularity. Given enough time, the algorithm descends the hierarchy in the index to reach the leaf entry that contains the microcluster that is most similar to the current object. If this microcluster is similar enough, it is updated incrementally by this object’s values. Otherwise, a new microcluster may be formed [16]. However, in anytime clustering of streaming data, there might not always be enough time to reach leaf level to insert the object. To deal with this, the authors provide some strategies for anytime inserts. By incorporating local aggregates, i.e., temporary buffers for “hitchhikers”, a solution is provided for the easy interruption of the insertion process so that it can be simply resumed at any later point in time. For very fast streams, aggregates of similar objects allow insertion of groups instead of single objects for even faster processing. For slower stream settings, alternative insertion strategies that exploit possible idle times of the algorithm to improve the quality of the resulting clustering are proposed [16].
Taking the means of the CFs as representatives, we can apply a kcenter clustering or density based clustering (e.g. kmeans or DBSCAN) to detect clusters of arbitrary shape.
Partitioning stream methods
A partitioningbased clustering algorithm organizes the objects into some number of partitions, where each partition represents a cluster. The clusters are formed based on a distance function like the kmeans algorithm which leads to finding only spherical clusters and the clustering results are usually influenced by noise.
CluStream

Microclusters: Statistical information about the data locality in terms microclusters are maintained. The microcluster structure is a temporal extension of the cluster feature vector [37]. The additivity property of the microclusters makes them a natural choice for the data stream problem. More precisely, a microcluster is tuple (N,L S,S S,L S T,S S T) where (N,L S,S S) are the three components of the CF vector (namely, the number of data points in the cluster, N; the linear sum of the N data points, LS; and the squared sum of the N data points, SS). The two other components are LST and SST (the sum and the sum of the squares of the time stamps of the N data points).

Pyramidal time frame: The microclusters are stored at time snapshots which follow a pyramidal pattern. This pattern provides an effective tradeoff between the storage requirements and the ability to recall summary statistics from different time horizons.
The data stream clustering algorithm proposed in [1] can generate approximate clusters in any userspecified length of history from the current moment. The online phase stores q microclusters in (secondary) memory, where q is an input parameter. Each microcluster has a maximum boundary, which is computed as the standard deviation of the mean distance of the data points to their centroids multiplied by a factor f. Each new point is assigned to its closest microcluster (according to the Euclidean distance) if the distance between the new point and the closest centroid falls within the maximum boundary. If so, the point is absorbed by the cluster and its summary statistics are updated. If none of the microclusters can absorb the point, a new microcluster is created. This is accomplished by either deleting the oldest microcluster or by merging two microclusters. The oldest microcluster is deleted if its timestamp is below a given threshold δ (input parameter). The q microclusters are stored in a secondary storage device in particular time intervals that decrease exponentially, which are referred to as snapshots. These snapshots allow the user to search for clusters in different time horizons through a pyramidal time window concept. This summary information in the microclusters is used by an offline component which is dependent upon a wide variety of user inputs such as the time horizon or the granularity of clustering. When the user specifies a particular time horizon of length h over which to find the clusters, then we need to find microclusters which are specific to that timehorizon. For this purpose, we find the additive property of the cluster feature vector very useful. The final clusters are determined by using a modification of a kmeans algorithm. In this technique, the microclusters are treated as pseudopoints which are reclustered in order to determine higher level clusters.
StreamKM++
StreamKM++ [3] is a twophase (onlineoffline) algorithm which maintains a small outline of the input data using the mergeandreduce technique. The merge step is performed by via a data structure, named the bucket set, which is a set of L buckets (also named buffers), where L is an input parameter. The reduce step is performed by a significantly different summary data structure that is suitable for highdimensional data, the coreset tree, when we consider that it reduces 2m data points to m data points (m is an input parameter). The advantage of such a coreset is that we can apply any fast approximation algorithm (for the weighted problem) on the usually much smaller coreset to compute an approximate solution for the original dataset more efficiently.
Consequently, the summary statistics of the child nodes v _{1} and v _{2} are updated. This is the expansion step of the tree, which creates two child nodes for a given inner node. When the tree has many leaf nodes, we have to decide which one should be expanded first. In this case, we start from the root node of the coreset tree and descend it by iteratively selecting a child node with probability proportional to \(\frac {SSE_{child}}{SSE_{parent}}\), until a leaf node is reached for the expansion step to be restarted. The coreset tree expansion stops when the number of leaf nodes is m.
When a new data point arrives, it is stored in the first bucket. If the first bucket is full, all of its data are moved to the second bucket. If the second bucket is full, the two buckets are merged resulting in 2m data points, which are then reduced to m data points, by the construction of a coreset tree, as previously detailed. The resulting m data points are stored in the third bucket, unless it is also full, and then again a new mergeandreduce step is needed [3, 14]. In its offline phase, the kmeans++ [43], which is executed on an input set of size m, is used for finding the final clusters. The kmeans++ method is a seeding procedure for the kmeans algorithm that guarantees a solution with a certain quality and gives good practical results.
StrAP

The first batch of data is used by AP to identify the first clusters and initialize the stream model.

As the stream flows in, each data point x _{ t } is compared to the prototypes; if too far from the nearest exemplar, x _{ t } is put in the reservoir, otherwise the stream model is updated accordingly.

The data distribution is checked for change point detection, using the PageHinkley significance test.

Upon triggering the change detection test, or if the number of outliers exceeds the reservoir size, the stream model is rebuilt based on the current model and reservoir, using a weighted version of AP (WAP).
The model of the data stream used in StrAP is inspired by DenStream [2]. It consists of a set of 4tuples (c _{ i },n _{ i },Σ _{ i },t _{ i }), where c _{ i } ranges over the clusters, n _{ i } is the number of items associated to cluster c _{ i },Σ _{ i } is the distortion of c _{ i } (sum of d(x,c _{ i })^{2}, where x ranges over all data points associated to c _{ i }), and t _{ i } is the last time stamp when a data point was associated to c _{ i }.
At time t, the data point x _{ t } is considered and its nearest cluster c _{ i } (w.r.t. distance d) in the current model is selected; if d(x _{ t },c _{ i }) is less than some threshold δ, heuristically set to the average distance between points and clusters in the initial model, x _{ t } is assigned to the ith cluster and the model is updated accordingly; otherwise, x _{ t } is considered to be an outlier, and put into the reservoir [4].
Approximations of the kmeans algorithm in the onepass streaming setting have been proposed in [45–47]. The streaming kmeans algorithm proposed in [45] is based on a divide and conquer approach. It uses the result of [43] as a subroutine, finding 3k logk centers of each block. Their experiment showed that this algorithm is an improvement over an online version of kmeans algorithm and was comparable to the batch version of kmeans.
The Highdimensional Projected Stream clustering method (HPStream) [48] introduces the concept of projected clustering to data streams. This algorithm is a projected clustering for highdimensional streaming data with higher clustering quality compared to CluStream [1].
SWClustering uses an EHCF (Exponential Histogram of Cluster Features) structure by combining Exponential Histogram with Cluster Feature to record the evolution of each cluster and to capture the distribution of recent data points [49]. It tracks the clusters in evolving data streams over sliding windows.
Densitybased stream methods
Densitybased algorithms are based on the connection between regions and density functions. In these types of algorithms, dense areas of objects in the data space are considered as clusters, which are segregated by low density area (noise). These algorithms find clusters of arbitrary shapes and generally they require two parameters: the radius and the minimum number of data points within a cluster.
The main challenge in the streaming scenario is to construct densitybased algorithms which can be efficiently executed in a single pass of the data, since the process of density estimation may be computationally intensive [9]. Amini [15] gives a survey on recent densitybased data streams clustering algorithms.
DenStream
DenStream [2] is a densitybased data stream clustering algorithm that also uses a feature vector based on the CF vector. By creating two kinds of microclusters (potential and outlier microclusters), in its online phase, DenStream overcomes one of the drawbacks of CluStream, its sensitivity to noise. Potential and outlier microclusters are kept in separate memories since they require different processing. Each potentialmicrocluster structure has an associated weight w that indicates its importance based on temporality. The weight of each data point decreases exponentially with time t via a fading function f(t)=2^{−λ t }, where λ>0. If the weight \( w = \sum _{j=1}^{n} f(t  T_{ij})\) is above a threshold input parameter μ then the corresponding cluster is considered as a coremicrocluster, where T _{ i1},…,T _{ in } are timestamps of data points p _{ i1},…,p _{ in }. At the time t, if w≥β μ then the microcluster is considered as potentialmicrocluster, else it is an outliermicrocluster, where β is the threshold of the outlier relative to coremicroclusters (0<β<1). Microclusters with no recent points tend to lose importance, i.e. their respective weights continuously decrease over time in outdatedmicroclusters. However, the latter could grow into a potential microcluster when, by adding new points, its weight exceeds the threshold. Weights of microclusters are periodically calculated and decision about removing or keeping them is made based on the weight threshold.
When a new data point arrives, the algorithm tries to insert it into its nearest potentialmicrocluster based on its updated radius. If the insertion is not successful, the algorithm tries to insert the data point into its closest outlier microcluster. If the insertion is successful, the cluster summary statistics will be updated accordingly. Otherwise, a new outlier microcluster is created to absorb this point. The Euclidean distance between the new data point and the center of the nearest potential or outlier microcluster is measured. A microcluster is chosen with the distance less than or equal to the radius threshold. DenStream has a pruning method in which it frequently checks the weights of the outliermicroclusters in the outlier buffer to guarantee the recognition of the real outliers. However, the nonrelease of the allocated memory when either deleting a microcluster or merging two old microclusters is considered as a limitation of the DenStream algorithm as well as the timeconsuming pruning phase for removing outliers [15]. In the offline phase, the potentialmicroclusters found during the online phase are considered as pseudopoints and will be passed to a variant of the DBSCAN algorithm in order to determine the final clusters.
SOStream
SOStream [50] is a densitybased clustering algorithm inspired by both the principle of the DBSCAN algorithm and selforganizing maps (SOM) [18], in the sense that a winner influences its immediate neighborhood. Generally speaking, densitybased clustering algorithms need setting a threshold manually (similarity threshold, grid size, etc.) for which is difficult to choose the most suitable value and if it is set to an unsuitable value, then the algorithm will suffer from overfitting, or from unstable clustering. SOStream addresses this problem by using a dynamically learned threshold value for each cluster based on the idea of building neighborhoods with a minimum number of points.
SOStream is also represented by a set of microclusters where for each cluster a cluster feature (CF) vector is stored, which is a tuple with three elements N _{ i }=(n _{ i },r _{ i },C _{ i }),n _{ i } is the number of data points assigned to N _{ i },r _{ i } is the cluster’s radius and C _{ i } is the centroid.
When a new point arrives, the nearest cluster is selected, based on the Euclidean distance to existing microclusters, and then absorbs this point if the calculated distance is less than a dynamically defined threshold. It also assigns the microclusters’ neighbors to the nearest cluster, i.e., the centroids of clusters sufficiently close to the winning cluster have their centroids modified to be closer to the winning cluster’s centroid. This approach is used to assist in merging similar clusters and increasing separation between different clusters. The neighborhood of the winner is defined based on the idea of a MinPts distance given by a minimum number of neighboring objects [2]. This distance is found by computing the Euclidean distance from any existing clusters to the winning cluster. If the new point is not absorbed by any microcluster, a new microcluster is created for it. In the SOStream algorithm, merging, updating and adapting dynamically the threshold value for each cluster are performed in an online manner. Clusters are merged if they overlap with a distance that is less than the mergethreshold, i.e., the spheres in ddimensional space defined by the radius of each cluster overlap. Hence, the threshold value is a determining factor for the number of clusters. When two clusters are merged, the largest radius of these two clusters is chosen to be the radius of the cluster to avoid losing any data points within the clusters. However, no split feature is proposed in the algorithm. SOStream also uses an exponential fading function to reduce the impact of old data whose relevance diminishes over time.
SVStream
SVStream (Support Vector based Stream clustering) [51] is a data stream clustering algorithm based on support vector clustering (SVC) and support vector domain description (SVDD).
In the Support Vector Clustering (SVC) [52] algorithm data points are mapped from the data space to a high dimensional feature space using a Gaussian kernel. In the feature space we look for the smallest sphere that encloses the image of the data. This sphere is mapped back to data space, where it forms a set of contours which enclose the data points. These contours are interpreted as cluster boundaries. Points enclosed by each separate contour are associated with the same cluster. Support vectors are used to construct cluster boundaries of arbitrary shape in SVC.
Support vector domain description (SVDD) [53] is a oneclass classifier inspired by the support vector classifier. The idea is to use kernels to project data into a feature space and then to find the sphere enclosing almost all data, namely not including outliers. SVDD has the possibility to reject a fraction of the training data points, when this sufficiently decreases the volume of the hypersphere. One inherent drawback of SVDD, which affects not only its outlier detection performance but also its general properties significantly, is that the resulting description is highly sensitive to the selection of the tradeoff parameter, which is difficult to estimate in practice.

SV is a support vector set.

BSV is a bounded support vector set.

∥μ∥^{2} is the squared length of the sphere center μ.

R _{ SV } is the radius of the sphere.

R _{ BSV } is the maximum Euclidean distance of the bounded support vectors from the sphere center μ.
The multisphere set SS is defined as a set consisting of multiple spheres, that is, S S={S ^{1},…,S ^{S S}}, where the superscript denotes the index of a sphere. In SVStream, the elements of a data stream are mapped to a kernel space, and the support vectors are used as the summary information of the historical elements to construct the cluster boundaries of arbitrary shape. To adapt both dramatic and gradual changes, multiple spheres are dynamically maintained, each describing the corresponding data domain presented in the data stream. When a new data batch arrives, if a dramatic change occurs, a new sphere is created; otherwise, only the existing spheres are updated to take into account the new batch. The data elements of this new batch are assigned with cluster labels according to the cluster boundaries constructed by the sphere set. Bounded support vector (BSVs) and a newly designed BSV decaying mechanism are introduced so as to respectively identify overlapping clusters and automatically detect outliers (noise) [51]. In the clustering process, if two spheres are too close to each other, they should be merged. In addition, eliminating old BSVs by the BSV decaying mechanism would help detect the tendency of a cluster to shrink or split.
OPTICSStream is an extension of the OPTICS algorithm [54] to the streaming data model. OPTICS uses a density identification method to create a onedimensional clusterordering of the data. OPTICSStream is an online visualization algorithm producing a map representing the clustering structure where each valley represents a cluster [55].
PreDeConStream [56] is based on the two phase process of mining data streams, which builds a microclusterbased structure to store an online summary of the streaming data. The technique is based on subspace clustering, targeting applications with high data dimensionality.
Gridbased stream methods
Gridbased clustering is another group of the clustering methods for data streams where the data space is quantized into finite number of cells which form the grid structure and perform clustering on the grids. Gridbased clustering maps the infinite number of data records in data streams to a finite number of grids. Then, the grids are clustered based on their density.
DStream
DStream [57] is also a twophase scheme which consists of an online component that processes input data stream and produces summary statistics and an offline component that uses the summary data to generate clusters. In the online component, the algorithm maps each input data point into a grid whereas in the offline component, it computes the grid density and clusters the grids based on the density. The algorithm adopts a density decaying technique to capture the dynamic changes of a data stream and it can find clusters of arbitrary shapes. Unlike other algorithms such as CluStream [1], DStream automatically and dynamically adjusts the clusters without requiring user specification of target time horizon and number of clusters. Algorithm 5 outlines the overall architecture of DStream.
DSteam uses the characteristic vector concept associated to each grid. This is a tuple (t _{ g },t _{ m },D,l a b e l,s t a t u s), where t _{ g } is the last time when g is updated, t _{ m } is the last time when g is removed from grid list as a sporadic grid (if ever), D is the grid density at the last update, label is the class label of the grid, and s t a t u s={S P O R A D I C,N O R M A L} is a label used for removing sporadic grids.
The procedures initial_clustering (used in Line 8 of Algorithm 5) and adjust_clustering (used in Line 11 of 5) update the density of all active grids to the current time, first. Once the density of grids are determined at the given time, the clustering procedure is similar to the standard method used by densitybased clustering.
The offline component dynamically adjusts the clusters every gap time steps, where gap is an integer parameter. After the first gap, the algorithm generates the initial cluster (Lines 78). Then, the algorithm periodically removes sporadic grids and adjusts the clusters (Lines 911) [57].
One weakness of the approach is that a significant number of nonempty grid cells need to be discarded in order to keep the memory requirements in check. In many cases, such gridcells occur at the borders of the clusters. The discarding of such cells may lead to a degradation in cluster quality [9].
Analogously to DStream, MRStream [58] facilitates the discovery of clusters at multiple resolutions by using a grid of cells that can dynamically be subdivided into more cells using a tree data structure. In the online phase, it assigns new incoming data to the appropriate cell and updates the summary information. The offline component obtains a portion of the tree at a fixed hight h and performs clustering at the resolution level determined by h.
Summary
Comparison between algorithms (WL: weighted links, 2 phases : online+offline)
Algorithms  Based on  Topology  WL  Phases  Remove  Merge  Split  Fade 

SVStream  SVC, SVDD  ✗  ✗  online  ✓  ✓  ✓  ✓ 
StreamKM++  kmeans++  ✗  ✗  2 phases  ✓  ✓  ✓  ✓ 
StrAP  AP  ✗  ✗  2 phases  ✓  ✗  ✗  ✓ 
SOStream  DBSCAN, SOM  ✗  ✗  online  ✓  ✓  ✗  ✓ 
OPTICSStream  OPTICS  ✗  ✗  2 phases  ✓  ✓  ✗  ✓ 
IGNG  NG  ✓  ✗  online  ✗  ✗  ✗  ✗ 
HCluStream  kprototypes  ✗  ✗  2 phases  ✓  ✓  ✓  ✓ 
GWR  NG  ✓  ✗  online  ✗  ✗  ✗  ✗ 
GStream  NG  ✓  ✓  online  ✓  ✗  ✗  ✓ 
EStream  kmeans  ✗  ✗  2 phases  ✓  ✓  ✓  ✓ 
DStream    ✗  ✗  2 phases  ✓  ✓  ✓  ✓ 
DenStream  DBSCAN  ✗  ✗  2 phases  ✓  offline  ✗  ✓ 
ClusTree  kmeans or DBSCAN  ✗  ✗  2 phases  ✓  offline  ✓  ✓ 
CluStream  kmeans  ✗  ✗  2 phases  ✓  offline  ✗  ✗ 
AING  NG  ✓  ✗  online  ✗  ✓  ✗  ✗ 
Streaming platforms
In today’s applications, evolving data streams are ubiquitous. As the need by industry for real time analysis has emerged, an increasing number of systems to support realtime data integration and analytics in the recent years. Generally speaking, there exists two types of streaming processing systems. Traditional streaming platforms, on which we can implement a streaming algorithm using a traditional programming language in a sequential manner. Distributed streaming platforms, where the data is distributed across a cluster of machine and the processing model is implemented using the MapReduce framework. This section gives a survey on the most wellknown streaming platforms with a focus on the streaming clustering task. Liu [59] gives a general survey on realtime processing systems for big data.
Spark streaming
Spark Streaming [7] is an extension of the Apache Spark [60] project by adding the ability to perform online processing through a similar functional interface to Spark, such as map, filter, reduce, etc. Spark is a cluster computing system originally developed by UC Berkeley AMPLab. Now it is an umbrellaed project of Apache foundation. The execution model of Spark is based on an abstraction called Resilient Distributed Dataset (RDD), which is a distributed memory abstraction of data. Spark performs inmemory computations on large clusters in a faulttolerant manner through RDDs [61]. Spark Streaming runs streaming computations as a series of short batch jobs on RDDs withing a programming model called discretized streams (DStreams). The key idea behind DStreams is to treat a streaming computation as a series of deterministic batch computations on small time intervals. For example, we might place the data received each second into a new interval, and run a MapReduce operation on each interval to compute a count. Similarly, we can perform a running count over several intervals by adding the new counts from each interval to the old result. Spark Streaming can automatically parallelize the jobs across the nodes in a cluster. It also supports fault recovery for a wide array of operations.
Spark Streaming comes with a new approach for fault recovery, while classical streaming systems update the mutable state on a perrecord basis and use either replication or upstream backup for fault recovery. The replication approach creates two or more copies of each process in the data flow graph [62]. This can double the hardware cost, and if two nodes in the same replica fail, the system is not recoverable. In upstream backup [63], upstream nodes act as backups for their downstream neighbors by preserving tuples in their output queues while their downstream neighbors process them. If a server fails, its upstream nodes replay the logged tuples on a recovery node. The disadvantage of this approach is long recovery times, as the system must wait for the standby node to catch up.
To address these issues, DStreams employ another approach: parallel recovery. The system periodically checkpoints some of the state RDDs, by asynchronously replicating them to other nodes. For example, in a view count program computing hourly windows, the system could checkpoint results every minute. When a node fails, the system detects the missing RDD partitions and launches tasks to recover them from the latest checkpoint [7].
In the streaming clustering point of view, Spartakus^{2} is an opensource project on top of Sparknotebook^{3} which provides frontend packages for some clustering algorithms implemented using the MapReduce framework. This includes the MBGStream^{4} algorithm [35] (detailed in “Background” section) with an integrated interface for execution and visualization checks. MLlib [64] gives implementations of some clustering algorithms, especially a Streaming kmeans^{5} opensource code. streamDM^{6} is another open source software for mining big data streams using Spark Streaming, developed at Huawei Noah’s Ark Lab. For streaming clustering, it includes Clustream [1] and StreamKM++ [3].
Flink
Flink^{7} is an open source platform for distributed stream and batch data processing. The core of Flink is a streaming iterative data flow engine. On top of the engine, Flink exposes two languageembedded fluent APIs, the DataSet API for consuming and processing batch data sources and the DataStream API for consuming and processing event streams. The key idea behind Flink is the optimistic recovery mechanism that does not checkpoint every state [8]. Therefore, it provides optimal failurefree performance and simultaneously uses less resources in the cluster than traditional approaches. Instead of restoring such a state from a previously written checkpoint and restarting the execution, a userdefined, algorithmspecific compensation function is applied. In case of a failure, this function restores a consistent algorithm state and allows the system to continue the execution.
MOA
MOA^{8} (Massive Online Analysis) is a framework for data stream mining [6]. It includes tools for evaluation and a collection of machine learning algorithms. Related to the WEKA project^{9} (Waikato Environment for Knowledge Analysis), it is also written in Java, while scaling to more demanding problems. The goal of MOA is a benchmark framework for running experiments in the data stream mining context by proving storable settings for data streams (real and synthetic) for repeatable experiments, a set of existing algorithms and measures from the literature for comparison, and an easily extendable framework for new streams, algorithms and evaluation methods. MOA currently supports stream classification, stream clustering, outlier detection, change detection and concept drift and recommender systems. Currently MOA contains several stream clustering methods including: StreamKM++ [3], CluStream [1], ClusTree [16], DenStream [2], DStream [57].
SAMOA
SAMOA^{10} (Scalable Advanced Massive Online Analysis) is distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms. It is a project started at Yahoo Labs Barcelona. SAMOA is both a framework and a library [65]. As a framework, it allows algorithm developers to abstract from the underlying execution engine, and therefore reuse their code on different engines. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm^{11}, S4^{12}, and Samza^{13}. As a library, SAMOA contains implementations of stateoftheart algorithms for distributed machine learning on streams. For streaming clustering, it includes an algorithm based on CluStream [1].
Open challenges in data stream clustering
In today’s applications, evolving data streams are ubiquitous. Mining, knowledge discovery, or more specifically clustering streaming data is a recent domain compared to the offline (or batch) model. Thus, many of the challenges, issues and problems remain to be addressed in the streaming model. This section is devoted to discuss some challenging issues and further directions from the viewpoints of both academic research and industrial applications [11, 14, 66–68].
Protecting privacy and confidentiality. Data streams present new challenges and opportunities with respect to protecting privacy and confidentiality in data mining. The main objective is to develop such data mining techniques that would not uncover information or patterns which compromise confidentiality and privacy obligations. Privacybydesign seems to be an interesting paradigm to use.
Handling incomplete information. The problem of missing values, which corresponds to incompleteness of features, has been discussed extensively for the offline, static settings.
Uncertain data. In most applications we don’t have sufficient data for statistical operations so new methods are needed to manage uncertain data stream in an accurate and fast manner.
Variety of data. Data type diversity in a given stream (text, video, audio, static image, etc.) as well as differences in data processability (structured, semistructured, unstructured data). Clustering these diverse types of data together, coming in a streaming form, is very challenging. Another interesting future application of data stream clustering is social network analysis. The activities of social network members can be regarded as a data stream, and a clustering algorithm can be used to show similarities among members, and how these similar profiles (clusters) evolve over time.
Synopsis, sketches and summaries. A synopsis is compact data structures that summarize data for further querying. Samples, Histograms, Wavelets, Sketches describe basic principles and recent developments in building approximate synopses (that is, lossy, compressed representations) of massive data [69]. Data sketching via random projections is a tool for dimensionality reduction. Although this technique is extremely efficient, its main drawback is that it may ignore relevant features.
Distributed streams. Data streams are distributed in nature. For learning from distributed data, we need efficient methods in minimizing the communication overheads between nodes. Most importantly, in applications like monitoring, centralized solutions introduce delays in event detection and reaction, that can make mining systems inefficient. Many data clustering techniques are not trivial to parallelize. To develop distributed versions of some methods, a lot of research is needed with practical and theoretical analysis to provide new methods.
Evaluation of data stream algorithms. Although in the field of static classification such tools exist, they are insufficient in data stream environments due to such problems as: concept drift, limited processing time, verification latency, multiple stream structures, evolving class skew, censored data, and changing misclassification costs. Indeed, in the streaming context, we are more interested in how the evaluation metric evolves over time [66].
Autonomous and selfdiagnosis. Knowledge discovery from data streams requires the ability for predictive selfdiagnosis. A significant and useful intelligence characteristic is diagnostics, not only after failure has occurred, but also predictive (before failure) and advisory (providing maintenance instructions). The development of such selfconfiguring, selfoptimizing, and selfrepairing systems is a major scientific and engineering challenge. All these aspects require monitoring the evolution of the learning process, taking into account the available resources, and the ability to reason and learn about it [67, 68].

New Data: All data entering the system is dispatched to both the batch layer and the speed layer for processing.

Batch layer: This layer has two functions: (i) managing the master dataset, an immutable, appendonly set of raw data, and (ii) to precompute arbitrary query functions from scratch, called batch views.

Serving layer: This layer indexes the batch views so that they can be queried in ad hoc with low latency.

Speed layer: This layer compensates for the high latency of updates to the serving layer, due to the batch layer. Using fast and incremental algorithms, the speed layer deals with recent data only.

Queries: Any incoming query can be answered by merging results from both batch views and realtime views.
Designing data stream clustering methods in a Lambda Architecture where we can benefit from the high accuracy of the batch model is very interesting and challenging.
Conclusion
Recently, examples of applications relevant to streaming data have become more numerous and more important, including network intrusion detection, transaction streams, phone records, web clickstreams, social streams, weather monitoring, etc. Indeed, the data stream clustering problem has become an active research in recent years. This problem requires a process capable of partitioning observations continuously while taking into account restrictions of memory and time.
In this paper, we surveyed, in a detailed and comprehensive manner, a number of the representative stateoftheart algorithms for the clustering over data streams. These algorithms are categorized according to the nature of their underlying clustering approach, including GNG, hierarchical, partitioning, density, and gridbased stream methods. Motivated by the need by industry for real time analysis, an increasing number of systems to support realtime data integration and analytics has emerged in recent years. We have made an overview of the most wellknown opensource streaming systems, including Spark Streaming, Flink, MOA, and SAMOA, with a focus on the streaming clustering task.
Endnotes
^{1} See http://spark.apache.org/streaming/
^{2} See https://hub.docker.com/r/spartakus/coliseum/
^{3} See http://sparknotebook.io/
^{4} See https://github.com/mghesmoune/sparkstreamingclustering
^{5} See http://spark.apache.org/docs/latest/mllibclustering.html%23streamingkmeans
^{6} See http://huaweinoah.github.io/streamDM/
^{7} See https://flink.apache.org/
^{8} See http://moa.cms.waikato.ac.nz/
^{9} See http://weka.wikispaces.com/
^{10} See http://samoaproject.net/
^{11} See http://storm.apache.org
^{12} See http://incubator.apache.org/s4
^{13} See http://samza.incubator.apache.org
Declarations
Acknowledgements
This work has been supported by the French Foundation FSN, PIA Grant Big dataInvestissements d’Avenir. The project is titled “Square Predict” (http://ns209168.ovh.net/squarepredict/). We thank anonymous reviewers for their insightful remarks.
Authors’ contributions
All authors contributed equally. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Aggarwal CC, Watson TJ, Ctr R, Han J, Wang J, Yu PS. A framework for clustering evolving data streams. In: In VLDB. Berlin: VLDB Endowment: 2003. p. 81–92.Google Scholar
 Cao F, Ester M, Qian W, Zhou A. Densitybased clustering over an evolving data stream with noise. In: SDM. SIAM: 2006. p. 328–39.Google Scholar
 Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C. StreamKM++: A clustering algorithm for data streams. ACM J Exp Algorithmics. 2012; 17(1):173–187.MathSciNetMATHGoogle Scholar
 Zhang X, Furtlehner C, Sebag M. Data streaming with affinity propagation. In: ECML/PKDD (2). Berlin: Springer Berlin Heidelberg: 2008. p. 628–43.Google Scholar
 Demchenko Y, Grosso P, De Laat C, Membrey P. Addressing big data issues in scientific data infrastructure. In: Collaboration Technologies and Systems (CTS), 2013 International Conference On. IEEE: 2013. p. 48–55.Google Scholar
 Bifet A, Holmes G, Pfahringer B, Kranen P, Kremer H, Jansen T, Seidl T. MOA: massive online analysis, a framework for stream classification and clustering. In: Proceedings of the First Workshop on Applications of Pattern Analysis, WAPA 2010, Cumberland Lodge, Windsor, UK September 1–3, 2010: 2010. p. 44–50.Google Scholar
 Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I. Discretized streams: faulttolerant streaming computation at scale. In: ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP ’13, Farmington, PA, USA, November 3–6, 2013: 2013. p. 423–38.Google Scholar
 Schelter S, Ewen S, Tzoumas K, Markl V. “all roads lead to rome”: optimistic recovery for distributed iterative data processing. In: 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, San Francisco, CA, USA, October 27  November 1, 2013: 2013. p. 1919–28.Google Scholar
 Aggarwal CC. A survey of stream clustering algorithms. In: Data Clustering: Algorithms and Applications. Chapman and Hall/CRC: 2013. p. 231–58.Google Scholar
 Nguyen H, Woon Y, Ng WK. A survey on data stream clustering and classification. Knowl Inf Syst. 2015; 45(3):535–69.View ArticleGoogle Scholar
 Khalilian M, Mustapha N. Data stream clustering: Challenges and issues. CoRR. 2010;abs/1006.5261.Google Scholar
 Yogita, Toshniwal D. Clustering techniques for streaming dataa survey. In: Advance Computing Conference (IACC), 2013 IEEE 3rd International. IEEE: 2013. p. 951–6.Google Scholar
 Mousavi M, Bakar AA, Vakilian M. Data stream clustering algorithms: A review. Int J Adv Soft Comput Appl. 2015; 7(3):13:1–13:31.Google Scholar
 de Andrade Silva J, Faria ER, Barros RC, Hruschka ER, de Carvalho ACPLF, Gama J. Data stream clustering: A survey. ACM Comput Surv. 2013; 46(1):13.MATHGoogle Scholar
 Amini A, Teh YW, Saboohi H. On densitybased data streams clustering algorithms: A survey. J Comput Sci Technol. 2014; 29(1):116–41.View ArticleGoogle Scholar
 Kranen P, Assent I, Baldauf C, Seidl T. The ClusTree: indexing microclusters for anytime stream mining. Knowl Inf Syst. 2011; 29(2):249–72.View ArticleGoogle Scholar
 Fritzke B. A growing neural gas network learns topologies. In: NIPS. MIT Press: 1994. p. 625–32.Google Scholar
 Kohonen T, Schroeder MR, Huang TS, (eds).SelfOrganizing Maps, 3rd edn. Secaucus, NJ, USA: Springer; 2001.MATHGoogle Scholar
 Martinetz T, Schulten K. A “NeuralGas” Network Learns Topologies. Artif Neural Netw. 1991; I:397–402.Google Scholar
 Beyer O, Cimiano P. Online semisupervised growing neural gas. Int J Neural Syst. 2012; 22(5):21–23.View ArticleGoogle Scholar
 Fritzke B. A selforganizing network that can follow nonstationary distributions. In: Artificial Neural Networks  ICANN ’97, 7th International Conference, Lausanne, Switzerland, October 8–10, 1997, Proceedings. Berlin: Springer Berlin Heidelberg: 1997. p. 613–8.Google Scholar
 Sledge IJ, Keller JM. Growing neural gas for temporal clustering. In: 19th International Conference on Pattern Recognition (ICPR 2008), December 8–11, 2008. Tampa. IEEE: 2008. p. 1–4.Google Scholar
 Mitsyn S, Ososkov G. The growing neural gas and clustering of large amounts of data. Opt Mem Neural Netw. 2011; 20(4):260–70.View ArticleGoogle Scholar
 Marsland S, Shapiro J, Nehmzow U. A selforganising network that grows when required. Neural Netw. 2002; 15(8–9):1041–58.View ArticleGoogle Scholar
 Marsland S, Nehmzow U, Shapiro J. Online novelty detection for autonomous mobile robots. Robot Auton Syst. 2005; 51(2):191–206.View ArticleGoogle Scholar
 Mendes CAT, Gattass M, Lopes H. Fgng: A fast multidimensional growing neural gas implementation. Neurocomputing. 2014; 128:328–40.View ArticleGoogle Scholar
 Prudent Y, Ennaji A. An incremental growing neural gas learns topologies. In: Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference On. MIT Press: 2005. p. 1211–1216.Google Scholar
 Hamza H, Belaïd Y, Belaïd A, Chaudhuri BB. Incremental classification of invoice documents. In: 19th International Conference on Pattern Recognition (ICPR 2008), December 8–11, 2008, Tampa, Florida, USA. IEEE Computer Society: 2008. p. 1–4.Google Scholar
 Lamirel JC, Boulila Z, Ghribi M, Cuxac P. A new incremental growing neural gas algorithm based on clusters labeling maximization: application to clustering of heterogeneous textual data. In: Proceedings of the 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems  Volume Part III. Berlin: SpringerVerlag: 2010. p. 139–48.Google Scholar
 GarcíARodríGuez J, Angelopoulou A, GarcíaChamizo JM, Psarrou A, Escolano SO, GiméNez VM. Autonomous growing neural gas for applications with time constraint: optimal parameter estimation. Neural Netw. 2012; 32:196–208.View ArticleGoogle Scholar
 Pimentel MA, Clifton DA, Clifton L, Tarassenko L. A review of novelty detection. Signal Process. 2014; 99:215–49.View ArticleGoogle Scholar
 Bouguelia MR, Belaïd Y, Belaïd A. An adaptive incremental clustering method based on the growing neural gas algorithm. In: ICPRAM. SciTePress: 2013. p. 42–9.Google Scholar
 Ghesmoune M, Azzag H, Lebbah M. Gstream: Growing neural gas over data stream. In: Neural Information Processing  21st International Conference, ICONIP 2014, Kuching, Malaysia, November 3–6, 2014. Proceedings, Part I: 2014. p. 207–14.Google Scholar
 Ghesmoune M, Lebbah M, Azzag H. Clustering over data streams based on growing neural gas. In: Advances in Knowledge Discovery and Data Mining  19th PacificAsia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19–22, 2015, Proceedings, Part II: 2015. p. 134–45.Google Scholar
 Ghesmoune M, Lebbah M, Azzag H. Microbatching growing neural gas for clustering data streams using spark streaming. In: INNS Conference on Big Data 2015, San Francisco, CA, USA, 8–10 August 2015. Elsevier: 2015. p. 158–66.Google Scholar
 Aggarwal CC, Reddy CK. Data Clustering: Algorithms and Applications, 1st: Chapman & Hall/CRC; 2013.Google Scholar
 Zhang T, Ramakrishnan R, Livny M. Birch: An efficient data clustering method for very large databases. In: SIGMOD Conference. New York: ACM: 1996. p. 103–14.Google Scholar
 Udommanetanakit K, Rakthanmanon T, Waiyamai K. Estream: Evolutionbased technique for stream clustering. In: ADMA: 2007. p. 605–15.Google Scholar
 Meesuksabai W, Kangkachit T, Waiyamai K. Huestream: Evolutionbased clustering technique for heterogeneous data streams with uncertainty. In: Advanced Data Mining and Applications  7th International Conference, ADMA 2011, Beijing, China, December 17–19, 2011, Proceedings, Part II: 2011. p. 27–40.Google Scholar
 Aggarwal CC, Yu PS. A framework for clustering uncertain data streams. In: Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7–12, 2008, Cancún, México: 2008. p. 150–9.Google Scholar
 Yang C, Zhou J. Hclustream: A novel approach for clustering evolving heterogeneous data stream. In: Workshops Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), 18–22 December 2006, Hong Kong, China: 2006. p. 682–8.Google Scholar
 Guttman A. Rtrees: A dynamic index structure for spatial searching. In: SIGMOD’84, Proceedings of Annual Meeting, Boston, Massachusetts, June 18–21, 1984: 1984. p. 47–57.Google Scholar
 Arthur D, Vassilvitskii S. kmeans++: the advantages of careful seeding. In: SODA. Philadelphia: Society for Industrial and Applied Mathematics: 2007. p. 1027–1035.Google Scholar
 Frey BJ, Dueck D. Clustering by passing messages between data points. Science. 2007; 315:2007.MathSciNetView ArticleMATHGoogle Scholar
 Ailon N, Jaiswal R, Monteleoni C. Streaming kmeans approximation. In: NIPS. USA: Curran Associates Inc.: 2009. p. 10–18.Google Scholar
 Braverman V, Meyerson A, Ostrovsky R, Roytman A, Shindler M, Tagiku B. Streaming Kmeans on Wellclusterable Data. In: Proceedings of the Twentysecond Annual ACMSIAM Symposium on Discrete Algorithms. Philadelphia: Society for Industrial and Applied Mathematics: 2011. p. 26–40.Google Scholar
 Shindler M, Wong A, Meyerson A. Fast and accurate kmeans for large datasets. In: NIPS. USA: Curran Associates Inc.: 2011. p. 2375–383.Google Scholar
 Aggarwal CC, Han J, Wang J, Yu PS. A framework for projected clustering of high dimensional data streams. In: (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, August 31  September 3 2004. VLDB Endowment: 2004. p. 852–63.Google Scholar
 Zhou A, Cao F, Qian W, Jin C. Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst. 2008; 15(2):181–214.View ArticleGoogle Scholar
 Isaksson C, Dunham MH, Hahsler M. SOStream: Self organizing densitybased clustering over data stream. In: MLDM. Berlin: Springer Berlin Heidelberg: 2012. p. 264–78.Google Scholar
 Wang C, Lai J, Huang D, Zheng W. SVStream: A support vectorbased algorithm for clustering data streams. IEEE Trans Knowl Data Eng. 2013; 25(6):1410–24.View ArticleGoogle Scholar
 BenHur A, Horn D, Siegelmann HT, Vapnik V. Support vector clustering. J Mach Learn Res. 2001; 2:125–37.MATHGoogle Scholar
 Tax DMJ, Duin RPW. Support vector domain description. Pattern Recogn Lett. 1999; 20(11–13):1191–9.View ArticleGoogle Scholar
 Ankerst M, Breunig MM, Kriegel H, Sander J. OPTICS: ordering points to identify the clustering structure. In: SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1–3, 1999, Philadelphia, Pennsylvania, USA. New York: ACM: 1999. p. 49–60.Google Scholar
 Tasoulis DK, Ross GJ, Adams NM. Visualising the cluster structure of data streams. In: Advances in Intelligent Data Analysis VII, 7th International Symposium on Intelligent Data Analysis, IDA 2007, Ljubljana, Slovenia, September 6–8, 2007, Proceedings: 2007. p. 81–92.Google Scholar
 Hassani M, Spaus P, Gaber MM, Seidl T. Densitybased projected clustering of data streams. In: Scalable Uncertainty Management  6th International Conference, SUM 2012, Marburg, Germany, September 17–19, 2012. Proceedings: 2012. p. 311–24.Google Scholar
 Chen Y, Tu L. Densitybased clustering for realtime stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, August 12–15, 2007. New York: ACM: 2007. p. 133–42.Google Scholar
 Wan L, Ng WK, Dang XH, Yu PS, Zhang K. Densitybased clustering of data streams at multiple resolutions. TKDD. 2009; 3(3):14:1–14:28.View ArticleGoogle Scholar
 Liu X, Iftikhar N, Xie X. Survey of realtime processing systems for big data. In: 18th International Database Engineering & Applications Symposium, IDEAS 2014, Porto, Portugal, July 7–9, 2014. New York: ACM: 2014. p. 356–61.Google Scholar
 Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: Cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10. Berkeley, CA, USA: USENIX Association: 2010. p. 10–10.Google Scholar
 Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets: A faulttolerant abstraction for inmemory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25–27, 2012. Berkeley: USENIX Association: 2012. p. 15–28.Google Scholar
 Balazinska M, Balakrishnan H, Madden S, Stonebraker M. Faultc distributed stream processing system. ACM Trans Database Syst. 2008; 33(1):3:1–3:44.View ArticleGoogle Scholar
 Hwang J, Balazinska M, Rasin A, Çetintemel U, Stonebraker M, Zdonik SB. Highavailability algorithms for distributed stream processing. In: Proceedings of the 21st International Conference on Data Engineering, ICDE 2005, 5–8 April 2005, Tokyo, Japan. Washington: IEEE Computer Society: 2005. p. 779–90.Google Scholar
 Meng X, Bradley JK, Yavuz B, Sparks ER, Venkataraman S, Liu D, Freeman J, Tsai DB, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A. Mllib: Machine learning in apache spark. CoRR. 2015;abs/1505.06807.Google Scholar
 Morales GDF, Bifet A. SAMOA: scalable advanced massive online analysis. J Mach Learn Res. 2015; 16:149–53.Google Scholar
 Krempl G, žliobaite I, Brzeziński D, Hüllermeier E, Last M, Lemaire V, Noack T, Shaker A, Sievi S, Spiliopoulou M, et al. Open challenges for data stream mining research. ACM SIGKDD explorations newsletter. 2014; 16(1):1–10.View ArticleGoogle Scholar
 Gama J. A survey on learning from data streams: current and future trends. Prog AI. 2012; 1(1):45–55.Google Scholar
 Gama J. Knowledge Discovery from Data Streams, 1st: Chapman & Hall/CRC; 2010.Google Scholar
 Cormode G, Garofalakis MN, Haas PJ, Jermaine C. Synopses for massive data: Samples, histograms, wavelets, sketches. Found Trends Databases. 2012; 4(1–3):1–294.MATHGoogle Scholar
 Marz N, Warren J. Big Data: Principles and Best Practices of Scalable Realtime Data Systems: Manning Publications Co.; 2015.Google Scholar