 Review
 Open access
 Published:
Stateoftheart on clustering data streams
Big Data Analytics volume 1, Article number: 13 (2016)
Abstract
Clustering is a key data mining task. This is the problem of partitioning a set of observations into clusters such that the intracluster observations are similar and the intercluster observations are dissimilar. The traditional setup where a static dataset is available in its entirety for random access is not applicable as we do not have the entire dataset at the launch of the learning, the data continue to arrive at a rapid rate, we can not access the data randomly, and we can make only one or at most a small number of passes on the data in order to generate the clustering results. These types of data are referred to as data streams. The data stream clustering problem requires a process capable of partitioning observations continuously while taking into account restrictions of memory and time. In the literature of data stream clustering methods, a large number of algorithms use a twophase scheme which consists of an online component that processes data stream points and produces summary statistics, and an offline component that uses the summary data to generate the clusters. An alternative class is capable of generating the final clusters without the need of an offline phase. This paper presents a comprehensive survey of the data stream clustering methods and an overview of the most wellknown streaming platforms which implement clustering.
Background
In today’s applications, evolving data streams are ubiquitous. Indeed, examples of applications relevant to streaming data are becoming more numerous and more important, including network intrusion detection, transaction streams, phone records, web clickstreams, social streams, weather monitoring, etc. There is active research on how to store, query, analyze, extract and predict relevant information from data streams. Clustering is a key data mining task. This is the problem of partitioning a set of observations into clusters such that the intracluster observations are similar (or close) and the intercluster observations are dissimilar (or distant). The other objective of clustering is to reduce the complexity of the data by replacing a group of observations (cluster) with a representative observation (prototype).
In this paper, we consider the problem of clustering data in the form of a stream, i.e. a sequence of potentially infinite, nonstationary data (the probability distribution of the unknown data generation process may change over time) arriving continuously (which requires a single pass through the data) where random access to the data is not feasible and storing all the arriving data is impractical. When applying data mining techniques, and specifically clustering algorithms, to data streams, restrictions in execution time and memory have to be considered carefully. To deal with time and memory restrictions, many of the existing data stream clustering algorithms modify traditional nonstreaming methods to use the twophase framework proposed in [1] to deal with streaming data, e.g., DenStream [2] is an extension of DBSCAN algorithm, StreamKM++ [3] of kmeans++, StrAP [4] of AP, etc.
Realtime processing means that the ongoing data processing requires a very low response delay. The velocity, which refers to that Big Data are generated at high speed (speed of data in and out), is an important concept in the Big Data domain [5]. With the increasing importance of data stream mining applications, many streaming platforms have been proposed. These can be classified in two categories: traditional or nondistributed such as MOA [6] and distributed streaming platforms such as Spark Streaming [7] and Flink [8]. The latter two, may be considered as the most widely used streaming platforms. These distributed streaming systems are based on two processing models, recordatatime and microbatching. On a recordatatime processing model, longrunning stateful operators process records as they arrive, update the internal state, and send out new records. On the other hand, the microbatching processing model runs each streaming computation as a series of deterministic batch computations on small time intervals, which is implemented in Spark Streaming [7].
General surveys have been recently published in the literature for mining data streams [9–13]. The authors of [14] introduced a taxonomy to classify data stream clustering algorithms. The work presented in [15] is a thorough survey of stateoftheart densitybased clustering algorithms over data streams. This paper presents a thorough survey of the stateoftheart for a wide range of data stream clustering algorithms and an overview of the most wellknown streaming platforms. The remainder of this paper is organized as follows. Section “Data stream clustering methods” presents in a comprehensive manner the most relevant works on data stream clustering algorithms. These algorithms are categorized according to the nature of their underlying clustering approach. Section “Streaming platforms” overviews the most wellknown streaming platforms with a focus on the streaming clustering task. Section “Conclusion” concludes this paper.
Data stream clustering methods
This section discusses previous works on data stream clustering problems, and highlights the most relevant algorithms proposed in the literature to deal with this problem. Most of the existing algorithms (e.g. CluStream [1], DenStream [2], StreamKM++ [3], or ClusTree [16]) divide the clustering process in two phases: (a)Online, the data will be summarized; (b)Offline, the final clusters will be generated. Figure 1 is a flowchart of the data stream clustering algorithms presented in this paper. These algorithms are categorized according to the nature of their underlying clustering approach.
GNG based algorithms
Growing Neural Gas
Growing Neural Gas (GNG) [17] is an incremental selforganizing approach which belongs to the family of topological maps such as SelfOrganizing Maps (SOM) [18] or Neural Gas (NG) [19]. It is an unsupervised clustering algorithm capable of representing a high dimensional input space in a low dimensional feature map. Typically, it is used for finding topological structures that closely reflect the structure of the input distribution. Therefore, it is used for visualization tasks in a number of domains [19, 20] as neurons (nodes), which represent prototypes, are easy to understand and interpret.
The GNG algorithm constructs a graph of nodes in which each node has its associated prototype. Prototypes can be regarded as positions in the input space of their corresponding nodes. Pairs of nodes are connected by edges (links), which are not weighted. The purpose of these links is to define the topological structure. These links are temporal in the sense that they are subject to aging during the iteration steps of the algorithm and are removed when they become “too old” [20].
Starting with two nodes, and as a new data point is available, the nearest and the secondnearest nodes are identified, linked by an edge, and the nearest node and its topological neighbors are moved toward the data point. Each node has an accumulated error variable. Periodically, a node is inserted into the graph between the nodes with the largest error values. Nodes can also be removed if they are identified as being superfluous. This is an advantage compared to SOM and NG, as there is no need to fix the graph size in advance. Algorithm 1 outlines an online version of the GNG approach. In this version, unlike the standard approach of GNG, the data is seen only once.
A number of authors have proposed variations on the Growing Neural Gas (GNG) approach. The GNG algorithm creates a new node every λ iterations (λ is fixed by the user as an input parameter). Hence, it is not adapted for data streams, or nonstationary datasets, or to novelty detection. In order to deal with nonstationary datasets, the author of [21] has investigated modifying the network by proposing an online criterion for identifying “useless” nodes. The algorithm proposed is known as the Growing Neural Gas with Utility (GNGU). Slow changes of the distribution are handled by adaptation of existing nodes, whereas rapid changes are handled by removal of “useless” neurons and subsequent insertions of new nodes in other places.
The authors of [22] modified GNG to detect incrementally emerging cluster structures. The proposed GNGC algorithm is able to match the temporal distribution of the original dataset by creating a new node whenever the received new data point is too far from its nearest node. It is noted that the algorithm is computationally demanding.
The clustering method proposed in [23] consists of two steps. In the first step, the data are prepared by generating the Voronoi partition using a modified GNG algorithm (which does not exceed linear complexity). The result is that the number of intermediate clusters is much smaller than the number of original objects. In the second step the intermediate clusters are clustered using conventional algorithms that have a much higher computational complexity (for this reason they should not be used for clustering the full volume of initial data). The approach examines hierarchical clustering of GNG units using single linkage and Ward’s method as linkage criteria. Although the clustering results look promising, the approach has the drawback that they have to manually identify the “right” level in the cluster hierarchy to obtain an adequate clustering of the input space.
GWR
The ‘Grow When Required’ (GWR) network [24] may add a new node at any time, whose position is dependent on the input and the current winning node. The GWR deals with the problem of novelty detection by adding new nodes into the network structure whenever the activity of the current bestmatching node is below some threshold, which implies that the bestmatching node is not trained to deal with that particular input. This means that the network grows very quickly when new data is presented, but stops growing once the network has matched the data to a given accuracy. This has benefits in that there is no need to decide in advance how large the network should be, as nodes will be added until the network is saturated. This means that for small datasets the complexity of the network is significantly reduced. In addition, if the dataset changes at some time in the future, further nodes can be added to represent the new data without disturbing the network that has already been created [24, 25]. Considering one iteration of the GWR algorithm, GWR has approximatively the same time complexity as one iteration of GNG. Hence, the complexity of GWR is O(k n m) where k is the number of iterations, n is the number of data points of the data stream m is the number of nodes in the graph. For more details on the complexity of GNG the reader is referred to [26].
IGNG
Still in the same idea of relaxing the constraint of periodical evolution of the network, the IGNG [27] algorithm has been proposed. In this algorithm a new neuron is created each time the distance of the current input data to the existing neuron is greater than a predefined fixed threshold σ, which is dependent on the global datasets. However, one disadvantage of this algorithm is the global character of the parameter σ and also that it must be computed prior to the learning. In order to resolve this weakness, I2GNG [28] associates a threshold variable σ to each neuron. However, its major drawback is the initialization of the σ values at the creation of each node. The authors of [29] address the problem of choosing the final winner neuron among the many input equidistant neurons. They proposed some adaptations of the IGNG and I2GNG algorithms. Notably, the use of a labeling maximization approach as a clustering similarity measure (IGNGF) to replace the distance in the winner selection process.
The ability of selforganizing neural network models to manage realtime applications, using a modified learning algorithm for a growing neural gas network is addressed in [30]. The proposed modification aims to satisfy realtime temporal constraints in the adaptation of the network. The proposed learning algorithm can add a dynamic number of neurons per iteration. Indeed, a detailed study has been conducted to estimate the optimal parameters that keep a good quality of representation in the available time. The authors concluded that the use of a large number of neurons made it difficult to obtain a representation of the distribution of training data with good accuracy in realtime [30, 31].
AING [32] is an incremental GNG that learns automatically the distance thresholds of nodes based on its neighbors and data points assigned to the node of interest. It merges nodes when their number reaches a given upperbound.
GStream
More recently, GStream [33, 34] was proposed as a data stream clustering approach based on the Growing Neural Gas algorithm. GStream uses a stochastic approach to update the prototypes, and it was implemented on a “centralized” platform, which can be summarized as follows: starting with two nodes, and as a new data point is reached, the nearest and the secondnearest nodes are identified, linked by an edge, and the nearest node with its topological neighbors are moved toward the data point. Each node has an accumulated error variable and a weight, which varies over time using a fading function. Using an edge management procedure, one, two or three nodes are inserted into the graph between the nodes with the largest error values. Nodes can also be removed if they are identified as being superfluous.
GStream can discover clusters of arbitrary shape in an evolving data stream, whose main features and advantages are:

(i)
the topological structure is represented by a graph wherein each node represents a cluster, which is a set of “close” data points, and neighboring nodes (clusters) are connected by edges. The graph size is not fixed but may evolve;

(ii)
to reduce the impact of old data whose relevance diminishes over time, GStream uses an exponential fading function
$$f(t) = 2^{\lambda_{1} (t  t_{0})} $$where λ _{1}>0, defines the rate of decay of the weight over time, t denotes the current time and t _{0} is the timestamp of the data point. The weight of a node is based on data points associated with it:
$$weight(c) = \sum_{i=1}^{m} 2^{\lambda_{1} (t  t_{i_{0}})} $$where m is the number of points assigned to the node c at the current time t. If the weight of a node is less than a threshold value then this node is considered as outdated and then deleted (with its links). For the same reason, links between nodes are also weighted by an exponential function;

(iii)
unlike many other data stream algorithms that start by taking a significant number of data points for initializing the model (these data points can be seen several times), GStream starts with only two nodes. Several nodes (clusters) are created in each iteration, unlike the traditional Growing Neural Gas (GNG) [17] algorithm;

(iv)
all aspects of GStream (including creation, deletion and fading of nodes, edges management, and reservoir management) are performed online;

(v)
a reservoir is used to hold, temporarily, the very distant data points, compared to the current prototypes.
However, the design of a “distributed” version of GStream would raise difficulties, which are resolved by MBGStream [35]. This later operates with parameters to control the decay (or “forgetfulness”) of the estimates. The MBGStream algorithm is implemented on a distributed streaming platform based on the microbatching processing model, i.e., the Spark Streaming API^{1}. In the proposed algorithm, the topological structure is represented by a graph wherein each node represents a cluster, which is a set of “close” data points and neighboring nodes (clusters) are connected by edges. Starting with only two nodes, the graph size is not fixed but may also evolve as several nodes (clusters) are created in each iteration. We use an exponential fading function to reduce the impact of old data whose relevance diminishes over time. For the same reason, links between nodes are also weighted by an exponential function. The data received in each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval is completed, this dataset is processed via deterministic parallel operations, such as Map and Reduce to produce new datasets representing either program outputs or intermediate states [7]. The input data is split and the master assigns the splits to the Map workers. Each worker processes the corresponding input split, generates key/value pairs and writes them to intermediate files (on disk or in memory). The Reduce function is responsible for aggregating information received from the Map functions. The algorithm uses a generalization of the minibatch GNG update rule, where the nearest node and all of its neighbors are moved in the direction of the data point. However, in MBGStream, for each batch of data X _{ p }, we assign all points x _{ i } to their best match unit, compute new cluster centers, then update each cluster. The update rule (the adaptation step) in a minibatch version without taking into account the neighbors of the referent is described in Eq. 1 as:
whereas Eq. 2 updates the number of points assigned to the cluster,
where \(\mathbf {w}_{c}^{(t)}\) is the previous center for the cluster, \(n_{c}^{(t)}\) is the number of points assigned to the cluster thus far, \(\mathbf {z}_{c}^{(t)}\) is the new cluster center from the current batch, and \(m_{c}^{(t)}\) is the number of points added to the cluster c in the current batch.
Hierarchical stream methods
A hierarchical clustering method groups the given data into a tree of clusters which is useful for data summarization and visualization. This is a binarytree based data structure called the dendrogram. Once the dendrogram is constructed, one can automatically choose the right number of clusters by splitting the tree at different levels to obtain different clustering solutions for the same dataset without rerunning the clustering algorithm again. Hierarchical clustering can be achieved in two different ways, namely, bottomup and topdown clustering. Though both of these approaches utilize the concept of dendrogram while clustering the data, they might yield entirely different sets of results depending on the criterion used during the clustering process [36]. In hierarchical clustering once a step (merge or split) is done, it can never be undone. Methods for improving the quality of hierarchical clustering have been proposed such as integrating hierarchical clustering with other clustering techniques, resulting in multiplephase clustering such as BIRCH [37].
BIRCH
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) incrementally and dynamically clusters multidimensional data points to try to produce the best quality clustering with the available resources (i. e., memory and time constraints) by making a single scan of the data, and to improve the quality further with a few additional scans. It should be noted that the BIRCH method is not designed for clustering data streams and cannot address the concept drift problem. The key characteristic of the BIRCH is to introduce a new data structure called a clustering feature (CF) as well as a CFtree. The CF can be regarded as a concise summary of each cluster. This is motivated by the fact that not every data point is equally important for clustering and we cannot afford to keep every data point in the main memory given that the overall memory is limited. On the other hand, for the purpose of clustering, it is often enough to keep up to the second order of data moment. In other words, CF is not only efficient, but also sufficient to cluster the entire data set [36, 37].
More precisely, a CF structure is a triple (N,L S,S S), where N is the number of data points in the cluster, LS is the linear sum of the N data points, and SS is the squared sum of the N data points. The CF vector has two main properties giving the incremental aspect, in an intuitive way, to any algorithm that uses this structure:

Incrementality If a point x is added to the cluster, the sufficient statistics are updated as follows:
$$ \begin{aligned} &N_{i} \leftarrow N_{i} + 1;\\[3pt] &LS_{i} \leftarrow LS_{i} + x;\\[3pt] &SS_{i} \leftarrow SS_{i} + x^{2}; \end{aligned} $$ 
Additivity If C F _{1}=(N _{1},L S _{1},S S _{1}) and C F _{2}=(N _{2},L S _{2},S S _{2}) are the CF vectors of two disjoint clusters, merging them is equal to the sum of their parts. The additive property allows us to merge subclusters incrementally without accessing the original data set.
$$ CF_{1} + CF_{2}= (N_{1}+N_{2}, LS_{1}+LS_{2}, SS_{1}+SS_{2}). $$
Figure 2 presents the CFtree structure in BIRCH. The CFtree is a heightbalanced tree which keeps track of the hierarchical clustering structure for the entire data set.
BIRCH requires two user defined parameters: B the branch factor or the maximum number of entries in each nonleaf node; and T the maximum diameter (or radius) of any CF in a leaf node. The maximum diameter T defines the examples that can be absorbed by a CF. Increasing T, more examples can be absorbed by a CF node and smaller CFTrees are generated. Each node in the CFtree represents a cluster which is in turn made up of at most B subclusters. All the leaf nodes are chained together for the purpose of efficient scanning. When a data point is available, it traverses down the current tree from the root, until it finds the appropriate leaf following the closestCF path, with respect to the L _{1} or L _{2} norms. The insertion on the CFtree can be performed in a similar way as the insertion in the classic Btree. If the closestCF in the leaf cannot absorb the data point, a new CF entry is created. If there is no room for new leaf, the parent node is split. A leaf node might be expanded due to the constraints imposed by B and T. The process consists of taking the two farthest CFs and creates two new leaf nodes. BIRCH operates in two main steps: the first step builds an initial CFtree in memory using the given amount of memory and recycling space on disk; the second step tries to cluster all the subclusters in the leaf nodes, called also the “global clustering”. There are two optional steps: the“tree condensing” step which aims to refine the initial CFtree by reinserting its leaf entries; and the “clustering refinement” step which reassigns all the data points based on the cluster centroid produced by the global clustering step.
EStream
EStream [38] classifies the evolution of data into five categories: appearance, disappearance, self evolution, merge, and split. This algorithm is an evolutionbased stream clustering method, i.e., a stream clustering method that supports the monitoring and the change detection of clustering structures. It uses another data structure for saving summary statistics, named the αbin histogram. Indeed, each cluster is represented as a Fading Cluster Structure (FCS) utilizing an αbin histogram for each feature of the dataset. A histogram of the cluster data values is utilized to identify cluster splits. The range of each bin is calculated as the difference between the maximum and minimum feature values divided by α. When the maximum or minimum value changes, a new range is calculated and the values in each range are updated from the intersection between the new and old ranges. Each cluster has a histogram of feature values, but the histogram is utilized only for the split of active clusters. Only an active cluster can assemble an incoming data point. If a statistically significant valley is found between two peaks in any of the marginal histograms, the cluster is split. Figure 3 illustrates the histogram management in a split.
EStream starts empty, and every new point either is mapped onto one of the existing clusters (based on a radius threshold) or a new cluster is created around it. Any cluster not meeting a predefined density level is considered inactive and remains isolated until achieving a desired weight. The weight of a cluster is the number of data elements assigned to this cluster. The algorithm employs an exponential decay function to weigh down the influence of older data, thus focuses on keeping an uptodate view of the data distribution. Clusters which are not active for a certain time period may be deleted from the data space.
HUEStream
HUEStream [39] which extends EStream in order to support uncertainty in heterogeneous data, i.e., including numerical and categorical attributes simultaneously. Uncertain data streams pose a special challenge because of the dual complexity of high volume and data uncertainty. This uncertainty is due to errors in the reading of sensors or other hardware collection technology. In many of these cases, the data errors can be approximated either in terms of statistical parameters, such as the standard deviation, or the probability density functions [9]. The Uncertain MICROclustering (UMicro) algorithm is proposed as a method for clustering uncertain data streams, which enhances the microclusters with additional information about the uncertainty of the data points in the clusters [40]. This information is used to improve the quality of the distance functions for the cluster assignments. HCluStream [41] extends the definition of the cluster feature vector to include categorical features, replaces the modified kmeans clustering with the corresponding kprototypes clustering which is able to handle heterogeneous attributes. The centroid of continuous attributes and the histogram of the discrete attributes are used to represent the microclusters, and the kprototype clustering algorithm is used to create the microclusters and macroclusters.
The distance function, cluster representation and histogram management are introduced for the different types of clustering structure evolution. A distance function between the probability distributions of two objects is introduced to support uncertainty in categorical attributes. To detect changes in the clustering structure, the proposed distance function is used to merge clusters and find the closest cluster of a given incoming data and the proposed histogram management to split clusters for categorical data. To decrease the weight of old data over time, a fading function is used. Experimental results show that HUEStream gives better cluster quality, in terms of purity and the Fmeasure, compared to UMicro for the KDDCUP’99 dataset [39].
ClusTree
ClusTree [16] is a parameterfree stream clustering algorithm that is capable of processing the stream in a single pass, with limited memory usage. It always maintains an uptodate cluster model and reports concept drift, novelty, and outliers. This is ensured by weighing data points with an exponential timedependent decay function. Moreover, this approach makes no apriori assumptions on the size of the clustering model, but dynamically selfadapts. ClusTree is an anytime algorithm that organizes microclusters in a tree structure for faster access and automatically adapts microcluster sizes based on the variance of the assigned data points. The tree used in ClusTree is a balanced multidimensional indexing structure with the following properties:

an inner node contains between m and M entries. Leaf nodes contain between l and L entries. The root has at least one entry (m, M, l and L are input parameters).

an entry in an inner node stores: (i) a cluster feature of the objects that it represents. (ii) a cluster feature of the objects in the buffer. (iii) a pointer to its child node.

an entry in a leaf stores a cluster feature of the data point(s) it represents.

a path from the root to any leaf node has always the same length (balanced).
So, it uses also the microcluster structure as a compact representation of the data distribution. Anytime algorithms denote approaches that are capable of delivering a result at any given point in time, and of using more time if it is available to refine the result. The basic idea is to maintain measures for incremental computation of the mean and variance of microclusters so that the infeasible access to all past stream objects is no longer necessary. We recall that a microcluster is a cluster feature tuple (or a variant of it) C F=(n,L S,S S) of the number n of represented data points, their linear sum LS, and their squared sum SS. In the proposed method, CFs are created and updated by extending index structures from the Rtree family [42]. Such hierarchical indexing structures provide the means for efficiently locating the right place to insert any object from the stream into a microcluster. The idea is to build a hierarchy of microclusters at different levels of granularity. Given enough time, the algorithm descends the hierarchy in the index to reach the leaf entry that contains the microcluster that is most similar to the current object. If this microcluster is similar enough, it is updated incrementally by this object’s values. Otherwise, a new microcluster may be formed [16]. However, in anytime clustering of streaming data, there might not always be enough time to reach leaf level to insert the object. To deal with this, the authors provide some strategies for anytime inserts. By incorporating local aggregates, i.e., temporary buffers for “hitchhikers”, a solution is provided for the easy interruption of the insertion process so that it can be simply resumed at any later point in time. For very fast streams, aggregates of similar objects allow insertion of groups instead of single objects for even faster processing. For slower stream settings, alternative insertion strategies that exploit possible idle times of the algorithm to improve the quality of the resulting clustering are proposed [16].
Taking the means of the CFs as representatives, we can apply a kcenter clustering or density based clustering (e.g. kmeans or DBSCAN) to detect clusters of arbitrary shape.
Partitioning stream methods
A partitioningbased clustering algorithm organizes the objects into some number of partitions, where each partition represents a cluster. The clusters are formed based on a distance function like the kmeans algorithm which leads to finding only spherical clusters and the clustering results are usually influenced by noise.
CluStream
The idea behind the CluStream [1] method is to divide the clustering process into an online component which periodically stores detailed summary statistics and an offline component which uses only this summary statistics. The offline component is utilized by the analyst who can use a wide variety of inputs (such as time horizon or number of clusters) in order to provide a quick understanding of the broad clusters in the data stream. The summary information is defined by the following structures:

Microclusters: Statistical information about the data locality in terms microclusters are maintained. The microcluster structure is a temporal extension of the cluster feature vector [37]. The additivity property of the microclusters makes them a natural choice for the data stream problem. More precisely, a microcluster is tuple (N,L S,S S,L S T,S S T) where (N,L S,S S) are the three components of the CF vector (namely, the number of data points in the cluster, N; the linear sum of the N data points, LS; and the squared sum of the N data points, SS). The two other components are LST and SST (the sum and the sum of the squares of the time stamps of the N data points).

Pyramidal time frame: The microclusters are stored at time snapshots which follow a pyramidal pattern. This pattern provides an effective tradeoff between the storage requirements and the ability to recall summary statistics from different time horizons.
The data stream clustering algorithm proposed in [1] can generate approximate clusters in any userspecified length of history from the current moment. The online phase stores q microclusters in (secondary) memory, where q is an input parameter. Each microcluster has a maximum boundary, which is computed as the standard deviation of the mean distance of the data points to their centroids multiplied by a factor f. Each new point is assigned to its closest microcluster (according to the Euclidean distance) if the distance between the new point and the closest centroid falls within the maximum boundary. If so, the point is absorbed by the cluster and its summary statistics are updated. If none of the microclusters can absorb the point, a new microcluster is created. This is accomplished by either deleting the oldest microcluster or by merging two microclusters. The oldest microcluster is deleted if its timestamp is below a given threshold δ (input parameter). The q microclusters are stored in a secondary storage device in particular time intervals that decrease exponentially, which are referred to as snapshots. These snapshots allow the user to search for clusters in different time horizons through a pyramidal time window concept. This summary information in the microclusters is used by an offline component which is dependent upon a wide variety of user inputs such as the time horizon or the granularity of clustering. When the user specifies a particular time horizon of length h over which to find the clusters, then we need to find microclusters which are specific to that timehorizon. For this purpose, we find the additive property of the cluster feature vector very useful. The final clusters are determined by using a modification of a kmeans algorithm. In this technique, the microclusters are treated as pseudopoints which are reclustered in order to determine higher level clusters.
StreamKM++
StreamKM++ [3] is a twophase (onlineoffline) algorithm which maintains a small outline of the input data using the mergeandreduce technique. The merge step is performed by via a data structure, named the bucket set, which is a set of L buckets (also named buffers), where L is an input parameter. The reduce step is performed by a significantly different summary data structure that is suitable for highdimensional data, the coreset tree, when we consider that it reduces 2m data points to m data points (m is an input parameter). The advantage of such a coreset is that we can apply any fast approximation algorithm (for the weighted problem) on the usually much smaller coreset to compute an approximate solution for the original dataset more efficiently.
The coreset tree is constructed as follow. First, the tree has only the root node v, which contains all the 2m data points in the set of data points E _{ v }. The prototype of the root node w _{ v } is chosen randomly from A _{ v } and N _{ v }=E _{ v }=2m. The computation of the sum of squared distances of the data points in E _{ v } to w _{ v } (S S E _{ v }) follows from the definition of w _{ v }. Afterwards, two child nodes for v are created: v _{1} and v _{2}. To create these nodes, it is necessary to choose a data point from E _{ v } with probability proportional to \( \frac {Dist(x^{j}, w_{v})^{2}}{SSE_{v}},\forall x^{j}\in E_{v} \phantom {\dot {i}\!}\), i.e., the data points that are farthest away from w _{ v } has the highest probability of being selected. We call the selected data point \(w_{v'}\phantom {\dot {i}\!}\). The next step is to allocate the data points in E _{ v } to E _{ v1} and E _{ v2}, such that:
Consequently, the summary statistics of the child nodes v _{1} and v _{2} are updated. This is the expansion step of the tree, which creates two child nodes for a given inner node. When the tree has many leaf nodes, we have to decide which one should be expanded first. In this case, we start from the root node of the coreset tree and descend it by iteratively selecting a child node with probability proportional to \(\frac {SSE_{child}}{SSE_{parent}}\), until a leaf node is reached for the expansion step to be restarted. The coreset tree expansion stops when the number of leaf nodes is m.
When a new data point arrives, it is stored in the first bucket. If the first bucket is full, all of its data are moved to the second bucket. If the second bucket is full, the two buckets are merged resulting in 2m data points, which are then reduced to m data points, by the construction of a coreset tree, as previously detailed. The resulting m data points are stored in the third bucket, unless it is also full, and then again a new mergeandreduce step is needed [3, 14]. In its offline phase, the kmeans++ [43], which is executed on an input set of size m, is used for finding the final clusters. The kmeans++ method is a seeding procedure for the kmeans algorithm that guarantees a solution with a certain quality and gives good practical results.
StrAP
StrAP [4] is an extension of the Affinity Propagation (AP) [44] algorithm for data streams, which uses a reservoir for saving potential outliers. The Affinity Propagation approach proposes an equivalent formulation of the kmedoids problem in the sense that a prototype is an effective data point, with the difference that the number of clusters to be found is not fixed. Formulating the clustering problem in terms of energy minimization, AP outputs a set of clusters, each of which is characterized by an actual data point, referred to as an exemplar or a prototype; the penalty value parameter controls the cost of adding another prototype. AP provides some asymptotic guarantees of the optimality of the solution. The tradeoff for these properties is the AP’s quadratic computational complexity, excluding its use on large scale datasets. The StrAP algorithm, as an online version of AP, proceeds by incrementally updating the current model if the current data point fits the model, and putting it in a reservoir otherwise. A change point detection test enables StrAP to catch drifting exemplars that significantly deviate away. StrAP involves four main steps as illustrated in Algorithm 4 with a diagram in Fig. 4 [4]:

The first batch of data is used by AP to identify the first clusters and initialize the stream model.

As the stream flows in, each data point x _{ t } is compared to the prototypes; if too far from the nearest exemplar, x _{ t } is put in the reservoir, otherwise the stream model is updated accordingly.

The data distribution is checked for change point detection, using the PageHinkley significance test.

Upon triggering the change detection test, or if the number of outliers exceeds the reservoir size, the stream model is rebuilt based on the current model and reservoir, using a weighted version of AP (WAP).
The model of the data stream used in StrAP is inspired by DenStream [2]. It consists of a set of 4tuples (c _{ i },n _{ i },Σ _{ i },t _{ i }), where c _{ i } ranges over the clusters, n _{ i } is the number of items associated to cluster c _{ i },Σ _{ i } is the distortion of c _{ i } (sum of d(x,c _{ i })^{2}, where x ranges over all data points associated to c _{ i }), and t _{ i } is the last time stamp when a data point was associated to c _{ i }.
At time t, the data point x _{ t } is considered and its nearest cluster c _{ i } (w.r.t. distance d) in the current model is selected; if d(x _{ t },c _{ i }) is less than some threshold δ, heuristically set to the average distance between points and clusters in the initial model, x _{ t } is assigned to the ith cluster and the model is updated accordingly; otherwise, x _{ t } is considered to be an outlier, and put into the reservoir [4].
Approximations of the kmeans algorithm in the onepass streaming setting have been proposed in [45–47]. The streaming kmeans algorithm proposed in [45] is based on a divide and conquer approach. It uses the result of [43] as a subroutine, finding 3k logk centers of each block. Their experiment showed that this algorithm is an improvement over an online version of kmeans algorithm and was comparable to the batch version of kmeans.
The Highdimensional Projected Stream clustering method (HPStream) [48] introduces the concept of projected clustering to data streams. This algorithm is a projected clustering for highdimensional streaming data with higher clustering quality compared to CluStream [1].
SWClustering uses an EHCF (Exponential Histogram of Cluster Features) structure by combining Exponential Histogram with Cluster Feature to record the evolution of each cluster and to capture the distribution of recent data points [49]. It tracks the clusters in evolving data streams over sliding windows.
Densitybased stream methods
Densitybased algorithms are based on the connection between regions and density functions. In these types of algorithms, dense areas of objects in the data space are considered as clusters, which are segregated by low density area (noise). These algorithms find clusters of arbitrary shapes and generally they require two parameters: the radius and the minimum number of data points within a cluster.
The main challenge in the streaming scenario is to construct densitybased algorithms which can be efficiently executed in a single pass of the data, since the process of density estimation may be computationally intensive [9]. Amini [15] gives a survey on recent densitybased data streams clustering algorithms.
DenStream
DenStream [2] is a densitybased data stream clustering algorithm that also uses a feature vector based on the CF vector. By creating two kinds of microclusters (potential and outlier microclusters), in its online phase, DenStream overcomes one of the drawbacks of CluStream, its sensitivity to noise. Potential and outlier microclusters are kept in separate memories since they require different processing. Each potentialmicrocluster structure has an associated weight w that indicates its importance based on temporality. The weight of each data point decreases exponentially with time t via a fading function f(t)=2^{−λt}, where λ>0. If the weight \( w = \sum _{j=1}^{n} f(t  T_{ij})\) is above a threshold input parameter μ then the corresponding cluster is considered as a coremicrocluster, where T _{ i1},…,T _{ in } are timestamps of data points p _{ i1},…,p _{ in }. At the time t, if w≥β μ then the microcluster is considered as potentialmicrocluster, else it is an outliermicrocluster, where β is the threshold of the outlier relative to coremicroclusters (0<β<1). Microclusters with no recent points tend to lose importance, i.e. their respective weights continuously decrease over time in outdatedmicroclusters. However, the latter could grow into a potential microcluster when, by adding new points, its weight exceeds the threshold. Weights of microclusters are periodically calculated and decision about removing or keeping them is made based on the weight threshold.
When a new data point arrives, the algorithm tries to insert it into its nearest potentialmicrocluster based on its updated radius. If the insertion is not successful, the algorithm tries to insert the data point into its closest outlier microcluster. If the insertion is successful, the cluster summary statistics will be updated accordingly. Otherwise, a new outlier microcluster is created to absorb this point. The Euclidean distance between the new data point and the center of the nearest potential or outlier microcluster is measured. A microcluster is chosen with the distance less than or equal to the radius threshold. DenStream has a pruning method in which it frequently checks the weights of the outliermicroclusters in the outlier buffer to guarantee the recognition of the real outliers. However, the nonrelease of the allocated memory when either deleting a microcluster or merging two old microclusters is considered as a limitation of the DenStream algorithm as well as the timeconsuming pruning phase for removing outliers [15]. In the offline phase, the potentialmicroclusters found during the online phase are considered as pseudopoints and will be passed to a variant of the DBSCAN algorithm in order to determine the final clusters.
SOStream
SOStream [50] is a densitybased clustering algorithm inspired by both the principle of the DBSCAN algorithm and selforganizing maps (SOM) [18], in the sense that a winner influences its immediate neighborhood. Generally speaking, densitybased clustering algorithms need setting a threshold manually (similarity threshold, grid size, etc.) for which is difficult to choose the most suitable value and if it is set to an unsuitable value, then the algorithm will suffer from overfitting, or from unstable clustering. SOStream addresses this problem by using a dynamically learned threshold value for each cluster based on the idea of building neighborhoods with a minimum number of points.
SOStream is also represented by a set of microclusters where for each cluster a cluster feature (CF) vector is stored, which is a tuple with three elements N _{ i }=(n _{ i },r _{ i },C _{ i }),n _{ i } is the number of data points assigned to N _{ i },r _{ i } is the cluster’s radius and C _{ i } is the centroid.
When a new point arrives, the nearest cluster is selected, based on the Euclidean distance to existing microclusters, and then absorbs this point if the calculated distance is less than a dynamically defined threshold. It also assigns the microclusters’ neighbors to the nearest cluster, i.e., the centroids of clusters sufficiently close to the winning cluster have their centroids modified to be closer to the winning cluster’s centroid. This approach is used to assist in merging similar clusters and increasing separation between different clusters. The neighborhood of the winner is defined based on the idea of a MinPts distance given by a minimum number of neighboring objects [2]. This distance is found by computing the Euclidean distance from any existing clusters to the winning cluster. If the new point is not absorbed by any microcluster, a new microcluster is created for it. In the SOStream algorithm, merging, updating and adapting dynamically the threshold value for each cluster are performed in an online manner. Clusters are merged if they overlap with a distance that is less than the mergethreshold, i.e., the spheres in ddimensional space defined by the radius of each cluster overlap. Hence, the threshold value is a determining factor for the number of clusters. When two clusters are merged, the largest radius of these two clusters is chosen to be the radius of the cluster to avoid losing any data points within the clusters. However, no split feature is proposed in the algorithm. SOStream also uses an exponential fading function to reduce the impact of old data whose relevance diminishes over time.
SVStream
SVStream (Support Vector based Stream clustering) [51] is a data stream clustering algorithm based on support vector clustering (SVC) and support vector domain description (SVDD).
In the Support Vector Clustering (SVC) [52] algorithm data points are mapped from the data space to a high dimensional feature space using a Gaussian kernel. In the feature space we look for the smallest sphere that encloses the image of the data. This sphere is mapped back to data space, where it forms a set of contours which enclose the data points. These contours are interpreted as cluster boundaries. Points enclosed by each separate contour are associated with the same cluster. Support vectors are used to construct cluster boundaries of arbitrary shape in SVC.
Support vector domain description (SVDD) [53] is a oneclass classifier inspired by the support vector classifier. The idea is to use kernels to project data into a feature space and then to find the sphere enclosing almost all data, namely not including outliers. SVDD has the possibility to reject a fraction of the training data points, when this sufficiently decreases the volume of the hypersphere. One inherent drawback of SVDD, which affects not only its outlier detection performance but also its general properties significantly, is that the resulting description is highly sensitive to the selection of the tradeoff parameter, which is difficult to estimate in practice.
Given a set of M data elements, the Gaussian kernel parameter q and the tradeoff parameter C, the sphere structure S is defined as
where,

SV is a support vector set.

BSV is a bounded support vector set.

∥μ∥^{2} is the squared length of the sphere center μ.

R _{ SV } is the radius of the sphere.

R _{ BSV } is the maximum Euclidean distance of the bounded support vectors from the sphere center μ.
The multisphere set SS is defined as a set consisting of multiple spheres, that is, S S={S ^{1},…,S ^{SS}}, where the superscript denotes the index of a sphere. In SVStream, the elements of a data stream are mapped to a kernel space, and the support vectors are used as the summary information of the historical elements to construct the cluster boundaries of arbitrary shape. To adapt both dramatic and gradual changes, multiple spheres are dynamically maintained, each describing the corresponding data domain presented in the data stream. When a new data batch arrives, if a dramatic change occurs, a new sphere is created; otherwise, only the existing spheres are updated to take into account the new batch. The data elements of this new batch are assigned with cluster labels according to the cluster boundaries constructed by the sphere set. Bounded support vector (BSVs) and a newly designed BSV decaying mechanism are introduced so as to respectively identify overlapping clusters and automatically detect outliers (noise) [51]. In the clustering process, if two spheres are too close to each other, they should be merged. In addition, eliminating old BSVs by the BSV decaying mechanism would help detect the tendency of a cluster to shrink or split.
OPTICSStream is an extension of the OPTICS algorithm [54] to the streaming data model. OPTICS uses a density identification method to create a onedimensional clusterordering of the data. OPTICSStream is an online visualization algorithm producing a map representing the clustering structure where each valley represents a cluster [55].
PreDeConStream [56] is based on the two phase process of mining data streams, which builds a microclusterbased structure to store an online summary of the streaming data. The technique is based on subspace clustering, targeting applications with high data dimensionality.
Gridbased stream methods
Gridbased clustering is another group of the clustering methods for data streams where the data space is quantized into finite number of cells which form the grid structure and perform clustering on the grids. Gridbased clustering maps the infinite number of data records in data streams to a finite number of grids. Then, the grids are clustered based on their density.
DStream
DStream [57] is also a twophase scheme which consists of an online component that processes input data stream and produces summary statistics and an offline component that uses the summary data to generate clusters. In the online component, the algorithm maps each input data point into a grid whereas in the offline component, it computes the grid density and clusters the grids based on the density. The algorithm adopts a density decaying technique to capture the dynamic changes of a data stream and it can find clusters of arbitrary shapes. Unlike other algorithms such as CluStream [1], DStream automatically and dynamically adjusts the clusters without requiring user specification of target time horizon and number of clusters. Algorithm 5 outlines the overall architecture of DStream.
For a data stream, at each time step, the online component of DStream continuously reads a new data point, places the multidimensional data into a corresponding discretized density grid in the multidimensional space, and updates the characteristic vector of the density grid (Lines 46 of Algorithm 5). The density for a grid g, at a given time t, D(g,t) is defined as the sum of the density coefficients of all data records that are mapped to g. That is the density of g at t is:
where E(g,t) is the set of data points that are mapped to g at or before time t. The density of any grid is constantly changing. However, the updating operation is executed only when a new data record is mapped to that grid.
DSteam uses the characteristic vector concept associated to each grid. This is a tuple (t _{ g },t _{ m },D,l a b e l,s t a t u s), where t _{ g } is the last time when g is updated, t _{ m } is the last time when g is removed from grid list as a sporadic grid (if ever), D is the grid density at the last update, label is the class label of the grid, and s t a t u s={S P O R A D I C,N O R M A L} is a label used for removing sporadic grids.
The procedures initial_clustering (used in Line 8 of Algorithm 5) and adjust_clustering (used in Line 11 of 5) update the density of all active grids to the current time, first. Once the density of grids are determined at the given time, the clustering procedure is similar to the standard method used by densitybased clustering.
The offline component dynamically adjusts the clusters every gap time steps, where gap is an integer parameter. After the first gap, the algorithm generates the initial cluster (Lines 78). Then, the algorithm periodically removes sporadic grids and adjusts the clusters (Lines 911) [57].
One weakness of the approach is that a significant number of nonempty grid cells need to be discarded in order to keep the memory requirements in check. In many cases, such gridcells occur at the borders of the clusters. The discarding of such cells may lead to a degradation in cluster quality [9].
Analogously to DStream, MRStream [58] facilitates the discovery of clusters at multiple resolutions by using a grid of cells that can dynamically be subdivided into more cells using a tree data structure. In the online phase, it assigns new incoming data to the appropriate cell and updates the summary information. The offline component obtains a portion of the tree at a fixed hight h and performs clustering at the resolution level determined by h.
Summary
Table 1 summarizes the main features offered by each algorithm considered in terms of: the basic clustering algorithm, whether the algorithm identifies a topological structure, whether the links (if they exist) between clusters (nodes) are weighted, how many phases it adopts (online and offline), the types of operations for updating clusters (remove, merge, and split cluster), and whether a fading function is used.
Streaming platforms
In today’s applications, evolving data streams are ubiquitous. As the need by industry for real time analysis has emerged, an increasing number of systems to support realtime data integration and analytics in the recent years. Generally speaking, there exists two types of streaming processing systems. Traditional streaming platforms, on which we can implement a streaming algorithm using a traditional programming language in a sequential manner. Distributed streaming platforms, where the data is distributed across a cluster of machine and the processing model is implemented using the MapReduce framework. This section gives a survey on the most wellknown streaming platforms with a focus on the streaming clustering task. Liu [59] gives a general survey on realtime processing systems for big data.
Spark streaming
Spark Streaming [7] is an extension of the Apache Spark [60] project by adding the ability to perform online processing through a similar functional interface to Spark, such as map, filter, reduce, etc. Spark is a cluster computing system originally developed by UC Berkeley AMPLab. Now it is an umbrellaed project of Apache foundation. The execution model of Spark is based on an abstraction called Resilient Distributed Dataset (RDD), which is a distributed memory abstraction of data. Spark performs inmemory computations on large clusters in a faulttolerant manner through RDDs [61]. Spark Streaming runs streaming computations as a series of short batch jobs on RDDs withing a programming model called discretized streams (DStreams). The key idea behind DStreams is to treat a streaming computation as a series of deterministic batch computations on small time intervals. For example, we might place the data received each second into a new interval, and run a MapReduce operation on each interval to compute a count. Similarly, we can perform a running count over several intervals by adding the new counts from each interval to the old result. Spark Streaming can automatically parallelize the jobs across the nodes in a cluster. It also supports fault recovery for a wide array of operations.
Spark Streaming comes with a new approach for fault recovery, while classical streaming systems update the mutable state on a perrecord basis and use either replication or upstream backup for fault recovery. The replication approach creates two or more copies of each process in the data flow graph [62]. This can double the hardware cost, and if two nodes in the same replica fail, the system is not recoverable. In upstream backup [63], upstream nodes act as backups for their downstream neighbors by preserving tuples in their output queues while their downstream neighbors process them. If a server fails, its upstream nodes replay the logged tuples on a recovery node. The disadvantage of this approach is long recovery times, as the system must wait for the standby node to catch up.
To address these issues, DStreams employ another approach: parallel recovery. The system periodically checkpoints some of the state RDDs, by asynchronously replicating them to other nodes. For example, in a view count program computing hourly windows, the system could checkpoint results every minute. When a node fails, the system detects the missing RDD partitions and launches tasks to recover them from the latest checkpoint [7].
In the streaming clustering point of view, Spartakus^{2} is an opensource project on top of Sparknotebook^{3} which provides frontend packages for some clustering algorithms implemented using the MapReduce framework. This includes the MBGStream^{4} algorithm [35] (detailed in “Background” section) with an integrated interface for execution and visualization checks. MLlib [64] gives implementations of some clustering algorithms, especially a Streaming kmeans^{5} opensource code. streamDM^{6} is another open source software for mining big data streams using Spark Streaming, developed at Huawei Noah’s Ark Lab. For streaming clustering, it includes Clustream [1] and StreamKM++ [3].
Flink
Flink^{7} is an open source platform for distributed stream and batch data processing. The core of Flink is a streaming iterative data flow engine. On top of the engine, Flink exposes two languageembedded fluent APIs, the DataSet API for consuming and processing batch data sources and the DataStream API for consuming and processing event streams. The key idea behind Flink is the optimistic recovery mechanism that does not checkpoint every state [8]. Therefore, it provides optimal failurefree performance and simultaneously uses less resources in the cluster than traditional approaches. Instead of restoring such a state from a previously written checkpoint and restarting the execution, a userdefined, algorithmspecific compensation function is applied. In case of a failure, this function restores a consistent algorithm state and allows the system to continue the execution.
MOA
MOA^{8} (Massive Online Analysis) is a framework for data stream mining [6]. It includes tools for evaluation and a collection of machine learning algorithms. Related to the WEKA project^{9} (Waikato Environment for Knowledge Analysis), it is also written in Java, while scaling to more demanding problems. The goal of MOA is a benchmark framework for running experiments in the data stream mining context by proving storable settings for data streams (real and synthetic) for repeatable experiments, a set of existing algorithms and measures from the literature for comparison, and an easily extendable framework for new streams, algorithms and evaluation methods. MOA currently supports stream classification, stream clustering, outlier detection, change detection and concept drift and recommender systems. Currently MOA contains several stream clustering methods including: StreamKM++ [3], CluStream [1], ClusTree [16], DenStream [2], DStream [57].
SAMOA
SAMOA^{10} (Scalable Advanced Massive Online Analysis) is distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms. It is a project started at Yahoo Labs Barcelona. SAMOA is both a framework and a library [65]. As a framework, it allows algorithm developers to abstract from the underlying execution engine, and therefore reuse their code on different engines. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm^{11}, S4^{12}, and Samza^{13}. As a library, SAMOA contains implementations of stateoftheart algorithms for distributed machine learning on streams. For streaming clustering, it includes an algorithm based on CluStream [1].
Open challenges in data stream clustering
In today’s applications, evolving data streams are ubiquitous. Mining, knowledge discovery, or more specifically clustering streaming data is a recent domain compared to the offline (or batch) model. Thus, many of the challenges, issues and problems remain to be addressed in the streaming model. This section is devoted to discuss some challenging issues and further directions from the viewpoints of both academic research and industrial applications [11, 14, 66–68].
Protecting privacy and confidentiality. Data streams present new challenges and opportunities with respect to protecting privacy and confidentiality in data mining. The main objective is to develop such data mining techniques that would not uncover information or patterns which compromise confidentiality and privacy obligations. Privacybydesign seems to be an interesting paradigm to use.
Handling incomplete information. The problem of missing values, which corresponds to incompleteness of features, has been discussed extensively for the offline, static settings.
Uncertain data. In most applications we don’t have sufficient data for statistical operations so new methods are needed to manage uncertain data stream in an accurate and fast manner.
Variety of data. Data type diversity in a given stream (text, video, audio, static image, etc.) as well as differences in data processability (structured, semistructured, unstructured data). Clustering these diverse types of data together, coming in a streaming form, is very challenging. Another interesting future application of data stream clustering is social network analysis. The activities of social network members can be regarded as a data stream, and a clustering algorithm can be used to show similarities among members, and how these similar profiles (clusters) evolve over time.
Synopsis, sketches and summaries. A synopsis is compact data structures that summarize data for further querying. Samples, Histograms, Wavelets, Sketches describe basic principles and recent developments in building approximate synopses (that is, lossy, compressed representations) of massive data [69]. Data sketching via random projections is a tool for dimensionality reduction. Although this technique is extremely efficient, its main drawback is that it may ignore relevant features.
Distributed streams. Data streams are distributed in nature. For learning from distributed data, we need efficient methods in minimizing the communication overheads between nodes. Most importantly, in applications like monitoring, centralized solutions introduce delays in event detection and reaction, that can make mining systems inefficient. Many data clustering techniques are not trivial to parallelize. To develop distributed versions of some methods, a lot of research is needed with practical and theoretical analysis to provide new methods.
Evaluation of data stream algorithms. Although in the field of static classification such tools exist, they are insufficient in data stream environments due to such problems as: concept drift, limited processing time, verification latency, multiple stream structures, evolving class skew, censored data, and changing misclassification costs. Indeed, in the streaming context, we are more interested in how the evaluation metric evolves over time [66].
Autonomous and selfdiagnosis. Knowledge discovery from data streams requires the ability for predictive selfdiagnosis. A significant and useful intelligence characteristic is diagnostics, not only after failure has occurred, but also predictive (before failure) and advisory (providing maintenance instructions). The development of such selfconfiguring, selfoptimizing, and selfrepairing systems is a major scientific and engineering challenge. All these aspects require monitoring the evolution of the learning process, taking into account the available resources, and the ability to reason and learn about it [67, 68].
Combining offline and online models. Online (or realtime) and offline (or batch) learning are mostly considered as mutually exclusive, but it is their combination that might enhance the value of data the most. Lambda Architecture [70] is a useful framework for designing big data applications where we can combine these two models in a same plateform. Figure 5 is a diagram of the Lambda Architecture.
Essentially, the Lambda Architecture comprises the following components, processes, and responsibilities:

New Data: All data entering the system is dispatched to both the batch layer and the speed layer for processing.

Batch layer: This layer has two functions: (i) managing the master dataset, an immutable, appendonly set of raw data, and (ii) to precompute arbitrary query functions from scratch, called batch views.

Serving layer: This layer indexes the batch views so that they can be queried in ad hoc with low latency.

Speed layer: This layer compensates for the high latency of updates to the serving layer, due to the batch layer. Using fast and incremental algorithms, the speed layer deals with recent data only.

Queries: Any incoming query can be answered by merging results from both batch views and realtime views.
Designing data stream clustering methods in a Lambda Architecture where we can benefit from the high accuracy of the batch model is very interesting and challenging.
Conclusion
Recently, examples of applications relevant to streaming data have become more numerous and more important, including network intrusion detection, transaction streams, phone records, web clickstreams, social streams, weather monitoring, etc. Indeed, the data stream clustering problem has become an active research in recent years. This problem requires a process capable of partitioning observations continuously while taking into account restrictions of memory and time.
In this paper, we surveyed, in a detailed and comprehensive manner, a number of the representative stateoftheart algorithms for the clustering over data streams. These algorithms are categorized according to the nature of their underlying clustering approach, including GNG, hierarchical, partitioning, density, and gridbased stream methods. Motivated by the need by industry for real time analysis, an increasing number of systems to support realtime data integration and analytics has emerged in recent years. We have made an overview of the most wellknown opensource streaming systems, including Spark Streaming, Flink, MOA, and SAMOA, with a focus on the streaming clustering task.
Endnotes
^{1} See http://spark.apache.org/streaming/
^{2} See https://hub.docker.com/r/spartakus/coliseum/
^{3} See http://sparknotebook.io/
^{4} See https://github.com/mghesmoune/sparkstreamingclustering
^{5} See http://spark.apache.org/docs/latest/mllibclustering.html%23streamingkmeans
^{6} See http://huaweinoah.github.io/streamDM/
^{7} See https://flink.apache.org/
^{8} See http://moa.cms.waikato.ac.nz/
^{9} See http://weka.wikispaces.com/
^{10} See http://samoaproject.net/
^{11} See http://storm.apache.org
^{12} See http://incubator.apache.org/s4
^{13} See http://samza.incubator.apache.org
References
Aggarwal CC, Watson TJ, Ctr R, Han J, Wang J, Yu PS. A framework for clustering evolving data streams. In: In VLDB. Berlin: VLDB Endowment: 2003. p. 81–92.
Cao F, Ester M, Qian W, Zhou A. Densitybased clustering over an evolving data stream with noise. In: SDM. SIAM: 2006. p. 328–39.
Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C. StreamKM++: A clustering algorithm for data streams. ACM J Exp Algorithmics. 2012; 17(1):173–187.
Zhang X, Furtlehner C, Sebag M. Data streaming with affinity propagation. In: ECML/PKDD (2). Berlin: Springer Berlin Heidelberg: 2008. p. 628–43.
Demchenko Y, Grosso P, De Laat C, Membrey P. Addressing big data issues in scientific data infrastructure. In: Collaboration Technologies and Systems (CTS), 2013 International Conference On. IEEE: 2013. p. 48–55.
Bifet A, Holmes G, Pfahringer B, Kranen P, Kremer H, Jansen T, Seidl T. MOA: massive online analysis, a framework for stream classification and clustering. In: Proceedings of the First Workshop on Applications of Pattern Analysis, WAPA 2010, Cumberland Lodge, Windsor, UK September 1–3, 2010: 2010. p. 44–50.
Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I. Discretized streams: faulttolerant streaming computation at scale. In: ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP ’13, Farmington, PA, USA, November 3–6, 2013: 2013. p. 423–38.
Schelter S, Ewen S, Tzoumas K, Markl V. “all roads lead to rome”: optimistic recovery for distributed iterative data processing. In: 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, San Francisco, CA, USA, October 27  November 1, 2013: 2013. p. 1919–28.
Aggarwal CC. A survey of stream clustering algorithms. In: Data Clustering: Algorithms and Applications. Chapman and Hall/CRC: 2013. p. 231–58.
Nguyen H, Woon Y, Ng WK. A survey on data stream clustering and classification. Knowl Inf Syst. 2015; 45(3):535–69.
Khalilian M, Mustapha N. Data stream clustering: Challenges and issues. CoRR. 2010;abs/1006.5261.
Yogita, Toshniwal D. Clustering techniques for streaming dataa survey. In: Advance Computing Conference (IACC), 2013 IEEE 3rd International. IEEE: 2013. p. 951–6.
Mousavi M, Bakar AA, Vakilian M. Data stream clustering algorithms: A review. Int J Adv Soft Comput Appl. 2015; 7(3):13:1–13:31.
de Andrade Silva J, Faria ER, Barros RC, Hruschka ER, de Carvalho ACPLF, Gama J. Data stream clustering: A survey. ACM Comput Surv. 2013; 46(1):13.
Amini A, Teh YW, Saboohi H. On densitybased data streams clustering algorithms: A survey. J Comput Sci Technol. 2014; 29(1):116–41.
Kranen P, Assent I, Baldauf C, Seidl T. The ClusTree: indexing microclusters for anytime stream mining. Knowl Inf Syst. 2011; 29(2):249–72.
Fritzke B. A growing neural gas network learns topologies. In: NIPS. MIT Press: 1994. p. 625–32.
Kohonen T, Schroeder MR, Huang TS, (eds).SelfOrganizing Maps, 3rd edn. Secaucus, NJ, USA: Springer; 2001.
Martinetz T, Schulten K. A “NeuralGas” Network Learns Topologies. Artif Neural Netw. 1991; I:397–402.
Beyer O, Cimiano P. Online semisupervised growing neural gas. Int J Neural Syst. 2012; 22(5):21–23.
Fritzke B. A selforganizing network that can follow nonstationary distributions. In: Artificial Neural Networks  ICANN ’97, 7th International Conference, Lausanne, Switzerland, October 8–10, 1997, Proceedings. Berlin: Springer Berlin Heidelberg: 1997. p. 613–8.
Sledge IJ, Keller JM. Growing neural gas for temporal clustering. In: 19th International Conference on Pattern Recognition (ICPR 2008), December 8–11, 2008. Tampa. IEEE: 2008. p. 1–4.
Mitsyn S, Ososkov G. The growing neural gas and clustering of large amounts of data. Opt Mem Neural Netw. 2011; 20(4):260–70.
Marsland S, Shapiro J, Nehmzow U. A selforganising network that grows when required. Neural Netw. 2002; 15(8–9):1041–58.
Marsland S, Nehmzow U, Shapiro J. Online novelty detection for autonomous mobile robots. Robot Auton Syst. 2005; 51(2):191–206.
Mendes CAT, Gattass M, Lopes H. Fgng: A fast multidimensional growing neural gas implementation. Neurocomputing. 2014; 128:328–40.
Prudent Y, Ennaji A. An incremental growing neural gas learns topologies. In: Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference On. MIT Press: 2005. p. 1211–1216.
Hamza H, Belaïd Y, Belaïd A, Chaudhuri BB. Incremental classification of invoice documents. In: 19th International Conference on Pattern Recognition (ICPR 2008), December 8–11, 2008, Tampa, Florida, USA. IEEE Computer Society: 2008. p. 1–4.
Lamirel JC, Boulila Z, Ghribi M, Cuxac P. A new incremental growing neural gas algorithm based on clusters labeling maximization: application to clustering of heterogeneous textual data. In: Proceedings of the 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems  Volume Part III. Berlin: SpringerVerlag: 2010. p. 139–48.
GarcíARodríGuez J, Angelopoulou A, GarcíaChamizo JM, Psarrou A, Escolano SO, GiméNez VM. Autonomous growing neural gas for applications with time constraint: optimal parameter estimation. Neural Netw. 2012; 32:196–208.
Pimentel MA, Clifton DA, Clifton L, Tarassenko L. A review of novelty detection. Signal Process. 2014; 99:215–49.
Bouguelia MR, Belaïd Y, Belaïd A. An adaptive incremental clustering method based on the growing neural gas algorithm. In: ICPRAM. SciTePress: 2013. p. 42–9.
Ghesmoune M, Azzag H, Lebbah M. Gstream: Growing neural gas over data stream. In: Neural Information Processing  21st International Conference, ICONIP 2014, Kuching, Malaysia, November 3–6, 2014. Proceedings, Part I: 2014. p. 207–14.
Ghesmoune M, Lebbah M, Azzag H. Clustering over data streams based on growing neural gas. In: Advances in Knowledge Discovery and Data Mining  19th PacificAsia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19–22, 2015, Proceedings, Part II: 2015. p. 134–45.
Ghesmoune M, Lebbah M, Azzag H. Microbatching growing neural gas for clustering data streams using spark streaming. In: INNS Conference on Big Data 2015, San Francisco, CA, USA, 8–10 August 2015. Elsevier: 2015. p. 158–66.
Aggarwal CC, Reddy CK. Data Clustering: Algorithms and Applications, 1st: Chapman & Hall/CRC; 2013.
Zhang T, Ramakrishnan R, Livny M. Birch: An efficient data clustering method for very large databases. In: SIGMOD Conference. New York: ACM: 1996. p. 103–14.
Udommanetanakit K, Rakthanmanon T, Waiyamai K. Estream: Evolutionbased technique for stream clustering. In: ADMA: 2007. p. 605–15.
Meesuksabai W, Kangkachit T, Waiyamai K. Huestream: Evolutionbased clustering technique for heterogeneous data streams with uncertainty. In: Advanced Data Mining and Applications  7th International Conference, ADMA 2011, Beijing, China, December 17–19, 2011, Proceedings, Part II: 2011. p. 27–40.
Aggarwal CC, Yu PS. A framework for clustering uncertain data streams. In: Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7–12, 2008, Cancún, México: 2008. p. 150–9.
Yang C, Zhou J. Hclustream: A novel approach for clustering evolving heterogeneous data stream. In: Workshops Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), 18–22 December 2006, Hong Kong, China: 2006. p. 682–8.
Guttman A. Rtrees: A dynamic index structure for spatial searching. In: SIGMOD’84, Proceedings of Annual Meeting, Boston, Massachusetts, June 18–21, 1984: 1984. p. 47–57.
Arthur D, Vassilvitskii S. kmeans++: the advantages of careful seeding. In: SODA. Philadelphia: Society for Industrial and Applied Mathematics: 2007. p. 1027–1035.
Frey BJ, Dueck D. Clustering by passing messages between data points. Science. 2007; 315:2007.
Ailon N, Jaiswal R, Monteleoni C. Streaming kmeans approximation. In: NIPS. USA: Curran Associates Inc.: 2009. p. 10–18.
Braverman V, Meyerson A, Ostrovsky R, Roytman A, Shindler M, Tagiku B. Streaming Kmeans on Wellclusterable Data. In: Proceedings of the Twentysecond Annual ACMSIAM Symposium on Discrete Algorithms. Philadelphia: Society for Industrial and Applied Mathematics: 2011. p. 26–40.
Shindler M, Wong A, Meyerson A. Fast and accurate kmeans for large datasets. In: NIPS. USA: Curran Associates Inc.: 2011. p. 2375–383.
Aggarwal CC, Han J, Wang J, Yu PS. A framework for projected clustering of high dimensional data streams. In: (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, August 31  September 3 2004. VLDB Endowment: 2004. p. 852–63.
Zhou A, Cao F, Qian W, Jin C. Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst. 2008; 15(2):181–214.
Isaksson C, Dunham MH, Hahsler M. SOStream: Self organizing densitybased clustering over data stream. In: MLDM. Berlin: Springer Berlin Heidelberg: 2012. p. 264–78.
Wang C, Lai J, Huang D, Zheng W. SVStream: A support vectorbased algorithm for clustering data streams. IEEE Trans Knowl Data Eng. 2013; 25(6):1410–24.
BenHur A, Horn D, Siegelmann HT, Vapnik V. Support vector clustering. J Mach Learn Res. 2001; 2:125–37.
Tax DMJ, Duin RPW. Support vector domain description. Pattern Recogn Lett. 1999; 20(11–13):1191–9.
Ankerst M, Breunig MM, Kriegel H, Sander J. OPTICS: ordering points to identify the clustering structure. In: SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1–3, 1999, Philadelphia, Pennsylvania, USA. New York: ACM: 1999. p. 49–60.
Tasoulis DK, Ross GJ, Adams NM. Visualising the cluster structure of data streams. In: Advances in Intelligent Data Analysis VII, 7th International Symposium on Intelligent Data Analysis, IDA 2007, Ljubljana, Slovenia, September 6–8, 2007, Proceedings: 2007. p. 81–92.
Hassani M, Spaus P, Gaber MM, Seidl T. Densitybased projected clustering of data streams. In: Scalable Uncertainty Management  6th International Conference, SUM 2012, Marburg, Germany, September 17–19, 2012. Proceedings: 2012. p. 311–24.
Chen Y, Tu L. Densitybased clustering for realtime stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, August 12–15, 2007. New York: ACM: 2007. p. 133–42.
Wan L, Ng WK, Dang XH, Yu PS, Zhang K. Densitybased clustering of data streams at multiple resolutions. TKDD. 2009; 3(3):14:1–14:28.
Liu X, Iftikhar N, Xie X. Survey of realtime processing systems for big data. In: 18th International Database Engineering & Applications Symposium, IDEAS 2014, Porto, Portugal, July 7–9, 2014. New York: ACM: 2014. p. 356–61.
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: Cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10. Berkeley, CA, USA: USENIX Association: 2010. p. 10–10.
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets: A faulttolerant abstraction for inmemory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25–27, 2012. Berkeley: USENIX Association: 2012. p. 15–28.
Balazinska M, Balakrishnan H, Madden S, Stonebraker M. Faultc distributed stream processing system. ACM Trans Database Syst. 2008; 33(1):3:1–3:44.
Hwang J, Balazinska M, Rasin A, Çetintemel U, Stonebraker M, Zdonik SB. Highavailability algorithms for distributed stream processing. In: Proceedings of the 21st International Conference on Data Engineering, ICDE 2005, 5–8 April 2005, Tokyo, Japan. Washington: IEEE Computer Society: 2005. p. 779–90.
Meng X, Bradley JK, Yavuz B, Sparks ER, Venkataraman S, Liu D, Freeman J, Tsai DB, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A. Mllib: Machine learning in apache spark. CoRR. 2015;abs/1505.06807.
Morales GDF, Bifet A. SAMOA: scalable advanced massive online analysis. J Mach Learn Res. 2015; 16:149–53.
Krempl G, žliobaite I, Brzeziński D, Hüllermeier E, Last M, Lemaire V, Noack T, Shaker A, Sievi S, Spiliopoulou M, et al. Open challenges for data stream mining research. ACM SIGKDD explorations newsletter. 2014; 16(1):1–10.
Gama J. A survey on learning from data streams: current and future trends. Prog AI. 2012; 1(1):45–55.
Gama J. Knowledge Discovery from Data Streams, 1st: Chapman & Hall/CRC; 2010.
Cormode G, Garofalakis MN, Haas PJ, Jermaine C. Synopses for massive data: Samples, histograms, wavelets, sketches. Found Trends Databases. 2012; 4(1–3):1–294.
Marz N, Warren J. Big Data: Principles and Best Practices of Scalable Realtime Data Systems: Manning Publications Co.; 2015.
Acknowledgements
This work has been supported by the French Foundation FSN, PIA Grant Big dataInvestissements d’Avenir. The project is titled “Square Predict” (http://ns209168.ovh.net/squarepredict/). We thank anonymous reviewers for their insightful remarks.
Authors’ contributions
All authors contributed equally. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Ghesmoune, M., Lebbah, M. & Azzag, H. Stateoftheart on clustering data streams. Big Data Anal 1, 13 (2016). https://doi.org/10.1186/s4104401600113
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4104401600113