- Open Access
A comparison on scalability for batch big data processing on Apache Spark and Apache Flink
© The Author(s) 2016
- Received: 13 August 2016
- Accepted: 21 October 2016
- Published: 1 March 2017
The large amounts of data have created a need for new frameworks for processing. The MapReduce model is a framework for processing and generating large-scale datasets with parallel and distributed algorithms. Apache Spark is a fast and general engine for large-scale data processing based on the MapReduce model. The main feature of Spark is the in-memory computation. Recently a novel framework called Apache Flink has emerged, focused on distributed stream and batch data processing. In this paper we perform a comparative study on the scalability of these two frameworks using the corresponding Machine Learning libraries for batch data processing. Additionally we analyze the performance of the two Machine Learning libraries that Spark currently has, MLlib and ML. For the experiments, the same algorithms and the same dataset are being used. Experimental results show that Spark MLlib has better perfomance and overall lower runtimes than Flink.
- Big data
- Machine learning
With the always growing amount of data, the need for frameworks to store and process this data is increasing. In 2014 IDC predicted that by 2020, the digital universe will be 10 times as big as it was in 2013, totaling an astonishing 44 zettabytes . Big Data is not only a huge amount of data, but a new paradigm and set of technologies that can store and process this data. In this context, a set of new frameworks focused on storing and processing huge volumes of data have emerged.
MapReduce  and its open-source version Apache Hadoop [3, 4] were the first distributed programming techniques to face Big Data storing and processing. Since then, several distributed tools have emerged as consequence of the spread of Big Data. Apache Spark [5, 6] is one of these new frameworks, designed as a fast and general engine for large-scale data processing based on in-memory computation. Apache Flink  is a novel and recent framework for distributed stream and batch data processing that is getting a lot of attention because of its streaming orientation.
Most of these frameworks have their own Machine Learning (ML) library for Big Data processing. The first one was Mahout  (as part of Apache Hadoop ), followed by MLlib  which is part of Spark project . Flink also has its own ML library that, while it is not as powerful or complete as Spark’s MLlib, it is starting to include some classic ML algorithms.
In this paper, we present a comparative study between the ML libraries of these two powerful and promising frameworks, Apache Spark and Apache Flink. Our main goal is to show the differences and similarities in performance between these two frameworks for batch data processing. For the experiments, we use two algorithms present in both ML libraries, Support Vector Machines (SVM) and Linear Regression (LR), on the same dataset. Additionally, we have implemented a feature selection algorithm to compare the different functioning of each framework.
In this section, we describe the MapReduce framework and two extensions of it, Apache Spark and Apache Flink.
MapReduce is a framework that has supposed a revolution since Google introduced it in 2003 . This framework processes and generates large datasets in a parallel and distributed way. It is based on the Divide and Conquer algorithm. Briefly explained, the framework splits the input data and distributes it across the cluster, then the same operation is performed on each split in parallel. Finally, the results are aggregated and returned to the master node. The framework manages all the task scheduling, monitoring and re-executing in case of failed tasks.
The MapReduce model is composed of two phases: Map and Reduce. Before the Map operation, the master node splits the dataset and distributes it across the computing nodes. Then the Map operation is performed to every key-value pair to the node local data. This produces a set of intermediate key-value pairs. Once all Map tasks have finished, the results are grouped by key and redistributed so that all pairs belonging to one key are in the same node. Finally, they are processed in parallel.
Low inter-communication capability
Inadequacy for in-memory computation
Poor perfomance for online and iterative computing
Apache Spark [5, 6] is a framework aimed at performing fast distributed computing on Big Data by using in-memory primitives. This platform allows user programs to load data into memory and query it repeatedly, making it a well suited tool for online and iterative processing (especially for ML algorithms). It was developed motivated by the limitations in the MapReduce/Hadoop paradigm [4, 10], which forces to follow a linear dataflow that make an intensive disk-usage.
Spark is based on distributed data structures called Resilient Distributed Datasets (RDDs) . Operations on RDDs automatically place tasks into partitions, maintaining the locality of persisted data. Beyond this, RDDs are an immutable and versatile tool that let programmers persist intermediate results into memory or disk for re-usability purposes, and customize the partitioning to optimize data placement. RDDs are also fault-tolerant by nature. The lazy operations performed on each RDD are tracked using a “lineage”, so that each RDD can be reconstructed at any moment in case of data loss.
Spark SQL: introduces DataFrames, which is a new data structure for structured (and semi-structured) data. DataFrames offers us the possibility of introducing SQL queries in the Spark programs. It provides SQL language support, with command-line interfaces and ODBC/JDBC controllers.
Spark Streaming: allows us to use the Spark’s API in streaming environments by using mini-batches of data which are quickly processed. This design enables the same set of batch code (formed by RDD transformations) to be used in streaming analytics with almost no change. Spark Streaming can work with several data sources like HDFS, Flume or Kafka.
Machine Learning library (MLlib) : is formed by common learning algorithms and statistic utilities. Among its main functionalities includes: classification, regression, clustering, collaborative filtering, optimization, and dimensionality reduction. This library has been especially designed to simplify ML pipelines in large-scale environments. In the latest versions of Spark, the MLlib library has been divided into two packages, MLlib, build on top of RDDs, and ML, build on top of DataFrames for constructing pipelines.
Spark GraphX: is the graph processing system in Spark. Thanks to this engine, users can view, transform and join interchangeably both graphs and collections. It also allows expressing the graph computation using the Pregel abstraction .
Apache Flink  is a recent open-source framework for distributed stream and batch data processing. It is focused on working with lots of data with very low data latency and high fault tolerance on distributed systems. Flink’s core feature is its ability to process data streams in real time.
Apache Flink offers a high fault tolerance mechanism to consistently recover the state of data streaming applications. This mechanism is generating consistent snapshots of the distributed data stream and operator state. In case of failure, the system can fall back to these snapshots.
It also supports both stream and batch data processing with his two main APIs: DataStream and DataSet. These APIs are built on top of the underlying stream processing engine.
Gelly: is the graph processing system in Flink. It contains methods and utilities for the development of graph analysis applications.
FlinkML: this library aims to provide a set of scalable ML algorithms and an intuitive API. It contains algorithms for supervised learning, unsupervised learning, data preprocessing, recommendation and other utilities.
Table API and SQL: is a SQL-like expression language for relational stream and batch processing that can be embedded in Flink’s data APIs.
FlinkCEP: is the complex event processing library. It allows to detect complex events patterns in streams.
Although Flink is a new platform, it is constantly evolving with new additions and it has already been adopted as a real-time process framework in many big companies, such as: ResearchData, Bouygues Telecom, Zalando and Otto Group.
This section describes the experiments carried out to show the performance of Spark and Flink using three ML algorithms over the same huge dataset. We carried out the comparative study using SVM, LR and DITFS algorithm.
The dataset used for the experiments is the ECBDL14 dataset. This dataset was used at the ML competition of the Evolutionary Computation for Big Data and Big Learning held on July 14, 2014, under the international conference GECCO-2014. It consists of 631 characteristics (including both numerical and categorical attributes) and 32 million instances. It is a binary classification problem where the class distribution is highly imbalanced: 2 % of positive instances. For this problem, two pre-processing algorithms were applied. First, the Random OverSampling (ROS) algorithm used in  was applied in order to replicate the minority class instances from the original dataset until the number of instances for both classes was equalized, summing a total of 65 millions instances. Finally, for DITFS algorithm, the dataset has been discretized using the Minimum Description Length Principle (MDLP) discretizer .
The original dataset has been sampled randomly using five differents rates in order to measure the scalability performance of both frameworks: 10, 30, 50, 75 and 100 % of the pre-processed dataset is used. Due to a current Flink limitation, we have employed a subset of 150 features of each ECBDL14 dataset sample for the SVM learning algorithm.
Summary description for ECBDL14 dataset
6 500 391
4 101 746 721
19 501 174
12 305 240 794
32 501 957
20 508 734 867
48 752 935
30 763 101 985
65 003 913
41 017 469 103
We have established 100 iterations, a step size of 0.01 and a regularization parameter of 0.01 for the SVM. For the LR, 100 iterations and a step size of 0.00001 are used. Finally, for DITFS 10 features are selected using minimum Redundancy Maximum Relevance algorithm .
As an evaluation criteria, we have employed the overall learning runtime (in seconds) for SVM and Linear Regression, as well as the overall runtime for DITFS.
For all experiments we have used a cluster composed of 9 computing nodes and one master node. The computing nodes hold the following characteristics: 2 processors x Intel Xeon CPU E5-2630 v3, 8 cores per processor, 2.40 GHz, 20 MB cache, 2 x 2TB HDD, 128 GB RAM. Regarding the software, we have used the following configuration: Hadoop 2.6.0-cdh5.5.1 from Cloudera’s open-source Apache Hadoop distribution, Apache Spark and MLlib 1.6.0, 279 cores (31 cores/node), 900 GB RAM (100 GB/node) and Apache Flink 1.0.3, 270 TaskManagers (30 TaskManagers/core), 100 GB RAM/node.
SVM learning time in seconds
LR learning time in seconds
DITFS runtime in seconds
In this paper, we have performed a comparative study for batch data processing of the scalability of two popular frameworks for processing and storing Big Data, Apache Spark and Apache Flink. We have tested these two frameworks using SVM and LR as learning algorithms, present in their respective ML libraries. We have also implemented and tested a feature selection algorithm in both platforms. Apache Spark have shown to be the framework with better scalability and overall faster runtimes. Although the differences between Spark’s MLlib and Spark ML are minimal, MLlib performs slightly better than Spark ML. These differences can be explained with the internal transformations from DataFrame to RDD in order to use the same implementations of the algorithms present in MLlib.
Flink is a novel framework while Spark is becoming the reference tool in the Big Data environment. Spark has had several improvements in performance over the different releases, while Flink has just hit its first stable version. Although some of the Apache Spark improvements are already present by design in Apache Flink, Spark is much refined than Flink as we can see in the results.
Apache Flink has a great potential and a long way still to go. With the necessary improvements, it can become a reference tool for distributed data streaming analytics. It is pending a study on data streaming, the theoretical strengh of Apache Flink.
This work is supported by the Spanish National Research Project TIN2014-57251-P, and the Andalusian Research Plan P11-TIC-7765. S. Ramirez-Gallego holds a FPU scholarship from the Spanish Ministry of Education and Science (FPU13/00047).
Availability of data and materials
ECBDL14 dataset is freely available in .
DG and SR carried out the comparative study and drafted the manuscript. SG and FH conceived of the study, participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- IDC. The Digital Universe of Opportunities. http://www.emc.com/infographics/digital-universe-2014.htm. Accessed 14 July 2016.
- Dean J, Ghemawat S. Mapreduce: Simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6. OSDI’04. Berkeley: USENIX Association: 2004. p. 10–10.Google Scholar
- Apache Hadoop Project. Apache Hadoop. http://hadoop.apache.org. Accessed 14 July 2016.
- White T. Hadoop: The Definitive Guide. Sebastopol: O’Reilly Media, Inc; 2012.Google Scholar
- Hamstra M, Karau H, Zaharia M, Konwinski A, Wendell P. Learning Spark: lightning-fast big data analytics. Sebastopol: O’Reilly Media; 2015.Google Scholar
- Spark A. Apache Spark: lightning-fast cluster computing. http://spark.apache.org. Accessed 14 July 2016.
- Flink A. Apache Flink. http://flink.apache.org. Accessed 14 July 2016.
- Apache Mahout Project. Apache Mahout. http://mahout.apache.org. Accessed 14 July 2016.
- MLlib. Machine Learning Library (MLlib) for Spark. http://spark.apache.org/docs/latest/mllib-guide.html. Accessed 14 July 2016.
- Lin JJ. Mapreduce is good enough? if all you have is a hammer, throw away everything that’s not a nail!Big Data. 2012; 1(1):28–37.View ArticleGoogle Scholar
- Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. NSDI’12. Berkeley: USENIX Association: 2012. p. 2–2.Google Scholar
- Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A. Mllib: Machine learning in apache spark. J Mach Learn Res. 2016; 17(34):1–7.MathSciNetMATHGoogle Scholar
- Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G. Pregel: A system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. SIGMOD ’10. New York: ACM: 2010. p. 135–46. doi:http://dx.doi.org/10.1145/1807167.1807184.
- Apache Spark Project. Project Tungsten (Apache Spark). https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html Accessed 14 July 2016.
- Apache Flink Project. Peeking Into Apache Flink’s Engine Room. https://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html. Accessed 14 July 2016.
- Brown G, Pocock A, Zhao MJ, Luján M. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J Mach Learn Res. 2012; 13:27–66.MathSciNetMATHGoogle Scholar
- Jaggi M, Smith V, Takác M, Terhorst J, Krishnan S, Hofmann T, Jordan MI. Communication-efficient distributed dual coordinate ascent. CoRR. 2014:3068–76. abs/1409.1458.Google Scholar
- del Río S, López V, Benítez JM, Herrera F. On the use of mapreduce for imbalanced big data using random forest. Inf Sci. 2014; 285:112–37.View ArticleGoogle Scholar
- Ramírez-Gallego S, García S, Mouriño-Talín H, Martínez-Rego D. Distributed entropy minimization discretizer for big data analysis under apache spark. In: Trustcom/BigDataSE/ISPA, 2015 IEEE: 2015. p. 33–40. doi:http://dx.doi.org/10.1109/Trustcom.2015.559.
- Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. J Bioinforma Comput Biol. 2005; 3(02):185–205.View ArticleGoogle Scholar
- Evolutionary Computation for Big Data and Big Learning Workshop. http://cruncher.ncl.ac.uk/bdcomp/. Accessed 14 July 2016.