Software architectures for big data: a systematic literature review

Avci, Cigdem; Tekinerdogan, Bedir; Athanasiadis, Ioannis N.

doi:10.1186/s41044-020-00045-1

Big Data Analytics

Table 13 Difference and importance of the primary studies^a

From: Software architectures for big data: a systematic literature review

Study Citation #	Difference	Importance
1	Data fusıon model with 5 processing levels	Based on partitioning and aggregation technique for big data. Focuses on improving the computational efficiency.
2	Native data storage and indexing as well as querying of datasets in HDFS or local files. Open data model that handles complex nested data as well as flat data and use cases ranging from “schema first” to “schema never”	Open source. Full function BDMS that is best characterized as a cross between a big data analytics platform, a parallel RDBMS, and a NoSQL store, yet different from each.
3	Not applicable	The first big data based architecture for construction waste analytics.
4	–	There are still no available frameworks or middleware solutions that are dedicated to supporting scientific applications in such a way that (1) users can easily upload their program to the cloud, (2) have a user-friendly interface automatically generated for them to run
5	Aims to equip an academic campus with sensors and supports the definition of innovating application exploiting these data	It triggers interesting challenges about scalability of a community-driven usage of such an open data platform, the evolution capabilities of the Data as a Service API, as well as privacy and security issues.
6	Integrate a web-based workflow interface with Spark to support big data analytics; 2) utilize Docker to create a light- weight virtualization environment to support a variety of program development environments, and facilitate user program/widget management; 3) demonstrate the workflow- based data analytics platform capabilities with real-world electric power industry data	Havıng a productive data analytics cloud platform by integrating a variety of data analytics tools and packages with a high-level workflow interface
7	Cloud computing primar- ily focuses on the system resource ar- chitecture of IT applications—that is, infrastructures, platforms, and software (developing and schedul- ing abilities). For the large scale con- verging of intelligent IT applications, it’s necessary to develop an open and interoperable intelligence service ar- chitecture for the contents of IT ap- plications—the data, information, knowledge, and wisdom (DIKW).	Supports the challenges of: Investigation of human BI via research on holistic intelligence? Collecting, managing, and mining BI Big Data to gain a systematic investigation and understanding of human intelligence?
8	Agile big data analytics for web-based systems: An architecture-centric approach. Agile big data analytics for web-based systems: An architecture-centric approach.	The first of its kind, AABA fills a methodological void by adopting an architecture-centric approach, advancing and integrating software architecture analysis and design, big data modeling and agile practices.
9	A citywide testbed, with regard to wireless network topology, reliable data transmission, battery lifetime and programmability of deployed sensor nodes	There is still a gap between what a big data platform for smart cities looks like at the high level and how it should be properly realized. To fill this gap, this paper presents a concrete and valueable example by introducing our city data and analytics platform named CiDAP.
10	Existing catalogs do not contain tactics specific to big data systems	Expanding the collection of architecture tactics pre- sented in this paper and encoding these in an environment that supports naviga- tion between quality attributes and tactics, making crosscutting concerns for design choices explicit
11	There are three major differences. First, the types of images are much more diverse in our environment, including classify-image, pan image, DEM, etc. Second, the number of bands is possibly more than three. For instance, the multispectral data in ZY-3 satellite has four bands. Third, many coordinate systems coexist in our system. For example, the ZY-3 satellite has its own RPC parameters.	A new computing model, the Remote Sensing On-Demand Computing (RS-Demand) model that overcomes these challenges. The key idea behind RS-Demand is to treat remote sensing image processing as chain computing procedures in memory. Image tiles go through the algorithm node and reach the end-user screen on-the-fly. no software installation, transparent pro- cessing and storage, and low bandwidth requirements – are critical in emergency applications.
12	Excellent economic efficiency.	Excellent economic efficiency.
13	No research has examined the production scheduling problems in a distributed manufacturing company from a holistic perspective	Make-to-order labor-intensive manufacturing to improve information visibility and transparency
14	Firstly, the data storages of the previous approaches are based on a relational database that may cause performance issues when a huge amount of datasets ranging from a few terabytes to multiple petabytes needs to be handled. Secondly, they do not support distributed processing, which may slow down processing time. Lastly, they collect data from only a single source channel, such as Twitter.	Own sentiment analysis model in previous research, which guarantees higher accuracy. Previous approaches mainly used a relational database as a main data storage.
15	The key novelties in the system are: (a) enabling iterative rapid domain scoping that takes advantage of several advanced text analytics tools, and (b) the development of a data-centric approach to support the overall lifecycle of flexible, iterative analytics exploration in the social media domain.	Alexandria advances the state of the art of social media analytics in two fundamental ways (see also Section VIII). First, the system brings together several text analytics tools to provide a broad-based environment to rapidly create domain models. This contrasts with research that has focused on perfecting such tools in isolation. Second, Alexandria applies data-centric and other design principles to provide a working platform that supports ad hoc, iterative, and collaborative exploration of social media data
16	Working on dynamic methods of inte-grating streams. One approach involves monitoring a the overall message rate of a given set of streams (i.e. posts per minute), and using fluctuations in stream volumes as an early indicator for com-bining streams undergoing similar changes	From a social machines researchers perspective, the ability to access unified, and in certain circumstances, inte-grated real-time streams of activity is a essential resource to un-derstand, analyse, and possibly make predictions about the current state of a social machines health
17	For Big Data Workflows in the Cloud more generic, implementation-independent solution	For Big Data Workflows in the Cloud more generic, implementation-independent solution
18	–	–
19	Data base is potentially available for event-based predictions of its manufacturing processes	Outlined the potentials of event-based predictions in order to plan and eventually control business processes. Besides outlining these potentials, a general concept for event-based predictions has been conceived and the current state of the art was discussed
20	A three-step system architecture for a consortium of universities	Efficient - The entire solution described is efficient, because activities are separated on levels and resources, the traffic is managed by Hadoop in Clouds, and the analysis is able to add graphic representations to other types of results.
21	The Prometheus has proven to be a practical agent oriented methodology	The Prometheus has proven to be a practical agent oriented methodology
22	Existing research literature lacks a generic data collection and dissemination system. Existing approaches are application-specific, hindering their scalability and reuse.	Gathering real-time infor- mation produced by such disparate existing systems can improve the management of city resources
23	–	Illustrating the challenges of real-time data pro-cessing
24	Inadequate research is not only on the quality measurement of the image in terms of width and resolution but also on the limited mobility on camera sensors. To get rid of this problem, we propose an energy efficient barrier construction algorithm where all camera sensors are hav- ing limited mobility. Also we provide a better solution for intruder detection with the help of this barrier line.	No barrier construction protocol proposed so far considers all the three functionalities such as node mobility, rotation of the camera sensors and Quality of Measurement of WSN to detect the intruder efficiently. Moreover, camera sensors are normally expensive and efficient detection of an intruder with a minimum number of camera sensors is a challenging research issue.
25	A new distributed computing paradigm based in highly scalable and fault tolerant map-reduce model, running on commodity class servers, a new opportunity h	An architectural and design pattern for the adoption of these new technologies in the solution of massive data processing and analytics tasks of investment and financial institutions, adapted to the strict requirements imposed by the banking technological model: rich and complex workflows, massive volumes, enormous variety of data structures that must be combined together and stringent requirements of reliability, consistency (every single record counts), data back-up and persistency
26	–	Fills a gap in the electronic healthcare register literature by providing an overview of cloud computing middleware services and standardized interfaces for the integration with medical devices.
27	–	–
28	Conceptual work integrating the approaches into one coherent reference architecture has been limited, others but they did not focus specifically on architectural issues or explicit classification of technologies and commercial products/services.	concentrated on reference architecture
29	No proposal that (i) collects in real time the large volume of information generated during a course; (ii) represents and stores this information following a standard specification to facilitate its interoperability with learning analytics services; (iii) enables these services to effectively access to the information generated in the learner’s activities; and (iv) offers a set of intelligent learning analytics services that provide new and valuable information to teachers in order to take better decisions for improving the quality of the learning and teaching processes.	Teachers acould be aware of what learners are doing, making difficult to improve or correct any deficiencies
30	Majority of work have been done in the various fields of remote sensory satellite image data, such as change detection [6], gradient-based edge detection [7], region similarity- based edge detection [5], and intensity gradient technique for efficient intraprediction [31]	Efficiently analyzing real-time remote sensing Big Data using earth observatory system
31	Data intensive, availability	Introduce a mobile system to monitor Code patients, while receiving professional Design healthcare
32	A faster operating speed, strong reliability and faster convergence rate, especially with the increase of the amount of data, the advantages of the speed of convergence is more obvious.	Uses double cloud architecture to make full use of cloud resources and network bandwidth.
33	Existing tools are mostly provided as part of IaaS or PaaS cloud services. These monitoring systems are provisioned by the cloud service providers. They often have limitations in adding analysis methods beyond simple aggregations and threshold-based settings. In this paper, we present a cloud architecture leveraging SolrCloud, the open source search-based cluster that supports large monitoring data storage, query, and processing. This architecture is integrated with Semantic MediaWiki that allows documenting, structuring, and sharing the source of cloud monitoring data as well as any analysis results.
34	The terms are increasingly used interchangeably and the correspond- ing solutions follow similar principles.	Addressing the a lack of an analytical framework that pulls all these components together such that services for urban decision makers can easily be developed.
35	In contrast to the Cloud, the Fog not only performs latency-sensitive applications at the edge of network, but also performs latency-tolerant tasks efficiently at powerful computing nodes at the intermediate of network.	To secure future communities, it is necessary to build large-scale, geospatial sensing networks, perform big data analysis, identify anomalous and hazardous events, and offer optimal responses in real-time.
36	–	Shown how autonomous agents can enhance the architecture and provide capabilities for robust processing of data in real-time.
37	Although seismology data repositories exist, they usually mandate data access methods, processing tools, and have very limited search options that mostly gear towards seismology researches. In contrast, we purposefully avoid prescribing and limiting what the data shall be used for and how they are used.	It answers to the urgent data management needs from the growing number of researchers who don’t fit in the big science/small science dichotomy.
38	Other big data batch processing frameworks and their machine-learning layer are usually not designed to be easily invokable as well-designed model training services by different users. It is also out of the scope of these frameworks to enable the resulting learned models to be used by different external products, not to mention dealing with real time requirements of these products.	Main contribution made in this paper is a real time data analytics service architecture design where it allows a machine learning model to be continually updated by real time data and it wraps big data processing framework as reusable services.
39	Banian overcomes the storage structure limitation of relational database and effectively integrates interactive query with large-scale storage management.	By combining HDFS with the splitting and scheduling model, Banian effectively integrates large-scale storage management with interactive query and analysis.
40	Our proposed architec- ture will combine Spark with the distribution computation of Hadoop YARN to enhance performance.	Because the real- time data collected is huge and from different attributes, this thesis utilizes a novel cloud architecture of big data to store, process, and analyze a huge amount of real-time data and thus provides useful information.
41	The researches above mainly focus on how to apply the IoT related techniques on one stage of PLM (such as manufacturing process of BOL), and the overall solution for the whole lifecycle is seldom investigated. There is lack of systematic solution of automatic identification and capturing for lifecycle data	A novel concept of integrating big data analytics with product service
42	Other research works have paid less attention to the overall penetration of cloud computing service model to the whole process of power system operation.	A new operational model of power system. It can support the cost-efficient and environmentally friendly operation and development of China′s power industr
43	Traditional CEP systems do not consider data variety and only support online queries.	A semantically enriched event and query model to address data variety.

^aThe table is presented respecting the content of the primary studies

Back to article page

ISSN: 2058-6345

Contact us

Submission enquiries: Access here and click Contact Us
General enquiries: info@biomedcentral.com