Software architectures for big data: a systematic literature review

Avci, Cigdem; Tekinerdogan, Bedir; Athanasiadis, Ioannis N.

doi:10.1186/s41044-020-00045-1

Big Data Analytics

Table 12 Benefits and limitations of the primary studies^a

From: Software architectures for big data: a systematic literature review

Study Citation #	Benefit	Limitation
1	Complex magnitudes can be altered into smaller data subsets using 5 level fusion model	Not compatible with social media applications
2	Efficiency for smaller as well as large queries. Scalable new runtime engine, all-LSM-based data storage, with B+ tree, R tree, and keyword and n-gram indexes; rich set of primitive types, including spatial, temporal, textual types to handle Web and social media data; support for fuzzy selections and joins; a built in notion of data feeds for continuous ingestion.	Continuous queries are not supported.
3	Minimization of the construction waste. The intended tool will equip designers with well-informed and data driven insights to optimize design for designing out waste	This paper limits discussions to horizontal scaling Big Data platforms, particularly, Apache TM Hadoop and Berkeley Big Data Analytics Stack (BDAS). This selection is mainly influenced by the data and computationalrequirements of construction waste analytics, which include iterative algorithms, compute-intensive tasks, and near real-time visualisation.
4	Manage the cloud infrastructure, including an interface to create modified input descriptions, job scheduling, plotting of output data, and file management	No support for more complex plotting capabilities, such as contour plots, no workflow management system, no command-line installer
5	Sensors to data management, and supports a user who wants to set up a research or production infrastructure to collect very large datasets in the context of the IoT	Project is still at its beginning. As a consequence, the work done in this architecture focused on data collection and storage
6	A scalable and productive platform to facilitate data scientists’ work. The Docker light-weighted visualization container, to support multiple programming environments embedded within the workflow interface. Spark infrastructure, scalable to big data set, which is transparent to the end user. Web interface provides a user- friendly data analytics environment with access- anywhere, − anytime, and any devices feature. Cloud platform is also be able to scale up and down based on the requirements of user’s data analytics work.	The workflow interface is to be enhanced to make it more open to data scientists, who will be able to revise and add widgets more conveniently
7	Web intelligence (WI) may be viewed as an enhancement or extension of ar- tificial intelligence (AI) and IT on the Web. A prototype of a portable brain and mental health-monitoring system (brain-monitoring system, for short) to support the monitoring of brain and mental disorders.	The tech- nological architecture of security and privacy protection should be fit for different application environments, including the Internet, IoT, and MI.
8	Wisdom Web of Things (W2T), where the “wisdom” means that each of the “things” in IoT and WoT is aware of both itself and others to pro- vide the right service for the right ob- ject at the right time and context.	A design method or development methodology, no matter how thorough, can never guarantee success. The application of an architecture-centric methodology like AABA requires discipline and creativity, which may be a tall order for organizations that do not have the required discipline and innovation mindset.
9	A valuable example to future Smart City platform designers so that they can foresee some practice issues and refer to this solution when building their own smart city data platforms.	It is not possible for us to identify the concrete reasons why those sensor become abnormal. But this observation indicates for a smart city with a large number of deployed sensors, detecting anomalies in collected data must be seriously considered. That is why we implemented some anomaly detection algorithms as external processing tasks
10	Systematic design using tactics	The tactics could be characterized. F.e. tactics that have dependency on each other or complex tactics etc.
11	processing chain model is proposed for satellite images on a private cloud computing platform	Currently our fault tolerance mechanism depends fully on the structure of ZooKeeper; all nodes are identical and there is no centralized control. When the route service fails, the work services in ZooKeeper will auto- matically recommend an alternative as the route service.
12	Architecture named SHMR (Semantic-based Heterogeneous Multimedia Retrieval) to support heterogeneous multimedia big data retrieval. Solves type heterogeneity.	Experimental dataset acquisition is from some specific websites such as Flickr, Wikipedia and Youtube, the semantic provision by social users is still a simulation. The experiments will be in real Internet environment and the retrieval speed will be increased.
13	Cloud and RFID technologies are integrated for remote and real-time production data capture and tracking while intelligent techniques are used to generate effective production scheduling solutions	Better supply chain coordination and better production scheduling decisions can be achieved
14	Generating meaningful information from text-based social data	Improve the efficiency of multi-processing. The dynamic process controller will work as a load balancer in our system to mitigate the gaps depending on the system resources such as usages of memory and CPU. We expect that the controller dynamically will control the number of processes according to their hardware resources
15	The system provides tools to help with constructing “domain models” (i.e., families of keywords and extractors to enable focus on tweets and other social media documents relevant to a project), to rapidly extract and segment the relevant social media and its authors, to apply further analytics (such as finding trends and anomalous terms), and visualizing the results	Optimizations are underway, including a shift to SPARK for management and pre- processing of the background corpora that support the rapid domain scoping. Tools to enable comparisons between term generation strategies and other scoping tools are under devel- opment. A framework to enable “crowd-sourced” evaluation and feedback about the accuracy of extractors is planned. The team is working to support multiple kinds of documents (e.g., forums, customer reviews, and marketing content), for both background and foreground analytics. The team is also developing a persistent catalog for managing sets of topics and extractors; this will be structured using a family of industry-specific ontologies.
16	Web Observatories with rich, timely resources for observation and analysis. Individually, these feeds provide a resource to measure the current state - or health - of a social machines, and combined, they have the potential to provide a collective pulse of the Web	A wider analysis of the current and proposed metrics for measuring social machine activity, and how they contribute to understanding different classes of social machines is required.
17	Provide scientific workflows to help remove technical burdens from researchers, allowing them to focus on solving their domain- specific problems.	workflow scheduling techniques Are to be explored
18	N/a	N/a
19	Proposes and examines the concept of event-based process predictions and outlines its potentials for planning, forecasting and eventually controlling business processes	Current techniques and systems available data cannot be analyzed in a reasonable time frame to make sufficient business value out of it
20	Help education in the near future, by changing the way we approach the e-learning process, by encouraging the interaction between students and teachers, by allowing the fulfilment of the individual requirements and goals of learners.	Further performance analysis to be done against high workload
21	The use of multi-agent systems in software development has two major benefits given by the reusability and composability of the agents and by the higher level of abstraction introduced by the agent oriented program- ming paradigm	Backed by the Prometheus Design Tool (PDT)⁴ an Eclipse plug-in,
22	Enable the integration of disparate urban sensing systems, including individually owned data through participatory sensing.	Framework to include more features, such as opportunistic task assign- ment by dynamically finding out the most suitable group of sensing participants to gather information about a spe- cific issue, sensor stream quality validation and improved privacy and security.
23	Provıdes a better understanding of how fundamen-tal assumptions in Hadoop’s design make it a poor fit for real-time applications	–
24	To analyze the data generated during the construction of the barrier and detection of the intruder using camera sensors	An intruder is detected, if it inter- sects with the sensor’s path along with the sensing range of the sensor. However, it could be possible that an intruder is unde- tected, if it is not within the sensing range of a sensor and also even if the intruder is detected, the sensor cannot communicate instantly with other sensors to pass the information
25	Architecture system design, based in open distributed computing paradigms like Hadoop map-reduce, offering horizontal scalability and no-SQL flexibility while at the same time meeting the stringent quality and resilience requirements of the banking software standards. Benefits: 1) the orchestration double layer architecture allows for an effective decoupling of the external from the internal processing workflows. Changes in the workflows due to external business requirements could be easily implemented without affecting the data processing structure. 2) The segmentation of map-reduce jobs in the triad barrier/map-reduce job/barrier together with the orchestration database provided an effective mechanism to orchestrate non-trivial data processing logic. 3) The orchestration database containing data processing status at configurable granularity level (both on data entities or processing steps) provides a reliable tool for the implementation of error monitoring, backup and disaster recovery procedures. 4) Following this pattern, the introduction of new processing steps like new XSLT transformations on already defined data requires minimal implementation effort, obtaining an already parallelized process.	Among the observed pitfalls the introduction of an external orchestration engine with advanced capabilities and a reliable database increases the cost, both in terms of platform infrastructure and development. At the present time, Oozie workflows are represented as simple directed acyclical graphs, which impose its limitations on the workflow data processing complexity that can be implemented.
26	An overview of cloud middleware services for interconnection of healthcare platforms	–
27	Separates the concerns of social CRM using architectural perspectives and aims at building a better understanding. The research method is a literature review in which artefacts are gathered and assigned to five layers, which are business, process, integration, software, and technology. The conclu- sion states that social CRM is an emergent research field and comprises a call for more artefacts that concretise abstracted components of the business-layer.	–
28	Technology independent reference architecture for big data systems, which is based on analysis of published implementation architectures of big data use cases. An additional contribution is classification of related implementation technologies and products/services, which is based on analysis of the published use cases and survey of related work.	A limitation of the proposed classification is concentration on selected technologies in the survey. However, other authors have covered other technological topics in earlier surveys: batch processing, machine learning, data mining, storage systems, statistical tools, and document-oriented databases. Another limitation of this work is that the reference architecture should be evaluated with a real big data use case, which would complete step 6 of the research method.
29	A big data software architecture that uses an ontology, based on the Experience API specification, to semantically represent the data streams generated by the learners when they undertake the learning activities of a course, e.g., in a course. Th	To be improved with an Enterprise Service Bus able to integrate different data stream sources and a big data-oriented message queue to increase the activity stream performance.
30	Real-time Big Data analytical architecture for remote sensing satellite application, capability of dividing, load balancing, and parallel processing of only useful data	Not compatible for Big Data analysis for all applications, e.g., sensors and social networking.
31	Mobile-based monitoring and visualization architecture for life-long diseases	To be expanded to detect and analyze other life-long disorders, such as Alzheimer and Parkinson’s disease
32	A mode of using double cloud architecture and optimizes clustering algorithm to monitor the massive network information in real time.	–
33	Cloud service architecture that explores a search cluster for data indexing and query	More analysis methods are required for the architecture extension to make the architecture generic. The architecture is to be expanded with support of running MapReduce-based analysis methods.
34	Contributions of this chapter are threefold: (1) we provide an overview of Big Data and Internet of Things technologies including a summary of their relationships, (2) we present a case study in the smart grid domain that illustrates the high level requirements towards such an analytical Big Data framework, and (3) we present an initial version of such a framework mainly addressing the volume and velocity chal- lenge.	Extend the analytical framework with the necessary mechanisms to achieve such uniform processing
35	To support the integration of massive number of infrastructure components and services in future smart cities.	-To secure future communities, it is necessary to build large-scale, geospatial sensing networks, perform big data analysis, identify anomalous and hazardous events, and offer optimal responses in real-time.
36	Processing of Big Data in real-time based on multi-agent system paradigms.	in the presented approach it is strongly rec- ommended to use the same event representation in a both processing: batch and online. it is argued the approach is general purpose.
37	Use and reuse driven big data management approach that fuses the data repository and data processing capabilities in a co-located, public cloud.	Although much still needs to be done to fully realize the vision of use and reuse driven data management, the evaluations presented in section 7 have clearly demonstrated the technical feasibility to manage big data in the cloud.
38	A real time data-analytics-as-service architecture that uses RESTful web services to wrap and integrate data services, dynamic model training services (supported by big data processing framework), prediction services and the product that uses the models.	The machine learning algorithms supported are limited to the applied machine learning library and big data frameworks.
39	An efficient system for managing and analyzing PB level structured data called Banian	To achieve higher processing performance and scalability, Banian does not support the partial update and deletion of table data, and its support for transaction consistency is not very strong. Therefore, it is not yet a full-fledged replacement for parallel database. In the future, the above-mentioned weaknesses of Banian will be addressed with further research efforts.
40	Real-time bus location and real-time traf- fic situation, especially the real-time traffic situation nearby, through open data, GPS, GPRS and cloud technologies. With the high-scalability cloud technologies, Hadoop and Spark, the proposed system architecture is first implemented successfully and efficiently.	In the future, we expect to apply this system to all roads in Taichung and to improve the accuracy of estimates.
41	To make better Product Lifecycle Management and Cleaner Production decisions based on these data collected from smart sensing devices	Without proper data preparation and accurate model, data mining is apt to generate useless information.
42	Service-oriented operation model of China′s power system, which integrates the concepts and techniques of cloud computing, big data analytics, internet of things (IoTs), high performance computing, smart grid and other advanced information and communication technologies (ICTs).	There are huge spaces for the future development of CG in China. We should also note that CG is a complex system engineering. The implementation of CG need more mature techniques, as well as the collaborative supports of government, industry and academia. Smart grid is in its infancy, and the implementation of CG is also a gradual process. The application of CG needs more comprehensive theoretical support and more widely experimental demonstrations.
43	A stateful complex event processing framework, to support hybrid online and on-demand queries over realtime data.	The scalability of the in-memory state persistence model. When the stateful CEP system processes a large number of long running online queries, the system memory may be drained out.

^aThe table is presented respecting the content of the primary studies

Back to article page

ISSN: 2058-6345

Contact us

Submission enquiries: Access here and click Contact Us
General enquiries: info@biomedcentral.com