A Big Data Analysis Framework Using Apache Spark and Deep Learning

Abstract

With the spreading prevalence of Big Data, many advances have recently beenmade in this field. Frameworks such as Apache Hadoop and Apache Spark havegained a lot of traction over the past decades and have become massivelypopular, especially in industries. It is becoming increasingly evident thateffective big data analysis is key to solving artificial intelligence problems.Thus, a multi-algorithm library was implemented in the Spark framework, calledMLlib. While this library supports multiple machine learning algorithms, thereis still scope to use the Spark setup efficiently for highly time-intensive andcomputationally expensive procedures like deep learning. In this paper, wepropose a novel framework that combines the distributive computationalabilities of Apache Spark and the advanced machine learning architecture of adeep multi-layer perceptron (MLP), using the popular concept of CascadeLearning. We conduct empirical analysis of our framework on two real worlddatasets. The results are encouraging and corroborate our proposed framework,in turn proving that it is an improvement over traditional big data analysismethods that use either Spark or Deep learning as individual elements.

Quick Read (beta)

loading the full paper ...