Big Data processing is said to have three main characteristics, a.k.a. the 3Vs: volume, velocity and variety. Many data scientists and Big Data engineers will argue that Apache Hadoop processes data fast enough to meet the criteria of velocity. Perhaps so. However, coming from a real-time and a high-performance computing background, this argument about Hadoop and velocity reminds me of engineers and computer scientists in the ‘90s who would ask, “Why do you need 64 bits when 32 bits work perfectly fine?” This ability to process Big Data really, really fast -- at higher velocities than Hadoop -- has left the door open for a myriad of proprietary and open source solutions. It has also led the Apache Foundation – the “grey beards” sponsoring Hadoop and others – to fast track viable solutions and projects from incubator status to top level projects. Enter Apache Spark, a “lightning-fast cluster computing” solution for Big Data processing. For those of you who run Apache Hadoop 2.0, how would you like to run programs 100 times faster than Hadoop MapReduce in memory or 10 times faster on disk? How would you like to write applications very quickly in Java, Python or Scala, in such a way that you could build parallel applications that take advantage of your distributed environment? How would you like to combine SQL, streaming and complex analytics in the same application? And finally, how would you like to still access and process all your data from your current Hadoop environment? This is what Apache Spark can do. So what is Apache Spark? Developed in Scala, Spark is an open source distributed computing framework for advanced analytics that can leverage much of the Hadoop storage environment (like HDFS). It was originally developed as a research project at UC Berkeley's AMPLab. In June 2013, Spark achieved incubator project status at the Apache Foundation. In February 2014, it was promoted to a top level project. As of this writing, Apache Spark 0.9.0 is available as open source.
Is Apache Spark the Next Big Thing in Big Data?
In any article or blog post, any mention of Big Data usually includes something about Hadoop. When it comes to Big Data, Apache Hadoop has been the big elephant in the room, and the release of Hadoop 2.0 in 2013 made the environment easier and more stable. But even with the inclusion of Impala for querying stored information real-time, Hadoop is still a batch-based system that processes data in, well, batch mode.
Big Data processing is said to have three main characteristics, a.k.a. the 3Vs: volume, velocity and variety. Many data scientists and Big Data engineers will argue that Apache Hadoop processes data fast enough to meet the criteria of velocity. Perhaps so. However, coming from a real-time and a high-performance computing background, this argument about Hadoop and velocity reminds me of engineers and computer scientists in the ‘90s who would ask, “Why do you need 64 bits when 32 bits work perfectly fine?” This ability to process Big Data really, really fast -- at higher velocities than Hadoop -- has left the door open for a myriad of proprietary and open source solutions. It has also led the Apache Foundation – the “grey beards” sponsoring Hadoop and others – to fast track viable solutions and projects from incubator status to top level projects. Enter Apache Spark, a “lightning-fast cluster computing” solution for Big Data processing. For those of you who run Apache Hadoop 2.0, how would you like to run programs 100 times faster than Hadoop MapReduce in memory or 10 times faster on disk? How would you like to write applications very quickly in Java, Python or Scala, in such a way that you could build parallel applications that take advantage of your distributed environment? How would you like to combine SQL, streaming and complex analytics in the same application? And finally, how would you like to still access and process all your data from your current Hadoop environment? This is what Apache Spark can do. So what is Apache Spark? Developed in Scala, Spark is an open source distributed computing framework for advanced analytics that can leverage much of the Hadoop storage environment (like HDFS). It was originally developed as a research project at UC Berkeley's AMPLab. In June 2013, Spark achieved incubator project status at the Apache Foundation. In February 2014, it was promoted to a top level project. As of this writing, Apache Spark 0.9.0 is available as open source.
Big Data processing is said to have three main characteristics, a.k.a. the 3Vs: volume, velocity and variety. Many data scientists and Big Data engineers will argue that Apache Hadoop processes data fast enough to meet the criteria of velocity. Perhaps so. However, coming from a real-time and a high-performance computing background, this argument about Hadoop and velocity reminds me of engineers and computer scientists in the ‘90s who would ask, “Why do you need 64 bits when 32 bits work perfectly fine?” This ability to process Big Data really, really fast -- at higher velocities than Hadoop -- has left the door open for a myriad of proprietary and open source solutions. It has also led the Apache Foundation – the “grey beards” sponsoring Hadoop and others – to fast track viable solutions and projects from incubator status to top level projects. Enter Apache Spark, a “lightning-fast cluster computing” solution for Big Data processing. For those of you who run Apache Hadoop 2.0, how would you like to run programs 100 times faster than Hadoop MapReduce in memory or 10 times faster on disk? How would you like to write applications very quickly in Java, Python or Scala, in such a way that you could build parallel applications that take advantage of your distributed environment? How would you like to combine SQL, streaming and complex analytics in the same application? And finally, how would you like to still access and process all your data from your current Hadoop environment? This is what Apache Spark can do. So what is Apache Spark? Developed in Scala, Spark is an open source distributed computing framework for advanced analytics that can leverage much of the Hadoop storage environment (like HDFS). It was originally developed as a research project at UC Berkeley's AMPLab. In June 2013, Spark achieved incubator project status at the Apache Foundation. In February 2014, it was promoted to a top level project. As of this writing, Apache Spark 0.9.0 is available as open source.