Apache Spark 1.5.0 released

Apache Spark 1.5.0 recently released, is the sixth release on the 1.x line. This release represents 1400+ patches from 230+ contributors and 80+ institutions. Apache Spark is a fast and general engine for large-scale data processing. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing and runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.

apache spark

Spark 1.5.0 changelog overview

  • rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large
  • web ui stage page becomes unresponsive when the number of tasks is large
  • Sort-based Aggregation
  • Scala style: disallow trailing spaces
  • Dimension table broadcast shouldn’t be eager
  • Support Scala/Java UDAF
  • Create python bindings for Streaming KMeans
  • Streaming Linear Regression- Python bindings
  • Simplify the Aggregation Function implementation
  • NPE with new Parquet Filters
  • Partial aggregation support the DISTINCT aggregation
  • Paginate stage page to avoid OOM with > 100,000 tasks
  • UDF clean up
  • Stabilize Spark SQL data type API followup
  • Stabilize data types
  • And predicates are not properly pushed down
  • Add VectorSlicer
  • Model import/export for LDAModel

Visit the release notes to read about the new features

Download Spark

 
comments powered by Disqus