Marcin Szymaniuk

TantusData

Apache Spark? If it only worked.

Spark has very nice API and it promises high performance of crunching large datasets. It’s really easy to write an app in Spark, unfortunately it’s also easy to write one which doesn’t perform the way you would expect or just fails for no obvious reason. The talk will introduce a practical framework for fixing most common problems with Spark applications.

Full abstract

Do you have plans to start working with Apache Spark? Are you already working with Spark but you are not happy, you don’t get expected performance and stability and you not sure where to look for a fix? Spark has very nice API and it promises high performance of crunching large datasets. It’s really easy to write an app in Spark, unfortunately it’s also easy to write one which doesn’t perform the way you would expect or just fails for no obvious reason. This talk will consist of multiple common problems one might face when running spark at scale and solutions for them of course. Each of described problems will come with well described background and examples so it will be understood by people with no Spark experience, although people who are working with spark are the main audience. At the end the audience should get a practical framework for optimizing Spark and fixing most common problems with Spark applications. Class of problems in the presentation: Dealing with skewed data Spark on yarn and it’s memory model Caching Sizing executors Locality