Matija Gobec

SmartCat.io

Real-time analytics with fast data stack

Data analytics was considered to be a time and resource consuming job and it was executed offline and usually on previously acquired data. Then came the lambda architecture that brought a lot of new concepts on how data is stored and analyzed in near-real-time. Todays standards require us to analyze data in real-time so that we can react accordingly and gain insights as the data is streamed into our system.

Full abstract

Traditional data warehousing and data analytics was an offline job and it usually required a lot of time to complete and give some results. These operations gave us some insight into our business but we had a limited possibility to improve or react in a timely manner. One of the efforts to move this approach closer to real-time was described with Lambda architecture where raw data was stored in real-time while also being analyzed in parallel and producing results with a certain delay. This approach eliminated those long running jobs by preparing analytic results on the incoming data but in todays business right information at the right time can mean improving our business and directly leading to more profit. Our latest technologies give us possibility to execute analytics on data streams and react almost instantly. Faster data requires faster reactions especially if its a fraud detection or a mission critical system. NoETL philosophy explains why the traditional approach is no longer valid and why we need these new technologies that we have today. In this presentation we are going to talk about the evolution from monolithic to distributed systems, pros and cons of both approaches and what we are able to do with todays technologies. Our main focus is going to be fast data stack (Spark, Mesos, Akka, Cassandra, Kafka) and how using these technologies we can create scalable and fast data pipeline while running real-time data analytics.