Overview In several recent client and personal projects, I’ve found a need to quickly create robust real-time anomaly detection for time series data. In the typical use case, the goal is to fit a model to as much historical data as possible, then use that model to predict future time series values. When the predictions are compared against new observed data, if the observed data appears to fall too far away from the predicted values, it is flagged for follow-on analysis.

Read more →

Through promotional materials from Databricks earlier this year, I had heard about the upcoming Spark 2.3 release. I sat in a demo and was really excited to try out new features like Continuous Processing and MLlib in Structured Streaming. But my customer work is all done in Spark 2.2 at the moment, and none of my side projects really required distributed processing at scale. Then I went to Spark Summit earlier this summer, and after seeing more Spark 2.

Read more →