Automatic Feature Engineering on Large Scale Time Series Data using tsfresh & Dask

[I have delivered a talk on this topic in Jan’ 2021. You can find that here]

The internet of things, digitized health care systems, financial markets, smart cities, all are continuously generating time series data of different types, sizes and complexities. As continuous monitoring & data collection becomes more feasible and popular, the demand for time series analysis using different statistical as well as machine learning methodologies is increasing.

Why Feature Engineering is important for Time Series Data?

Time series data is different from non-temporal data. In time series data, observation at any instance of time depends on the observations from the past based on the underlying process. Often it contains noise and redundant information. To make things more complex, most of the traditional Machine Learning algorithms are developed for non-temporal data. Thus, extracting meaningful features from raw time series plays a major role in the data science process.

How automated Feature Engineering can help?

While there are features generic across different flavors of time series, there could be features specific to different domains. As a result, feature engineering often demands familiarity with domain specific and signal processing algorithms making the process complicated.

Automated Feature Engineering using tsfresh

tsfresh is Python based Open Source library. It accelerates the feature engineering process by automatically generating hundreds of features. These features may include very basic one like number of peaks or more complex one like time reversal symmetric statistics. Once features are generated, irrelevant features can be dropped by using tsfresh’s built-in feature selection mechanism or by using any other popular feature selection mechanism.

What happens if the data is big? Parallelization & Distributed Computing

The complexity of the problem increases if the data is large. Time needed for feature extraction scales linearly with the number of time series. In this session , I am going to discuss about different ways to speed up the computation:

  1. If the data fits into a single machine, utilize multiple cores using multiprocessing.
  2. Even when the data fits into single Machine, the computation can still be distributed over multiple machine using tsfresh’s distributor framework.
  3. If data doesn’t fit into single Machine, distribute the data and computation across multiple machines using libraries like Dask.

Why Dask?

Popular Python based Data Science libraries (like Pandas, Numpy etc.) are not designed to

  1. to scale beyond a single core in a single machine.
  2. to handle large data set which doesn’t fit in the main memory (RAM).

Dask is a framework designed to enables parallel computation to multiple cores in a single machine or distributed across multiple machines. It does out-of-core computation by streaming larger-than-memory data from disk. Dask can scale on thousand-machine clusters to handle hundreds of terabytes of data. At the same time, it works efficiently on a single machine as well, enabling analysis of moderately large datasets (100GB+) on relatively low power laptops.