As a part of Bangalore Python Users Group’s meetup (Dec 2020), I delivered a talk on parallelization of Data Science Workflow with Large Data using Dask.
Brief Summary
Over time, two things are happening in parallel:
- Computers are becoming more efficient with increasing number of processors/cores.
- Internet of things, digitized health care systems, financial markets, smart cities (etc.) are continuously generating Tera bytes of data.
But, popular Python based Data Science libraries (like Pandas, Numpy etc.) are not designed to scale beyond a single machine or to handle larger than memory datasets. Dask is a Open Source framework designed to scale these libraries to single computer as well as distributed clusters. Dask enables parallel computation leveraging multi-core CPUs and out-of-core computation by streaming larger-than-memory from disk. Dask can scale on thousand-machine clusters to handle hundreds of terabytes of data. At the same time, it works efficiently on a single machine as well, enabling analysis of moderately large datasets (100GB+) on relatively low power laptops.
Here, I describe how Dask can be used to overcome the processor or memory limitations to parallelize Data Science workflows using common data science libraries like Pandas, NumPy, Scikit-Learn.
Outline of the talk:
- Parallelization & Challenges in utilizing Multi Core Architecture
- Challenges with Larger than Memory Data
- Introduction to Architecture of Dask
- How Dask DataFrame & Dask Array can be used for large data
- How dask-ml can be used scaling up Machine Learning training
Resources