Parallelization of Data Science Workflow with Large Data using Dask

As a part of Bangalore Python Users Group’s meetup (Dec 2020), I delivered a talk on parallelization of Data Science Workflow with Large Data using Dask.

Brief Summary

Over time, two things are happening in parallel:

  • Computers are becoming more efficient with increasing number of processors/cores.
  • Internet of things, digitized health care systems, financial markets, smart cities (etc.) are continuously generating Tera bytes of data.

But, popular Python based Data Science libraries (like Pandas, Numpy etc.) are not designed to scale beyond a single machine or to handle larger than memory datasets. Dask is a Open Source framework designed to scale these libraries to single computer as well as distributed clusters. Dask enables parallel computation leveraging multi-core CPUs and out-of-core computation by streaming larger-than-memory from disk. Dask can scale on thousand-machine clusters to handle hundreds of terabytes of data. At the same time, it works efficiently on a single machine as well, enabling analysis of moderately large datasets (100GB+) on relatively low power laptops.

Here, I describe how Dask can be used to overcome the processor or memory limitations to parallelize Data Science workflows using common data science libraries like Pandas, NumPy, Scikit-Learn.

Outline of the talk:

  • Parallelization & Challenges in utilizing Multi Core Architecture
  • Challenges with Larger than Memory Data
  • Introduction to Architecture of Dask
  • How Dask DataFrame & Dask Array can be used for large data
  • How dask-ml can be used scaling up Machine Learning training

Resources

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s