I am following Kaggle community for last several years. In 2021, I actively participated in multiple Tabular Playground Series competitions. I didn’t perform well (struggled to be around top 10%), but I used this opportunity to build a robust Kaggle Pipeline for my personal use. After September 2021’s Tabular Playground Series competition, I decided to stop participating in Kaggle Competitions for several reasons (Kaggle induced anxiety being the top most of one).
Today, I am open sourcing my Kaggle Pipeline (github). It’s not a library, but a Python project which can be customized for any data science competition related to Tabular data. This project structure includes source code for most of the end to end ML competition related tasks (data processing, visualization, feature engineering, training, experiment tracking, submitting to Kaggle).
Is it completely my code? No. I borrowed the initial project structure from Kaggle Grandmaster, Rob Mulla‘s open sourced code. Lot of utility functions are from “Approaching (Almost) Any Machine Learning Problem” by Abhishek Thakur. I remember that I used some feature selection related code from SRK‘s github repository as well. Since, this pipeline was built for my personal usage, I have not documented sources in all the places. My sincere apologies for that.
This pipeline is the result of hundreds of hours I have spent working through various tabular data competitions. Code which I wrote are mostly “hacky” in nature. My goal was not to design/write well tested and documented, maintainable Python code, but to make yet another submission at Kaggle. I used to time box my “Kaggle work” to two hours on every work day (and to several hours over the weekends and holidays). So, most of the code is not documented (name of the functions/packages are helpful though). There is no unit test case. Sometimes several functions have repeated boilerplate code. Honestly, I never bothered. With this code, I was able to achieve what I wanted to do (within my limited commitment) – perform another experiment to verify my hypothesis or capture some more insights from the data within a short burst of time. I know, it’s a shame, specially when it comes from a developer with two decade long experience. But, honestly I didn’t have much option. In a big scheme of things, Kaggling was another experiment I was doing with my life.
Since October 2021, I was planning to convert this pipeline to a standard library (following all the development best practices), but it never happened (Technical debt after all !). So, after one year, I have finally decided to open source it. I hope this project will fast forward the journey of a Kaggle newbie by several months.
github repo: https://github.com/arnabbiswas1/kaggle_pipeline_tps_aug_22
kaggle discussion: https://www.kaggle.com/competitions/tabular-playground-series-aug-2022/discussion/341120