Skip to main content
Overview

Kaggle Tips

September 23, 2021
3 min read

Competition Platforms

  • Kaggle
  • Kakao Arena: reportedly limited to their subsidiaries.
  • Dacon: public competitions. Gradually adopting Kaggle-style practices.

Ranking

  • Ranking system: rankings determined by competition points
    • If you compete as a team, points are divided by N\sqrt{N}.
  • Tier system: determined by competition medal count

Competition

Purpose

  • Featured
    • Commercial competitions
    • Winning models sometimes get used by companies.
  • Research
    • Research-oriented competitions
    • Fun topics but lower prize money, apparently.
  • Getting Started & Playground
    • Beginner competitions like Titanic survivor prediction
    • Not for points or medals
  • Analytics
    • Data analysis competitions
    • Submit data exploration and visualization notebooks
  • Recruitment
    • Hiring purpose

Submission

  • General competition
    • No resource constraints
    • Just submit submission.csv
  • Code competition
    • Must run a Kaggle notebook to generate submission.csv
    • Resource limits apply
    • Designed to encourage building practical models

Processing Competition

![](/assets/images/Kaggle tip/a6770f33-28bd-4a3e-a95a-30f9e76912e4-image.png)

A familiar-looking workflow diagram. The differences:

  • Uses Kaggle notebooks
  • Can browse other people’s Kaggle notebooks
    • Each notebook serves a different purpose: e.g., train, inference, preprocessing…

For Winning

Fast and Efficient Pipeline Iteration

  • Invest in GPU hardware
    • A Korean Kaggle grandmaster uses Ryzen 3700, 64GB RAM, and 2x RTX 2080 Ti.
    • With 2+ GPUs, they recommended blower-type GPU coolers.
    • I expected researchers would need multi-GPU setups, but surprisingly a single RTX 3090 or 3080 works well too. Though 2x 3090 is better.
    • Still grateful to CDPR’s Poland for letting me buy an RTX 3070 for 720,000 won.
  • Invest your own time
    • They reportedly spend 4+ hours per weekday and 8+ hours per weekend day over 1-2 months.
  • Your own baseline that works like a template
    • Speeds up development and reduces mistakes.
    • They won 3 gold medals in 3 months using this setup.

Score Improvement

  • Look for good ideas in the Notebook tab and Discussion within the competition.
    • Augmentation, deep learning architecture
    • Relevant papers
  • They strongly emphasized not letting your guard down until the very end.

Validation Strategy

A methodology to narrow the gap between training set and test set scores.

Essential for preventing final ranking drops. Public LB (Leaderboard) and Private LB differ, so avoid overfitting to the Public LB.

  • Recently, the trend is to not reveal the test set.
  • Extract validation set from training set
    • K-fold validation
    • Stratified k-fold
      • Generate validation sets per class

Ensemble

In most cases, ensembles outperform single models. Recent Kaggle trends: Ensembling different architectures tends to work better. e.g., LSTM + BERT

  • Stratified k-fold ensemble
    • Don’t just use for validation checking — ensemble those models
  • Tabular data
    • LightGBM, CatBoost, XGBoost, NNs
  • Image data
    • ResNet, EfficientNet, ResNeXt
  • Text data
    • LSTM, BERT, GPT-2, RoBERTa

Single Model Improvement

You can’t ensemble from the start. You need to improve single models to some extent first, then attempt ensembling. Setting a threshold for when to stop is important:

  • Single model scores mentioned by top rankers in discussion
  • Being within top 50 with a single model 1-2 weeks before competition end

Miscellaneous Tips

  • Teams are better. Solo for 2+ months is too long.
  • Teams can’t be disbanded, so choose carefully.
  • Check potential teammates’ current competition rankings. Apparently some people are surprisingly lazy.
  • Version management in folders like v1, v2.
    • This is done to keep the option open for ensembling across folder-organized versions.
    • They reportedly only use VCS for final uploads. No version control at all otherwise. Interesting approach.
Loading comments...