Kaggle competitions

I simply started Kaggle competitions to improve my data analysis skills, but now being Kaggle Master in competition is my goal. I would like to introduce to you about Kaggle competitions I have completed so far.

Libraries used: Tensorflow, Pytorch, Pandas, Numpy, Ray (for multiprocessing)

My Kaggle profile: here

Jane Street Market Prediction

Duration: 2020.11.24 ~ 2021.08.24

Topic: Test your model against future real market data

Tags: finance, time series, custom metric, generalization, binary classification

Difficulties

Noises in real market data
Correlations between variables
Skewed data distribution
Selection of better prediction model

Task

Build quantitative trading model to maximize returns using market data from a major global stock exchange. Next, test the predictiveness of built models against future market returns.

Approach

Cleaned data to remove noises
Performed feature engineering
Trained DL models for large dataset
Applied Ensemble to improve predictiveness

Result

Ranked 31st/4245 (top 1%) - silver medal

Google Universal Image Embedding

Duration: 2022.07.12 ~ 2022.10.11

Topic: Create image representations that work across many visual domains.

Tags: image, multiclass classification

Difficulties

Dataset is not provided by host
Large-scale model training and inference
Class imbalances in distribution of evaluation dataset
Insufficient GPU resources

Task

In this competition, the developed models are expected to retrieve relevant database images to a given query image (ie, the model should retrieve database images containing the same object as the query). The images in our dataset comprise a variety of object types, such as apparel, artwork, landmarks, furniture, packaged goods, among others.

Approach

Proper dataset collection and processing
CLIP model finetuning
Model architecture and loss function customization

Result

Ranked 107th/1022 (Top 11%)

Kaggle - LLM Science Exam

Duration: 2023.07.12 ~ 2023.10.11

*Topic: Use LLMs to answer difficult science questions. *

Tags: physics, NLP

Difficulties

Dataset is not provided by host
Hard science questions
Limited resources to implement large scale AI model

Task

This competition challenges participants to answer difficult science-based questions written by a Large Language Model.

Approach

Science-topic text dataset collection via Wikipedia
Data pre-processing for better quality
Implementation of open-source large language model
Improved context generation through Retrieval Augmented Generation (RAG)

Result

Ranked 354th/2664 (Top 14%)

Google Smartphone Decimeter Challenge

Duration: 2021.05.13 ~ 2021.08.05

Topic: Improve high precision GNSS positioning and navigation accuracy on smartphones.

Tags: time series data, geospatial analysis, mobile and wireless, signal processing, custom metric

Difficulties

Noises and outliers in signal data
Bias in measurements due to many factors
Effects of signal interference and surroundings
Sensor fusion

Task

Train a prediction model to compute location down to decimeter or even centimeter resolution based on ground truth, raw GPS, and IMU datasets. Next, test your results.

Approach

Smoothing for noise and outlier removal
Kalman-filter based sensor fusion

Result

Ranked 293rd /810 (Top 37%)

Twitter Facebook LinkedIn