Kaggle competitions
I simply started Kaggle competitions to improve my data analysis skills, but now being Kaggle Master in competition is my goal. I would like to introduce to you about Kaggle competitions I have completed so far.
Libraries used: Tensorflow, Pytorch, Pandas, Numpy, Ray (for multiprocessing)
My Kaggle profile: here
Jane Street Market Prediction
Duration: 2020.11.24 ~ 2021.08.24
Topic: Test your model against future real market data
Tags: finance, time series, custom metric, generalization, binary classification

Difficulties
- Noises in real market data
- Correlations between variables
- Skewed data distribution
- Selection of better prediction model
Task
Build quantitative trading model to maximize returns using market data from a major global stock exchange. Next, test the predictiveness of built models against future market returns.
Approach
- Cleaned data to remove noises
- Performed feature engineering
- Trained DL models for large dataset
- Applied Ensemble to improve predictiveness
Result
Ranked 31st/4245 (top 1%) - silver medal
Google Universal Image Embedding
Duration: 2022.07.12 ~ 2022.10.11
Topic: Create image representations that work across many visual domains.
Tags: image, multiclass classification

Difficulties
- Dataset is not provided by host
- Large-scale model training and inference
- Class imbalances in distribution of evaluation dataset
- Insufficient GPU resources
Task
In this competition, the developed models are expected to retrieve relevant database images to a given query image (ie, the model should retrieve database images containing the same object as the query). The images in our dataset comprise a variety of object types, such as apparel, artwork, landmarks, furniture, packaged goods, among others.
Approach
- Proper dataset collection and processing
- CLIP model finetuning
- Model architecture and loss function customization
Result
Ranked 107th/1022 (Top 11%)
Kaggle - LLM Science Exam
Duration: 2023.07.12 ~ 2023.10.11
*Topic: Use LLMs to answer difficult science questions. *
Tags: physics, NLP

Difficulties
- Dataset is not provided by host
- Hard science questions
- Limited resources to implement large scale AI model
Task
This competition challenges participants to answer difficult science-based questions written by a Large Language Model.
Approach
- Science-topic text dataset collection via Wikipedia
- Data pre-processing for better quality
- Implementation of open-source large language model
- Improved context generation through Retrieval Augmented Generation (RAG)
Result
Ranked 354th/2664 (Top 14%)
Google Smartphone Decimeter Challenge
Duration: 2021.05.13 ~ 2021.08.05
Topic: Improve high precision GNSS positioning and navigation accuracy on smartphones.
Tags: time series data, geospatial analysis, mobile and wireless, signal processing, custom metric

Difficulties
- Noises and outliers in signal data
- Bias in measurements due to many factors
- Effects of signal interference and surroundings
- Sensor fusion
Task
Train a prediction model to compute location down to decimeter or even centimeter resolution based on ground truth, raw GPS, and IMU datasets. Next, test your results.
Approach
- Smoothing for noise and outlier removal
- Kalman-filter based sensor fusion
Result
Ranked 293rd /810 (Top 37%)