Voon Yan Kho
Open to roles from June 2026

Voon Yan Kho,
data scientist.

Final-year Data Science & Computer Science double major at the University of Western Australia. I work at the intersection of machine learning, medical research, and social impact — usually with a Jupyter notebook open. Based in Perth, WA.

currently focused on  → dbt & Snowflake pipelines diffusion MRI tractography deep learning in PyTorch dbt & Snowflake pipelines
About

I build with data — from clinical MRI scans to charity registries to health-insurance analytics. I like problems where the answer changes how a person, a policy, or a pipeline actually behaves.

I'm a double major in Data Science and Computer Science at UWA (graduating mid-2026). I've held research and engineering roles at HBF, KEMH's Neonatal Health team, the UWA Centre for Social Impact, and the Harry Perkins Institute of Medical Research — spanning dbt & Snowflake, diffusion MRI, web scraping, and single-cell RNA sequencing.

I hold 17 IBM certifications — a certified IBM Business Intelligence Analyst, and currently completing the IBM AI Engineering Professional Certificate. I speak English, Chinese, and Malay. Outside work: volunteering, travel, roadtrips, and good coffee.

Kaggle Projects all projects ↗
🚢
Titanic — Machine Learning from Disaster
End-to-end pipeline on the Kaggle classic: imputed Embarked / Age / Cabin, handled outliers, engineered features (log-Fare, Sex × Pclass, Age × Pclass, FamilySize, IsAlone), and benchmarked Logistic Regression, Decision Tree, Random Forest & XGBoost. XGBoost won on ROC-AUC.
Classification · XGBoost · Feature Engineering
🎗️
Breast Cancer Classification — Deep Neural Network
PyTorch feed-forward network (30 → 64 → 2) on the Breast Cancer Wisconsin Diagnostic dataset. Balanced the classes to 200 / 200, standardised 30 features, and trained with Adam + cross-entropy over 10 epochs — final test loss ≈ 0.09. Ran ablations on SGD vs Adam, hidden-unit counts, and generalised the pipeline to the Iris dataset.
Deep Learning · PyTorch · Medical Classification
🍷
Red Wine Quality Prediction
Binary classification on the UCI Vinho Verde dataset (quality ≥ 7 = good). Cleaned 240 duplicates, log-transformed skewed physicochemical features, and benchmarked Logistic Regression, SVM, Decision Tree, Random Forest, and XGBoost across full / balanced / minimal feature sets with class-balanced weights. Random Forest on the minimal set won — leaner features, same accuracy, less overfit.
Classification · XGBoost · Model Comparison
🧠
Red Wine Quality — Deep Learning Extension
Follow-up to the classical ML study, now with TensorFlow / Keras. Built a Dense 64 → Dropout → 32 → Dropout → Softmax MLP with Adam + EarlyStopping (patience 12) and class-balanced weights. Trained on both the full 6-class quality target and a regrouped 3-class (low / medium / high) target across full / balanced / minimal feature sets — the 3-class full model reached ≈ 0.66 accuracy, and misclassifications clustered around adjacent quality levels, confirming the target's ordinal structure.
Deep Learning · TensorFlow · Multiclass
🏘️
Apartment Prices Prediction — Poland
Trained and compared multiple regression models to predict apartment prices across Polish cities. Full EDA, feature engineering, and model evaluation — my first end-to-end ML project.
Regression · scikit-learn · EDA
📈
Income by Education — Australia & Canada
Cross-country comparative analysis of education ROI. Australia offers a higher return on education investment than Canada, especially at higher qualification levels.
Data Analytics · Pandas · Seaborn
🎓
Student Performance Prediction
Predicted students' final grades from prior performance and behavioural factors. Linear regression quietly outperformed more complex models — a useful reminder.
Regression · Feature Engineering
♻️
Waste Classification — IBM Deep Learning Capstone
Automated waste classifier separating recyclable from organic waste using transfer learning and fine-tuning in Keras & TensorFlow. Final project for the IBM AI Capstone.
Deep Learning · Keras · Transfer Learning
🎨
Anime Image Classification — CNN
Built and trained a convolutional neural network from scratch in PyTorch for multi-class anime image classification. Practice project from the IBM PyTorch course.
Deep Learning · PyTorch · CNN
Education
The University of Western Australia
Bachelor of Science — Data Science & Computer Science (double major)
Jul 2024 – Jun 2026
81.4
WAM
6.67 / 7
GPA
17
IBM Certs
3
Languages
Experience four roles · 2024 → 2026
Mar 2026 — Apr 2026
Perth, WA
🏥Data Engineering Intern
HBF Health · health insurance
Built and maintained analytical datasets with dbt and Snowflake. Diagnosed and fixed failing dbt test cases, improving pipeline reliability for the data engineering team.
dbtSnowflakeSQLData Modelling
Nov 2025 — Feb 2026
Subiaco, WA
🧠Research Assistant — Data Science
KEMH Neonatal Health Team · women & newborn hospital
Diffusion MRI tractography of the preterm infant brain in Python & DIPY. Ran the full preprocessing stack — denoising, Gibbs ringing & eddy correction, bias-field correction, brain masking, response-function estimation, and CSD for fiber orientation.
DIPYMRIPythonNeuroscience
Feb 2025 — May 2025
Crawley, WA
🌱Research Intern — Data Science
UWA Centre for Social Impact (CSI) · research institute
Cleaned survey datasets, built Power BI dashboards for social-impact metrics, implemented Whisper / WhisperX for speech-to-text, and scraped a dataset of 65,000+ Australian charities and not-for-profits.
Power BIWhisperWeb ScrapingEDA
Nov 2024 — Jan 2025
Remote
🧬Research Assistant — Data Science
Harry Perkins Institute of Medical Research · biomedical
Single-cell RNA sequencing (scRNA-seq) analysis in Seurat / R. Performed Hallmark GSEA and GO pathway analysis to study differential expression in endothelial cells, and produced ggplot2 reports for collaborators.
RSeuratscRNA-seqggplot2
Skills
🤖Machine Learning
scikit-learn · Bayesian statistics · regression · classification · cross-validation · hyperparameter tuning
🧠Deep Learning
PyTorch · Keras · TensorFlow · CNNs · transfer learning · fine-tuning · neural networks
📊Data Analytics & EDA
Pandas · NumPy · Seaborn · Matplotlib · ggplot2 · data cleaning · feature engineering
⚙️Data Engineering
dbt · Snowflake · ETL · PySpark · Databricks · IBM Cloud · Apache HBase · Hive
🗄️Databases
SQL · PostgreSQL · MySQL · query optimisation · data modelling
📈Business Intelligence
Tableau · Power BI · IBM Cognos Analytics · dashboard design · data storytelling
Certifications IBM · Coursera
Generative AI & LLMs: Architecture & Data PrepFeb 2026
AI Capstone Project with Deep LearningJul 2025
Deep Learning with PyTorchJul 2025
Introduction to Neural Networks & PyTorchJun 2025
Deep Learning with Keras & TensorFlowDec 2024
Introduction to Deep Learning & Neural Networks (Keras)Nov 2024
Practera Study Australia Industry ExperienceOct 2024
IBM Business Intelligence (BI) Analyst SpecializationJul 2024
Oracle Java FoundationJul 2023
Salesforce Lightning Reports & Dashboards SpecialistJun 2023
Contact

Open to graduate roles, research collaborations, and conversations about data. Say hi.

Or send a message