Kaggle Competition Consultation: CSIRO Biomass Prediction
Kaggle Competition Consultation: CSIRO Biomass Prediction
Dear Professor Hadjila Fethallah,
I hope this message finds you well. I am preparing for an upcoming Kaggle competition and would greatly appreciate your guidance on my approach and some technical questions I have.
Competition Link: https://www.kaggle.com/competitions/csiro-biomass
1. Project Goal
Objective:
Predict multiple biomass targets using a combination of images and tabular features.
Targets to Predict:
Dry_Green_gDry_Dead_gDry_Clover_gGDM_gDry_Total_g
Input Data:
- Images – Visual features of the samples
- Tabular Features – Numerical or categorical metadata associated with each sample
2. Dataset Description
File Structure
train/ - Directory containing training images (JPEG), referenced by image_path
test/ - Directory reserved for test images (hidden at scoring time); paths in test.csv point here
test.csv
- sample_id — Unique identifier for each prediction row (one row per image–target pair)
- image_path — Relative path to the image (e.g.,
test/ID1001187975.jpg) - target_name — Name of the biomass component to predict for this row. One of:
Dry_Green_g,Dry_Dead_g,Dry_Clover_g,GDM_g,Dry_Total_g
The test set contains over 800 images.
train.csv
- Number of rows: 1,785
- Number of columns: 9
Columns:
| Column Name | Description |
|---|---|
sample_id |
Unique identifier for each training sample (image) |
image_path |
Relative path to the training image (e.g., images/ID1098771283.jpg) |
Sampling_Date |
Date of sample collection |
State |
Australian state where sample was collected |
Species |
Pasture species present, ordered by biomass (underscore-separated) |
Pre_GSHH_NDVI |
Normalized Difference Vegetation Index (GreenSeeker) reading |
Height_Ave_cm |
Average pasture height measured by falling plate (cm) |
target_name |
Biomass component name for this row (Dry_Green_g, Dry_Dead_g, Dry_Clover_g, GDM_g, or Dry_Total_g) |
target |
Ground-truth biomass value (grams) corresponding to target_name for this image |
sample_submission.csv
- sample_id — Copy from
test.csv; one row per requested (image,target_name) pair - target — Your predicted biomass value (grams) for that
sample_id
What You Must Predict
For each sample_id in test.csv, output a single numeric target value in sample_submission.csv. Each row corresponds to one (image_path, target_name) pair; you must provide the predicted biomass (grams) for that component. The actual test images are made available to your notebook at scoring time.
3. Current Data Format Analysis
Observed Data Structure (Long Format):
image_id target_name target
img1 Dry_Green_g 0.1
img1 Dry_Dead_g 0.1
img1 Dry_Clover_g 0.1
img1 GDM_g 0.2
img1 Dry_Total_g 0.5
Feature Characteristics:
- Species: 15 unique plant species
- State: 4 unique Australian states
- Sampling_Date: Format
"2015-09-04" - target & target_name: Essentially the same information (value and name)
Observation: I believe we can drop target_name during training after data transformation.
4. My Proposed Approach
🔸 Data Transformation: Long to Wide Format
I'm considering converting the dataset to wide format:
Proposed Wide Format:
image_id Dry_Green_g Dry_Dead_g Dry_Clover_g GDM_g Dry_Total_g
img1 0.5 0.4 0.98 0.32 0.22
Rationale:
- Most ML libraries (
sklearn.MultiOutputRegressor,XGBoost,LightGBM) handle wide datasets more efficiently - Metrics calculation is easier (e.g., R² for each output or weighted R²)
- Enables vectorized training
- Better for multi-output learning
🔸 Modeling Strategy: Single Multi-Output Model
My preferred approach: All outputs in different columns (wide format)
Advantages:
- The model can learn correlations between targets
- For example, the sum of different grass types should approximately equal
Dry_Total_g
Potential Issue:
- If outputs have very different scales, normalization per target column may be necessary
🔸 Alternative Approaches I'm Considering:
Option 1: Model Per Output (Univariate model per target)
- Advantage: If targets are completely different in terms of scale or distribution, might yield better performance
- Disadvantage: Cannot leverage relationships between outputs
- My assessment: Not suitable for my case since targets are related
Option 2: Chained Model / Regressor Chain
- Idea:
sklearn.RegressorChain- first model predictstarget1, then uses it as a feature fortarget2, and so on - Useful if: There's clear dependency between outputs (e.g.,
Total = sum(others)) - Question: Do you think this approach is worth exploring for this problem?
5. Feature Engineering Questions
I have several questions about how to handle the features:
5.1 State Feature
Current state: We have 4 unique Australian states
My questions:
- How should we handle the
Statefeature? - Should we use simple categorical encoding (one-hot)?
- Or would embeddings be more appropriate?
- Is there a better approach you'd recommend?
5.2 Sampling_Date Feature
Current format: "2015-09-04"
My questions:
- Should I use cyclical encoding and drop the year component? Or is there a temporal trend I should preserve?
- I'm thinking of applying cyclical encoding to handle the circular nature (December 31 is close to January 1):
- Extract day:
sin_day = sin(2π × day/31),cos_day = cos(2π × day/31) - Extract month:
sin_month = sin(2π × month/12),cos_month = cos(2π × month/12)
- Extract day:
- Should I add a
seasonfeature? - Would adding month cyclical features also be beneficial?
What do you think about this approach?
5.3 Species Feature
Current state: 15 unique plant species, underscore-separated, ordered by biomass
My questions:
- How should I handle this feature optimally?
- Should I treat it as categorical?
- Or split it into multiple features?
- What encoding would you recommend?
6. Evaluation Metric
The competition uses weighted R² score:
Weighted Coefficient of Determination
$$
R^2_w = 1 - \frac{\sum_j w_j (y_j - \hat{y}_j)^2}{\sum_j w_j (y_j - \bar{y}_w)^2}
$$
where $\bar{y}_w = \frac{\sum_j w_j y_j}{\sum_j w_j}$
Residual Sum of Squares (SS_res)
Measures the total error of the model's predictions:
$$
SS_{\text{res}} = \sum_j w_j (y_j - \hat{y}_j)^2
$$
Total Sum of Squares (SS_tot)
Measures the total weighted variance in the data:
$$
SS_{\text{tot}} = \sum_j w_j (y_j - \bar{y}_w)^2
$$
Terms:
- $y_j$: Ground-truth value for data point $j$
- $\hat{y}_j$: Model prediction for data point $j$
- $w_j$: Per-row weight based on target type
- $\bar{y}_w$: Global weighted mean of all ground-truth values
7. Dataset Information
Files: 361 files
Total Size: 1.11 GB
File Types: JPG (images), CSV (metadata)
License: CC BY-SA 4.0
Download Command:
kaggle competitions download -c csiro-biomass
Citation:
@misc{liao2025estimatingpasturebiomasstopview,
title={Estimating Pasture Biomass from Top-View Images: A Dataset for Precision Agriculture},
author={Qiyu Liao and Dadong Wang and Rebecca Haling and Jiajun Liu and Xun Li and Martyna Plomecka and Andrew Robson and Matthew Pringle and Rhys Pirie and Megan Walker and Joshua Whelan},
year={2025},
eprint={2510.22916},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.22916}
}
8. Summary of My Questions
I would greatly appreciate your guidance on the following:
-
Data transformation: Is converting from long to wide format the right approach for this multi-output regression problem?
-
Model architecture: Do you recommend using a single multi-output model, or should I explore regressor chains given the relationship between targets?
-
Feature engineering:
- What's the best approach for encoding the
Statefeature (categorical vs embeddings)? - For the date feature: should I use cyclical encoding? Should I preserve year information or drop it?
- How should I handle the
Speciesfeature optimally?
- What's the best approach for encoding the
-
Image processing: What CNN architecture would you recommend for this task?
-
Multi-modal fusion: What's the best way to combine image features with tabular data?
Thank you very much for taking the time to review my approach. I truly value your expertise and look forward to your feedback and suggestions.
Best regards