Kaggle Competition Consultation: CSIRO Biomass Prediction

Kaggle Competition Consultation: CSIRO Biomass Prediction

Dear Professor Hadjila Fethallah,

I hope this message finds you well. I am preparing for an upcoming Kaggle competition and would greatly appreciate your guidance on my approach and some technical questions I have.

Competition Link: https://www.kaggle.com/competitions/csiro-biomass


1. Project Goal

Objective:
Predict multiple biomass targets using a combination of images and tabular features.

Targets to Predict:

  • Dry_Green_g
  • Dry_Dead_g
  • Dry_Clover_g
  • GDM_g
  • Dry_Total_g

Input Data:

  1. Images – Visual features of the samples
  2. Tabular Features – Numerical or categorical metadata associated with each sample

2. Dataset Description

File Structure

train/ - Directory containing training images (JPEG), referenced by image_path

test/ - Directory reserved for test images (hidden at scoring time); paths in test.csv point here

test.csv

  • sample_id — Unique identifier for each prediction row (one row per image–target pair)
  • image_path — Relative path to the image (e.g., test/ID1001187975.jpg)
  • target_name — Name of the biomass component to predict for this row. One of: Dry_Green_g, Dry_Dead_g, Dry_Clover_g, GDM_g, Dry_Total_g

The test set contains over 800 images.

train.csv

  • Number of rows: 1,785
  • Number of columns: 9

Columns:

Column Name Description
sample_id Unique identifier for each training sample (image)
image_path Relative path to the training image (e.g., images/ID1098771283.jpg)
Sampling_Date Date of sample collection
State Australian state where sample was collected
Species Pasture species present, ordered by biomass (underscore-separated)
Pre_GSHH_NDVI Normalized Difference Vegetation Index (GreenSeeker) reading
Height_Ave_cm Average pasture height measured by falling plate (cm)
target_name Biomass component name for this row (Dry_Green_g, Dry_Dead_g, Dry_Clover_g, GDM_g, or Dry_Total_g)
target Ground-truth biomass value (grams) corresponding to target_name for this image

sample_submission.csv

  • sample_id — Copy from test.csv; one row per requested (image, target_name) pair
  • target — Your predicted biomass value (grams) for that sample_id

What You Must Predict

For each sample_id in test.csv, output a single numeric target value in sample_submission.csv. Each row corresponds to one (image_path, target_name) pair; you must provide the predicted biomass (grams) for that component. The actual test images are made available to your notebook at scoring time.


3. Current Data Format Analysis

Observed Data Structure (Long Format):

image_id      target_name       target
img1          Dry_Green_g       0.1
img1          Dry_Dead_g        0.1
img1          Dry_Clover_g      0.1
img1          GDM_g             0.2
img1          Dry_Total_g       0.5

Feature Characteristics:

  • Species: 15 unique plant species
  • State: 4 unique Australian states
  • Sampling_Date: Format "2015-09-04"
  • target & target_name: Essentially the same information (value and name)

Observation: I believe we can drop target_name during training after data transformation.


4. My Proposed Approach

🔸 Data Transformation: Long to Wide Format

I'm considering converting the dataset to wide format:

Proposed Wide Format:

image_id    Dry_Green_g    Dry_Dead_g    Dry_Clover_g    GDM_g    Dry_Total_g
img1        0.5            0.4           0.98            0.32     0.22

Rationale:

  • Most ML libraries (sklearn.MultiOutputRegressor, XGBoost, LightGBM) handle wide datasets more efficiently
  • Metrics calculation is easier (e.g., R² for each output or weighted R²)
  • Enables vectorized training
  • Better for multi-output learning

🔸 Modeling Strategy: Single Multi-Output Model

My preferred approach: All outputs in different columns (wide format)

Advantages:

  • The model can learn correlations between targets
  • For example, the sum of different grass types should approximately equal Dry_Total_g

Potential Issue:

  • If outputs have very different scales, normalization per target column may be necessary

🔸 Alternative Approaches I'm Considering:

Option 1: Model Per Output (Univariate model per target)

  • Advantage: If targets are completely different in terms of scale or distribution, might yield better performance
  • Disadvantage: Cannot leverage relationships between outputs
  • My assessment: Not suitable for my case since targets are related

Option 2: Chained Model / Regressor Chain

  • Idea: sklearn.RegressorChain - first model predicts target1, then uses it as a feature for target2, and so on
  • Useful if: There's clear dependency between outputs (e.g., Total = sum(others))
  • Question: Do you think this approach is worth exploring for this problem?

5. Feature Engineering Questions

I have several questions about how to handle the features:

5.1 State Feature

Current state: We have 4 unique Australian states

My questions:

  • How should we handle the State feature?
  • Should we use simple categorical encoding (one-hot)?
  • Or would embeddings be more appropriate?
  • Is there a better approach you'd recommend?

5.2 Sampling_Date Feature

Current format: "2015-09-04"

My questions:

  • Should I use cyclical encoding and drop the year component? Or is there a temporal trend I should preserve?
  • I'm thinking of applying cyclical encoding to handle the circular nature (December 31 is close to January 1):
    • Extract day: sin_day = sin(2π × day/31), cos_day = cos(2π × day/31)
    • Extract month: sin_month = sin(2π × month/12), cos_month = cos(2π × month/12)
  • Should I add a season feature?
  • Would adding month cyclical features also be beneficial?

What do you think about this approach?

5.3 Species Feature

Current state: 15 unique plant species, underscore-separated, ordered by biomass

My questions:

  • How should I handle this feature optimally?
  • Should I treat it as categorical?
  • Or split it into multiple features?
  • What encoding would you recommend?

6. Evaluation Metric

The competition uses weighted R² score:

Weighted Coefficient of Determination

$$
R^2_w = 1 - \frac{\sum_j w_j (y_j - \hat{y}_j)^2}{\sum_j w_j (y_j - \bar{y}_w)^2}
$$

where $\bar{y}_w = \frac{\sum_j w_j y_j}{\sum_j w_j}$

Residual Sum of Squares (SS_res)

Measures the total error of the model's predictions:

$$
SS_{\text{res}} = \sum_j w_j (y_j - \hat{y}_j)^2
$$

Total Sum of Squares (SS_tot)

Measures the total weighted variance in the data:

$$
SS_{\text{tot}} = \sum_j w_j (y_j - \bar{y}_w)^2
$$

Terms:

  • $y_j$: Ground-truth value for data point $j$
  • $\hat{y}_j$: Model prediction for data point $j$
  • $w_j$: Per-row weight based on target type
  • $\bar{y}_w$: Global weighted mean of all ground-truth values

7. Dataset Information

Files: 361 files
Total Size: 1.11 GB
File Types: JPG (images), CSV (metadata)
License: CC BY-SA 4.0

Download Command:

kaggle competitions download -c csiro-biomass

Citation:

@misc{liao2025estimatingpasturebiomasstopview,
    title={Estimating Pasture Biomass from Top-View Images: A Dataset for Precision Agriculture},
    author={Qiyu Liao and Dadong Wang and Rebecca Haling and Jiajun Liu and Xun Li and Martyna Plomecka and Andrew Robson and Matthew Pringle and Rhys Pirie and Megan Walker and Joshua Whelan},
    year={2025},
    eprint={2510.22916},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2510.22916}
}

8. Summary of My Questions

I would greatly appreciate your guidance on the following:

  1. Data transformation: Is converting from long to wide format the right approach for this multi-output regression problem?

  2. Model architecture: Do you recommend using a single multi-output model, or should I explore regressor chains given the relationship between targets?

  3. Feature engineering:

    • What's the best approach for encoding the State feature (categorical vs embeddings)?
    • For the date feature: should I use cyclical encoding? Should I preserve year information or drop it?
    • How should I handle the Species feature optimally?
  4. Image processing: What CNN architecture would you recommend for this task?

  5. Multi-modal fusion: What's the best way to combine image features with tabular data?


Thank you very much for taking the time to review my approach. I truly value your expertise and look forward to your feedback and suggestions.

Best regards