Fine-Grained Furniture Classification

From self-rendered 3D dataset and CNN baseline to CLIP-based few-shot fine-grained recognition with context conditioning.

Individual → Undergraduate researcher Mar 2023–Jan 2024

Two-phase project: first building a controlled 3D rendered dataset and CNN baseline, then extending to CLIP-based few-shot fine-grained classification with context conditioning.


Phase 1 — Dataset & CNN Baseline

Role: Individual · Dates: Mar–Jun 2023 · Stack: TensorFlow, SketchUp / Rhino, Python

The Dataset Problem

Most furniture datasets use web-scraped images with inconsistent lighting, backgrounds, and angles. Training on this noise makes it hard to know whether the model struggles with the category or the context.

I built the dataset using 3D modeling and rendering (SketchUp / Enscape), controlling for the variables that matter:

  • Consistent lighting and background across all categories
  • Multiple angles per object to improve generalization
  • Rigorous splits to prevent leakage between visually similar categories

Model Pipeline

Started with a TensorFlow CNN baseline, then refined iteratively:

Stage Change Effect
Baseline CNN trained from scratch
Refinement 1 Data augmentation + learning rate tuning +accuracy
Refinement 2 Transfer learning (pretrained backbone) +accuracy

Interior architecture training means thinking carefully about what makes two chairs different — proportion, material, silhouette. That domain knowledge directly shaped how categories were defined and how the dataset was constructed.


Phase 2 — CLIP-Based Few-Shot Extension

Role: Undergraduate researcher · Dates: Oct 2023–Jan 2024 · Stack: PyTorch, HuggingFace, Pandas

Motivation

The CNN baseline required substantial labeled data per category. Fine-grained furniture recognition is a natural few-shot problem — categories are visually similar and labels are expensive. CLIP’s vision-language alignment offered a better prior.

Approach

Added lightweight context conditioning to a CLIP-style baseline:

  • Engineered geographic and language priors as auxiliary inputs
  • Built fusion heads to blend visual and contextual signals
  • Ablated conditioning strength and prompt variants systematically
  • Built stratified few-shot splits with seeded runs for full reproducibility

Role: Individual (Phase 1) → Undergraduate researcher (Phase 2) · Dates: Mar 2023–Jan 2024