Smartphone Photos to Synthetic Training Data: A 3D Reconstruction Pipeline

26th February 2026

This pipeline turns smartphone photos of your home into synthetic training data — RGB images, depth maps, and camera parameters from viewpoints that never existed. You capture photos, reconstruct a 3D model, then render unlimited novel views. Here’s how the five-stage pipeline works.

Stage 1: Validate Photos

The CaptureValidator scores every image on three metrics using OpenCV before anything touches COLMAP:

for each image:
    blur     = Laplacian(gray).variance()    # higher = sharper
    bright   = mean(gray)                     # 0-255
    contrast = std(gray)                      # pixel spread

    if blur < 30:       REJECTED    # too blurry
    if blur < 60:       POOR        # marginal
    if bright < 40:     too dark
    if bright > 220:    overexposed
    if contrast < 20:   flat scene

iPhone photos come as HEIC — the pipeline only reads JPG/PNG/TIFF/BMP/WebP. Convert first with macOS built-in sips:

for f in *.HEIC; do sips -s format jpeg "$f" --out "${f%.HEIC}.jpg"; done

You need at least 30 accepted images. The recommendation is 50-100 per room with 60-80% overlap between adjacent shots.

python home_3d_reconstruct.py validate --images ./photos/ --auto-reject

The --auto-reject flag moves rejected images to a rejected/ subfolder so COLMAP never sees them.

Stage 2: COLMAP Structure-from-Motion

COLMAP extracts camera poses and a sparse 3D point cloud. Three sequential steps:

┌──────────────────────┐     ┌──────────────────────┐     ┌──────────────────────┐
│  Feature Extraction  │────▶│  Feature Matching     │────▶│  Sparse Mapper       │
│  SIFT + affine shape │     │  Exhaustive all-pairs │     │  Bundle adjustment   │
│  max 2000px per side │     │  32768 max matches    │     │  → camera poses      │
└──────────────────────┘     └──────────────────────┘     └──────────────────────┘

Default config assumes all images come from the same camera (iPhone) with a PINHOLE model. The output lands in sparse/0/ — camera intrinsics, extrinsics, and 3D points in COLMAP’s binary format.

python home_3d_reconstruct.py colmap --images ./photos/ --workspace ./workspace/

Stage 3: 3D Reconstruction

Two backends, four methods:

Method	Backend	VRAM	Best For
`splatfacto`	Nerfstudio	8-12 GB	Fast iterations, real-time rendering
`nerfacto`	Nerfstudio	6-8 GB	High-quality mesh export
`nerfacto-big`	Nerfstudio	12+ GB	Maximum quality NeRF
`3dgut-mcmc`	NVIDIA 3DGUT	12-16 GB	Isaac Sim integration

Nerfstudio runs ns-train for 30,000 iterations (4-hour timeout). If you provide COLMAP data it uses those camera poses directly; otherwise it runs its own ns-process-data preprocessing first. 3DGUT requires pre-computed COLMAP output.

python home_3d_reconstruct.py reconstruct --data ./photos/ --method splatfacto --colmap-path ./workspace/colmap/sparse/0

Stage 4: Export

The trained neural representation gets exported to standard 3D formats via ns-export:

pointcloud     → PLY point cloud (1M points sampled from NeRF)
poisson        → Watertight mesh via Poisson surface reconstruction
tsdf           → Volumetric mesh (256³ voxel grid)
marching-cubes → Isosurface extraction
gaussian-splat → Raw 3DGS format

The PLY writer uses binary little-endian format — 3 floats for position, 3 floats for normals, 3 unsigned bytes for RGB per vertex. An optional decimation step uses vertex clustering to reduce face count.

python home_3d_reconstruct.py export --config outputs/home_3d/splatfacto/config.yml --format pointcloud

Stage 5: Generate Training Data

This is where the pipeline pays off. Given a point cloud, it generates novel-view synthetic data that never existed in the original photos.

Camera poses are distributed using a Fibonacci sphere — uniform angular spacing without clustering at the poles:

golden_ratio = (1 + √5) / 2

for view i (0 to num_views):
    θ = 2π * i / golden_ratio        # azimuth
    φ = acos(1 - 2*(i+0.5)/n)        # polar angle
    r = random(1.0, 5.0)             # orbital radius
    y = random(0.5, 2.5)             # camera height

    position = [r*sin(φ)*cos(θ), y, r*sin(φ)*sin(θ)]
    look_at(scene_center + jitter)

Each view is rendered with a z-buffer point cloud splatter — project every 3D point through the camera intrinsics, depth-sort, and splat with cv2.circle(). Holes get filled with morphological closing (3x3 ellipse kernel).

Three augmentations are applied per frame:

Lighting — brightness α∈[0.7,1.3], contrast β∈[-20,20], per-channel color shift ±10
Noise — Gaussian σ∈[0,10] on RGB, σ=0.02 on depth
Viewpoint — small 2D rotation ±2° (simulates camera shake)

python home_3d_reconstruct.py generate-data --pointcloud exports/point_cloud.ply --num-views 500

Output structure:

training_data/
├── images/          view_00000.png ... view_00499.png    (augmented RGB)
├── depth/           depth_00000.npy + depth_viz_00000.png (per-pixel depth + Viridis colormap)
├── annotations/
│   ├── camera_00000.json ... camera_00499.json           (4x4 extrinsics + 3x3 intrinsics)
│   └── instances.json                                     (COCO format)
└── metadata.json    (scene bounds, point count, config)

200 views is the default. For training, 500+ gives better coverage. Each view includes the full camera matrix so you can project between depth and world coordinates.

Full Pipeline

One command runs all five stages sequentially:

python home_3d_reconstruct.py full-pipeline \
    --images ./photos/ \
    --output ./home_3d_output/ \
    --method splatfacto \
    --num-training-views 500

Summary

Stage	What It Does	Command
Validate	Score blur/brightness/contrast, reject bad images	`python home_3d_reconstruct.py validate -i ./photos/`
COLMAP	SIFT features → exhaustive matching → sparse SfM	`python home_3d_reconstruct.py colmap -i ./photos/ -w ./work/`
Reconstruct	Train NeRF or 3D Gaussian Splatting (30K iterations)	`python home_3d_reconstruct.py reconstruct -d ./photos/ -m splatfacto`
Export	Sample 1M points → PLY/mesh/splat	`python home_3d_reconstruct.py export -c config.yml -f pointcloud`
Generate	Fibonacci camera orbits → z-buffer render → augment	`python home_3d_reconstruct.py generate-data -p scene.ply -n 500`

Validate ensures COLMAP gets clean input. COLMAP gives you camera poses. The reconstruction learns the full 3D scene. Export converts it to geometry. Generate creates unlimited training views from angles you never photographed.

Akshay Parkhi's Weblog