Akshay Parkhi's Weblog

Subscribe

Smartphone Photos to Synthetic Training Data: A 3D Reconstruction Pipeline

26th February 2026

This pipeline turns smartphone photos of your home into synthetic training data — RGB images, depth maps, and camera parameters from viewpoints that never existed. You capture photos, reconstruct a 3D model, then render unlimited novel views. Here’s how the five-stage pipeline works.

Stage 1: Validate Photos

The CaptureValidator scores every image on three metrics using OpenCV before anything touches COLMAP:

for each image:
    blur     = Laplacian(gray).variance()    # higher = sharper
    bright   = mean(gray)                     # 0-255
    contrast = std(gray)                      # pixel spread

    if blur < 30:       REJECTED    # too blurry
    if blur < 60:       POOR        # marginal
    if bright < 40:     too dark
    if bright > 220:    overexposed
    if contrast < 20:   flat scene

iPhone photos come as HEIC — the pipeline only reads JPG/PNG/TIFF/BMP/WebP. Convert first with macOS built-in sips:

for f in *.HEIC; do sips -s format jpeg "$f" --out "${f%.HEIC}.jpg"; done

You need at least 30 accepted images. The recommendation is 50-100 per room with 60-80% overlap between adjacent shots.

python home_3d_reconstruct.py validate --images ./photos/ --auto-reject

The --auto-reject flag moves rejected images to a rejected/ subfolder so COLMAP never sees them.

Stage 2: COLMAP Structure-from-Motion

COLMAP extracts camera poses and a sparse 3D point cloud. Three sequential steps:

┌──────────────────────┐     ┌──────────────────────┐     ┌──────────────────────┐
│  Feature Extraction  │────▶│  Feature Matching     │────▶│  Sparse Mapper       │
│  SIFT + affine shape │     │  Exhaustive all-pairs │     │  Bundle adjustment   │
│  max 2000px per side │     │  32768 max matches    │     │  → camera poses      │
└──────────────────────┘     └──────────────────────┘     └──────────────────────┘

Default config assumes all images come from the same camera (iPhone) with a PINHOLE model. The output lands in sparse/0/ — camera intrinsics, extrinsics, and 3D points in COLMAP’s binary format.

python home_3d_reconstruct.py colmap --images ./photos/ --workspace ./workspace/

Stage 3: 3D Reconstruction

Two backends, four methods:

MethodBackendVRAMBest For
splatfactoNerfstudio8-12 GBFast iterations, real-time rendering
nerfactoNerfstudio6-8 GBHigh-quality mesh export
nerfacto-bigNerfstudio12+ GBMaximum quality NeRF
3dgut-mcmcNVIDIA 3DGUT12-16 GBIsaac Sim integration

Nerfstudio runs ns-train for 30,000 iterations (4-hour timeout). If you provide COLMAP data it uses those camera poses directly; otherwise it runs its own ns-process-data preprocessing first. 3DGUT requires pre-computed COLMAP output.

python home_3d_reconstruct.py reconstruct --data ./photos/ --method splatfacto --colmap-path ./workspace/colmap/sparse/0

Stage 4: Export

The trained neural representation gets exported to standard 3D formats via ns-export:

pointcloud     → PLY point cloud (1M points sampled from NeRF)
poisson        → Watertight mesh via Poisson surface reconstruction
tsdf           → Volumetric mesh (256³ voxel grid)
marching-cubes → Isosurface extraction
gaussian-splat → Raw 3DGS format

The PLY writer uses binary little-endian format — 3 floats for position, 3 floats for normals, 3 unsigned bytes for RGB per vertex. An optional decimation step uses vertex clustering to reduce face count.

python home_3d_reconstruct.py export --config outputs/home_3d/splatfacto/config.yml --format pointcloud

Stage 5: Generate Training Data

This is where the pipeline pays off. Given a point cloud, it generates novel-view synthetic data that never existed in the original photos.

Camera poses are distributed using a Fibonacci sphere — uniform angular spacing without clustering at the poles:

golden_ratio = (1 + √5) / 2

for view i (0 to num_views):
    θ = 2π * i / golden_ratio        # azimuth
    φ = acos(1 - 2*(i+0.5)/n)        # polar angle
    r = random(1.0, 5.0)             # orbital radius
    y = random(0.5, 2.5)             # camera height

    position = [r*sin(φ)*cos(θ), y, r*sin(φ)*sin(θ)]
    look_at(scene_center + jitter)

Each view is rendered with a z-buffer point cloud splatter — project every 3D point through the camera intrinsics, depth-sort, and splat with cv2.circle(). Holes get filled with morphological closing (3x3 ellipse kernel).

Three augmentations are applied per frame:

  • Lighting — brightness α∈[0.7,1.3], contrast β∈[-20,20], per-channel color shift ±10
  • Noise — Gaussian σ∈[0,10] on RGB, σ=0.02 on depth
  • Viewpoint — small 2D rotation ±2° (simulates camera shake)
python home_3d_reconstruct.py generate-data --pointcloud exports/point_cloud.ply --num-views 500

Output structure:

training_data/
├── images/          view_00000.png ... view_00499.png    (augmented RGB)
├── depth/           depth_00000.npy + depth_viz_00000.png (per-pixel depth + Viridis colormap)
├── annotations/
│   ├── camera_00000.json ... camera_00499.json           (4x4 extrinsics + 3x3 intrinsics)
│   └── instances.json                                     (COCO format)
└── metadata.json    (scene bounds, point count, config)

200 views is the default. For training, 500+ gives better coverage. Each view includes the full camera matrix so you can project between depth and world coordinates.

Full Pipeline

One command runs all five stages sequentially:

python home_3d_reconstruct.py full-pipeline \
    --images ./photos/ \
    --output ./home_3d_output/ \
    --method splatfacto \
    --num-training-views 500

Summary

StageWhat It DoesCommand
ValidateScore blur/brightness/contrast, reject bad imagespython home_3d_reconstruct.py validate -i ./photos/
COLMAPSIFT features → exhaustive matching → sparse SfMpython home_3d_reconstruct.py colmap -i ./photos/ -w ./work/
ReconstructTrain NeRF or 3D Gaussian Splatting (30K iterations)python home_3d_reconstruct.py reconstruct -d ./photos/ -m splatfacto
ExportSample 1M points → PLY/mesh/splatpython home_3d_reconstruct.py export -c config.yml -f pointcloud
GenerateFibonacci camera orbits → z-buffer render → augmentpython home_3d_reconstruct.py generate-data -p scene.ply -n 500

Validate ensures COLMAP gets clean input. COLMAP gives you camera poses. The reconstruction learns the full 3D scene. Export converts it to geometry. Generate creates unlimited training views from angles you never photographed.

Links

This is Smartphone Photos to Synthetic Training Data: A 3D Reconstruction Pipeline by Akshay Parkhi, posted on 26th February 2026.

Next: AWS Bedrock AgentCore Async Agents

Previous: Teaching a Humanoid Robot to Wave: Custom Motions with GEAR-SONIC