A Hybrid Structure-from-Motion and Vision
Transformer Pipeline for Fast and Accurate Chronic
Wound Measurement
3D Reconstruction, Chronic Wounds, Structure from Motion (SfM),
Deep Neural Networks, Smartphones, Computer Vision, Photogrammetry,
Biomedical Imaging, Accessible Healthcare Technologies.
Accurate measurement of chronic wounds is essential for clinical monitoring;
however, manual methods lack precision, and conventional 3D photogramme-
try—particularly Multi-View Stereo (MVS)—is computationally expensive for
mobile applications and insufficiently robust on low-texture cutaneous surfaces.
To address these limitations, this work proposes a hybrid three-dimensional (3D)
reconstruction pipeline that combines Structure-from-Motion (SfM) with a deep
neural network based on Vision Transformers (ViT), namely DepthAnythingV2.
MVS-based densification is replaced by monocular depth inference, with metric
scale recovery achieved through the use of fiducial markers and volumetric
fusion via Truncated Signed Distance Function (TSDF), ensuring geometric
consistency and reconstruction stability under varying acquisition conditions.
The proposed approach was compared against an SfM+MVS pipeline opti-
mized through Bayesian Optimization. Experimental results using synthetic
phantoms and clinical images demonstrate that the hybrid method achieves a
mean absolute error (MAE) of 1.85 cm2 (mean relative error of 10%), whereas
the SfM+MVS baseline attains an MAE of 1.50 cm2 and a relative error of 8.2%,
indicating comparable performance. The Wilcoxon test revealed no statisticallysignificant difference at the 5% significance level (p = 0.0923). Furthermore, the
proposed method substantially reduces processing time, supporting its adoption
in longitudinal wound monitoring scenarios and enhancing its feasibility for
continuous use on conventional mobile devices.