Structural support vector machines (SSVMs) are amongst the best
performing models for many structured computer vision tasks,
such as semantic image segmentation or human pose estimation.
Training SSVMs, however, is computationally costly, since it
requires repeated calls to a structured prediction subroutine
(called max-oracle), which requires solving an optimization
problem itself, e.g. a graph cut.
In this work, we introduce a new technique for SSVM training that
is more efficient than earlier techniques when the max-oracle is
computationally expensive, as it is frequently the case in computer
vision tasks. The main idea is to combine the recent stochastic
Block-Coordinate Frank-Wolfe method with efficient hyperplane
caching and to use an automatic selection rule for deciding whether
to call the max-oracle or to rely on one of the cached hyperplanes.
We show experimentally that this strategy leads to faster convergence
to the optimum with respect to the number of requires oracle calls,
and that this also translates into faster convergence with respect
to the total runtime for cases where the max-oracle is slow compared
to the other steps of the algorithm.
A publicly available C++ implementation is provided.