For a self-driving car problem, classification with localization might have the following classes,
For localization, we assume 4 more classes, $b_x,b_y,b_h,h_w$
Where $b_x,b_y$ are the midpoints and $b_h,b_w$ are the height and width of the bounding box.
$y=\begin{bmatrix} p_c\\ b_x\\b_y\\b_h\\b_w\\c_1\\c_2\\c_3 \end{bmatrix}$
<aside> 💡 Here, we assume that the CNN is predicting only one object.
</aside>
When detecting faces, we map the face we set of landmarks (say 64).
$(l_{1x}, l_{1y}), (l_{2x},l_{2y}), (l_{3x},l_{3y}), (l_{4x}, l_{4y}), . . . . (l_{64x}, l_{64y}),$
These then go through a ConvNet and then output those landmarks in the softmax layer
The same goes with pose detection. You enter the key landmarks of a body.