Object Localization

For a self-driving car problem, classification with localization might have the following classes,

For localization, we assume 4 more classes, $b_x,b_y,b_h,h_w$

Where $b_x,b_y$ are the midpoints and $b_h,b_w$ are the height and width of the bounding box.

$y=\begin{bmatrix} p_c\\ b_x\\b_y\\b_h\\b_w\\c_1\\c_2\\c_3 \end{bmatrix}$

<aside> 💡 Here, we assume that the CNN is predicting only one object.

</aside>

Landmark Detection

When detecting faces, we map the face we set of landmarks (say 64).

$(l_{1x}, l_{1y}), (l_{2x},l_{2y}), (l_{3x},l_{3y}), (l_{4x}, l_{4y}), . . . . (l_{64x}, l_{64y}),$

These then go through a ConvNet and then output those landmarks in the softmax layer

The same goes with pose detection. You enter the key landmarks of a body.