Object Localization

For a self-driving car problem, classification with localization might have the following classes,

For localization, we assume 4 more classes, $b_x,b_y,b_h,h_w$

Where $b_x,b_y$ are the midpoints and $b_h,b_w$ are the height and width of the bounding box.

Target Label

$y=\begin{bmatrix} p_c\\ b_x\\b_y\\b_h\\b_w\\c_1\\c_2\\c_3 \end{bmatrix}$

Loss function

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/9836d875-abaf-47e1-bb3c-d9d39c088cf3/Untitled.png

<aside> 💡 Here, we assume that the CNN is predicting only one object.

</aside>

Landmark Detection

When detecting faces, we map the face we set of landmarks (say 64).

$(l_{1x}, l_{1y}), (l_{2x},l_{2y}), (l_{3x},l_{3y}), (l_{4x}, l_{4y}), . . . . (l_{64x}, l_{64y}),$

These then go through a ConvNet and then output those landmarks in the softmax layer

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/783c15d1-10d5-4ce6-96f8-86bba3322a65/Untitled.png

The same goes with pose detection. You enter the key landmarks of a body.