Suppose you have an input matrix (image) of dimensions 6x6x1 (not rgb), we construct another 3x3 matrix which is called a filter (also called the kernel). This matrix is then convolved ("$*$") with the input matrix.
The first element of the output matrix will be the sum of the element-wise multiplication of the filter and image.
We repeat the operation for the next elements too by shifting the filter by one stride. A 6x6 input matrix with 3x3 filter gives a 4x4 matrix as output.
If we flip the input image, with the same filter, we get a negative (very dark) output.
The elements in the kernel are parameters that are learned by the model using backpropagation.
General formula for determining the dimension of the output matrix
$n-f+1$ where,
There are certain downsides to this method