Team cv2
Abhinav Raundhal - 2022101089
Gaurav Behera - 2022111004
As digital photography has advanced, numerous techniques have emerged for high-level image and video manipulation. For example, image retargeting algorithms adjust an image’s aspect ratio while preserving its visual content, and image completion algorithms seamlessly fill in removed regions by synthesising textures from the surrounding image. These techniques often rely on the observation that images can be divided into small patches, such as 7x7 pixel regions, which can then be rearranged, similar to assembling a jigsaw puzzle. However, patch-based manipulation traditionally incurs significant computational and memory costs due to the difficulty of efficiently searching for similar patches.
This project aims to reimplement and explore the Nearest Neighbour Field (NNF) algorithm from scratch in Python using the original MATLAB and C++ reference code as guides. Additionally, we have implemented a simplified image in-painting approach in Python using an Expectation-Maximisation (EM) step combined with a Multi-Scale Coarse-to-Fine Refinement method. This in-painting approach is inspired by another GitHub repository, whose implementation is originally in C++. We have also created a simple python GUI which allows users to load an image, draw mask regions on it and save it to be used by the algorithm.
The paper introduces PatchMatch, an algorithm that significantly accelerates the search for matching image patches, achieving speeds 20-100x faster than prior methods while using much less memory. By leveraging the natural statistical coherence of adjacent patches in images, the algorithm quickly computes nearest-neighbour matches for densely overlapping patches—one per pixel. This method has applications in image in-painting, retargeting, and reshuffling, which are also demonstrated in this project.
PatchMatch has been widely adopted and integrated into Adobe's image and video editing tools, particularly in Photoshop and After Effects. One prominent application is the Content-Aware Fill feature, which leverages PatchMatch for efficient and high-quality image and video in-painting. This feature allows users to seamlessly remove unwanted objects or regions from images or videos by synthesising plausible content based on surrounding textures.
Natural images exhibit two key statistical properties that facilitate efficient search algorithms: coherence and a peaked distribution. Coherence refers to the tendency of adjacent patches in an image to have similar nearest neighbours, meaning that good matches for one patch are likely to inform the matches for nearby patches. The peaked distribution implies that the best matches for a given patch are often located spatially close to its position in the source or target image. By leveraging these properties, search algorithms can propagate good matches across neighbouring patches and concentrate computational efforts on nearby regions, significantly improving both the speed and accuracy of finding nearest-neighbour patches between images.
High level motivation behind the algorithm. We match descriptors computed on two manifolds, visualised as colour circles. Each descriptor independently finds its most similar match in the other manifold. Coherence here could be indicated by the red descriptor being similar to the purple descriptor, and that their nearest neighbours are spatially close to each other.
The core of the system revolves around an efficient algorithm for computing patch correspondences between two images. The main challenge is to identify the best matching patches in one image that correspond to a patch in another image. To solve this, we define a Nearest-Neighbour Field (NNF), which is a function
Formally, for a given patch at coordinate
Phases of the randomised nearest neighbour algorithm: (a) patches initially have random assignments; (b) the blue patch checks above/green and left/red neighbours to see if they will improve the blue mapping, propagating good matches; (c) the patch searches randomly for improvements in concentric neighbourhoods.
The initialisation phase of the algorithm involves filling the nearest-neighbour field with an initial set of correspondences. This can be done in two ways:
- Random Initialisation: We begin by assigning random coordinates within the possible range of image B. This randomness is uniformly distributed across the full image, ensuring that the search space is broadly covered.
- Prior Information: In some applications, we use prior information to initialise the nearest-neighbour field. For instance, in our in-painting implementation, we start with an initialisation from a coarser level, progressively refining it as the resolution increases.
Here are the pseudo codes for few initialisation instances:
Function initialize_nnf():
For each pixel (ay, ax) in the source image:
If pixel is in border:
Set NNF[ay, ax] to (ay, ax)
Set NNF distance[ay, ax] to 0
Else:
Randomly select bx, by from the target image
Set NNF[ay, ax] to (by, bx)
Set NNF distance[ay, ax] to patch_distance(ax, ay, bx, by)Function initialize_nnf_with_other_nnf(other_nnf):
Upsample the other NNF by a factor of 2
For each pixel (ay, ax) in the source image:
If pixel is in border:
Set NNF[ay, ax] to (ay, ax)
Set NNF distance[ay, ax] to 0
Else:
Get corresponding bx, by from the upsampled other NNF
Set NNF[ay, ax] to (by, bx)
Set NNF distance[ay, ax] to patch_distance(ax, ay, bx, by)Function initialize_nnf_with_mask(mask):
For each pixel (ay, ax) in the source image:
If pixel needs inpainting (mask[ay, ax] == 0) and is not in border:
Randomly select bx, by from the target image
Set NNF[ay, ax] to (by, bx)
Set NNF distance[ay, ax] to patch_distance(ax, ay, bx, by)
Else:
Set NNF[ay, ax] to (ay, ax)
Set NNF distance[ay, ax] to 0After initialisation, we iteratively improve the nearest-neighbour field by refining the offsets. Each iteration consists of two major steps: Propagation and Random Search. These steps are interleaved at the patch level, meaning that we alternate between propagating good offsets and performing random search to explore nearby offsets for better matches.
Propagation is a key step where good offset estimates are transferred to neighbouring patches. Specifically, for a given patch at coordinate
Mathematically, this process can be described as:
Here,
Additionally, to ensure that the propagation is not limited to one direction, the algorithm performs a reverse scan during even iterations. In this reverse pass, offsets from the neighbouring patches
Here is the pseudo code for the propagation step:
Function propagate(iter_num, ax, ay, x_change, y_change):
d_best = nnf_dist[ay, ax] # Current best distance
If ax - x_change is within bounds:
(y_prop, x_prop) = nnf[ay, ax - x_change]
x_prop += x_change
If x_prop is within bounds:
improve_guess(ax, ay, d_best, x_prop, y_prop)
If ay - y_change is within bounds:
(y_prop, x_prop) = nnf[ay - y_change, ax]
y_prop += y_change
If y_prop is within bounds:
improve_guess(ax, ay, d_best, x_prop, y_prop)After propagation, the next step is random search. In this phase, we attempt to improve the current offset by exploring nearby offsets. We begin by considering the current offset as a starting point, and then generate candidate offsets by adding random variations. The candidate offsets are computed as:
where
The random search continues with exponentially decreasing distances, refining the search radius until it becomes less than 1 pixel. The search process is halted when the radius is small enough, ensuring that the best matching offset is found within a reasonable distance.
Here is the pseudo code for the random search step:
Function random_search(ax, ay):
Initialize rs_start as the smaller of rs_max and max(image dimensions)
Set mag = rs_start
While mag >= 1:
(y_best, x_best) = nnf[ay, ax]
d_best = nnf_dist[ay, ax]
x_min = max(x_best - mag, 0)
x_max = min(x_best + mag + 1, bw)
y_min = max(y_best - mag, 0)
y_max = min(y_best + mag + 1, bh)
If x_min == x_max or y_min == y_max:
Reduce mag by half and continue
Randomly select x_rand in [x_min, x_max)
Randomly select y_rand in [y_min, y_max)
improve_guess(ax, ay, d_best, x_rand, y_rand)
Reduce mag by half- Evaluation Criteria: We evaluate these candidates and update the nearest neighbour if any of them provide a smaller patch distance defined by the l2-norm of the difference of RGB values. In cases where a mask is present as part of the patch, the distance is penalised by adding a large value.
- Halting Criteria: We iterate the above process a fixed number of times (typically 4-5 iterations) until the NNF converges, meaning that further iterations no longer significantly improve the nearest neighbour assignments.
- Patch Width: The patch width is set to 7 pixels, as this size strikes a balance between capturing meaningful texture patterns and maintaining computational efficiency. Smaller patches might miss broader textures, while larger ones increase computation and risk over-smoothing.
We have implemented image in-painting using the PatchMatch’s Nearest Neighbour Field algorithm. It operates through a multi-resolution coarse-to-fine refinement strategy, incorporating Expectation-Maximisation (EM) to iteratively update the masked regions.
The algorithm begins with the initialisation of inputs: the image (a 3D array representing pixel colours), the mask (a binary 2D array marking known and missing regions), and hyper-parameters like the patch size (patch_w) and maximum PatchMatch iterations (max_pm_iters).
The mask ensures that in-painting is confined to the specified missing regions. A key step in initialisation is constructing an image pyramid, which stores progressively smaller versions of the image and mask for coarse-to-fine processing. This multi-scale approach enables structural and texture consistency by addressing global details at lower resolutions and refining finer details at higher resolutions.
The image and mask are repeatedly downsampled by a factor (e.g., 2) until the dimensions of the smallest image are close to the patch size. During downsampling, the _downsample_img_and_mask method applies Gaussian smoothing using a custom kernel.
This reduces noise and ensures smooth transitions between resolutions. Missing regions in the mask are proportionally propagated by averaging weighted contributions from neighbouring pixels. The pyramid serves as a foundation for coarse-to-fine in-painting, starting from the smallest (coarsest) resolution.
The Expectation-Maximisation (EM) process alternates between estimating pixel values (E-step) and updating them (M-step):
-
E-Step: Weighted contributions from nearby patches are aggregated to form "votes" for the in-painted region. For each pixel, the weight
$w$ of a patch at coordinates$(a_x,a_y)$ in the source and$(b_x,b_y)$ in the target represents the similarity and is calculated as:$$ w = 1 - \frac{d(P_A, P_B)}{\text{MAX_PATCH_DIFF}} $$
These weights are normalised, and pixel values are updated based on the contributions of overlapping patches.
-
M-Step: Pixel values in missing regions are updated by averaging the weighted votes. This step ensures a gradual refinement of missing areas, balancing contributions from the surrounding context.
Here are the pseudo codes for the EM Step:
function E_STEP(nnf, src_image, mask, upscale=False):
initialize votes as zeros array of shape (height, width, 4)
for each pixel (ay, ax) in source image:
(by, bx) = nnf[ay, ax] // Nearest neighbor coordinates
weight = 1 - (patch_distance(ax, ay, bx, by) / MAX_PATCH_DIFF)
if upscale:
// Handle upscaled patches (2x2 neighborhood)
votes += weighted_copy(2*ax, 2*ay, 2*bx, 2*by)
votes += weighted_copy(2*ax+1, 2*ay, 2*bx+1, 2*by)
votes += weighted_copy(2*ax, 2*ay+1, 2*bx, 2*by+1)
votes += weighted_copy(2*ax+1, 2*ay+1, 2*bx+1, 2*by+1)
else:
votes += weighted_copy(ax, ay, bx, by)
return votesfunction M_STEP(target_image, votes):
initialize new_target as a copy of target_image
for each pixel (by, bx) in target image:
if votes[by, bx, 3] > 0: // Normalize only if votes exist
new_target[by, bx, 0] /= votes[by, bx, 3] // Blue
new_target[by, bx, 1] /= votes[by, bx, 3] // Green
new_target[by, bx, 2] /= votes[by, bx, 3] // Red
return new_targetThe coarse-to-fine refinement process begins at the coarsest pyramid level, where global structures are addressed. In-painting starts by initialising missing regions using PatchMatch and iteratively refining them using the EM steps. Once the coarsest level is complete, the inpainted result is upsampled to the next resolution using bilinear interpolation, ensuring continuity. The NNF from the coarser level is propagated to initialise PatchMatch at the finer level, reducing computational overhead and preserving structural consistency. This process repeats for all levels until the highest resolution is reached, where fine details are reconstructed.
Here is the pseudo code for this:
function inpaint(image_pyramid, mask_pyramid, patch_size, max_pm_iters):
initialize inpainted_image = None
for level from coarsest to finest:
src_image, mask = image_pyramid[level], mask_pyramid[level]
if level == coarsest:
initialize inpainted_image and nearest neighbor fields (NNFs)
else:
update NNFs using results from the previous level
for em_step in range(max_em_steps(level)):
votes = compute_votes(src_image, mask, NNFs, upscale)
inpainted_image = update_image_from_votes(inpainted_image, votes)
update_NNFs_with_inpainted_image(NNFs, inpainted_image)
if level > finest:
upscale inpainted_image for the next level
return inpainted_imageAt the finest resolution, the in-painted image contains reconstructed regions that are visually consistent with the surrounding context. The combination of PatchMatch for texture synthesis and EM for iterative refinement ensures that the output maintains high structural fidelity and realistic textures. Mathematically, the process converges to a solution that minimises the patch distance across scales, effectively filling the missing regions while preserving global coherence and local detail.
These are original input images
Computing the NNF from two images A and B
Here is the detailed NNF output of the algorithm:
-
$NNF(y)$ represents the pixel distance difference along the$y$ axis -
$NNF(x)$ represents the pixel distance difference along the$x$ axis -
The
$NNF(Dist)$ represents the magnitude of distance combining$NNF(x)$ and$NNF(y)$ -
The
$NNF(RGB dist)$ represents the pixel intensity difference between each patch in$A$ and$B$
Similarly grouped regions in the above plots indicate the coherence of the images.
The image
- Directly replacing the pixel value with the centre of the NNF patch in
$B$ - Replacing patches for each pixel centre and averaging the pixel intensities over all overlapping patches
Reconstruction using direct pixel transfer method
Reconstruction using averaged patch method
The direct pixel replacement reconstruction method (1) results in some inconsistent spots compared to the averaging method (2). The averaging method provides much better results for the reconstruction task with smoother blending of patches.
The mask of an input image can be created using the mask creator tool by running get_mask.py
Mask Creation
The in-painting algorithm starts by downscaling the image and mask multiple times. At each level, the Expectation-Maximisation code is run multiple times and constructs the target image. The reconstructed target image is passed as initialisation to the lower level and the process it repeated till the final resolution image is obtained.
Algorithm overview for in-painting (watermark removal)
Here are the results for some more images:
In-painting for Object Removal
In-painting for Tattoo Removal
In-painting using both propagation and random search.
We compare the the effectiveness of the propagate and random search steps in the NNF computation for the task of image in-painting.
The three cases we test are:
- Both Propagate and Random Search
- Only Random Search (No propagate)
- Only Propagate (No random search)
We evaluate the three cases on one image:
Ablation Results
There is almost no visual difference between the three images implying that both the individual steps perform good enough results independently such that combining both does not improve the final output. The only difference is in the time taken. Only Propagation takes only 35 seconds compared to 54 seconds taken by only Search and 63 seconds taken by both. Thus, we can achieve good results within a fraction of the time by only keeping the Propagation step of the algorithm while removing the Random Search step.
We confirm this hypothesis by evaluating another image:
Ablation Results
We notice similar results here. Only Propagate (no Random Search) is the best choice among the three options in terms of times taken.
Image reshuffling algorithms make it possible to grab portions of the image and move them around – the computer automatically synthesises the remainder of the image so as to resemble the original while respecting the moved regions.
To perform image reshuffling using image in-painting, we first copy the portion of the image we want to move, and paste it on the original image to obtain the edited image. Now we define the mask for removing the original object as well as smoothening out the edges between the newly placed object and the background. This mask is applied on the edited image to obtain the final reshuffled image.
Defining the edited image and the masked image.
Here is the level wise construction of the in-painted image:
Image reshuffling
Final Reshuffled image
GitHub - ZQPei/patchmatch_inpainting: Implementation of PatchMatch for image inpainting in cpp
PatchMatch - A Randomized Correspondence Algorithm for Structural Image Editing















