In this work, we developed an explorable Super-Resolution model, which generates a high-resolution image that is both consistent with the original low-resolution image and consistent with the semantic information desired by the user.
The control of the image exploration is obtained by using a text prompt that is processed for its semantic information using CLIP network.
We have investigated several methods of performing the above task; We first attempted expanding an existing explorable Super-Resolution network to optimize over the semantic information in the text. Next, we tried training a network dedicated to the task of creating a CLIP-encoding consistent explorable Super-Resolution network. To gain understanding into the nature of the CLIP-encodings for both methods we explored the underlying information captured by the CLIP network.