ReDi

Description of the image — Overview of ReDi pipeline. Given an input image, the VAE latent and the principal components of DINOv2 are extracted. Both modalities are noised and fused into a joint token sequence, given as input to DiT or SiT.

Motivation: Recent work by Yu et al. (2025) (REPA) demonstrates that improving the semantic quality of diffusion features through distillation of pretrained self-supervised representations leads to better generation quality and faster convergence. Their results establish a clear connection between representation learning and generative performance. Motivated by these insights, we investigate whether a more effective approach to leveraging representation learning can further enhance image generation performance. Rather than aligning diffusion features with external representations via distillation, we propose to jointly model:

ReDi

Precise low-level features via the VAE latents
Semantic high-level features from DINOv2

Joint Image-Representation Generation: During training, given an image $\mathbf{x}_0$ and its DINOv2 features $\mathbf{z}_0$, we define a joint forward diffusion process:

\[ \begin{aligned} \textcolor{teal}{\mathbf{x}_t} = \sqrt{\bar{\alpha}_t}\textcolor{teal}{\mathbf{x}_0} + \sqrt{1-\bar{\alpha}_t} \textcolor{teal}{\boldsymbol{\epsilon}_x}, \quad \textcolor{purple}{\mathbf{z}_t} = \sqrt{\bar{\alpha}_t}\textcolor{purple}{\mathbf{z}_0} + \sqrt{1-\bar{\alpha}_t} \textcolor{purple}{\boldsymbol{\epsilon}_z}, \end{aligned} \] The diffusion model $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{z}_t, t)$ takes as input $\mathbf{x}_t$ and $\mathbf{z}_t$, along with timestep $t$, and jointly predicts the noise for both inputs. Specifically, it produces two separate predictions: $\textcolor{black}{\boldsymbol{\epsilon}^x_\theta}(\mathbf{x}_t, \mathbf{z}_t, t)$ for the image latent noise $\boldsymbol{\epsilon}_x$, and $\textcolor{black}{\boldsymbol{\epsilon}^z_\theta}(\mathbf{x}_t, \mathbf{z}_t, t)$ for the visual representation noise $\boldsymbol{\epsilon}_z$. The training objective combines both predictions: \[ \mathcal{L}_{joint} = \underset{\textcolor{teal}{\mathbf{x}_0}, \textcolor{purple}{\mathbf{z}_0}, t} { \mathbb{E}} \Big [ \Vert \textcolor{black}{\boldsymbol{\epsilon}^x_\theta}(\textcolor{teal}{\mathbf{x}_t}, \textcolor{purple}{\mathbf{z}_t}, t) - \textcolor{teal}{\boldsymbol{\epsilon}_x} \Vert_2^2 + \lambda_z \Vert \textcolor{black}{\boldsymbol{\epsilon}^z_\theta}(\textcolor{teal}{\mathbf{x}_t},\textcolor{purple}{\mathbf{z}_t}, t) - \textcolor{purple}{\boldsymbol{\epsilon}_z} \Vert_2^2 \Big], \]

An example image
Fusion of Image and Representation Tokens
We explore two approaches:
Merged Tokens: The tokens are summed channel-wise:

$\mathbf{h}_t = \mathbf{x}_t \mathbf{W}_{\text{emb}}^x + \mathbf{z}_t \mathbf{W}_{\text{emb}}^z \in \mathbb{R}^{L \times C_d}$

Separate Tokens: Tokens are concatenated along the sequence dimension:

$\mathbf{h}_t = [\mathbf{x}_t \mathbf{W}_{\text{emb}}^x \,, \, \mathbf{z}_t \mathbf{W}_{\text{emb}}^z] \in \mathbb{R}^{2L \times C_d},$

An example image Dimensionality-Reduced DINOv2
In practice, the channel dimension of visual representations ($C_z$) significantly exceeds that of image latents ($C_x$), i.e., $C_z \gg C_x$. We empirically observe that this imbalance degrades performance, as the model disproportionately allocates capacity to visual representations at the expense of image latents. To address this, we apply Principal Component Analysis (PCA) to reduce the dimensionality of $\mathbf{z}_0$ from $C_z$ to $C^{\text{pca}}_z$.

Representation Guidance: Joint modeling allows us to treat the generated noisy representation as a condition. During inference we modify the posterior distribution to: $\hat{p}_\theta(\mathbf{x}_t, \mathbf{z}_t) \propto p_\theta(\mathbf{x}_t) p( \mathbf{z}_t \vert \mathbf{x}_t)^{w_r}$ \begin{align} \nabla_{\!\mathbf{x}_t} \text{log} \; \hat{p}_\theta(\mathbf{x}_t, \mathbf{z}_t)=& \nabla_{\!\mathbf{x}_t} \text{log} \;p_\theta(\mathbf{x}_t)+ w_r\big( \nabla_{\!\mathbf{x}_t} \text{log} \;p_\theta(\mathbf{z}_t \vert \mathbf{x}_t) \big) \\ =& \nabla_{\!\mathbf{x}_t} \text{log} \;p_\theta(\mathbf{x}_t)+ w_r\big( \nabla_{\!\mathbf{x}_t} \text{log} \;p_\theta(\mathbf{x}_t, \mathbf{z}_t)- \nabla_{\!\mathbf{x}_t} \text{log} \;p_\theta(\mathbf{x}_t) \big). \end{align} We implement this representation-guided prediction $\boldsymbol{\hat{e}_\theta}(\mathbf{x}_t, \mathbf{z}_t, t)$ at each denoising step as follows: \begin{equation} \boldsymbol{\hat{\epsilon}}_\theta(\mathbf{x}_t, \mathbf{z}_t, t) = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) + w_r\left(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{z}_t, t) - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right). \end{equation} Both the image-feature model the image-only model are trained together. With probability $p_{drop}$, we zero out $\mathbf{z}_t$ (setting $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{0}, t)$) and disable the visual representation denoising loss

ReDi: Boosting Generative Image Modeling via Joint Image-Feature Synthesis

Abstract

Method

Results

Analysis

Cite Us