ReDi: Boosting Generative Image Modeling via Joint Image-Feature Synthesis


1. Archimedes/Athena RC | 2. valeo.ai | 3. National Technical University of Athens | 4. University of Crete
5. IACM-Forth | 6. IIT, NCSR "Demokritos"
Intro Image

Overview of ReDi . Our generative image modeling framework bridges the gap between generative modeling and representation learning by leveraging a diffusion model that jointly captures low-level image details (via VAE latents) and high-level semantic features (via DINOv2). Trained to generate coherent image–feature pairs from pure noise, this unified latent-semantic dual-space diffusion approach significantly boosts both generative quality and training convergence speed.

Abstract

Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image modeling framework that seamlessly bridges this gap by leveraging a diffusion model to jointly model low-level image latents (from a variational autoencoder) and high-level semantic features (from a pretrained self-supervised encoder like DINO). Our latent-semantic diffusion approach learns to generate coherent image–feature pairs from pure noise, significantly enhancing both generative quality and training efficiency, all while requiring only minimal modifications to standard Diffusion Transformer architectures. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance, which leverages learned semantics to steer and refine image generation. Evaluated in both conditional and unconditional settings, our method delivers substantial improvements in image quality and training convergence speed, establishing a new direction for representation-aware generative modeling.

Method

Description of the image
Overview of ReDi pipeline. Given an input image, the VAE latent and the principal components of DINOv2 are extracted. Both modalities are noised and fused into a joint token sequence, given as input to DiT or SiT.
Motivation: Recent work by Yu et al. (2025) (REPA) demonstrates that improving the semantic quality of diffusion features through distillation of pretrained self-supervised representations leads to better generation quality and faster convergence. Their results establish a clear connection between representation learning and generative performance. Motivated by these insights, we investigate whether a more effective approach to leveraging representation learning can further enhance image generation performance. Rather than aligning diffusion features with external representations via distillation, we propose to jointly model:

ReDi
  • Precise low-level features via the VAE latents
  • Semantic high-level features from DINOv2

Joint Image-Representation Generation: During training, given an image $\mathbf{x}_0$ and its DINOv2 features $\mathbf{z}_0$, we define a joint forward diffusion process:

\[ \begin{aligned} \textcolor{teal}{\mathbf{x}_t} = \sqrt{\bar{\alpha}_t}\textcolor{teal}{\mathbf{x}_0} + \sqrt{1-\bar{\alpha}_t} \textcolor{teal}{\boldsymbol{\epsilon}_x}, \quad \textcolor{purple}{\mathbf{z}_t} = \sqrt{\bar{\alpha}_t}\textcolor{purple}{\mathbf{z}_0} + \sqrt{1-\bar{\alpha}_t} \textcolor{purple}{\boldsymbol{\epsilon}_z}, \end{aligned} \] The diffusion model $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{z}_t, t)$ takes as input $\mathbf{x}_t$ and $\mathbf{z}_t$, along with timestep $t$, and jointly predicts the noise for both inputs. Specifically, it produces two separate predictions: $\textcolor{black}{\boldsymbol{\epsilon}^x_\theta}(\mathbf{x}_t, \mathbf{z}_t, t)$ for the image latent noise $\boldsymbol{\epsilon}_x$, and $\textcolor{black}{\boldsymbol{\epsilon}^z_\theta}(\mathbf{x}_t, \mathbf{z}_t, t)$ for the visual representation noise $\boldsymbol{\epsilon}_z$. The training objective combines both predictions: \[ \mathcal{L}_{joint} = \underset{\textcolor{teal}{\mathbf{x}_0}, \textcolor{purple}{\mathbf{z}_0}, t} { \mathbb{E}} \Big [ \Vert \textcolor{black}{\boldsymbol{\epsilon}^x_\theta}(\textcolor{teal}{\mathbf{x}_t}, \textcolor{purple}{\mathbf{z}_t}, t) - \textcolor{teal}{\boldsymbol{\epsilon}_x} \Vert_2^2 + \lambda_z \Vert \textcolor{black}{\boldsymbol{\epsilon}^z_\theta}(\textcolor{teal}{\mathbf{x}_t},\textcolor{purple}{\mathbf{z}_t}, t) - \textcolor{purple}{\boldsymbol{\epsilon}_z} \Vert_2^2 \Big], \]


An example image
Fusion of Image and Representation Tokens
We explore two approaches:
Merged Tokens: The tokens are summed channel-wise:

  • $\mathbf{h}_t = \mathbf{x}_t \mathbf{W}_{\text{emb}}^x + \mathbf{z}_t \mathbf{W}_{\text{emb}}^z \in \mathbb{R}^{L \times C_d}$

Separate Tokens: Tokens are concatenated along the sequence dimension:
  • $\mathbf{h}_t = [\mathbf{x}_t \mathbf{W}_{\text{emb}}^x \,, \, \mathbf{z}_t \mathbf{W}_{\text{emb}}^z] \in \mathbb{R}^{2L \times C_d},$


An example image Dimensionality-Reduced DINOv2
In practice, the channel dimension of visual representations ($C_z$) significantly exceeds that of image latents ($C_x$), i.e., $C_z \gg C_x$. We empirically observe that this imbalance degrades performance, as the model disproportionately allocates capacity to visual representations at the expense of image latents. To address this, we apply Principal Component Analysis (PCA) to reduce the dimensionality of $\mathbf{z}_0$ from $C_z$ to $C^{\text{pca}}_z$.


Representation Guidance: Joint modeling allows us to treat the generated noisy representation as a condition. During inference we modify the posterior distribution to: $\hat{p}_\theta(\mathbf{x}_t, \mathbf{z}_t) \propto p_\theta(\mathbf{x}_t) p( \mathbf{z}_t \vert \mathbf{x}_t)^{w_r}$ \begin{align} \nabla_{\!\mathbf{x}_t} \text{log} \; \hat{p}_\theta(\mathbf{x}_t, \mathbf{z}_t)=& \nabla_{\!\mathbf{x}_t} \text{log} \;p_\theta(\mathbf{x}_t)+ w_r\big( \nabla_{\!\mathbf{x}_t} \text{log} \;p_\theta(\mathbf{z}_t \vert \mathbf{x}_t) \big) \\ =& \nabla_{\!\mathbf{x}_t} \text{log} \;p_\theta(\mathbf{x}_t)+ w_r\big( \nabla_{\!\mathbf{x}_t} \text{log} \;p_\theta(\mathbf{x}_t, \mathbf{z}_t)- \nabla_{\!\mathbf{x}_t} \text{log} \;p_\theta(\mathbf{x}_t) \big). \end{align} We implement this representation-guided prediction $\boldsymbol{\hat{e}_\theta}(\mathbf{x}_t, \mathbf{z}_t, t)$ at each denoising step as follows: \begin{equation} \boldsymbol{\hat{\epsilon}}_\theta(\mathbf{x}_t, \mathbf{z}_t, t) = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) + w_r\left(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{z}_t, t) - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right). \end{equation} Both the image-feature model the image-only model are trained together. With probability $p_{drop}$, we zero out $\mathbf{z}_t$ (setting $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{0}, t)$) and disable the visual representation denoising loss

Results

We compare the FID values between vanilla DiT or SiT models and those trained with ReDi. Without CFG, ReDi achieves FID=7.5 at 400K iterations, outperforming the vanilla model's performance at 7M iterations. Importantly ReDi shows a greater performance boost than REPA, reaching an FID of 5.7 at 700K whereas REPA reaches an FID of 5.9 at 4M iterations. Moreover, using classifier-free guidance, SiT-XL/2 with ReDi outperforms recent diffusion models with fewer epochs as well as REPA with SiT-XL/2.

Intro Image


Description of the image
Selected samples from our SiT-XL/2 w/ ReDi model trained on ImageNet 256 × 256. Images and visual representations are jointly generated by our model.

Analysis

An example image
Dimensionality reduction ablation. We observe that Increasing the component count improves performance, up to $r=8$, beyond which further components begin to degrade the quality of generation. This suggests an optimal intermediate subspace where compressed visual features retain sufficient expressivity to guide generation without dominating model capacity.



An example image Merged Tokens vs. Separate Tokens. While both approaches achieve comparable performance gains, SP demonstrates slightly better results. This advantage comes at a significant computational cost: SP doubles the transformer's input sequence length by introducing $256$ additional DINOv2 tokens, resulting in approximately $2\times$ greater compute demands during both training and inference. The MR strategy, by contrast, maintains the original sequence length while delivering similar performance improvements, thereby preserving computational efficiency as measured by throughput.

Cite Us

ReDi

article{kouzelis2025boosting,
  title={Boosting Generative Image Modeling via Joint Image-Feature Synthesis},
  author={Kouzelis, Theodoros and Karypidis, Efstathios and Kakogeorgiou, Ioannis and Gidaris, Spyros and Komodakis, Nikos},
  journal={arXiv preprint arXiv:2504.16064},
  year={2025}
}
        
--> --> -->