Abstract
Virtual try-on systems have long struggled with rigid dependencies on human body masks, limited fine- grained control over garment attributes, and poor generalization to in-the-wild scenarios. In this paper, we propose JCo-MVTON (Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-On), a novel framework that simultaneously addresses these challenges by unifying diffusion-based generation with multi-modal condition fusion. Our architecture leverages a Multi-Modal Diffusion Transformer (MM-DiT) backbone to integrate diverse control signals—including reference image and garment image—directly into the denoising process. Our key architectural innovation involves conditioning the MM-DiT by fusing reference and garment information into its self-attention layers through dedicated conditional pathways. This integration is augmented by critical refinements to the positional encodings and attention masks, which specialize the multi-conditional MM-DiT for the nuanced requirements of the virtual try-on task and yield superior results. On the data curation front, we introduce a bi-directional generation strategy to construct our training set. This approach leverages two complementary pathways: first, a mask-based model is employed to generate a substantial volume of reference images; second, a Try-Off model, sharing an identical architecture and trained via self- supervision, synthesizes the corresponding garment data.
Method Overview

The overall framework of JCo-MVTON. Given a fixed prompt, reference image, and garment image as inputs, the Joint MM-DiT architecture fuses multi-conditional features to synthesize the try-on image. Within each Joint MM-DiT block, noise and conditional features are processed through three parallel branches and fused via self-attention.
Data Preparation

Overview of the two‑stage data pipeline used to build our mask‑free virtual try‑on dataset. Stage I bootstraps a raw pool by alternately running a Try‑Off model, which generates garment images, and a mask‑based try‑on model, which produces initial reference images. Stage II iteratively refines and enlarges the dataset: human‑in‑the‑loop filtering cleans the seed pool, JCo‑MVTON regenerates sharper triplets, and ICLoRA injects new styles to widen the domain. The cycle repeats until the refined pool reaches the desired quality and diversity.
Results
