Towards Anatomically Plausible Human Image Generation via Synthetic Localized Preferences

1 Westlake University
2 The Chinese University of Hong Kong, Shenzhen
*Indicates Co-corresponding Author.
Teaser figure

While foundation models like FLUX and SDXL frequently struggle with complex human anatomy, especially for fingers, ASAP significantly improves anatomical plausibility.

Abstract

Large-scale text-to-image foundation models have achieved remarkable visual realism, yet generating human images with correct anatomical structures remains challenging. Existing approaches enforce anatomical constraints through part-specific modules or localized loss weighting during supervised fine-tuning on high-quality human photos, but such datasets are limited and often provide ambiguous optimization signals due to confounding factors such as lighting, pose, and background. Preference-based alignment offers an alternative, but standard Direct Preference Optimization (DPO) treats all pixels equally and therefore fails to exploit the localized nature of anatomical artifacts. To address this, we propose the framework of Alignment via Synthetic Anatomical Preference (ASAP), which constructs controlled preference pairs through a localized degradation mechanism applied to high-fidelity human images. This mechanism performs a controlled experiment on images by introducing explicit anatomical errors in targeted regions while preserving the remaining content. With this mechanism, we create the Human Anatomical Preference (HAP) dataset with over 10K curated pairs for effective anatomical alignment of text-to-image human image generative models. To better leverage the locality of these controlled preference pairs, we introduce a localized and margin-bounded variant of DPO that prioritizes optimization in targeted anatomical regions while enforcing a finite preference margin to prevent over-optimization and preserve global semantics. We further introduce HAF-Bench, a benchmark for systematic evaluation of anatomical fidelity. Extensive experiments demonstrate that ASAP consistently reduces anatomical errors across multiple foundation models while maintaining overall image quality.

ASAP Framework

Experiments results

We introduce a two-phase approach to rectify anatomical artifacts without compromising global quality. In Phase I, we construct the HAP dataset using a localized degradation mechanism to synthesize negative samples that differ from positives exclusively in anatomical correctness. In Phase II, we propose a Localized and Margin-Bounded Alignment objective. Unlike standard DPO, which pushes the preference gap toward infinity and risks semantic collapse, our method utilizes a spatial weight map and enforces a target margin to penalize over-optimization, ensuring precise local corrections.

Synthetic Localized Preferences

Framework pipeline

We validate our synthetic preference pairs visually and quantitatively. As shown above, our synthetic negatives (bottom row) closely match typical real-world T2I failure modes (top row). To quantitatively validate this localized preferences, region-wise metrics demonstrate that targeted anatomical regions exhibit significant structural divergence (average SSIM: 0.5315, LPIPS: 0.3446), whereas non-target areas remain exceptionally consistent (average SSIM: 0.9827, LPIPS: 0.0109). These statistics confirm that our synthetic pipeline provides a controlled and meaningful preference signal.

Experiments

Main performance results
Small result 1
Small result 2
Bottom full-width result

BibTeX

@article{li2026asap,
    title={Towards Anatomically Plausible Human Image Generation via Synthetic Localized Preferences},
    author={Li, Bao and Xiu, Yuliang and Liu, Zhen},
    journal={arXiv preprint arXiv:2605.25759},
    year={2026},
}