Drawing upon StyleGAN's expressivity and disentangled latent space, existing 2D approaches employ textual prompting to edit facial images with different attributes. In contrast, 3D-aware approaches that generate faces at different target poses require attribute-specific classifiers, learning separate model weights for each attribute, and are not scalable for novel attributes. In this work, we propose an efficient, plug-and-play, 3D-aware face editing framework based on attribute-specific prompt learning, enabling the generation of facial images with controllable attributes across various target poses. To this end, we introduce a text-driven learnable style token-based latent attribute editor (LAE). The LAE harnesses a pre-trained vision-language model to find text-guided attribute-specific editing direction in the latent space of any pre-trained 3D-aware GAN. It utilizes learnable style tokens and style mappers to learn and transform this editing direction to 3D latent space. To train LAE with multiple attributes, we use directional contrastive loss and style token loss. Furthermore, to ensure view consistency and identity preservation across different poses and attributes, we employ several 3D-aware identity and pose preservation losses. Our experiments show that our proposed framework generates high-quality images with 3D awareness and view consistency while maintaining attribute-specific features. We demonstrate the effectiveness of our method on different facial attributes, including hair color and style, expression, and others.
Following figure presents our overall framework comprising a mapping network $f_{map}$, language-driven attribute editor $LAE$, RGB image ($C$) and $\alpha$-maps generator $f_G$ and a differentiable renderer $R$. Given a noise code $z\sim \mathcal{N}(0, 1)$, the mapping network maps it to the latent code $w \in \mathcal{W}$. Further, this latent code $w$ is edited within the $LAE$ module by input prompt $P_A^i$ (a combination of attribute prompt $A^i$, system prompt $t$, and learnable style tokens $V_i$). In particular, within the $LAE$, attribute-specific tokens $P_A^i$ are learned and mapped into textual embeddings $f_T$ using a text encoder $f_T(.)$. The resulting $f_T$ and latent code $w$ are then utilized to obtain the edited latent code $\hat{w}$ using style mappers $M_c$, $M_m$, and $M_f$. The edited $\hat{w}$ output by $LAE$ is subsequently fed into RGB-$\alpha$ generator $f_G$, which generates RGB image and alpha maps, which are then fed to the renderer $R$ for synthesizing images with the desired attributes at specified target poses $p_t$.