Stable diffusion in other languages
Stable Diffusion was developed by CompVis, Stability AI, and LAION. It mainly uses the English subset LAION2B-en of the LAION-5B dataset for its training data and, as a result, requires English text prompts to producing images.
This means that the tagging and correlating of images and text are based on English tagged data sets – which naturally tend to come from English-speaking sources and regions. Users that use other languages must first use a translator from their native language to English – which often loses the nuances or even core meaning. On top of that, it also means the latent model images Stable Diffusion can use are usually limited to English-speaking region sources.
For example, one of the more common Japanese terms re-interpreted from the English word businessman is “salary man” which we most often imagine as a man wearing a suit. You would get results that look like this, which might not be very useful if you’re trying to generate images for a Japanese audience.
rinna Co., Ltd. has developed a Japanese-specific text-to-image model named “Japanese Stable Diffusion”. Japanese Stable Diffusion accepts native Japanese text prompts and generates images that reflect the naming and tagged pictures of the Japanese-speaking world which may be difficult to express through translation and whose images may simply not present in the western world. Their new text-to-image model was trained on source material that comes directly from Japanese culture, identity, and unique expressions – including slang.
They did this by using a two step approach that is instructive on how stable diffusion works.
First, the latent diffusion model is left alone and they replaced the English text encoder with a Japanese-specific text encoder. This allowed the text encoder to understand Japanese natively, but would still generate western style tagged images because the latent model remained intact. This was still better than just translating the stable diffusion prompt.
Now Stable Diffusion could understand what the concept of a ‘businessman’ was but it still generated images of decidedly western looking businessmen because the underlying latent diffusion model had not been changed:
The second step was to retrain the the latent diffusion model from more Japanese tagged data sources with the new text encoder. This stage was essential to make the model become more language-specific. After this, the model could finally generate businessmen with the Japanese faces they would have expected:
Read more about it on the links below.
Links:
- https://huggingface.co/blog/japanese-stable-diffusion
- https://rinna.co.jp/
- Japanese Stable Diffusion:
- Hugging Face presentation/usage info: https://huggingface.co/rinna/japanese-stable-diffusion
- GitHub:https://github.com/rinnakk/japanese-stable-diffusion