I’ve already written about nVidia’s GET3D code that can generate a wide variety of 3D objects using AI trained networks. These objects, however, are more finely tuned to generate specific objects (chairs, cars, etc). This requires a large labeled 3D dataset. nVidia provides simple ones, but if you want them to generate specific kinds of styles or from different eras (only 50’s era cars, only 1800’s style furnature), you’ll need to collect, label, and train the model for that.
There’s another player in town called DreamFusion that goes a slightly different direction. Some Google and a UC Berkley researchers are using a similar method to generate 3D models from text. This gets around the problem of needing lots of pre-trained data by using images generated from 2D text-to-image diffusion models (like Stable Diffusion, DALL-E, and MidJourney). They developed an error/loss metric that they then use to evaluate the generated 2D images and potential for 3D generation and then do so. They come up with some astounding results.
There is also a paper by Nikolay Jetchev called ClipMatrix that attempts the same text-to-2D-to-3D generation. He also seems to be experimenting with animations and something called VolumeCLIP that does ray-casing.
This kind of end-to-end workflow pipeline is exactly the kind of content makers want. Unfortunately, it also means that it could likely decimate an art department. This kind of technology could easily be used to fill the non-critical areas of ever-expanding 3D worlds in games and VR with very minimal effort or cost. In theory, it could even be done pseudo-realtime. Imagine worlds in which you can walk in any direction – forever – and see constantly new locations and objects.