Developing algorithms capable of generating realistic, high quality 3D content at scale has been a long standing problem in Computer Vision and Graphics. We anticipate that having generative models that can reliably synthesize meaningful 3D content will completely revolutionize the workflow of artists and content creators, and will also enable new levels of creativity through ``generative art". Although recently there has been considerable success in generating photorealistic images, the quality and generality of 3D generative models has lagged behind their 2D counterparts. Additionally, efficiently controlling what needs to be generated and scaling these approaches to complex scenes with several static and dynamic objects still remains an open challenge.
In this workshop, we seek to bring together researchers working on generative models for 3D humans, objects, and scenes to discuss the latest advances and next steps toward developing generative pipelines capable of producing fully controllable 3D environments with multiple humans interacting with each other or objects in the scene. In summary, the expected covered topics are:
- Representations: What is the most appropriate representation for generating high-quality 3D assets? What is the best representation that can enable intuitive control over the generated objects? How can we effectively represent interactions between humans and objects?
- Modelling: How can we build a foundation model that is capable of generating diverse, high-quality and photo-realistic humans/objects/scenes? How can we ensure expressiveness that faithfully captures subtle details and nuances corresponding to the semantics of diverse input conditions? How can we maintain robustness under varied real-world conditions (\emph{e.g.,} lighting, hard pose)?
- Interaction: How can we construct powerful models that can reliably generate human/objects performing plausible real-life motion with complex interactions? How far are we from the world model that could allow us to manipulate both the appearance of the scene elements as well as their spatial composition? How can we incorporate common sense knowledge about 3D objects and scenes, such as part structures, and arrangements of objects from Large Foundation Models or physics simulators to enable training with fewer data?
- Applications: Are there new fields that could benefit from generated 3D content, such as embodied AI, construction and agriculture? How can we leverage 2D priors to enable photo-realistic 3D content creation? How can we build tools that meet designers' real need to make their workflow more efficient (\emph{e.g.,} interactive editing, asset rigging)?