A better method for planning complex visual tasks | MIT News

Source Domain: news.mit.edu

MIT researchers have created a generative AI framework for long-term visual task planning, like robot navigation, that is about twice as effective as some existing techniques.
The framework uses a specialized vision-language model to perceive the scenario and simulate actions needed, then transforms the simulation into a planning problem in a formal language.
The approach outperforms baseline models by generating actionable plans with a 70 percent success rate.
The system can solve new, unseen problems, making it well-suited for dynamic environments.
The researchers combined the strengths of vision-language models and formal planners, successfully generalizing to new instances.
The developed system, called VLM-guided formal planning (VLMFP), includes two models, SimVLM and GenVLM, that facilitate action simulation and generation of formal planning files respectively.
The framework achieved significant success rates in multiple planning tasks, including 2D and 3D scenarios such as multirobot collaboration and robotic assembly.
The researchers aim to improve the complexity handling of the system and reduce errors or “hallucinations” from the vision-language models in future work.

You may have missed