Abstract
Multimodal large language models (MLLMs) have advanced vision-languagereasoning and are increasingly deployed in embodied agents. However,significant limitations remain: MLLMs generalize poorly across digital-physicalspaces and embodiments; vision-language-action models (VLAs) produce low-levelactions yet lack robust high-level embodied reasoning; and most embodied largelanguage models (ELLMs) are constrained to digital-space with poorgeneralization to the physical world. Thus, unified models that operateseamlessly across digital and physical spaces while generalizing acrossembodiments and tasks remain absent. We introduce the \textbf{Boundless LargeModel (BLM$_1$)}, a multimodal spatial foundation model that preservesinstruction following and reasoning, incorporates embodied knowledge, andsupports robust cross-embodiment control. BLM$_1$ integrates three keycapabilities -- \textit{cross-space transfer, cross-task learning, andcross-embodiment generalization} -- via a two-stage training paradigm. Stage Iinjects embodied knowledge into the MLLM through curated digital corpora whilemaintaining language competence. Stage II trains a policy module through anintent-bridging interface that extracts high-level semantics from the MLLM toguide control, without fine-tuning the MLLM backbone. This process is supportedby a self-collected cross-embodiment demonstration suite spanning four robotembodiments and six progressively challenging tasks. Evaluations across digitaland physical benchmarks show that a single BLM$_1$ instance outperforms fourmodel families -- MLLMs, ELLMs, VLAs, and GMLMs -- achieving$\sim\!\textbf{6%}$ gains in digital tasks and $\sim\!\textbf{3%}$ in physicaltasks.