We introduce an on-ground Pedestrian World Model, a computational model thatcan predict how pedestrians move around an observer in the crowd on the groundplane, but from just the egocentric-views of the observer. Our model,InCrowdFormer, fully leverages the Transformer architecture by modelingpedestrian interaction and egocentric to top-down view transformation withattention, and autoregressively predicts on-ground positions of a variablenumber of people with an encoder-decoder architecture. We encode theuncertainties arising from unknown pedestrian heights with latent codes topredict the posterior distributions of pedestrian positions. We validate theeffectiveness of InCrowdFormer on a novel prediction benchmark of realmovements. The results show that InCrowdFormer accurately predicts the futurecoordination of pedestrians. To the best of our knowledge, InCrowdFormer is thefirst-of-its-kind pedestrian world model which we believe will benefit a widerange of egocentric-view applications including crowd navigation, tracking, andsynthesis.