Large language models (LLMs) are increasingly adopted for a variety of taskswith implicit graphical structures, such as planning in robotics, multi-hopquestion answering or knowledge probing, structured commonsense reasoning, andmore. While LLMs have advanced the state-of-the-art on these tasks withstructure implications, whether LLMs could explicitly process textualdescriptions of graphs and structures, map them to grounded conceptual spaces,and perform structured operations remains underexplored. To this end, wepropose NLGraph (Natural Language Graph), a comprehensive benchmark ofgraph-based problem solving designed in natural language. NLGraph contains29,370 problems, covering eight graph reasoning tasks with varying complexityfrom simple tasks such as connectivity and shortest path up to complex problemssuch as maximum flow and simulating graph neural networks. We evaluate LLMs(GPT-3/4) with various prompting approaches on the NLGraph benchmark and findthat 1) language models do demonstrate preliminary graph reasoning abilities,2) the benefit of advanced prompting and in-context learning diminishes on morecomplex graph problems, while 3) LLMs are also (un)surprisingly brittle in theface of spurious correlations in graph and problem settings. We then proposeBuild-a-Graph Prompting and Algorithmic Prompting, two instruction-basedapproaches to enhance LLMs in solving natural language graph problems.Build-a-Graph and Algorithmic prompting improve the performance of LLMs onNLGraph by 3.07% to 16.85% across multiple tasks and settings, while how tosolve the most complicated graph reasoning tasks in our setup with languagemodels remains an open research question. The NLGraph benchmark and evaluationcode are available at https://github.com/Arthur-Heng/NLGraph.