While recurrent models have been effective in NLP tasks, their performance oncontext-free languages (CFLs) has been found to be quite weak. Given that CFLsare believed to capture important phenomena such as hierarchical structure innatural languages, this discrepancy in performance calls for an explanation. Westudy the performance of recurrent models on Dyck-n languages, a particularlyimportant and well-studied class of CFLs. We find that while recurrent modelsgeneralize nearly perfectly if the lengths of the training and test strings arefrom the same range, they perform poorly if the test strings are longer. At thesame time, we observe that recurrent models are expressive enough to recognizeDyck words of arbitrary lengths in finite precision if their depths arebounded. Hence, we evaluate our models on samples generated from Dyck languageswith bounded depth and find that they are indeed able to generalize to muchhigher lengths. Since natural language datasets have nested dependencies ofbounded depth, this may help explain why they perform well in modelinghierarchical dependencies in natural language data despite prior worksindicating poor generalization performance on Dyck languages. We performprobing studies to support our results and provide comparisons withTransformers.