Within the last few years, there has been a move towards using statisticalmodels in conjunction with neural networks with the end goal of being able tobetter answer the question, "what do our models know?". From this trend,classical metrics such as Prediction Interval Coverage Probability (PICP) andnew metrics such as calibration error have entered the general repertoire ofmodel evaluation in order to gain better insight into how the uncertainty ofour model compares to reality. One important component of uncertainty modelingis model uncertainty (epistemic uncertainty), a measurement of what the modeldoes and does not know. However, current evaluation techniques tends toconflate model uncertainty with aleatoric uncertainty (irreducible error),leading to incorrect conclusions. In this paper, using posterior predictivechecks, we show how calibration error and its variants are almost alwaysincorrect to use given model uncertainty, and further show how this mistake canlead to trust in bad models and mistrust in good models. Though posteriorpredictive checks has often been used for in-sample evaluation of Bayesianmodels, we show it still has an important place in the modern deep learningworld.