Conversational bilingual speech encompasses three types of utterances: twopurely monolingual types and one intra-sententially code-switched type. In thiswork, we propose a general framework to jointly model the likelihoods of themonolingual and code-switch sub-tasks that comprise bilingual speechrecognition. By defining the monolingual sub-tasks with label-to-framesynchronization, our joint modeling framework can be conditionally factorizedsuch that the final bilingual output, which may or may not be code-switched, isobtained given only monolingual information. We show that this conditionallyfactorized joint framework can be modeled by an end-to-end differentiableneural network. We demonstrate the efficacy of our proposed model on bilingualMandarin-English speech recognition across both monolingual and code-switchedcorpora.