The goal of this work is to provide a viable solution based on reinforcementlearning for traffic signal control problems. Although the state-of-the-artreinforcement learning approaches have yielded great success in a variety ofdomains, directly applying it to alleviate traffic congestion can bechallenging, considering the requirement of high sample efficiency and howtraining data is gathered. In this work, we address several challenges that weencountered when we attempted to mitigate serious traffic congestion occurringin a metropolitan area. Specifically, we are required to provide a solutionthat is able to (1) handle the traffic signal control when certain surveillancecameras that retrieve information for reinforcement learning are down, (2)learn from batch data without a traffic simulator, and (3) make controldecisions without shared information across intersections. We present atwo-stage framework to deal with the above-mentioned situations. The frameworkcan be decomposed into an Evolution Strategies approach that gives a fixed-timetraffic signal control schedule and a multi-agent off-policy reinforcementlearning that is capable of learning from batch data with the aid of threeproposed components, bounded action, batch augmentation, and surrogate rewardclipping. Our experiments show that the proposed framework reduces trafficcongestion by 36% in terms of waiting time compared with the currently usedfixed-time traffic signal plan. Furthermore, the framework requires only 600queries to a simulator to achieve the result.