Initial DR studies mainly adopt model predictive control and thus requireaccurate models of the control problem (e.g., a customer behavior model), whichare to a large extent uncertain for the EV scenario. Hence, model-freeapproaches, especially based on reinforcement learning (RL) are an attractivealternative. In this paper, we propose a new Markov decision process (MDP)formulation in the RL framework, to jointly coordinate a set of EV chargingstations. State-of-the-art algorithms either focus on a single EV, or performthe control of an aggregate of EVs in multiple steps (e.g., aggregate loaddecisions in one step, then a step translating the aggregate decision toindividual connected EVs). On the contrary, we propose an RL approach tojointly control the whole set of EVs at once. We contribute a new MDPformulation, with a scalable state representation that is independent of thenumber of EV charging stations. Further, we use a batch reinforcement learningalgorithm, i.e., an instance of fitted Q-iteration, to learn the optimalcharging policy. We analyze its performance using simulation experiments basedon a real-world EV charging data. More specifically, we (i) explore the varioussettings in training the RL policy (e.g., duration of the period with trainingdata), (ii) compare its performance to an oracle all-knowing benchmark (whichprovides an upper bound for performance, relying on information that is notavailable or at least imperfect in practice), (iii) analyze performance overtime, over the course of a full year to evaluate possible performancefluctuations (e.g, across different seasons), and (iv) demonstrate thegeneralization capacity of a learned control policy to larger sets of chargingstations.