This work examines the use of reinforcement learning (RL) to optimize cycliclockdowns, which is one of the methods available for control of the COVID-19pandemic. The problem is structured as an optimal control system for tracking areference value, corresponding to the maximum usage level of a criticalresource, such as ICU beds. However, instead of using conventional optimalcontrol methods, RL is used to find optimal control policies. A framework wasdeveloped to calculate optimal cyclic lockdown timings using an RL-based on-offcontroller. The RL-based controller is implemented as an RL agent thatinteracts with an epidemic simulator, implemented as an extended SEIR epidemicmodel. The RL agent learns a policy function that produces an optimal sequenceof open/lockdown decisions such that goals specified in the RL reward functionare optimized. Two concurrent goals were used: the first one is a public healthgoal that minimizes overshoots of ICU bed usage above an ICU bed threshold, andthe second one is a socio-economic goal that minimizes the time spent underlockdowns. It is assumed that cyclic lockdowns are considered as a temporaryalternative to extended lockdowns when a region faces imminent danger ofoverpassing resource capacity limits and when imposing an extended lockdownwould cause severe social and economic consequences due to lack of necessaryeconomic resources to support its affected population during an extendedlockdown.