Federated Learning (FL) seeks to distribute model training across localclients without collecting data in a centralized data-center, hence removingdata-privacy concerns. A major challenge for FL is data heterogeneity (whereeach client's data distribution can differ) as it can lead to weight divergenceamong local clients and slow global convergence. The current SOTA FL methodsdesigned for data heterogeneity typically impose regularization to limit theimpact of non-IID data and are stateful algorithms, i.e., they maintain localstatistics over time. While effective, these approaches can only be used for aspecial case of FL involving only a small number of reliable clients. For themore typical applications of FL where the number of clients is large (e.g.,edge-device and mobile applications), these methods cannot be applied,motivating the need for a stateless approach to heterogeneous FL which can beused for any number of clients. We derive a first-order gradient regularizationto penalize inconsistent local updates due to local data heterogeneity.Specifically, to mitigate weight divergence, we introduce a first-orderapproximation of the global data distribution into local objectives, whichintuitively penalizes updates in the opposite direction of the global update.The end result is a stateless FL algorithm that achieves 1) significantlyfaster convergence (i.e., fewer communication rounds) and 2) higher overallconverged performance than SOTA methods under non-IID data distribution.Importantly, our approach does not impose unrealistic limits on the clientsize, enabling learning from a large number of clients as is typical in most FLapplications.