Landmark matching via geodesic shooting is a prerequisite task for numerousregistration based applications in biomedicine. Geodesic shooting has beendeveloped as one solution approach and formulates the diffeomorphicregistration as an optimal control problem under the Hamiltonian framework. Inthis framework, with landmark positions q0 fixed, the problem solely depends onthe initial momentum p0 and evolves through time steps according to a set ofconstraint equations. Given an initial p0, the algorithm flows q and p forwardthrough time steps, calculates a loss based on point-set mismatch and kineticenergy, back-propagate through time to calculate gradient on p0 and update it.In the forward and backward pass, a pair-wise kernel on landmark points K andadditional intermediate terms have to be calculated and marginalized, leadingto O(N2) computational complexity, N being the number of points to beregistered. For medical image applications, N maybe in the range of thousands,rendering this operation computationally expensive. In this work we ropose aCUDA implementation based on shared memory reduction. Our implementationachieves nearly 2 orders magnitude speed up compared to a naive CPU-basedimplementation, in addition to improved numerical accuracy as well as betterregistration results.