Neural style transfer, allowing to apply the artistic style of one image toanother, has become one of the most widely showcased computer visionapplications shortly after its introduction. In contrast, related tasks in themusic audio domain remained, until recently, largely untackled. While severalstyle conversion methods tailored to musical signals have been proposed, mostlack the 'one-shot' capability of classical image style transfer algorithms. Onthe other hand, the results of existing one-shot audio style transfer methodson musical inputs are not as compelling. In this work, we are specificallyinterested in the problem of one-shot timbre transfer. We present a novelmethod for this task, based on an extension of the vector-quantized variationalautoencoder (VQ-VAE), along with a simple self-supervised learning strategydesigned to obtain disentangled representations of timbre and pitch. Weevaluate the method using a set of objective metrics and show that it is ableto outperform selected baselines.