We construct a general unified framework for learning representation ofstructured data, i.e. data which cannot be represented as the fixed-lengthvectors (e.g. sets, graphs, texts or images of varying sizes). The key factoris played by an intermediate network called SAN (Set Aggregating Network),which maps a structured object to a fixed length vector in a high dimensionallatent space. Our main theoretical result shows that for sufficiently largedimension of the latent space, SAN is capable of learning a uniquerepresentation for every input example. Experiments demonstrate that replacingpooling operation by SAN in convolutional networks leads to better results inclassifying images with different sizes. Moreover, its direct application totext and graph data allows to obtain results close to SOTA, by simpler networkswith smaller number of parameters than competitive models.