To make deliberate progress towards more intelligent and more human-likeartificial systems, we need to be following an appropriate feedback signal: weneed to be able to define and evaluate intelligence in a way that enablescomparisons between two systems, as well as comparisons with humans. Over thepast hundred years, there has been an abundance of attempts to define andmeasure intelligence, across both the fields of psychology and AI. We summarizeand critically assess these definitions and evaluation approaches, while makingapparent the two historical conceptions of intelligence that have implicitlyguided them. We note that in practice, the contemporary AI community stillgravitates towards benchmarking intelligence by comparing the skill exhibitedby AIs and humans at specific tasks such as board games and video games. Weargue that solely measuring skill at any given task falls short of measuringintelligence, because skill is heavily modulated by prior knowledge andexperience: unlimited priors or unlimited training data allow experimenters to"buy" arbitrary levels of skills for a system, in a way that masks the system'sown generalization power. We then articulate a new formal definition ofintelligence based on Algorithmic Information Theory, describing intelligenceas skill-acquisition efficiency and highlighting the concepts of scope,generalization difficulty, priors, and experience. Using this definition, wepropose a set of guidelines for what a general AI benchmark should look like.Finally, we present a benchmark closely following these guidelines, theAbstraction and Reasoning Corpus (ARC), built upon an explicit set of priorsdesigned to be as close as possible to innate human priors. We argue that ARCcan be used to measure a human-like form of general fluid intelligence and thatit enables fair general intelligence comparisons between AI systems and humans.