Intelligent Arxiv: Sort daily papers by learning users topics preference

Abstract

Current daily paper releases are becoming increasingly large and areas ofresearch are growing in diversity. This makes it harder for scientists to keepup to date with current state of the art and identify relevant work withintheir lines of interest. The goal of this article is to address this problemusing Machine Learning techniques. We model a scientific paper to be built as acombination of different scientific knowledge from diverse topics into a newproblem. In light of this, we implement the unsupervised Machine Learningtechnique of Latent Dirichlet Allocation (LDA) on the corpus of papers in agiven field to: i) define and extract underlying topics in the corpus; ii) getthe topics weight vector for each paper in the corpus; and iii) get the topicsweight vector for new papers. By registering papers preferred by a user, webuild a user vector of weights using the information of the vectors of theselected papers. Hence, by performing an inner product between the user vectorand each paper in the daily Arxiv release, we can sort the papers according tothe user preference on the underlying topics. We have created the website IArxiv.org where users can read sorted dailyArxiv releases (and more) while the algorithm learns each users preference,yielding a more accurate sorting every day. Current IArxiv.org version runs onArxiv categories astro-ph, gr-qc, hep-ph and hep-th and we plan to extend toothers. We propose several new useful and relevant implementations to beadditionally developed as well as new Machine Learning techniques beyond LDA tofurther improve the accuracy of this new tool.

Quick Read (beta)

loading the full paper ...