Abstract
Recently, large language models (LLMs) have demonstrated outstandingreasoning capabilities on mathematical and coding tasks. However, theirapplication to financial tasks-especially the most fundamental task of stockmovement prediction-remains underexplored. We study a three-classclassification problem (up, hold, down) and, by analyzing existing reasoningresponses, observe that: (1) LLMs follow analysts' opinions rather than exhibita systematic, independent analytical logic (CoTs). (2) LLMs list summaries fromdifferent sources without weighing adversarial evidence, yet suchcounterevidence is crucial for reliable prediction. It shows that the modeldoes not make good use of its reasoning ability to complete the task. Toaddress this, we propose Reflective Evidence Tuning (RETuning), a cold-startmethod prior to reinforcement learning, to enhance prediction ability. Whilegenerating CoT, RETuning encourages dynamically constructing an analyticalframework from diverse information sources, organizing and scoring evidence forprice up or down based on that framework-rather than on contextualviewpoints-and finally reflecting to derive the prediction. This approachmaximally aligns the model with its learned analytical framework, ensuringindependent logical reasoning and reducing undue influence from context. Wealso build a large-scale dataset spanning all of 2024 for 5,123 A-share stocks,with long contexts (32K tokens) and over 200K samples. In addition to price andnews, it incorporates analysts' opinions, quantitative reports, fundamentaldata, macroeconomic indicators, and similar stocks. Experiments show thatRETuning successfully unlocks the model's reasoning ability in the financialdomain. Inference-time scaling still works even after 6 months or onout-of-distribution stocks, since the models gain valuable insights about stockmovement prediction.