Medical systematic reviews are crucial for informing clinical decision makingand healthcare policy. But producing such reviews is onerous andtime-consuming. Thus, high-quality evidence synopses are not available for manyquestions and may be outdated even when they are available. Large languagemodels (LLMs) are now capable of generating long-form texts, suggesting thetantalizing possibility of automatically generating literature reviews ondemand. However, LLMs sometimes generate inaccurate (and potentiallymisleading) texts by hallucinating or omitting important information. In thehealthcare context, this may render LLMs unusable at best and dangerous atworst. Most discussion surrounding the benefits and risks of LLMs have beendivorced from specific applications. In this work, we seek to qualitativelycharacterize the potential utility and risks of LLMs for assisting inproduction of medical evidence reviews. We conducted 16 semi-structuredinterviews with international experts in systematic reviews, groundingdiscussion in the context of generating evidence reviews. Domain expertsindicated that LLMs could aid writing reviews, as a tool for drafting orcreating plain language summaries, generating templates or suggestions,distilling information, crosschecking, and synthesizing or interpreting textinputs. But they also identified issues with model outputs and expressedconcerns about potential downstream harms of confidently composed butinaccurate LLM outputs which might mislead. Other anticipated potentialdownstream harms included lessened accountability and proliferation ofautomatically generated reviews that might be of low quality. Informed by thisqualitative analysis, we identify criteria for rigorous evaluation ofbiomedical LLMs aligned with domain expert views.