Modern astronomy is a treasure hunt on a cosmic scale. Every night, telescopes around the globe scan the skies, searching for fleeting events like exploding stars (supernovae) that give us crucial insights into the workings of the universe. These surveys generate millions of alerts about potential discoveries, but there’s a catch: the vast majority are not real cosmic events but “bogus” signals from satellite trails, cosmic ray hits, or other instrumental artefacts.
For years, astronomers have used specialized machine learning models, like convolutional neural networks (CNNs), to sift through this data. While effective, these models often act as “black boxes,” providing a simple “real” or “bogus” label with no explanation. This forces scientists to either blindly trust the output or spend countless hours manually verifying candidates — a bottleneck that will soon become insurmountable with next-generation telescopes like the Vera C. Rubin Observatory, expected to generate 10 million alerts per night.
This challenge led us to ask a fundamental question: could a general-purpose, multimodal model, designed to understand text and images together, not only match the accuracy of these specialized models but also explain what it sees? In our paper, “Textual interpretation of transient image classifications from large language models”, published in Nature Astronomy, we demonstrate that the answer is a resounding yes. We show how Google’s Gemini model can be transformed into an expert astronomy assistant that can classify cosmic events with high accuracy and, crucially, explain its reasoning in plain language. We accomplished this by employing few-shot learning with Gemini, providing it with just 15 annotated examples per survey and concise instructions to accurately classify and explain cosmic events.