Cosmic images from the world's largest digital camera are so big they require a 'data butler'
The Rubin Observatory's enormous datasets call for cloud computing, seven different "brokers" and, indeed, a butler of sorts.
The amount of data that will be collected by the Vera C. Rubin Observatory, which released its fabulous first-light images this week, will far outweigh what any telescope before it managed to deliver. This has led astronomers to take a step into cloud computing — as well as enlist the help of seven brokers and a data butler.
Once it is fully up and running, the Rubin Observatory (funded by the U.S. National Science Foundation–Department of Energy) will be collecting 20 terabytes of data each night. Analyzing this data, it will issue 10 million alerts to astronomers, all of which will be managed by what are known as "brokers" that filter the huge number of alerts into something more manageable.
"In terms of data, we're at least an order of magnitude bigger than previous telescopes," University of Edinburgh computer scientist George Beckett, who is the U.K. Data Facility Coordinator for Rubin, told Space.com.
Over the next 10 years, Rubin's Legacy Survey of Space and Time will collect about 500 petabytes of data, equivalent to half a million 4K-UHD Blu-ray disks. Once collected by the telescope, the data will get transmitted along a dedicated network link between Rubin, which is located in Chile, and a data center at the SLAC National Accelerator Laboratory in California. From SLAC, a copy of all the raw data will be sent to the IN2P3 computing facility in Lyon, France, and some of the data will also be sent to a U.K.-based distributed computing network.
The processing of the data will be shared between these three data centers, with SLAC contributing 35%, IN2P3 taking on 40% and the UK 25%. (There's also a modest data center in Chile, which hosts the Rubin Observatory, to support Chilean astronomers.) Not only do the multiple data centers provide redundancy so data can't be lost in an accident, but they also can support each other if one data center is falling behind on the processing. That's because what really counts for astronomers is getting the important data out quickly, so they can follow up on interesting alerts as soon as possible.
"My biggest challenge is having astronomers constantly demanding their data!" joked Beckett.
This vast amount of data will be a precious resource for astronomers not only in the here and now, but also decades into the future.
Breaking space news, the latest updates on rocket launches, skywatching events and more!
So, how does one go about searching through it all?
Beckett draws an analogy with searching for a photograph taken on your smartphone. "Your phone is probably full of pictures you've taken over the past five or 10 years, and finding that one picture from two years ago usually involves flicking through and it is a bit of a piecemeal approach," he said. "Now imagine that your phone has 1.5 million photos and they're all 10,000 pixels wide, you haven't got a chance of just flicking through them."
Bringing this analogy back to the Rubin dataset, the solution, Beckett says, is to provide accessible descriptions of all those images in a way that astronomers can find what they are searching for with relative ease. That's one of the reasons why Rubin's data handling is different compared to that of previous telescopes, with which astronomers could download pockets of data that they need without too much complexity. The dataset for Rubin is simply too big to download — so it's all kept in the "cloud."
The Rubin dataset is managed by a service called the Data Butler. It records all the metadata, which is the data about the data — time, date, sky coordinates, what's in the image and so on.
"An astronomer can come up with pretty much any query they want written in astronomy terms talking about astronomical objects, timescales or coordinate systems, and the Data Butler fetches what they need," said Beckett.
That's for longer-term research, but there's also the transients, the moving objects, the things that go bump in the night that set off alerts to prompt astronomers to chase them up before the transients fade away. These include supernovas, kilonovas that produce gravitational waves, novas, flare stars, eclipsing binaries, magnetar outbursts, asteroids and comets moving across the sky, quasars, and much more besides, possibly even new types of object never seen before. Rubin will produce an estimated 10 million alerts each night, releasing each alert within two minutes of it being detected by the telescope: Even with the help of Data Butler, how can astronomers possibly sift through all those to find the most important ones to follow-up on?
There are seven brokers, operated by scientists in different countries, which will process the full 10 million alerts (and two more brokers with specific science goals that will only work on a subset of the 10 million daily alerts). For example, there's a Chilean broker called ALeRCE, standing for Automatic Learning for the Rapid Classification of Events, and ANTARES, the Arizona–NOIRLab Temporal Analysis and Response to Events Systems. The U.K. broker is called Lasair (pronounced LAH-suhr, meaning 'flame' or 'flash' in Scottish and Irish Gaelic) and focuses on transients.
Think of the brokers as a set of filters that astronomers can choose to help sift through the alerts and pick out the ones that they're most interested in. Some of the brokers use machine learning and artificial intelligence algorithms, but more traditional modeling methods are also used for quickly processing the data.
"Astronomers can sign up to a broker, describe the kind of things they're interested in, and hope that with appropriate descriptions the 10 million alerts each night will be filtered down to maybe two or three," said Beckett.
It's not that the other 9,999,998 alerts are not of value — maybe they're just not the thing the astronomer is interested in, or perhaps they're not unique enough to demand dedicated follow-ups, but they do add to the statistics for each type of object.
Rubin will survey a quarter of the Southern Hemisphere sky every night, seeing everything and missing nothing. One might think that it is the survey to end all surveys, that there will never be a bigger survey that will produce more data. However, Beckett also works on the data management team for the Square Kilometre Array (SKA), which is a huge array of radio telescopes in South Africa and Australia, and the techniques developed for Rubin and the lessons learned are going into making the data handing for the SKA run a lot smoother.
"The size of Rubin's dataset will be swamped by the SKA, which will be an order of magnitude again larger than Rubin," said Beckett.
There's always a bigger fish!
Join our Space Forums to keep talking space on the latest missions, night sky and more! And if you have a news tip, correction or comment, let us know at: community@space.com.

Keith Cooper is a freelance science journalist and editor in the United Kingdom, and has a degree in physics and astrophysics from the University of Manchester. He's the author of "The Contact Paradox: Challenging Our Assumptions in the Search for Extraterrestrial Intelligence" (Bloomsbury Sigma, 2020) and has written articles on astronomy, space, physics and astrobiology for a multitude of magazines and websites.
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.