The ATLAS experiment at CERN is one of the largest scientific projects in history, with thousands of scientists from around the world working together to analyze the torrents of data flowing from its detectors. From CERN’s Large Hadron Collider in Switzerland, data travels to 38 different countries, where physicists from 174 institutions apply their own analysis techniques to make new discoveries about the laws and structure of the universe. It’s an ambitious effort, but like any large, spread out collaboration, organizational challenges increase exponentially in tandem with the amount of data collected, processed and shared.
Now, scientists with the Computation Institute and the ATLAS Midwest Tier 2 Center have created a new platform to help unify and monitor this busy global hive of research. Working with open source software, the cloud computing testbed of CloudLab, and a team of experts from CERN, the Computation Institute’s Ilija Vukotic designed a new data analytics platform for ATLAS. From anywhere in the world, team members can directly access a website (or the platform itself, programmatically via an API) that allows them to analyze and visualize the processing of petabytes of data circulating the globe as the project unfolds.
In aggregate, the ATLAS experiment and collaboration can be thought of as a single project, organized around the ATLAS detector at the Large Hadron Collider. But in practice, it’s really a well-aligned constellation of smaller projects and teams working in concert, each using different methods and approaches to pursue a wide range of physics goals, including detailed measurements of the Higgs boson, the search for supersymmetry, and dozens of other questions at the forefront of particle physics research. Each analysis group requires its own collection of data sets typically filtered using bespoke analysis algorithms developed by individuals from each physics team.
Over the past decade, a set of workload, data management, data transfer and grid middleware services have been developed for ATLAS, providing the project with a comprehensive system for reconstruction, simulation and the analysis of proton collisions in the detector. These services produce prodigious amounts of metadata that mark progress, operational performance and group activity—metadata that can be mined to ensure physics teams get the data they need in a timely fashion and can efficiently analyze them once in place.
However, these metadata were difficult to share between computing subsystems and were not in a form suitable for processing by modern data analytics frameworks. Determining who possessed the needed data and learning the custom tools developed around that data often took months, Vukotic said, a “Tower of Babel” situation that slows collaboration within the ATLAS ecosystem.
But today, these kinds of distributed “big data” problems are no longer unique to data-intensive sciences such as physics and astronomy. As similar issues appeared in commercial and government sectors over the past decade, open source developers created suites of new analytics tools to address them. Vukotic set out to apply some of these tools to untangle the growing data knots facing ATLAS researchers.