CERN operates seven large particle accelerators, including the Large Hadron Collider (LHC). With a circumference of 27 kilometers, making it by far the largest machine ever constructed, the LHC contains more than 10,000 superconducting magnets to accelerate and steer beams of subatomic particles. To keep the LHC and other accelerators working reliably to serve the scientific community, CERN gathers, stores, and analyzes more than 2.5 million signals (3.5TB of data daily) from operational devices that make up the Accelerator Control System.
Vito Baggiolini, leader of the CPA Section in the CERN Accelerator Controls Group, says: “We mainly collect information on how the hundreds of sub-systems and tens of thousands of devices in the accelerators are working, and on the characteristics of the particle beam, such as its shape and intensity. The data we store helps accelerator experts to ensure that the whole complex of accelerators is functioning reliably and to the expected standards at all times.”
Already in the early 2000s, CERN built its own highly available, resilient, and robust data acquisition and storage system: CALS. Based on a high-performance relational database, this system had scaled well from its first deployment in 2003, but to cope with the increasing need for more complex and long-running analyses, it became clear that a new approach was needed.
In 2017, CERN set out to create its next-generation replacement: NXCALS. This would use Spark on a Hadoop cluster for distributed execution of advanced analytical algorithms.
The key technical requirements for the data acquisition and ingestion system within NXCALS were: exceptional fault-tolerance and resilience to manage process failures without any data losses or other service disruption; the ability to load-balance data subscriptions evenly across processes to avoid overloads; and the ability to scale rapidly and easily by adding new system resources to handle new sources of data.
Marcin Sobieszek, senior computing engineer within the CERN Accelerator Controls Group, says: “We aimed to implement a highly distributed application in which workload would be spread horizontally among nodes. We wanted the ability to dynamically add resources to the process cluster, along with failure detection and notification capabilities. For this, the ability to persist the state of this distributed system on external storage was vital.”
We wanted the ability to dynamically add resources to the process cluster, along with failure detection and notification capabilities. For this, the ability to persist the state of this distributed system on external storage was vital.
For its NXCALS system, the CERN Accelerator Controls Group opted for a Master-Worker data processing model, in which a single master instance assigns subscriptions to multiple workers and used Akka from Lightbend to implement this model.
Akka is a software framework designed for the development of distributed, concurrent, fault-tolerant, and scalable applications. It uses the Actor Model to provide high-level abstraction for concurrent programming; actors in Akka handle state, ensure thread safety, and make concurrency much simpler.
Marcin Sobieszek says, “Akka fulfilled our requirements for workload distribution, dynamic scalability, and resilience, especially given all its clustering extensions. As a developer, you don’t have to think about concurrency or about where an actor is physically located, Akka takes care of that and gives you a simple abstraction to work with. And although it’s written in Scala, it’s great that it also offers an API for Java, which is our preferred language.”
“The actor model in Akka made it easy for us to implement a scaling system of workers, which is still relatively lightweight, especially compared to other clustering solutions” adds Vito Baggiolini. “When our monitoring system alerts us that the number of subscriptions handled by a single worker has exceeded a given threshold, we can then adjust the deployment parameter to rebalance the subscriptions or spawn more nodes.”
Written in Java using the Akka framework, CERN’s Data Sources time-series acquisition software is fully horizontally scalable and fault-tolerant. It currently processes about 90,000 subscriptions, accounting for about 100,000 messages per second, storing the data in Hadoop.
Actors in Akka communicate through asynchronous messages, and the default setup is “at-most-once” delivery, meaning that each message is delivered either zero or one times. CERN modified Akka for “at-least-once” delivery, to ensure that messages between actors would always be received.
Marcin Sobieszek says: “It was easy to adapt the paradigm within Akka – we just implemented an additional protocol for acknowledging messages. Another advantage is the ability to assign roles to cluster nodes, so that they can be easily managed as required by the business logic – for example, so that you can isolate nodes. In our system, we have a validation process with a role validation-executor to check new subscriptions before we allow them to send data to the NXCALS system. Once a device is subscribed, we monitor the volume of data it sends. If this suddenly becomes excessively large, we isolate the subscription to its own worker node, assigned the isolated-worker role, thereby avoiding causing instability to the wider NXCALS acquisition layer. This was really useful functionality almost out of the box.”
The mission-critical NXCALS acquisition system built on Akka is the go-to service for monitoring the all-important beam quality and for troubleshooting problems with accelerator equipment or unexpected beam behavior.
“CERN operations teams and equipment experts must be able look at all the accumulated data to check for any operational anomalies”, says Vito Baggiolini. “This is why it is so important to have virtually no data loss. The Akka framework helps making NXCALS data acquisition extremely resilient to failure.”
Marcin Sobieszek adds: “The self-healing capabilities of Akka are really impressive: during broader infrastructure maintenance, the cluster shuts itself down and then fully recreates itself when the network is restored, without loss of data.”
The flexibility of Akka comes into play whenever a new system is commissioned, or a new mode of operation is established: the team can add new data subscriptions non-disruptively without having to stop data acquisition from other devices. This resilience is a critical benefit in an environment that is constantly generating data and that typically runs for many months without a break.
“NXCALS has shown remarkable robustness and high availability, and we are confident that it will continue to scale in line with fast-growing volumes of operational data when HL-LHC comes online in the years to come,” says Vito Baggiolini. “This robustness and scalability contribute to a stable environment as needed to support some of the world’s leading research into fundamental physics.”
Akka fulfilled our requirements for workload distribution, dynamic scalability, and resilience, especially given all its clustering extensions...
CERN is an inter-governmental, non-profit organization responsible for running the world’s largest particle physics laboratory. Straddling the border between France and Switzerland, CERN employs more than 2,500 people and helps a wider population of more than 12,000 scientific researchers unlock the secrets of the physical universe. If you’re reading this text on the web, you can thank CERN for that too: this is where the World Wide Web was invented in 1989.
Download PDFAkka, from Lightbend, is the basis of CERN’s mission-critical NXCALS acquisition system, which enables the organization to monitor the quality of its particle beam and to troubleshoot issues with accelerator equipment or unexpected beam behavior.