The thermal management of data centers has reached a boiling point. For decades, air cooling was the undisputed standard, capable of handling the steady heat dissipation requirements of traditional enterprise servers. In modern AI data centers, where generative AI workloads push rack densities towards 100 kW or more, air cooling has hit its thermodynamic ceiling. Air is unable to efficiently transfer such amounts of heat compared to liquid cooling methods[1].
In 2026, liquid cooling—specifically direct-to-chip (DTC) and immersion cooling—has shifted from a niche experiment to a mainstream cooling solution for AI factories[2]. But introducing liquids introduces a new set of challenges and risks. In a liquid-cooled environment, the margin for error is practically non-existent. A single leak in a 2-megawatt server row or a failure in pressure regulation can lead to a potentially catastrophic outcome.
We understand that transitioning to liquid cooling requires rethinking traditional approaches to facility asset monitoring. The focus for operators is now shifting from passive monitoring to holistic, real-time visibility of OT cooling systems and the implementation of robust redundancy measures to mitigate the risks and challenges of liquid cooling.
Different Cooling Techniques, Different Monitoring Needs
In a traditional air-cooled data center, monitoring is relatively straightforward. Temperature probes and humidity sensors provide sufficient data for a building management system (BMS) to adjust fan speeds or chiller setpoints over the span of minutes.
For liquid cooling, the physics and consequences of a cooling failure are fundamentally different[3]. Because of their exceptional ability to conduct heat, these systems are designed with much tighter tolerances. In a direct-to-chip setup, if the cooling distribution unit (CDU) loses pressure or a pump fails, the temperature of a GPU cluster can spike to critical levels in seconds.
Offsetting potential risks of liquid cooling requires a major expansion in both the volume and complexity of system monitoring compared to traditional air-cooled environments. Where previously ambient temperature readings were sufficient, operators must now maintain continuous visibility over a much more diverse array of critical metrics, ranging from fluid dynamics and system pressure to coolant chemistry and moisture leak detection.
Closing the Sensor Gap for Monitoring the Cooling Flow
The key obstacle for operators transitioning to liquid cooling is the integration of these specialized sensors into an integrated monitoring architecture[4]. These sensors often communicate using industrial protocols that traditional IT networks are not designed to handle reliably. Furthermore, the environmental conditions inside liquid-cooled racks are particularly challenging. High-density racks generate localized heat pockets, and proximity to liquid manifolds creates a wet environment risk that standard enterprise-grade hardware cannot withstand.
Industrial-grade remote I/O solutions help bridge this integration gap. Designed to sit at the edge or directly at the rack or CDU, these devices aggregate various sensor signals, including analog, digital, and thermocouple. By then converting these signals into a unified data stream via Modbus TCP, EtherNet/IP, or SNMP, they provide the BMS with accurate, holistic oversight of the health of systems around the facility.
From Passive Monitoring to Active Protection
In high-stakes data centers, simply knowing an issue has occurred is no longer enough. If a leak is detected in a 2-megawatt row, the system must be able to respond immediately. Real-time, precise monitoring capabilities are the cornerstone of a centralized response platform to address potential issues before they have a chance to cause any damage. For such a platform to work, it needs a brain capable of local, deterministic logic.
Industrial computers lie at the heart of this architecture. More than just gateways, they are powerful edge computing platforms capable of running autonomous emergency logic. For instance, they can be programmed to monitor pressure and leak sensors simultaneously. If a specific pattern is detected based on sensor data—such as a sudden drop in pressure combined with a moisture trigger—the computer can execute a sub-second shutdown command to the CDU valves, bypassing the latency of the wider network. By deploying industrial computers, facility operators can create an active response system that leverages the aggregated, real-time sensor data to prevent liquid cooling risks.
Redundancy in the Cooling Nervous System
In an AI factory, the network that carries cooling telemetry is just as critical as the cooling liquid itself. If the network fails, the liquid cooling management platform is blind. With stakes this high, relying on basic redundancy implementations such as daisy chaining, becomes a liability.
By utilizing industrial Ethernet switches capable of sub-20 ms recovery, such as Moxa’s Turbo Ring technology[5], operators ensure that a single faulty cable does not lead to a blind spot in thermal management and potential loss of multi-million dollar GPU investments.
Reclaiming Peace of Mind in a Liquid-cooled World
The extreme thermal loads common in generative AI facilities necessitate the inevitable transition to liquid cooling. However, adopting liquid cooling without a corresponding upgrade in OT data visibility is a risk that operators cannot afford to take.
By overhauling traditional monitoring approaches complemented with robust millisecond-level redundancy, data center operators can build a solid foundation to scale their AI infrastructure. Industrial-grade I/Os, deterministic edge logic, and redundancy mechanisms are the essential building blocks for designing resilient data center infrastructure.
For more information on how Moxa can help secure your liquid-cooled infrastructure, check out our brochure.