Dynamic Multi-Hotspot Cooling: A Machine Learning Approach

Nov 10, 2024

Dynamic Multi-Hotspot Cooling: A Machine Learning Approach

With the rapid evolution of heterogeneous architectures such as System-on-Chip (SoC), chiplets, CoWoS, and 3D ICs, managing on-chip hotspots has become a complex and pressing challenge. These hotspots are no longer static; they can vary spatially and temporally based on user operations, environmental conditions, device defects, and aging. Traditional cooling solutions, with their limited degrees of freedom, are increasingly inadequate for such dynamic scenarios. To address this, the development of intelligent and flexible cooling technologies is essential.

Temperature (hotspot) maps of different user operations on a 4-core AMD chip.

Thermoelectric cooling (TEC) is an active cooling technique that uses electricity to transfer heat. A TEC system typically includes a cooler to absorb heat, a heater to dissipate it, and thermoelectric materials that facilitate heat transfer. The process is nonlinear, combining the thermoelectric effect with heat conduction and joule effect.

Over the past few years, I have been developing an in-chip TEC based on holey silicon (10.1109/TED.2024.3358788). This TEC integrates seamlessly with electronic devices and acts as a built-in module in a silicon die. In our conceptual design, the cooler encircles the central hotspot, while the heater is positioned at the outer edge. Once activated, the TEC pumps heat laterally from the central hotspot to the heater, effectively cooling the device.

FEM simulation (comparing TEC device with passive cooling method)

Conceptual designs of (left) single TEC and (right) TEC array

While the TEC array offers significant flexibility, its optimization posed challenges. For a single TEC with nine intensity levels, testing all options in a simulation is straightforward. However, as the array size grows, the number of possible configurations increases exponentially, making traditional finite element method (FEM) simulations time-consuming and impractical for real-time applications. For instance, optimizing a 2×2 array can involve 9⁴ configurations, which can take days to simulate, and furthermore, optimizing a 3×3 array can involve up to 3.9×10⁹ configurations, which can take years to complete.

In fact, there have been many researchers attempting to address the similar challenge, such as uniform control (T. Bucher, IEEE Access, 2022) and linear control (J. Zhang, Integration, 2023). However, their solutions could be either to simplified or impractical. Given the high degrees of freedom as well as the high non-linearity of TEC array, we have developed our machine learning-based control, which can be the most promising solution to this problem.

To overcome these challenges, we developed a machine learning-based approach for TEC optimization. First of all, we automated 100,000 FEM simulations with random inputs to generate training data. This included hotspot and TEC intensity maps as inputs, and temperature and power matrices as outputs. We used this dataset to train a convolutional neural network (CNN) with over 120 million parameters.

The model’s architecture incorporates inception modules for spatial hierarchy comprehension and multi-task learning for diversified outputs. The result is a robust system capable of predicting temperature and power consumption for arbitrary m×n TEC arrays with high accuracy. For instance, our ML model achieves an average temperature error of only 1.6K compared to FEM ground truth, while reducing computation time from 45 seconds to just 42 milliseconds.

With the rapid predictions enabled by machine learning, we implemented a backtracking algorithm to further enhance optimization. This approach focuses only on configurations likely to reduce the peak temperature, significantly cutting down unnecessary iterations. For example, in a 5×5 array, level-1 optimization reduced the peak temperature from 257°C to 117°C in just one second. Higher levels of optimization further improved performance, albeit with slightly increased computation time.

Here is a simple example of a 5×5 array with a central hotspot, where the original temperature map results in a peak temperature of 257℃. Our lvl-1 optimization only allows the TEC to have only one intensity other than zero. And the algorithm advice to put a TEC in the center with 4 as intensity. Now the peak temperature become 117℃. Then if we further use a level 2 optimization, now the TEC can have two different intensity levels other than zero. And the algorithm now suggest 3 and 6 arranged in this way to provide a peak temperature of 89℃. Finally, if we use lvl-3 optimization, the program will suggest using 2, 3, 6 as intensity for the TEC, and the peak temperature will further decrease to 83℃. Note that the higher the optimization level, the more time it will be to complete the alogirhtm. The shortest time it take is only 1 second, which is the lvl1 optimization. But overall, time consumption is much shorter than FEM simulation and with no backtracking algorithm.

We tested our approach on various scenarios, including random and clustered hotspot maps in a 9×9 TEC array. For a random hotspot map with a peak temperature of 362°C, level-1 optimization reduced the peak to 190°C, achieving a 52% cooling efficiency. Similarly, for clustered hotspots—a common issue in heterogeneous systems—the checkerboard TEC assignment pattern suggested by the algorithm reduced the peak temperature by 55%.

Across 1,800 random samples, the algorithm demonstrated consistent performance, with peak temperature reductions exceeding 50% and power consumption as low as 21.7mW. These results highlight the practicality and efficiency of our machine learning and backtracking algorithm approach in addressing dynamic multi-hotspot cooling challenges.

As the complexity of electronic systems continues to grow, traditional cooling methods will struggle to keep pace. Our work demonstrates that combining advanced thermoelectric materials, machine learning, and optimization algorithms offers a promising solution to manage dynamic, multi-hotspot scenarios. This innovative approach not only enhances thermal management in current architectures but also lays the groundwork for future advancements in heterogeneous systems. By bridging the gap between cutting-edge technology and practical applications, we hope to inspire further research and development in this critical area.

Written by JJ on Nov 10, 2024

Likes:(0)

Go to Top

Jiajian Luo (He/Him)

Incoming NVIDIA Intern, PhD in Electronic Thermal Management, Semiconductor Packaging & Process Engineering

Nov 10, 2024

Dynamic Multi-Hotspot Cooling: A Machine Learning Approach