AI-Driven Dynamic TEC Cooling for SoC Systems

Nov 10, 2024

AI-Driven Dynamic TEC Cooling for SoC Systems

With the rapid evolution of heterogeneous architectures such as System-on-Chip (SoC), chiplets, CoWoS, and 3D ICs, managing on-chip hotspots has become a complex and pressing challenge. These hotspots are no longer static; they can vary spatially and temporally based on user operations, environmental conditions, device defects, and aging. Traditional cooling solutions, with their limited degrees of freedom, are increasingly inadequate for such dynamic scenarios. To address this, the development of intelligent and flexible cooling technologies is essential.

Temperature (hotspot) maps of different user operations on a 4-core AMD chip.

Thermoelectric cooling (TEC) is an active cooling technique that uses electricity to transfer heat. A TEC system typically includes a cooler to absorb heat, a heater to dissipate it, and thermoelectric materials that facilitate heat transfer. The process is nonlinear, combining the thermoelectric effect with heat conduction and joule effect.

Over the past few years, I have been developing an in-chip TEC based on holey silicon (10.1109/TED.2024.3358788). This TEC integrates seamlessly with electronic devices and acts as a built-in module in a silicon die. In our conceptual design, the cooler encircles the central hotspot, while the heater is positioned at the outer edge. Once activated, the TEC pumps heat laterally from the central hotspot to the heater, effectively cooling the device.

Conceptual designs of (left) single TEC and (right) TEC array

While the TEC array offers significant flexibility, its optimization posed challenges. For a single TEC with nine intensity levels, testing all options in a simulation is straightforward. However, as the array size grows, the number of possible configurations increases exponentially, making traditional finite element method (FEM) simulations time-consuming and impractical for real-time applications. For instance, optimizing a 2×2 array can involve 9⁴ configurations, which can take days to simulate, and furthermore, optimizing a 3×3 array can involve up to 3.9×10⁹ configurations, which can take years to complete.

In fact, although many researchers have attempted to address this assignment challenge, including uniform control (T. Bucher, IEEE Access, 2022) and linear control (J. Zhang, Integration, 2023), their solutions could be either to simplified or impractical. Given the high degrees of freedom as well as the high non-linearity of TEC array, we have developed our machine learning-based control, which can be the most promising solution to this problem.

Optimization challenge due to dynamic input and multiple degrees of freedom in control

To overcome these challenges, we developed a machine learning-based approach for TEC optimization. First of all, we extract essential input parameters including local hotspot conditions and corresponding TEC intensity. For every input parameter set, its output includes the temperature map and the consequent TEC power consumption. The goal of machine learning development is to train a model that can reflect this relationship so that a control algorithm can quickly iterate possible decision.

Extracting essential parameters for machine learning training process

With this goal in mind, a research pipeline is built, which includes massive FEM automation and data postprocessing. First of all, we randomly run 100,000 FEM simulations to extract their inputs and outputs for training data. Then we split the data set into training set, development set and test set. With data augmentation, the training samples are significantly increased from 70,000 to 560,000. Later, we train a convolutional neural network (CNN) with over 120 million parameters which is eventually used for control algorithm.

The model’s architecture incorporates inception modules for spatial hierarchy comprehension and multi-task learning for diversified outputs. The result is a robust system capable of predicting temperature and power consumption for arbitrary m×n TEC arrays with high accuracy. For instance, our ML model achieves an average temperature error of only 1.6K compared to FEM ground truth, while reducing computation time from 45 seconds to just 42 milliseconds.

Developing ML architecture with Inception module and multi-tasking learning

A example of 6×6 array prediction after ML model development

With the rapid predictions enabled by machine learning, we implemented a backtracking algorithm to further enhance optimization. This approach focuses only on configurations likely to reduce the peak temperature, significantly cutting down unnecessary iterations. For example, in a 5×5 array, level-1 optimization reduced the peak temperature from 257°C to 117°C in just one second. Higher levels of optimization further improved performance, albeit with slightly increased computation time.

Here is a simple example of a 5×5 array with a central hotspot, where the original temperature map results in a peak temperature of 257℃. Our lvl-1 optimization only allows the TEC to have only one intensity other than zero. And the algorithm advice to put a TEC in the center with 4 as intensity. Now the peak temperature become 117℃. Then if we further use a level 2 optimization, now the TEC can have two different intensity levels other than zero. And the algorithm now suggest 3 and 6 arranged in this way to provide a peak temperature of 89℃. Finally, if we use lvl-3 optimization, the program will suggest using 2, 3, 6 as intensity for the TEC, and the peak temperature will further decrease to 83℃. Note that the higher the optimization level, the more time it will be to complete the alogirhtm. The shortest time it take is only 1 second, which is the lvl1 optimization. But overall, time consumption is much shorter than FEM simulation and with no backtracking algorithm.

We tested our approach on various scenarios, including random and clustered hotspot maps in a 9×9 TEC array. For a random hotspot map with a peak temperature of 362°C, level-1 optimization reduced the peak to 190°C, achieving a 52% cooling efficiency. Similarly, for clustered hotspots—a common issue in heterogeneous systems—the checkerboard TEC assignment pattern suggested by the algorithm reduced the peak temperature by 55%.

Across 1,800 random samples, the algorithm demonstrated consistent performance, with peak temperature reductions exceeding 50% and power consumption as low as 21.7mW. These results highlight the practicality and efficiency of our machine learning and backtracking algorithm approach in addressing dynamic multi-hotspot cooling challenges.

As the complexity of electronic systems continues to grow, traditional cooling methods will struggle to keep pace. Our work demonstrates that combining advanced thermoelectric materials, machine learning, and optimization algorithms offers a promising solution to manage dynamic, multi-hotspot scenarios. This innovative approach not only enhances thermal management in current architectures but also lays the groundwork for future advancements in heterogeneous systems. By bridging the gap between cutting-edge technology and practical applications, we hope to inspire further research and development in this critical area.

Written by JJ on Nov 10, 2024

Likes:(0)

Go to Top

Jiajian Luo (He/Him)

PhD in Mechanical Engineering | ex-NVIDIA intern | Electronic Packaging • Data Center Cooling • Machine Learning for Thermal Management

Nov 10, 2024

AI-Driven Dynamic TEC Cooling for SoC Systems