Student Work

Improving Computing Efficiency and Reducing Carbon Footprint for Turing Cluster

Öffentlich Deposited

Herunterladbarer Inhalt

open in viewer

With the increasing popularity of large language models (LLM) and its great potential of renovating the way we work, now, more than ever, high performance computing (HPC) is paramount for training accurate machine learning models and creating scientific breakthroughs in numerous subjects. However, due to arising concerns of its carbon footprint and ecological impact, user education of proper cluster usage is vital for improving energy efficiency, while administrators also need a tool to gain understanding of utilization status and identify problems, so to better manage the cluster and guide its users. In order to achieve this, it can be helpful to increase the observability of computing resource utilization that is previously impacted by workload managers like SLURM isolating compute nodes from login nodes and make it harder for users to interact with the former ones. This increased observability in turn reveals the problems and can raise the awareness of underutilization. In this project, we implement an automated tool that collects statistics and GPU driver readings for readings of processes under job allocations and generates actionable reports with user-friendly explanations of potential problems in job submissions and their possible solutions. The tool can both be started by regular users along with their job allocations without administrator involvement or be deployed by administrator as daemons to sample the cluster at tunable frequency, with both data collection and problem identification components being highly extensible. By running on the Turing HPC cluster in WPI for roughly half a year, it has helped users to better understand running status of their jobs and has aided administrators in identifying job problems across multiple dimensions, such as user, resource, and time period.

  • This report represents the work of one or more WPI undergraduate students submitted to the faculty as evidence of completion of a degree requirement. WPI routinely publishes these reports on its website without editorial or peer review.
Creator
Subject
Publisher
Identifier
  • E-project-031924-094652
  • 118928
Stichwort
Advisor
Year
  • 2024
UN Sustainable Development Goals
Date created
  • 2024-03-19
Resource type
Source
  • E-project-031924-094652
Rights statement

Beziehungen

Objekte

Artikel

Permanent link to this page: https://digital.wpi.edu/show/z890rz418