HKUST SuperPOD

HKUST SuperPOD is a state-of-the-art AI supercomputing facility. This system, being a University’s Central Research Facility (CRF), is now made available to all HKUST researchers to enhance their research capabilities related to AI. It serves as a platform to foster an "AI for Science" environment at HKUST.

Quick Links

What's New

 

More announcements and training

    Highlights

    HKUST SuperPOD is a next-generation scalable infrastructure for AI leadership that provides the levels of computing performance required to solve advanced computational challenges in AI, high-performance computing (HPC), and hybrid applications where the two are combined to improve prediction performance and time-to-solution. It is a turnkey AI data center solution that provides world-class computing, software tools, expertise, and continuous innovation delivered seamlessly. The compute foundation of HKUST SuperPOD is built on NVIDIA DGX SuperPOD H800 systems, which provide unprecedented compute density, performance, and flexibility. The system is expected to be able to complete the massive AI tasks like GPT-3 training within an hour. 

    DGX SuperPOD is powered by several key NVIDIA technologies, including NVIDIA NDR (400 Gbps) InfiniBand, and NVIDIA NVLink technology, which connect GPUs at the NVLink layer to provide unprecedented performance for most demanding communication patterns. The DGX SuperPOD architecture is managed by NVIDIA solutions including NVIDIA Base Command, NVIDIA AI Enterprise, CUDA, and NVIDIA Magnum IO. 

     

    Photos and Diagrams

    Click here to view enlarged photos and diagrams

     

    Hardware Specification

    For detailed hardware specification of the HKUST SuperPOD, please refer to this webpage.

     

    Applicable Research Areas

    The HKUST SuperPOD is specifically engineered to optimize performance for cutting-edge model training, with the capability to scale up to exaflops of computing power. It is designed to deliver the utmost performance in terms of storage as well. The HKUST SuperPOD serves a variety of use cases including AI research, LLM training, Transformer model development, and more. For comprehensive information on AI frameworks tailored to different research requirements, please refer to the AI Enterprise website.

    Software List

    Modules

    Lmod is used to manage installations for most application software. With the modules system, user can set up the shell environment to give access to applications and make running and compiling software easier. It also allows us to run multiple versions of the same software that co-exist in the system with abstraction of version and high dependencies of the OS.

    Click here for details on the module system.

    NVIDIA AI Enterprise

    It is an end-to-end, cloud-native software platform that accelerates data science pipelines and streamlines development and deployment of production-grade, GPU-optimized AI applications, including generative AI. Enterprises that run their businesses on AI rely on the security, support, and stability provided by NVIDIA AI Enterprise to ensure a smooth transition from pilot to production.

    Click here for details on NVIDIA AI Enterprise

     

    Use of Apptainer (Singularity)

    Apptainer (formerly known as Singularity) container lets user run applications in a Linux environment of their choice. It encapsulates the operating system and the application stack into a single image file. One can modify, copy and transfer this file to any system has Apptainer installed and run as a user application by integrating the system native resources such as infiniband network, GPU/accelerators, and resource manager with the container. Apptainer literally enables BYOE (Bring-Your-Own-Environment) computing in the multi-tenant and shared HPC cluster.

    Click here to view details of Apptainer (Singularity)

    Workflow Examples

    Some workflow examples can be found here for getting started in running jobs in HKUST SuperPOD.

     

    Account Application

    Policy for Account Creation

    The HKUST SuperPOD system is established as a platform to support GPU computational needs for activities fostering “AI for Science”. During the initial pilot period until Jun 2024, it is opened to the following users:

     

    Getting Started

    How to login to the cluster

    Click here to view the instructions on how to get access to the HKUST SuperPOD cluster

    Use of SLURM Job Scheduling System

    The Simple Linux Utility for Resource Management (SLURM) is the resource management and job scheduling system of the cluster. All jobs in the cluster must be run with the SLURM.

    Click here to learn how to submit your first SLURM job

    Click here to view details of using SLURM

    Partition and Resource Quota

    Click here to view more information on partition and resource quota.

    File and Storage

    Click here to view more information on file and storage.

    Nvidia GPU Cloud

    Working with NGC catalog

    Usage Tips

    Job Priority and Accounting

    Details available soon.

    Best Practice
     
    FAQ

    Click here to view the FAQ for HKUST SuperPOD


    Learn More
    Cluster Usage Status

    Please refer to this webpage for the usage of HKUST SuperPOD (available soon)