The Case for On-Prem AI Data Centers

AI has become and will continue to be a dominant technology for enterprises worldwide. The technology to change business practices and make better decisions in a wide range of industries has led to an unprecedented demand for access to servers that can perform the AI process's training or inference phase. The AI infrastructure needed for the training phase can be significant in terms of cost, but a high end system (multiple CPUs and GPUs) may not always be the best choice. By implementing AI training within an enterprise's data center, organizations can reduce costs and become more productive and flexible at the same time.

Graphic showing racks of Supermicro 4U 10-GPU systems

Cloud Benefits and Drawbacks

Many organizations are moving their workloads to a public cloud infrastructure, which, by definition, is shared by many clients. While the scalability in a public cloud can be quite large, very few training models require thousands of GPUs working concurrently. A benefit to using a public, shared cloud infrastructure is that a large number of high-end (read expensive) servers may be available. Conversely, a large number of high end servers may not be available when desired. In addition, the costs associated with data ingress and egress for large training models can be significant, especially if the training data needs to be imported from another public, shared cloud provider.

On-Prem for AI Training

Several reasons exist to consider and implement AI within an on-prem data center.

Cost – While acquiring servers with GPUs may be high, the longer term cost can be lower compared to using a public, shared cloud. Cloud fees can be relatively high over time, especially for data movements. In addition, the costs for acquiring a high end GPU server can be high, whether all CPUs or GPUs are used 100% of the available time, which is unlikely.
Performance – There are a range of CPU and GPU combinations available, both in terms of the quantity of each and the performance. With an understanding of enterprise AI requirements, the number and performance of the CPUs (1, 2, 4, or 8) is essential. The latest generation of CPUs range from 16 to 128 cores, and base clock rates approaching 4 GHz. A range of GPUs exist, from older generations to the latest releases, with up to thousands of cores. Optimal and multiple configurations can be implemented in a data center, depending on the project's CPU and GPU requirements.
Retraining – While there are various methods to estimate the cost to train a model of a particular size and number of GPUs available, many models need to be continuously re-trained with new parameters. For inference accuracy, the model must be retrained with updated and more recent data, which can take as long as the original training depending on the amount of new data to be used. In an on-prem data center, the systems can be used repeatedly, whereas in the public cloud, expenses can pile off with each iteration and re-training of the model.
Software – There are many software choices to consider when creating an efficient and effective AI training solution. A public, shared cloud provider may not have all the available components, which may require additional setup and testing for each instance acquired in a public cloud infrastructure.
Data Location and Sovereignty – For many industries and geographies, there may be restrictions and requirements for where the data used for AI training must reside. An on-prem data center allows organizations to adhere to these regulations, where using a remote, public cloud data center may not be permitted.
Security – For many organizations, the security of both data and results is critical. In an on-prem data center, security teams can implement more stringent security policies regarding access to the systems or storage devices. When creating and using AI that needs access to internal processes and data, implementing AI in an on-prem data center is an obvious choice.
Compliance – When the data is subject to various regulations, creating a conformant on-prem data center may be ideal, compared to identifying a public cloud that adheres to these regulations.

Trio of Supermicro AI GPU systems: 8U system, 4U system, 5U System

Summary

Implementing an effective and efficient on-prem AI-focused data center requires understanding the performance requirements for the workloads that best suit the enterprise. An on-prem data center, when properly designed, can decrease the time to get results for AI training and can deliver low latency inference results and decisions tuned to the type of model. An on-prem data center can be uniquely configured at a low cost to respond to the needs of the enterprise. Understanding workloads, the amount of data, the fine tuning of the AI workflow, and in-house expertise with various software layers will help determine the best option for the organization.

机架服务器

1U 双处理器

2U 双处理器

单处理器

多处理器

产品系列

GPU 服务器

8U/10U GPU 系列

4U/5U GPU 系列

2U GPU 系列

1U GPU 系列

Twin 服务器

FlexTwin™

BigTwin®

GrandTwin®

TwinPro®

Twin

FatTwin®

刀片服务器

SuperBlade®

MicroBlade®

MicroCloud

存储系列服务器

所有存储系列产品

全闪存 NVMe

顶部装载存储

JBOF

Petascale Grace Storage

企业优化的存储

主板

机箱

SuperRack®

辅助配件

Edge & Telecom Servers

Fanless Edge Systems

Compact Edge Systems

Edge GPU Systems

Outdoor Edge Systems

1U Edge Network Systems

5G/Telecom Systems

嵌入式组件

嵌入式/物联网主板

嵌入式系統机箱

交换机

网路卡

超级工作站

Liquid-Cooled AI Development Platform

单处理器

双处理器

Supero™ 游戏解決方案

人工智能基础设施

人工智能超级集群

企业人工智能解决方案

边缘人工智能

人工智能存储

NVIDIA 解决方案

AMD 解决方案

Intel 解决方案

HPC

机架解决方案

液体冷却

数据管理

人工智能存储

软件定义存储和內存

超融合基础架构

Veeam

企业应用和数据分析

数据工程

数据库和 ERP

Microsoft

云端和虚拟化

Cloud Service Providers (CSPs)

Google Distributed Cloud

Canonical OpenStack

Red Hat OpenStack

Kubernetes

虚拟桌面

5G、Edge Computing 和 IoT

电信解决方案

Rakuten Symphony