When an organization moves Artificial Intelligence (AI) and Deep Learning (DL) projects from the test and design phase to production, the responsibility for maintaining that project often lands in IT’s lap. IT then has to put together an architecture that supports the project. AI and DL architectures often look like science projects as IT is forced to cobble together commodity servers, GPU servers, storage, and open source file systems. The result is an architecture that is hard to design, implement, upgrade and operate. Organizations where AI and DL are a part of their business instead of the entire business need a better option.
Storage performance is critical to an AI or DL project and the faster storage can respond to requests from the applications the more intelligent these applications appear. Most AI environments have a scale-out computing architecture that runs the application. The computing tier interfaces with GPUs and both tiers interact with the storage tier.
AI and DL typically deal with unstructured data, often millions of tiny files and sequential reads of those files is critical. The file system that stores all the unstructured data is typically a parallel file system so that multiple computing and GPU nodes can access the nodes that have the data which the AI and DL process needs directly. The physical storage media is often flash based and because of the direct access capability, these systems are quickly adopting NVMe flash.
IT faces a dilemma when putting together storage architecture to meet these demands. The requirements are unlike anything IT has dealt with in the past. One of the primary challenges is all of these components come as parts. It has to assemble all the components, make sure they work together and diagnose any problems that may arise. The parallel file system is of particular concern since many of these solutions are open source projects and lack a formal support process.
The result is once an AI project makes it through the test and development portion of its lifecycle there is a long delay before it gets to its next stage, production. It can take organizations months to stand up a complete AI architecture to run the AI/DL application. Even once the environment is running, IT may find itself dedicating unbudgeted time toward supporting and maintaining it.
Turnkey AI Storage Architectures
Recently Data Direct Networks (DDN) and NVIDIA announced a turnkey storage architecture designed for AI and DL workloads. The DDN A³I with DGX-1 is a scalable end-to-end solution for AI and DL workloads. The systems are pre-configured to ease deployment and reduce ongoing management time. The flexible architecture seamlessly scales GPU capacity, storage performance, and storage capacity to keep pace with evolving AI/DL workloads.
The NVIDIA DGX-1 is a line of NVIDIA-produced servers that specialize in using GPU to accelerate AI and DL applications. Each server features 8 GPUs based on the Pascal or Volta daughter cards with HBM 2 memory connected by an NVLink mesh network.
DDN’s AI200 and AI7990 make up the storage component of the solution. The AI200 is an all NVMe-flash, parallel file system, storage appliance, optimized for the most intensive AI/DL workloads. The AI200 is available in 30TB, 60TB and 120TB capacities in two rack units.
The AI7990 is a Hybrid parallel file storage appliance. The system leverages hard disk drives for capacity and flash for performance. The AI7990 can scale to 1PB of capacity in a single four rack unit node.
An organization can start with a single AI appliance and scale out as needed. IT can mix AI200 and AI7990 nodes into the same storage cluster to meet the demands of almost any AI/DL workload.
StorageSwiss Take
While these systems are all available separately, DDN’s integrated turnkey approach enables organizations new to AI and DL to shorten their time to value. Even experienced AI/DL organizations may find that the time savings and operational savings the bundle delivers may be more valuable than the effort of piecing a system together. Overall it should broaden and accelerate AI and DL adoption.