Sunday, August 11, 2019

Accelerate Commercial HPC workloads with SAGA

A continuing issue with HPC workloads is the fact that as the amount of concurrent jobs increases, storage reaches a vital point where NFS latency spikes, and beyond that critical point, all workloads are running with that storage crawl. An integration of Dell EMC Isilon scale-out storage with Altair Accelerator enables Storage-Aware Grid Acceleration (SAGA), a stylish and innovative solution that may address the next wave of design challenges.

Let's think about a scenario that you have 10,000 cores inside your compute grid and every of the jobs runs half an hour, if you submit 10,000 jobs towards the job scheduler, it ought to be carried out in half an hour without any jobs browsing queue. As time passes, your test cases have become to twenty,000 jobs, with 10,000 cores that set finishes in an hour. The company need is you want individuals 20,000 jobs to complete in half an hour, which means you add 10,000 more cores. However, the task doesn’t finish even just in 2  hrs because storage latency has spiked from 3ms to 10ms. Latency has x^2 effect on run time, so doubling latency quadruples your average run time.

Let’s now take a look at another scenario with increased I/O-intensive jobs, so just 5,000 concurrent jobs push the NFS latency to that particular critical point. With the addition of only 50 more jobs, you'd spike the latency to 2x the standard value. Which latency spike doesn’t just modify the additional 50 jobs however the entire 5,050 jobs around the compute grid. Beyond that critical point, there's no value running I/O-intensive jobs around the grid.



Inside a scale-out Dell EMC Isilon Network Attached Storage architecture you can include more storage nodes and push the critical point right to be able to run more concurrent jobs around the compute grid. Keep in mind that workloads are unpredictable, as well as their I/O profiles can alter with little notice.

Among the key bits of the Electronic Design Automation (EDA) infrastructure - or any HPC infrastructure - is really a job scheduler that dispatches various workloads towards the compute grid. In the past, the workload needs which are forwarded to the task scheduler happen to be cores, memory, tools, licenses and CPU affinity. Let's say we add storage like a workload requirement - NFS latency, IOPS and disk usage? The job scheduler handling the compute grid understands the actual storage system and may manage job scheduling according to each job’s storage needs, thus speeding up grid throughput by disbursing jobs appropriately. Storage has become an origin much like cores, memory, and tools consumed through the workload according to its priority, great amount and limits.

This straightforward idea has huge implications on job throughput within the EDA world. As you know, job throughput impacts design quality and reliability, which impacts tape-outs and eventually time for you to market. EDA workloads are massively parallel so that as you increase the amount of parallel jobs, you place more pressure around the underlying storage system, because it should, however this effect on storage is a lot more drastic on legacy scale-up storage architectures when compared with Isilon, a scale-out storage system. On the advantages of an Isilon scale-out NAS architecture within this white-colored paper.

Storage-Aware Grid Acceleration with Isilon and Altair Accelerator


With SAGA, you’re throttling and/or disbursing jobs which are I/O-intensive as latency spikes beyond a configured value, and today you aren't running 20,000 concurrent jobs but enough so your jobs finish in 30-forty-five minutes rather of four hrs. Additionally to 100% throughput gains, you might also need substantial indirect financial savings because you’re using 50% less licenses and cores. Within this example, the figures are skewed to simplify calculations, however the impact and benefits offer a similar experience within the real life.

No comments:

Post a Comment