DelftBlue Policies¶
This page tries to explain some of the policies implemented on DelftBlue.
Disk quota and scratch space¶
We allow each user to use up to 30GB of disk space in their /home
directory, and at most 5TB on /scratch
. Data in the home directory is backed up, data in scratch should be regularly cleaned up. In case the scratch disk becomes too full, ICT may announce an automatic clean-up a few days in advance so that you have time to save important data to your project drive.
Rationale: Data-intensive workflows benefit a lot from the availability of fast scratch storage. All nodes have access to /scratch
(and /home
), so if you explicitly transfer data to DelftBlue (by using the file transfer nodes), you can speed-up the overall execution of the workflow significantly (at least a factor of 10) compared to having every process access the central storage. It may take a one-time effort to adjust your workflow, but afterwards you will reap the fruits! Furthermore, having the data transfer to DelftBlue as a separate job allows you to run multiple applications on the same data (after each other or in parallel).
Jobs are limited to 120 hours (compute)/48 hours (gpu)¶
The slurm queue currently allows jobs to have a run time of at most 120 hours, and 48 hours for GPU jobs. In the future, we aim to reduce this runtime to 24 hours for all jobs. Rationale:
- We want to encourage users to optimize and parallelize their applications: that is the intended use of a supercomputer, after all.
- Regularly checkpointing your application if it runs for more than a few hours should be in your best interest because there is always the risk of a hardware or software failure. Once you have the checkpointing implemented/turned on, it is no longer a problem to restart in case your job did not finish before it hit the time limit.
- having long-running jobs prohibits shorter but more parallel jobs from being scheduled, and as mentioned before, those are the kind of applications that a supercomputer is designed and optimized for.
- The fairshare scheduling works much more smoothly with jobs of similar length, hence hour intention to go to 24 hours. This is the only feasible way to reduce the average waiting time for everyone, regardless if they are frequent heavy users or newcomers, and thus ensure fair access for everyone.
Limited number of GPU nodes¶
DelftBlue serves the TU Delft community as a whole, which means it has to accommodate a large variety of workloads. For many of these workloads, GPUs are either not suitable or too difficult to program. Furthermore, GPUs are very expensive compared to high-end multi-core CPUs. That said, the aim of DHPC is to provide significantly more GPUs in the future to support the growing demand of machine learning and other data analytics applications, but also make sure that a sizeable and growing CPU partition is available for classical simulation workloads.