Resource requirements
Declare Resource Needs[edit | edit source]
The goal of this exercise is to demonstrate how to test and tune the request_X
statements in a submit file for when you don't know what resources your job needs.
There are three special resource request statements that you can use (optionally) in an HTCondor submit file:
* request_cpus
for the number of CPUs your job will use (most softwares will take an argument to control this number, and it's usually otherwise "1") * request_memory
for the maximum amount of run-time memory your job may use * request_disk
for the maximum amount of disk space your job may use (including the executable and all other data that may show up during the job)
HTCondor defaults to certain reasonable values for these request settings, so you do not need to use them to get small jobs to run. However, on some HTCondor pools, if your job goes over the request values, it may be removed from the execute machine and either held (awaiting action on your part) or rerun later. So it can be a disadvantage to you if you do not declare your resource needs or if you underestimate them. If you overestimate them, your jobs will match to fewer slots (and with a longer average wait time) and you'll be hogging up resources that you don't need, but that could be used for the jobs of other users. In the long run, it works better for all users of the pool if you declare what you really need.
But how do you know what to request? In particular, we are concerned with memory and disk here; requesting multiple CPUs and using them is covered a bit in later school materials, but true HTC splits work up into jobs that each use as few CPU cores as possible (one CPU core is always best to have the most jobs running soonest!).
Determining Resource Needs Before Running Any Jobs [edit | edit source]
It can be very difficult to determine the memory needs of your running program. Typically, the memory size of a job changes over time, making the task even trickier. If you have knowledge ahead of time about your job’s maximum memory needs, use that, or a maybe a number that's just a bit higher, to be safe. If not, then it's best to run your program in a single test job, first, and let HTCondor tell you in the log file (or in the condor_q -nobatch
output, if you're able to watch it), which is covered in the next section on "Determining Resource Needs by Running Test Jobs".
You can try to figure out the resource requirements of a job by running it locally, and seeing what it uses. In the lab here, this machine is the same thing as your execute machine, so the difference is artificial. However, here's a couple of tools you could use.
Usingps
:$ ps -ux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
bejones 22210 0.0 0.0 161316 1592 pts/2 R+ 22:17 0:00 ps -ux
bejones 31695 0.0 0.0 129852 3460 pts/2 Ss 21:39 0:00 -bash
RSS
) column, highlighted above, gives a rough indication of the memory usage (in KB) of each running process. If your program runs long enough, you can run this command several times and note the greatest value.
Using top
:$ top -u bejones
top - 22:18:45 up 53 days, 15:18, 29 users, load average: 0.02, 0.16, 0.48
Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.8 us, 1.5 sy, 0.0 ni, 94.8 id, 2.7 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 3531688 total, 161792 free, 2588424 used, 781472 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 533636 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22544 bejones 20 0 167872 2200 1580 R 0.3 0.1 0:00.03 top
31695 bejones 20 0 129852 3464 956 S 0.0 0.1 0:00.17 bash
top
command (shown here with an option to limit the output to a single user ID) also shows information about running processes, but updates periodically by itself. Type the letter q
to quit the interactive display. Again, the highlighted RES
column shows an approximation of memory usage.
For Disk: Determining disk needs may be a bit simpler, because you can check on the size of files that a program is using while it runs. However, it is important to count all files that HTCondor counts to get an accurate size. HTCondor counts everything in your job sandbox toward your job’s disk usage:
- The executable itself
- All "input" files (anything else that gets transferred TO the job, even if you don't think of it as "input")
- All files created during the job (broadly defined as "output"), including the captured standard output and error files that you list in the submit file.
- All temporary files created in the sandbox, even if they get deleted by the executable before it's done.
If you can run your program within a single directory on a local computer (not on the submit server), you should be able to view files and their sizes with the ls
command.
Determining Resource Needs By Running Test Jobs (BEST) [edit | edit source]
Despite the techniques mentioned above, by far the easiest approach to measuring your job’s resource needs is to run one or a small number of sample jobs and have HTCondor itself tell you about the resources used during the runs.
For example, here is a strange Python script that does not do anything useful, but consumes some real resources while running:#!/usr/bin/env python
import time
import os
size = 1000000
numbers = []
for i in xrange(size): numbers.append(str(i))
tempfile = open('temp', 'w')
tempfile.write(' '.join(numbers))
tempfile.close()
time.sleep(60)
os.remove('temp')
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 6739 1048576 8022934
Memory (MB) : 3 1024 1024
Setting Resource Requirements [edit | edit source]
Once you know your job’s resource requirements, it is easy to declare them in your submit file. For example, taking our results above as an example, we might slightly increase our requests above what was used, just to be safe:# rounded up from 3 MB
request_memory = 4MB
# rounded up from 6.5 MB
request_disk = 7MB
- Without explicit units,
request_memory
is in MB (megabytes) - Without explicit units,
request_disk
is in KB (kilobytes) - Allowable units are
KB
(kilobytes),MB
(megabytes),GB
(gigabytes), andTB
(terabytes)
HTCondor translates these requirements into expressions that become part of the requirements
expression. However, do not put your CPU, memory, and disk requirements directly into the requirements
expression; use the request_XXX
statements instead.
Add these requirements to your submit file for the Python script, rerun the job, and confirm in the log file that your requests were used.