Explore condor q

Explore condor_q[edit | edit source]

The goal of this exercise is try out some of the most common options to the condor_q command, so that you can view jobs effectively.

The main part of this exercise should take just a few minutes, but if you have more time later, come back and work on the extension ideas at the end to become a condor_q expert!

Selecting Jobs [edit | edit source]

The condor_q program has many options for selecting which jobs are listed. You have already seen that the default mode (as of version 8.5) is to show only your jobs in "batch" mode:
$ condor_q
You've seen that you can view all jobs (all users) in the submit node's queue by using the -all argument:
$ condor_q -all
And you've seen that you can view more details about queued jobs, with each separate job on a single line using the -nobatch option:
$ condor_q -nobatch
$ condor_q -all -nobatch
Did you know you can also name one or more user IDs on the command line, in which case jobs for all of the named users are listed at once?
$ condor_q username1 username2 username3
There are two other, simple selection criteria that you can use. To list just the jobs associated with a single cluster number:
$ condor_q <cluster>
For example, if you want to see the jobs in cluster 5678 (i.e., 5678.05678.1, etc.), you use condor_q 5678. To list a specific job (i.e., cluster.process, as in 5678.0):
$ condor_q <job id>
For example, to see job ID 5678.1, you use condor_q 5678.1.

Note: You can name more than one cluster, job ID, or combination thereof on the command line, in which case jobs for all of the named clusters and/or job IDs are listed.

Let’s get some practice using condor_q selections!

  1. Using a previous exercise, submit several sleep jobs
  2. List all jobs in the queue — are there others besides your own?
  3. Practice using all forms of condor_q that you have learned:
    • List just your jobs, with and without batching
    • List a specific cluster
    • List a specific job ID
    • Try listing several users at once
    • Try listing several clusters and job IDs at once
  4. When there are a variety of jobs in the queue, try combining a user ID and a different user's cluster or job ID in the same command — what happens?

Viewing a Job ClassAd [edit | edit source]

You may have wondered why it is useful to be able to list a single job ID using condor_q. By itself, it may not be that useful. But, in combination with another option, it is very useful!

If you add the -long option to condor_q (or its short form, -l), it will show the complete ClassAd for each selected job, instead of the one-line summary that you have seen so far. Because job ClassAds may have 80–90 attributes (or more!), it probably makes the most sense to show the ClassAd for a single job at a time. And you know how to show just one job! Here is what the command looks like:
$ condor_q -long <job id>
The output from this command is long and complex. Most of the attributes that HTCondor adds to a job are arcane and uninteresting for us now. But here are some examples of common, interesting attributes taken directly from condor_q output (except with some line breaks added to the Requirements attribute):
MyType = "Job"
Err = "sleep.err"
UserLog = "/home/cat/1-monday-2.1-queue/sleep.log"
JobUniverse = 5
Requirements = ( IsOSGSchoolSlot =?= true ) &&
        ( TARGET.Arch == "X86_64" ) &&
        ( TARGET.OpSys == "LINUX" ) &&
        ( TARGET.Disk >= RequestDisk ) &&
        ( TARGET.Memory >= RequestMemory ) &&
        ( TARGET.HasFileTransfer )
ClusterId = 2420
WhenToTransferOutput = "ON_EXIT"
Owner = "cat"
CondorVersion = "$CondorVersion: 8.5.5 May 03 2016 BuildID: 366162 $"
Out = "sleep.out"
Cmd = "/bin/sleep"
Arguments = "120"

Note: Attributes are listed in no particular order and may change from time to time. Do not assume anything about the order of attributes in condor_q output.

See what you can find in a job ClassAd from your own job.

  1. Using a previous exercise, submit a sleep job
  2. Before the job executes, capture its ClassAd and save to a file
    $ condor_q -l job-id > classad-1.txt
    
3. After the job starts execution but before it finishes, capture its ClassAd again and save to a file:
$ condor_q -l job-id > classad-2.txt
Now examine each saved ClassAd file. Here are a few things to look for:
  • Can you find attributes that came from your submit file? (E.g., JobUniverse, Cmd, Arguments, Out, Err, UserLog, and so forth)
  • Can you find attributes that could have come from your submit file, but that HTCondor added for you? (E.g., Requirements)

How many of the following attributes can you guess the meaning of?

  • DiskUsage
  • ImageSize
  • BytesSent
  • JobStartDate — what format is the value in? (Hint: The HTCondor developers are primarily trained in the Unix way of doing things.)
  • JobStatus

Why Is My Job Not Running? [edit | edit source]

Sometimes, you submit a job and it just sits in the queue in Idle state, never running. It can be difficult to figure out why a job never matches and runs. Fortunately, HTCondor can give you some help.

To ask HTCondor why you job is not running, add the -better-analyze option to condor_q for the specific job. For example, for job ID 2423.0, the command is:
$ condor_q -better-analyze 2423.0
Of course, replace the job ID with your own. Let’s submit a job that will never run and see what happens. Here is the submit file to use:
universe = vanilla
executable = /bin/hostname
output = norun.out
error = norun.err
log = norun.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
request_memory = 2TB
queue
(Do you see what I did?)
  1. Save and submit this file
  2. Run condor_q -better-analyze on the job ID
There is a lot of output, but a few items are worth highlighting. Here is a sample from my own job (with many lines left out):
-- Schedd: htcondor-0.os-internal : <10.10.0.120:9618?...
The Requirements expression for job 15.000 is

    ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer )

Job 15.000 defines the following attributes:

    DiskUsage = 17
    RequestDisk = DiskUsage
    RequestMemory = 2097152

The Requirements expression for job 15.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]          16  TARGET.Arch == "X86_64"
[1]          16  TARGET.OpSys == "LINUX"
[3]          16  TARGET.Disk >= RequestDisk
[5]           0  TARGET.Memory >= RequestMemory

No successful match recorded.
Last failed match: Tue Aug 29 07:28:09 2017

Reason for last match failure: no match found 

015.000:  Run analysis summary ignoring user priority.  Of 16 machines,
      0 are rejected by your job's requirements 
      0 reject your job because of their own requirements 
     16 are exhausted partitionable slots 
      0 match and are already running your jobs 
      0 match but are serving other users 
      0 are available to run your job

WARNING:  Be advised:
   No machines matched the jobs's constraints
Toward the top, condor_q said that it considered 16 “machines” (really, slots) and all 16 of them were rejected as the partitionable slots were exhausted. In other words, I am asking for something that is not available. But what? If you look at the list of requirements, you will see that the requirement that was matched by zero slots was that TARGET.Memory >= RequestMemory. TARGET here is a namespace of the target machine in the attempted matchmaking. Therefore we can see that for none of the target machines was the available memory greater or equal to the requested 2TB.

Automatic Formatting Output (Optional) [edit | edit source]

There is a way to format output from condor_q with the -autoformat or -af option. In this case, HTCondor decides for you how to format the data you ask for from job ClassAd(s). (To tell HTCondor how to format this information, yourself, you could use the -format option, which we're not covering.)

To use autoformatting, use the -af option followed by the attribute name, for each attribute that you want to output:
$ condor_q -af Owner -af ClusterId -af Cmd
bejones 16 /share/test.sh
cat 17 /bin/sleep
cat 18 /bin/sleep
Bonus Question: If you wanted to print out the Requirements expression of a job, how would you do that with -af? Is the output what you expected? (HINT: for ClassAd attributes like "Requirements" that are long expressions, instead of simple values, you can use -af:r to view the expressions, instead of what it's current evaluation.)
 PreviousNext