A Simple Job Script

A Simple Job Script[edit | edit source]

In htcondor_workshop/01_simple_submit are a bash job_setup.sh . This script should generate a program on the batch system . This script should do some checks and compile the generate_number.cpp file. The compiled program should copied to a shared filesystem. At EKP it could be the home on the portal machines. After you edit this script submit this script with grid-control or JDL-Creator.

JDL-Creator[edit | edit source]

Have a look at the python script create_jdl.py. This script creates a directory setup_jobs with some subdirectories error out log, the JDL file and a copy of the bash script. The warnings from the python script can be ignored here. We discuss these in the next exercise. You can submit your job with condor_submit job_setup.jdl.

Grid-Control[edit | edit source]

Have a look at the grid-control submit config file gc_submit.cfg. This is a minimal config file for an user defined job which should run on a HTCondor batch system. You can submit this job with <path to grid-control>/go.py gc_submit.cfg. This setting ist would be ../../grid-control/go.py. Grid-control would create an directory work.gs_submit. This directory contains amongst others the direcotries sandbox and output. The sandbox includes all the files which are needed by the jobs. The output directory includes for each job the stdout and stderr from the job. Look at this stdout file. You see that grid-control do some checks and print some information in which directory the job starts and how much space is left on it.

Job Monitoring in HTCondor[edit | edit source]

You can show the status of your jobs with condor_q. If the job is finished it will be removed from that list automatically. You can look with condor_q -analyze <job-id>or condor_q -better-analyze <job-id>how much resources can run your job. With the option

Special Job Submit[edit | edit source]

How we want to run our compiled program with different parameters. First change the job_setup.sh script to run your job with an argument.

Program Arguments in Jobs[edit | edit source]

For the JDL-Creator you have to create a list of strings. Each element of the list is one job. The element of the list contains the arguments for that job. The argument list will be written in file arguments.txtin the submit folder.

For grid.-control you have to add the [parameter] section and define there the set of your variables. This set of variables will be used in the [UserTask] section to set the arguments per job.

You will see that some of that jobs go into the hold state. The reason for that can be found with condor_q -hold. The message should say that the job needs more memory than requested. The default value at EKP is about 2100MB. That our job runs complet, we have to request more memory for the jobs. For that jobs 4500MB memory should requested. Have a look in the batch system log file. There are the maximum of used disk and memory per job and look if you can reduce the request memory.

Additional to that information we should add some other attributes to the submit file: accounting_group and walltime. The accounting group depends on your working group. At the EKP we have following accounting groups:

  • ams
  • belle
  • cms
    • cms.top
    • cms.higgs
    • cms.jet

The walltime of your job can you get from the condor_history command or the EKP monitoring website HappyFace [1].

Job Information[edit | edit source]

You can look all attributes of a job with condor_q -l <job-id>. For jobs which are done use the condor_history command. Both, condor_q and condor_history can also print special values with the autoformat argument. To see the walltime of the 10 last finished jobs of the user mschnepf run condor_history -limit 10 -af RemoteWallClockTime mschnepf.

Other useful job values (ClassAds) are

  • LastRemoteHost
  • ExitCode
  • RequestDisk

Job Resubmission[edit | edit source]

With the JDL-Creator you can simple resubmit your jobs with the new jdl file. The old jobs will be in queue for a while.

The resubmission with grid-control is more complex. Grid-control knows the old submit file and looks at the state of these jobs when it runs. You can remove the old jobs from the batch system with go.py -d all. Now you can submit again. Grid-control does not allow the change of some variable after the submit. If this is the case, you can remove the work.* directory and resubmit your jobs.

Select Resources[edit | edit source]

At the EKP we have different resources. These resources can be categorised in CPU or I/O. You can set requirements that has the resource to provide. These requirements are ProvidesCPUand ProvidesIO. Please set the requirements that your jobs run on resources (Target) which are specialised for that kind of job.

The example program generate_numbersis a CPU intensive program, so set the requirement for the job to Target.ProvidesCPU. Submit the jobs with that requirement and look with condor_q -better-analyze how much resources can run your jobs.

HTDA Resources[edit | edit source]