Read and Interpret log files[edit | edit source]
The goal of this exercise is quite simple: Learn to understand the contents of a job log file. When things go wrong with your job, it is usually the first place you should look for important messages. Plus, there is other useful information there.
This exercise is short. If you do not have time for it now, come back and visit it later.
Reading a Log file[edit | edit source]A job log file is updated throughout the life of a job, usually at key events. Each event starts with a heading that indicates what happened and when. Here are all of the event headings from the
sleepjob log (detailed output in between headings has been omitted here):
000 (459934.000.000) 08/28 13:07:40 Job submitted from host: <188.8.131.52:9618?addrs=184.108.40.206-9618+[2001-1458-301-e2--100-20]-9618&noUDP&sock=1412634_012e_3> 001 (459934.000.000) 08/28 13:13:18 Job executing on host: <220.127.116.11:9618?addrs=18.104.22.168-9618+[--1]-9618&noUDP&sock=7965_3855_3> 006 (459934.000.000) 08/28 13:13:22 Image size of job updated: 120812 005 (459934.000.000) 08/28 13:13:22 Job terminated.
- The job ID: cluster 459934, process 0 (written
- The date and local time of each event
- A brief description of the event: submission, execution, some information updates, and termination
000 (459934.000.000) 08/28 13:07:40 Job submitted from host: <22.214.171.124:9618?addrs=126.96.36.199-9618+[2001-1458-301-e2--100-20]-9618&noUDP&sock=1412634_012e_3> ...
001 (459934.000.000) 08/28 13:13:18 Job executing on host: <188.8.131.52:9618?addrs=184.108.40.206-9618+[--1]-9618&noUDP&sock=7965_3855_3> ...
...But the periodic information update event contains some additional information:
006 (459934.000.000) 08/28 13:13:22 Image size of job updated: 120812 0 - MemoryUsage of job (MB) 0 - ResidentSetSize of job (KB) ...
005 (459934.000.000) 08/28 13:13:22 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 57 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 57 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 17 1 2013116 Memory (MB) : 0 2000 2000 ...
0here, which is success; non-zero usually means failure)
- The total number of bytes transferred each way, which could be useful if your network is slow
Partitionable Resourcestable, especially disk and memory usage
There are many other kinds of events, but the ones above will occur in almost every job log.
Understanding when job log events are written[edit | edit source]
When are events written to the job log file? Let’s find out. Read through the entire procedure below before starting, because some parts of the process are time sensitive.
- Change the
sleepjob submit file, so that the job sleeps for 2 minutes (= 120 seconds)
- Submit the updated sleep job
- As soon as the
condor_submitcommand finishes, hit the return key a few times, to create some blank lines
- Right away, run a command to show the log file and keep showing updates as they occur.Be sure to use the correct filename for your log file, as named in your submit file.
$ tail -f sleep.log
- Watch the output carefully. When do events appear in the log file?
- After the termination event appears, press Control-C to end the
tailcommand and return to the shell prompt.
Understanding How HTCondor Writes Files [edit | edit source]
When HTCondor writes the output, error, and log files, does it erase the previous contents of the file or does it add new lines onto the end? Let’s find out!
For this exercise, we can use the
hostname job from earlier.
- Edit the
hostnamesubmit file so that it uses new and unique filenames for output, error, and log files Alternatively, delete any existing output, error, and log files from previous runs of the
- Submit the job three separate times in a row (there are better ways to do this, which we will cover in the next lecture)
- Wait for all three jobs to finish
- Examine the output file: How many hostnames are there? Did HTCondor erase the previous contents for each job, or add new lines?
- Examine the log file… carefully: What happened there? Pay close attention to the times and job IDs of the events.