Revision $Revision: 1.5 $ of file $RCSfile: VDSUG_Kickstart.xml,v $.
This document shows how to interpret the results common found in the kickstart result record. This document assumes that you are familiar with some basic taxonomy of XML documents. You need to know what an element is and how to find attributes. You need some elementary grasp of UNIX programming concepts like the relation of errno to failed system calls, or how to invoke man pages to find out about system calls like the wait family.
To facilitate an example, please invoke kickstart on the command-line as follows:
kickstart -n unix::date -N isodate -R test /bin/date -Isec > date.out
The above will capture the output from kickstart into the file date.out. If your environment is correctly set up, you should not have seen any errors.
When viewing the example, several of the options passed to kickstart can be found in the resulting provenance tracking record (PTR) as the output is frequently is called. It is useful to extend the viewing terminal or editor beyond 80 columns for proper viewing.
The PTR is XML pretty-printed, with proper indentations to make it more legible to humans. The different sections in the PTR are somewhat repetitive and depend on how kickstart was invoked, and with what options.
The current description of the schema can be found at the GriPhyN web site:
http://www.griphyn.org/workspace/VDS/
This document describes the provenance tracking record XML schema in version 1.5:
http://www.griphyn.org/workspace/VDS/iv-1.5/iv-1.5.html
A local copy is shipped with the VDS distribution. It can be found in the $VDS_HOME/doc/schemas directory.
The file starts with the XML preamble that designated the document as XML.
<?xml version="1.0" encoding="ISO-8859-1"?>
This preamble is common do most properly formatted XML documents.
The root element is invocation. Its initial set of attributes deals with stating that the present document is of a certain XML schema:
<invocation
xmlns=http://www.griphyn.org/chimera/Invocation
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
xsi:schemaLocation=http://www.griphyn.org/chimera/Invocation
http://www.griphyn.org/chimera/iv-1.5.xsd
version="1.5" start="2005-08-30T10:40:30.160-05:00"
duration="0.021"
transformation="unix::date" derivation="iso"
resource="test"
hostaddr="128.135.152.241"
hostname="griodine.uchicago.edu"
pid="18554" uid="500" user="voeckler"
gid="500" group="voeckler">
· xmlns associates an XML namespace identifier for all un-prefixed elements and attributes in the present document.
· xmlns:xsi associates an XML namespace identifier for all elements and attributes in the present document which are prefixed with xsi.
· xsi:schemaLocation associates the XML namespace identifier with a location of an XML schema document that describes elements falling into the XML namespace. For your convenience, the location pointed to is at the GriPhyN web site.[1]
The next set of attributes belongs to the PTR.
· version defines the version of this document instance.
· start is the timestamp when kickstart was started. Note that XML uses an ISO 8601 compliant timestamp format. Also, a time stamp is only as accurate as the clock of the worker node it is derived from.
· duration is the is the execution wall clock time of kickstart. It is in seconds with a millisecond resolution. To determine the finish time of kickstart, you can add the duration to the start. Since many UNIX systems still have a clock resolution of 10 ms, only two digits after the period may be significant.
· transformation defines the VDL transformation name for the current job. The value for this attribute is directly copied from the -n command-line argument. If the -n option is not present when kickstart is invoked, this attribute will not be generated.
In our example case, the value is "unix::date", as passed on the command-line. The concrete planner is expected to set this option.
· derivation defines the VDL derivation name for the current job. The value for this attribute is directly copied from the value of the -N command-line option. If the -N option is not present when kickstart was invoked, this attribute will not be generated.
In our example case, the value is "isodate", as passed on the command-line. The concrete planner is expected to set this option.
· resource marks the site handle where the present job was executed. The value for this attribute derives from the -R command-line option. If the -R option is not present when kickstart is invoked, this attribute will not be generated.
In our example, the value is "test", as passed on the command-line. The concrete planner is expected to set this option.
· hostaddr is the numeric address of the primary interface of the host where kickstart was executed. Its format is the dotted quad.
· hostname is the symbolic version of hostaddr. However, should the name resolution fail for some reason, this attribute will not be generated.
· pid is the UNIX process id of the kickstart process itself.
· uid is the effective user id under which kickstart is executed.
· user is the symbolic translation of uid. Should the translation fail for some reason, this attribute will not be generated.
· gid is the effective group id under which kickstart runs.
· group is the symbolic translation of gid. Should the translation fail for some reason, this attribute will not be generated.
Kickstart knows about five different jobs:
· setup describes the independent set-up helper job.
· prejob describes any helper job which is executed prior to the main job.
· mainjob describes the main compute job.
· postjob describes any helper job which is executed after the main job.
· cleanup describes the independent clean-up helper job.
The jobs have the following relationship and dependencies in shell pseudo notation:
setup ; prejob && mainjob && postjob; cleanup
It means that, if a setup job is present, it will be executed first. Regardless of its success, the chain of pre-, main- and postjob is executed next. If a pre job is present, it will be executed. If it fails, the chain ends there. If the pre job succeeds, or is not present, the main job will be executed next. If the main job fails, the chain ends. If a post script is present, it will be executed upon successful execution of the main job. Independently of the chain, if a cleanup job is present, it will be executed after the chain.
Due to the complexity of the execution path, you will usually see at least one resulting job element. However, there is no implied warrantee that you will always see a main job element in the PTR. In our current example, we will see just the main job:
<mainjob
start="2005-08-30T10:40:30.161-05:00" duration="0.019"
pid="18555">
<usage utime="0.000" stime="0.000"
minflt="14" majflt="133"
nswap="0" nsignals="0" nvcsw="0"
nivcsw="0"/>
<status raw="0"><regular exitcode="0"/></status>
<statcall error="0">
<file name="/bin/date">7F454C46010101000000000000000000</file>
<statinfo mode="0100755"
size="38588" inode="4251559"
nlink="1" blksize="4096"
mtime="2003-10-29T08:44:15-06:00"
atime="2005-08-30T10:40:00-05:00"
ctime="2004-06-28T16:26:44-05:00"
uid="0" user="root" gid="0"
group="root"/>
</statcall>
<arguments>-Isec</arguments>
</mainjob>
All the above jobs have an identical layout. Each job has the following set of attributes, similar to kickstart’s own record:
· start is the wall time when the specific job was started. Note that XML uses an ISO 8601 compliant timestamp format.
· duration is the is the execution wall clock time of kickstart. It is in seconds with a millisecond resolution. To determine the finish time of the job, add the duration to start.
· pid is the UNIX process id under which the job was started.
Each job is comprised of the following four child elements:
· usage is an excerpt of the UNIX struct rusage record, which recorded the resource usage of the job. Because Linux only records a fraction of the potential, not all fields are passed back.
· status records the job exit results. It is most pertinent when trying to figure out the success of any job operation.
· statcall records the struct stat record about the application that was invoked.
· commandline records the argument vector of the job.
The usage element capture the resource consumption of the specific job. Unfortunately, Linux is less than forthcoming with number, and thus only a subset of numbers are propagated in the PTR:
<usage utime="0.000" stime="0.000"
minflt="14" majflt="133"
nswap="0" nsignals="0" nvcsw="0"
nivcsw="0"/>
Please refer to the UNIX manual page for the getrusage function to obtain what the attributes mean on your system. Generally the following interpretation is acceptable:
· utime is the time in seconds with millisecond resolution spent in user-land.
· stime is the time in seconds with millisecond resolution spent in kernel space.
· minflt is the number of page faults not requiring physical I/O.
· majflt is the number of page faults resulting in physical I/O.
· nswap is the number of swap operation, always 0 for Linux.
· nsignals is the number of signals received, always 0 for Linux.
· nvcsw is the number of voluntary context switches.
· nivcsw is the number of involuntary context switches (pre-emption).
The status element of any job exhibits the clues as to what happened to a job, e.g. if it started at all, how it finished, and whether it received a signal.
<status raw="0"><regular exitcode="0"/></status>
The raw attribute of the status element capture the raw exit code as obtained from the UNIX wait call family.
· If it is negative, some failure to start the application was perceived. Kickstart itself customarily generates a negative value, if it fails to start the application. The application itself is usually not started in this case.
· If it is 0, the application terminated successfully.
· If any the lower 7 bits are set, the application terminated on the signal number pertaining to the bit pattern.
· If the bit#7 is set, the application produced a UNIX core file upon exit.
· The upper eight bits constitute the exit code with which the application exited.
This distinction leads to one of the four possible child elements inside the status element:
· failure signals the inability to even start the application, which relates to the first case above. Kickstart itself detects this family of errors. The only attribute error contains the most recent value of the UNIX libc errno variable when the error was detected. The element's content pertains to the error condition.
· regular constitutes an application that ran until it finished by itself. Its exitcode attribute of the element captures the exit code value from the upper 8 bits in the raw exit code. A value of 0 means that the application returns an exit code of 0, etc.
· signalled marks the abnormal termination of an application by a signal. The content (value) of the XML element pertains to the symbolic cause. It features the following attributes:
o signal records the signal number that cause the abnormal termination.
o corefile is a Boolean value to indicate that a core file was generated. This attribute is optional, as not all systems may produce core file.
· suspended is a state that should never be visible here. Please contact vds-support, if you ever notice it in your kickstart output.
The statcall element of a job records i-node information about the application that was started. Its only attribute error records the result of the UNIX stat call on the executable's file name.
The error attribute is useful for debugging. If not 0, it provides UNIX’s reason why it failed to start the application, e.g. the executable was not found where it was expected.
<statcall error="0">
<file name="/bin/date">7F454C46010101000000000000000000</file>
<statinfo mode="0100755" size="38588"
inode="4251559"
nlink="1" blksize="4096"
mtime="2003-10-29T08:44:15-06:00"
atime="2005-08-30T10:40:00-05:00"
ctime="2004-06-28T16:26:44-05:00"
uid="0" user="root" gid="0"
group="root"/>
</statcall>
The statcall element as sub-element of a job is different from the root element’s record, see section 12.6. The inner statcall record usually consists itself of two sub-elements:
· The file element records the name of the application in its name attribute. The content contains the hexadecimal coded first 16 (or less) byte of the application, if it was successfully run.
· The statinfo record is only present, if the stat call succeeded. Its attributes reflect a subset of the UNIX struct stat record:
o size records the file size in byte. It is the only mandatory attribute.
o mode records the file permissions in the UNIX-typical octal notation. This attribute helps to distinguish, if a file was executable or not a file at all.
o inode records the i-node number of the application. Please note that the i-node number is specific to the file system the file resides upon.
o nlink records the hard link count of the file.
o blksize records the block size of the underlying file system in bytes.
o atime records the access time of the file as ISO 8601 timestamp. This is effectively the moment in time the stat call was executed.
o mtime records the last time the file's data was modified as ISO 8601 timestamp.
o ctime records the last time the file's status was modified as ISO 8601 timestamp.
o uid records the file's owner as UNIX user id.
o user records the symbolic translation of uid according to the executing host. If the translation fails, the attribute is not generated.
o gid records the file's group as UNIX group id.
o group records the symbolic translation of gid according to the executing host. If the translation fails, the attribute is not generated.
The arguments element records the argument vector as it is passed to the application. The content of the element is a space-separated list of the arguments passed to the application.
<arguments>-Isec</arguments>
Note to self: For some reason, the XML schema claims that the element name is command-line. This is wrong and requires fixing. For some reason, the XML parser never produced the necessary warnings.
The invocation element feature a couple more sub elements to describe the remote execution. As you will notice, some elements are recycled from the job elements.
cwd records the current working directory in which kickstart was launched. This working directory is the same for all jobs.
<cwd>/home/voeckler</cwd>
uname tries to capture information about the architecture and OS where kickstart was launched. This information is useful when debugging. It features the following attributes:
· system records the down-cased operating system name, see "uname -s".
· archmode is an artificial flag to record the architecture. Its values are currently hard-coded into kickstart, and derive from the compiler settings. Future versions may want to substitute the hardware platform or processor information.
· nodename is the name of the execution host as set by its system administrator. This may be just a hostname, or a fully qualified domain-name see "uname -n".
· release is the version of the operating system, see "uname -r".
· machine is the machine's hardware name, see "uname -m".
· domainname is optional for certain GNU systems. It records the domain-name of the executing host. Its presence depends on how kickstart was compiled.
The content of the uname element depends on the operating system, see "uname -v".
<uname
system="linux"
archmode="IA32"
nodename="griodine.uchicago.edu"
release="2.4.31"
machine="i686">#1 SMP Thu Jun 9 12:55:56 CDT 2005</uname>
usage is an excerpt of the UNIX struct rusage record, which recorded the resource usage of kickstart itself, but not its children. Because Linux only records a fraction of the potential, not all fields are passed back. The attributes of the usage record are discussed in section 12.4.1.
<usage utime="0.000" stime="0.000"
minflt="35" majflt="115"
nswap="0" nsignals="0" nvcsw="0"
nivcsw="0"/>
statcall records may appear multiple times as direct child of the invocation element. These record struct stat information of various kickstart-related file, as well as from files kickstart was explicitly asked about with the -s and -S options.
The format is slightly richer than the statcall records described in section 12.4.3. The attributes of the statcall element at this position needs to record which entity it pertains to. Thus, the set of attributes include:
· error records the result of the UNIX stat call on the executable's file name. The error attribute is useful for debugging. If not 0, it provides UNIX's reason why it failed to start the application, e.g. the executable was not found where it was expected.
· id record the entity the stat was invoked upon. This is a small set of identifiers with the following possible values:
o gridstart pertains to kickstart itself. Kickstart tries to obtain its own stat information by attempting to stating argv[0].
o stdin, stdout and stderr refer to the system default file descriptors pre-connected to standard input, standard output and standard error respectively. A remote scheduling system may chose to connect this to unexpected things.
o logfile refers to the stat information about the logging file, if the -l option was used with kickstart. Otherwise, it pertains to kickstart's own stdout filedescriptor.
o channel refers to the feedback FIFO kickstart creates to funnel data from kickstart-aware applications to the submit host.
o initial and final refer to the stat records specifically asked for by the -S and -s option respectively. These stat records will also set the lfn attribute.
· lfn is only used in conjunction with the initial and final ids. The attribute records the logical filename for the physical file of which the stat information is recorded.
The first and mandatory element in kickstart's own statcall elements are sub-elements describing a file or a file-like entity:
· The file element records the name of the application in its name attribute. The content contains the hexadecimal coded first 16 (or less) byte of the application, if it was successfully run.
· The descriptor element records a pre-opened file descriptor. Since UNIX does not provide a user-space and portable way to map a file descriptor to a filename, there is no file name. The attribute number records the descriptor number.
· The temporary element is used for temporary files which kickstart creates and destroys. Attributes record both, the file's name and descriptor.
· The fifo element pertains to named pipes. It records the file's name and descriptor, but also the number and extent of messages that passed through the FIFO.
Please refer to section 12.4.3 for details on the statinfo record.
If the file was a temporary file and there was data captured in the temporary file, the first 64 pages of content are copied into the data contents. A page in Linux has 4kB, thus the first 256kB are usually reported. The example below will illustrate the details.
There are a number of trailing statcall records towards the end of our example.
<statcall error="2" id="gridstart">
<file name="kickstart"/>
</statcall>
This first record from the example refers by the id attribute to kickstart itself. It shows that kickstart was unable to record stat information for itself. The error number 2 means ENOENT in UNIX, and refers to the error message "file not found". This is correct, as a file called "kickstart" is unlikely in your current working directory. However, the shell which invoked kickstart found it in its PATH.
<statcall error="0" id="stdin">
<file name="/dev/null"/>
<statinfo mode="020666" size="0"
inode="66938"
nlink="1" blksize="4096"
mtime="2003-01-30T04:24:37-06:00"
atime="2003-01-30T04:24:37-06:00"
ctime="2004-06-21T13:10:43-05:00"
uid="0" user="root" gid="0"
group="root"/>
</statcall>
Unless the concrete planner tells kickstart otherwise, the stdin file descriptor is typically connected to the /dev/null device. Please note how the mode shows that the file is a device and not a regular file.
<statcall error="0" id="stdout">
<temporary name="/tmp/gs.out.G9acZ2" descriptor="3"/>
<statinfo mode="0100600" size="25"
inode="229165"
nlink="1" blksize="4096"
mtime="2005-08-30T10:40:30-05:00"
atime="2005-08-30T10:40:30-05:00"
ctime="2005-08-30T10:40:30-05:00"
uid="500" user="voeckler" gid="500"
group="voeckler"/>
<data>2005-08-30T10:40:30-0500
</data>
</statcall>
Unless the concrete planner tells kickstart otherwise, the standard file descriptors stdout and stderr are connected to temporary files by kickstart. These temporary files capture the output an application may produce, but are removed when kickstart finishes. Note that with very verbose applications, you may flood a system's /tmp directory.
In the above case for stdout, kickstart captured contents produced on stdout into a file /tmp/gs.out.G9acZ2. The temporary file was also accessible with file-descriptor 3. Please note the new element temporary.
Additionally, for debugging purposes only, the first 64 pages (a page has 4kb on Linux) of content from the file are returned as content of the data element. In our case, we see the output from the date application that we ran. The result includes the trailing line-feed that date produces.[2]
<statcall error="0" id="stderr">
<temporary name="/tmp/gs.err.gQtdyL" descriptor="4"/>
<statinfo mode="0100600" size="0"
inode="229446"
nlink="1" blksize="4096"
mtime="2005-08-30T10:40:30-05:00"
atime="2005-08-30T10:40:30-05:00"
ctime="2005-08-30T10:40:30-05:00"
uid="500" user="voeckler" gid="500"
group="voeckler"/>
</statcall>
The data section is only available with kickstart's temporary files, and only if data was actually produced. It is useful for debugging. For instance, if an application fails unexpectedly, it is often helpful to see what it printed on stderr. Since our date example did not fail, it did not print anything onto its stderr.
Both, Globus and remote scheduling systems, typically connect stdio to /dev/null. Thus, kickstart facilitates grid debugging by providing access to otherwise lost data. However, the data element was not, and is never meant to pass production data from the remote site to the submit host. Please use the application feedback FIFO for such a purpose.
<statcall error="0" id="logfile">
<descriptor number="1"/>
<statinfo mode="0100644" size="0"
inode="3434245"
nlink="1" blksize="4096"
mtime="2005-08-30T10:40:30-05:00"
atime="2005-08-30T10:40:30-05:00"
ctime="2005-08-30T10:40:30-05:00"
uid="500" user="voeckler" gid="500"
group="voeckler"/>
</statcall>
The log file pertains to kickstart's own stdout filedescriptor, which we captured into the file date.out with shell redirections. Kickstart itself only knows the file descriptor, thus the descriptor element. Its only attribute is the file descriptor number.
<statcall error="0" id="channel">
<fifo name="/tmp/gs.app.Mwci7t"
descriptor="5"
count="0" rsize="0" wsize="0"/>
<statinfo mode="010640" size="0"
inode="229531"
nlink="1" blksize="4096"
mtime="2005-08-30T10:40:30-05:00"
atime="2005-08-30T10:40:30-05:00"
ctime="2005-08-30T10:40:30-05:00"
uid="500" user="voeckler" gid="500"
group="voeckler"/>
</statcall>
The final statcall element in our example refers to the kickstart-aware application feedback channel. The channel is implemented as a UNIX FIFO (named pipe). The name of the FIFO is passed into the environment of all jobs. Whenever a jobs writes anything to the FIFO, it will be passed to the submit host on Globus's stderr channel. This way, a kickstart-aware application can communicate with the submit host. However, please note the best-effort transport of the data, which may not arrive on the submit host until the job has finished.
The fifo element contains several additional attributes besides the file's name and descriptor number. It records how many messages were received, how many bytes were read, and how many bytes in XML-encoded messages were sent.