Using Dtrace to get system call Oracle Solaris


This post focuses on dtrace and truss command usages and benefits for Oracle Solaris 10 Operating System.Dtrace is a debugging tool which used in Oracle Solaris.

One of customer getting an unexpected error from java process . This process getting a sigterm signal.Application vendor couldn’t find which process sending sigterm signal to their application.

I tried to using truss command to get all attempts for the application process. I monitored application process with truss command and save all truss output to a file and also print every syscall and analyze it to find out why application is going down.

Take a  look this link to  check how to  monitor process with  truss  command.

 

</p>

#truss -o /tmp/syslog.truss.out -sall -p <PID>



/35: lwp_cond_wait(0x0026E948, 0x0026E930, 0xF5CFFA00, 0) Err#62 ETIME



/38:    pollsys(0x00000000, 0, 0xF54EF4D0, 0x00000000)  = 0



/41:    pollsys(0x00000000, 0, 0xF533F750, 0x00000000)  = 0



/39:    pollsys(0x00000000, 0, 0xF545F650, 0x00000000)  = 0



/37:    pollsys(0x00000000, 0, 0xF557F518, 0x00000000)  = 0



/40:    pollsys(0x00000000, 0, 0xF53CF5D0, 0x00000000)  = 0



/1:         Stopped by signal #24, SIGTSTP, in lwp_wait()



/4:         Received signal #15, SIGTERM, in lwp_cond_wait() [caught]



/49:        Received signal #25, SIGCONT, in pollsys() [default]



/49:          siginfo: SIGCONT pid=12351 uid=121



/16:    lwp_cond_wait(0x000BB248, 0x000BB230, 0x00000000, 0) = 0



/38:    pollsys(0x00000000, 0, 0xF54EF4D0, 0x00000000)  = 0



/41:    pollsys(0x00000000, 0, 0xF533F750, 0x00000000)  = 0



/52:    lwp_cond_wait(0x00D46948, 0x00D46930, 0x00000000, 0) = 0



/32:    lwp_cond_wait(0x00261448, 0x00261430, 0x00000000, 0) = 0



/53:    lwp_cond_wait(0x00D47248, 0x00D47230, 0x00000000, 0) = 0



/37:    pollsys(0x00000000, 0, 0xF557F518, 0x00000000)  = 0



/25:    lwp_cond_wait(0x000D2648, 0x000D2630, 0x00000000, 0) = 0



/21:    lwp_cond_wait(0x000C9248, 0x000C9230, 0x00000000, 0) = 0



/18:    lwp_cond_wait(0x000C1C48, 0x000C1C30, 0x00000000, 0) = 0



/8:     lwp_cond_wait(0x000A8A48, 0x000A8A30, 0x00000000, 0) = 0



/10:    lwp_cond_wait(0x000AEC48, 0x000AEC30, 0x00000000, 0) = 0



/34:    lwp_cond_wait(0x00265148, 0x00265130, 0x00000000, 0) = 0



/17:    lwp_cond_wait(0x000BFF48, 0x000BFF30, 0x00000000, 0) = 0



/22:    lwp_cond_wait(0x000CB048, 0x000CB030, 0x00000000, 0) = 0



/15:    lwp_cond_wait(0x000B9D48, 0x000B9D30, 0x00000000, 0) = 0



/3:     lwp_cond_wait(0x0009E248, 0x0009E230, 0x00000000, 0) = 0



/11:    lwp_cond_wait(0x000B0A48, 0x000B0A30, 0x00000000, 0) = 0



/33:    lwp_cond_wait(0x00262C48, 0x00262C30, 0x00000000, 0) = 0



/28:    lwp_cond_wait(0x000CFD48, 0x000CFD30, 0xF612F770, 0) = 0



/12:    lwp_cond_wait(0x000B1E48, 0x000B1E30, 0x00000000, 0) = 0



/27:        Received signal #1, SIGHUP, in lwp_cond_wait() [ignored]



/27:          siginfo: SIGHUP pid=12351 uid=121



/20:    lwp_cond_wait(0x000C7E48, 0x000C7E30, 0x00000000, 0) = 0



/27:    lwp_cond_wait(0x000D5848, 0x000D5830, 0x00000000, 0) = 0



/41:    pollsys(0x00000000, 0, 0xF533F750, 0x00000000)  = 0



/24:    lwp_cond_wait(0x000D1248, 0x000D1230, 0x00000000, 0) = 0



/13:    lwp_cond_wait(0x000B6C48, 0x000B6C30, 0x00000000, 0) = 0



/37:    pollsys(0x00000000, 0, 0xF557F518, 0x00000000)  = 0



/35:    lwp_cond_wait(0x0026E948, 0x0026E930, 0xF5CFFA00, 0) = 0



/6:     lwp_cond_wait(0x000A5948, 0x000A5930, 0x00000000, 0) = 0



/5:     lwp_cond_wait(0x000A4548, 0x000A4530, 0x00000000, 0) = 0



/9:     lwp_cond_wait(0x000AD848, 0x000AD830, 0x00000000, 0) = 0

 

As you see  at  one of the line  there  is  sigterm  signal. Application is  going down beacuse  of this  sigterm  signal.

/4:         Received signal #15, SIGTERM, in lwp_cond_wait() [caught]

The  next step  is  find  out  which process  sends  sigterm  signal to application.Fortunately, Oracle  Solaris has a powerful debugging tool which named  DTrace.It could  help us  to dig kernel system call and find  out what  is  doing  in background.

It’s time to  debugging kernel  system call  with  dtrace.First  I create a dtrace  probe  which shows system call.I listed  all syscall  with  “kill” command.I’ll try to get SIGTERM signal so  defined  signal  value “15” on dtrace  probe.

 

# kill -l



1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL



5) SIGTRAP      6) SIGABRT      7) SIGEMT       8) SIGFPE



9) SIGKILL     10) SIGBUS      11) SIGSEGV     12) SIGSYS



13) SIGPIPE     14) SIGALRM     15) SIGTERM     16) SIGUSR1



17) SIGUSR2     18) SIGCHLD     19) SIGPWR      20) SIGWINCH



21) SIGURG      22) SIGIO       23) SIGSTOP     24) SIGTSTP



25) SIGCONT     26) SIGTTIN     27) SIGTTOU     28) SIGVTALRM



29) SIGPROF     30) SIGXCPU     31) SIGXFSZ     32) SIGWAITING



33) SIGLWP      34) SIGFREEZE   35) SIGTHAW     36) SIGCANCEL



37) SIGLOST     38) SIGXRES     41) SIGRTMIN    42) SIGRTMIN+1



43) SIGRTMIN+2  44) SIGRTMIN+3  45) SIGRTMAX-3  46) SIGRTMAX-2



47) SIGRTMAX-1  48) SIGRTMAX

Dtrace  probe Script:

 

#!/usr/sbin/dtrace -qs

proc:::signal-send

/ args[2] == 15 /

{

printf("Process %d (%s) killing %d (%s)\n",

pid, execname, args[1]->pr_pid, args[1]->pr_fname);

}

Now let’s  find which  process sends sigterm signal with a  simple test.

Step 1:Start  dtrace  probe

root@TEST# vi /tmp/xx

#!/usr/sbin/dtrace -qs

proc:::signal-send

/ args[2] == 15 /

{

printf("Process %d (%s) killing %d (%s)\n",

pid, execname, args[1]->pr_pid, args[1]->pr_fname);

}

 

chmod  700  /tmp/xx

Run script and wait  for  SIGTERM signal.


 

root@TEST# .xx

Step 2:Create  a simple sleep process

root@TEST# sleep 1000&

[1] 16113

Step 3:Check process ID  for dumy sleep process

root@TEST# ps -ef|grep sleep|grep -v grep

root 16113 16108 0 10:15:12 pts/12 0:00 sleep 1000

Step 4:Send SIGTERM signal to dumy process

root@TEST# kill  -15  16113

Step 5:Check dtrace  prob  logs

We  find that the  process which send sigterm to our  dumy sleep  process  is “sh”  process related  to my  SSH session.

root@TEST# ./xx

Process 16249 (ludo) killing 16250 (lupi_zones)

Process 16249 (ludo) killing 16252 (lupi_bebasic)

Process 15703 (sh) killing 16113 (sleep)



root@TEST# ps -ef|grep 15703|grep -v grep

root 15703 15686 0 10:11:52 pts/16 0:00 -sh

root@TEST# ptree 15703

11306 /usr/lib/ssh/sshd

8980 /usr/lib/ssh/sshd

8981 /usr/lib/ssh/sshd

15684 /etc/omonitor/bin/logger /sbin/sh -sh

15686 -sh

15703 -sh

Also  Please watch this video.I show  my test results.

Tagged In:

I'm a IT Infrastructure and Operations Architect with extensive experience and administration skills and works for Turk Telekom. I provide hardware and software support for the IT Infrastructure and Operations tasks.

205 Total Posts
Follow Me