This post focuses on dtrace and truss command usages and benefits for Oracle Solaris 10 Operating System.Dtrace is a debugging tool which used in Oracle Solaris.
One of customer getting an unexpected error from java process . This process getting a sigterm signal.Application vendor couldn’t find which process sending sigterm signal to their application.
I tried to using truss command to get all attempts for the application process. I monitored application process with truss command and save all truss output to a file and also print every syscall and analyze it to find out why application is going down.
Take a look this link to check how to monitor process with truss command.
</p> #truss -o /tmp/syslog.truss.out -sall -p <PID> /35: lwp_cond_wait(0x0026E948, 0x0026E930, 0xF5CFFA00, 0) Err#62 ETIME /38: pollsys(0x00000000, 0, 0xF54EF4D0, 0x00000000) = 0 /41: pollsys(0x00000000, 0, 0xF533F750, 0x00000000) = 0 /39: pollsys(0x00000000, 0, 0xF545F650, 0x00000000) = 0 /37: pollsys(0x00000000, 0, 0xF557F518, 0x00000000) = 0 /40: pollsys(0x00000000, 0, 0xF53CF5D0, 0x00000000) = 0 /1: Stopped by signal #24, SIGTSTP, in lwp_wait() /4: Received signal #15, SIGTERM, in lwp_cond_wait() [caught] /49: Received signal #25, SIGCONT, in pollsys() [default] /49: siginfo: SIGCONT pid=12351 uid=121 /16: lwp_cond_wait(0x000BB248, 0x000BB230, 0x00000000, 0) = 0 /38: pollsys(0x00000000, 0, 0xF54EF4D0, 0x00000000) = 0 /41: pollsys(0x00000000, 0, 0xF533F750, 0x00000000) = 0 /52: lwp_cond_wait(0x00D46948, 0x00D46930, 0x00000000, 0) = 0 /32: lwp_cond_wait(0x00261448, 0x00261430, 0x00000000, 0) = 0 /53: lwp_cond_wait(0x00D47248, 0x00D47230, 0x00000000, 0) = 0 /37: pollsys(0x00000000, 0, 0xF557F518, 0x00000000) = 0 /25: lwp_cond_wait(0x000D2648, 0x000D2630, 0x00000000, 0) = 0 /21: lwp_cond_wait(0x000C9248, 0x000C9230, 0x00000000, 0) = 0 /18: lwp_cond_wait(0x000C1C48, 0x000C1C30, 0x00000000, 0) = 0 /8: lwp_cond_wait(0x000A8A48, 0x000A8A30, 0x00000000, 0) = 0 /10: lwp_cond_wait(0x000AEC48, 0x000AEC30, 0x00000000, 0) = 0 /34: lwp_cond_wait(0x00265148, 0x00265130, 0x00000000, 0) = 0 /17: lwp_cond_wait(0x000BFF48, 0x000BFF30, 0x00000000, 0) = 0 /22: lwp_cond_wait(0x000CB048, 0x000CB030, 0x00000000, 0) = 0 /15: lwp_cond_wait(0x000B9D48, 0x000B9D30, 0x00000000, 0) = 0 /3: lwp_cond_wait(0x0009E248, 0x0009E230, 0x00000000, 0) = 0 /11: lwp_cond_wait(0x000B0A48, 0x000B0A30, 0x00000000, 0) = 0 /33: lwp_cond_wait(0x00262C48, 0x00262C30, 0x00000000, 0) = 0 /28: lwp_cond_wait(0x000CFD48, 0x000CFD30, 0xF612F770, 0) = 0 /12: lwp_cond_wait(0x000B1E48, 0x000B1E30, 0x00000000, 0) = 0 /27: Received signal #1, SIGHUP, in lwp_cond_wait() [ignored] /27: siginfo: SIGHUP pid=12351 uid=121 /20: lwp_cond_wait(0x000C7E48, 0x000C7E30, 0x00000000, 0) = 0 /27: lwp_cond_wait(0x000D5848, 0x000D5830, 0x00000000, 0) = 0 /41: pollsys(0x00000000, 0, 0xF533F750, 0x00000000) = 0 /24: lwp_cond_wait(0x000D1248, 0x000D1230, 0x00000000, 0) = 0 /13: lwp_cond_wait(0x000B6C48, 0x000B6C30, 0x00000000, 0) = 0 /37: pollsys(0x00000000, 0, 0xF557F518, 0x00000000) = 0 /35: lwp_cond_wait(0x0026E948, 0x0026E930, 0xF5CFFA00, 0) = 0 /6: lwp_cond_wait(0x000A5948, 0x000A5930, 0x00000000, 0) = 0 /5: lwp_cond_wait(0x000A4548, 0x000A4530, 0x00000000, 0) = 0 /9: lwp_cond_wait(0x000AD848, 0x000AD830, 0x00000000, 0) = 0
As you see at one of the line there is sigterm signal. Application is going down beacuse of this sigterm signal.
/4: Received signal #15, SIGTERM, in lwp_cond_wait() [caught]
The next step is find out which process sends sigterm signal to application.Fortunately, Oracle Solaris has a powerful debugging tool which named DTrace.It could help us to dig kernel system call and find out what is doing in background.
It’s time to debugging kernel system call with dtrace.First I create a dtrace probe which shows system call.I listed all syscall with “kill” command.I’ll try to get SIGTERM signal so defined signal value “15” on dtrace probe.
# kill -l 1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP 6) SIGABRT 7) SIGEMT 8) SIGFPE 9) SIGKILL 10) SIGBUS 11) SIGSEGV 12) SIGSYS 13) SIGPIPE 14) SIGALRM 15) SIGTERM 16) SIGUSR1 17) SIGUSR2 18) SIGCHLD 19) SIGPWR 20) SIGWINCH 21) SIGURG 22) SIGIO 23) SIGSTOP 24) SIGTSTP 25) SIGCONT 26) SIGTTIN 27) SIGTTOU 28) SIGVTALRM 29) SIGPROF 30) SIGXCPU 31) SIGXFSZ 32) SIGWAITING 33) SIGLWP 34) SIGFREEZE 35) SIGTHAW 36) SIGCANCEL 37) SIGLOST 38) SIGXRES 41) SIGRTMIN 42) SIGRTMIN+1 43) SIGRTMIN+2 44) SIGRTMIN+3 45) SIGRTMAX-3 46) SIGRTMAX-2 47) SIGRTMAX-1 48) SIGRTMAX
Dtrace probe Script:
#!/usr/sbin/dtrace -qs proc:::signal-send / args[2] == 15 / { printf("Process %d (%s) killing %d (%s)\n", pid, execname, args[1]->pr_pid, args[1]->pr_fname); }
Now let’s find which process sends sigterm signal with a simple test.
Step 1:Start dtrace probe
root@TEST# vi /tmp/xx #!/usr/sbin/dtrace -qs proc:::signal-send / args[2] == 15 / { printf("Process %d (%s) killing %d (%s)\n", pid, execname, args[1]->pr_pid, args[1]->pr_fname); }
chmod 700 /tmp/xx
Run script and wait for SIGTERM signal.
root@TEST# .xx
Step 2:Create a simple sleep process
root@TEST# sleep 1000& [1] 16113
Step 3:Check process ID for dumy sleep process
root@TEST# ps -ef|grep sleep|grep -v grep root 16113 16108 0 10:15:12 pts/12 0:00 sleep 1000
Step 4:Send SIGTERM signal to dumy process
root@TEST# kill -15 16113
Step 5:Check dtrace prob logs
We find that the process which send sigterm to our dumy sleep process is “sh” process related to my SSH session.
root@TEST# ./xx Process 16249 (ludo) killing 16250 (lupi_zones) Process 16249 (ludo) killing 16252 (lupi_bebasic) Process 15703 (sh) killing 16113 (sleep) root@TEST# ps -ef|grep 15703|grep -v grep root 15703 15686 0 10:11:52 pts/16 0:00 -sh root@TEST# ptree 15703 11306 /usr/lib/ssh/sshd 8980 /usr/lib/ssh/sshd 8981 /usr/lib/ssh/sshd 15684 /etc/omonitor/bin/logger /sbin/sh -sh 15686 -sh 15703 -sh
Also Please watch this video.I show my test results.