Wise people learn when they can; fools learn when they must - Arthur Wellesley

Thursday, 9 October 2014

Solaris System Performance Monitoring -1 [Memory - vmstat]

                     
                      
                      SYSTEM PERFORMANCE -1
                      [Memory –vmstat / sar/ prstat]

First I want to thanks all bloggers and oracle documentation for great resources provided by them, I want to say thanks individually but really I forget, that from where I had learned all these. At the time when I was learning and preparing notes I got help from all over the blogs/books/YouTube, and now I really don’t remember that which part i referenced from where. All I have is only my Notes, again thanks to all who helped me in any manner to have some knowledge and confidence i gained because of them and my sincere apologies also.

**Performance monitoring is one of the most crucial things in System Admin’s  Life and I am pretty sure that this thing will learned only when we start sweating from each and every side of ours; means facing a Lion in Zoo and facing a Lion in Jungle… imagine the difference

“IT CANNOT BE LEARNED, IT SHOULD BE EARNED”

TSHOOT is ART rather than technology… 


SO, for performance what we need to check?

·         Memory
·         Disk
·         Network
·         CPU
·         File system
·         Some bug which causes Memory Leak
·         NFS cache

MEMORY

For memory we have vmstat [system-wide summary of key activity]

Whenever a system has sufficient Physical memory, it won’t have any problem.

When Ph. mem is low then kernel starts using swap memory and we know that swap has 2 processes i.e. Paging and Swapping

Let’s start

root@sol-test-1:>/# vmstat i 5
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr -- -- -- --   in   sy   cs us sy id
 0 0 0 1314100 357412 8  18 14  0  1  0 20  0  0  0  0  502  179  246  0  1 99
 0 0 0 1306284 347812 0   8  0  0  0  0  0  0  0  0  0  501  121  194  0  1 99
 0 0 0 1306196 347724 0   2  0  0  0  0  0  0  0  0  0  505  120  195  0  1 99
^C
root@sol-test-1:>/# vmstat i 5
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr -- -- -- --   in   sy   cs us sy id
 0 0 0 1314092 357400 8  18 14  0  1  0 20  0  0  0  0  502  179  246  0  1 99
 0 0 0 1306284 347812 0   8  0  0  0  0  0  0  0  0  0  489  120  195  0  0 100
 0 0 0 1306196 347724 0   2  0  0  0  0  0  0  0  0  0  506  121  193  0  2 98
^C
root@sol-test-1:>/# vmstat i 2
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr -- -- -- --   in   sy   cs us sy id
 0 0 0 1314084 357388 8  18 14  0  1  0 20  0  0  0  0  502  179  246  0  1 99
 0 0 0 1306284 347812 0  21  0  0  0  0  0  0  0  0  0  474  229  195  0  2 98
 0 0 0 1306196 347724 0   4  0  0  0  0  0  0  0  0  0  459  233  193  0  1 99


“i” is interval of 5 sec in first two and it is 2sec in third one


When u fire this command u might notice that the first o/p comes without any delay or it do not waits for any second to complete,

Why is that?

I learned somewhere that it's average report from the time of last reboot.

Or some said to ignore the first line, so decide yourself

Well at least the first line gives us the good idea about system health

Let’s start with the first field

Kthr            kernel thread

r          Number of processes in the run queue
b          Number of processes blocked for resource I/O or paging         
w          Number of processes currently swapped

memory

swap       Current free swap (KB).
free       Current free memory (KB).

To know the exact value of swap / free…

1314084    (swap) already in kb…
1314084/1024
1283mb /1024 = 1gb

357388     (free) already in kb…
357388/1024
349mb

page       Report information about page faults and paging  activity

re         Page reclaims /sec.
mf         Minor faults /sec.
pi         KB paged in /sec.
po         KB paged out /sec.
fr         KB freed /sec.
de         Anticipated  short-term  memory  shortfall (KB) /sec.
sr         Pages scanned by Page Scanner/sec. by LRU [last recently used]  algorithm

disk

s0         Number of disk operations /sec on SCSI disk target 0 (there can be  up to four columns of information, depending on how many disks installed on system)

faults     Report the trap/interrupt rates /sec.
                  
in         Device interrupts  /sec
sy         System  calls  /sec
cs         CPU context  switches /sec

cpu        percentage usage of CPU [this is average of all processors on system]

us         % CPU time in user mode.
sy         % CPU ti me in system  mode.
id         % CPU time idle.


SO… really what to look?

We are analyzing Memory… right?

“sr” from page column and “w” from kthr column

OK… what does it mean?

“sr” is scan rate …. So when it is high then we need to alarmed but we
have to continuously monitor this for some time, it might be possible that
some processes are reading some uncached data, so we need to wait for let
it down after some time, if it remain high then we should start our Hunt.

“w” is no of processes / threads transferred to swap. If it is 20 and
still counting after each interval, then we are in trouble.

OK… Now what?

Where to find the culprit?

root@sol-test-1:>/#ps -ef -o pid,rss,args|sort -n +1

root@sol-test-1:>/#prstat -t -s size -c 1 2

           [Sort on the basis of Virtual memory size 
                  {virtual memory=available memory+swap memory}]

           [-t  provides complete users resource utilization]

           [-s  Sort output lines by key in ascending order]

           [    Size / rss / time /cpu are keys]

           [-c  do not overwrite, just print the new one following old]

           [1   is interval of 1sec]

           [2   is count, means repeat this 2 times]

root@sol-test-1:>/#prstat -t -s rss -c 1 2
           [Sort on the basis of Physical mem size {rss=resident set size}]

root@sol-test-1:>/#prstat -s size –Z
           [-Z  is to see the performance status of zones]

root@sol-test-1:>/#prstat -t -s size -c 1 2
           [This could be fired under the zone if some zone found culprit]

root@sol-test-1:>/#sar -r 1 5

SunOS sol-test-1 5.10 Generic_147441-01 i86pc    10/09/2014

01:58:35 freemem freeswap
01:58:36   86545  2610712
01:58:37   86545  2610712
01:58:38   86545  2610712
01:58:39   86545  2610712
01:58:40   86544  2610696

Average    86545  2610709

Freemem    average pages available to user processes
                [Here it talks in term of page means we have to find             “pagesize”]
                                            
Freeswap   disk blocks available for page swapping                              [Here it is talking about block size]

Calculate Freemem [86545]

root@sol-test-1:>/#pagesize 
4096
[in sparc it is 8192]

86545*4096/1024/1024
338mb

Calculate Freeswap [2610712]

2610712*512/1024/1024
1274
1274/1024
1
root@sol-test-1:>/#vmstat i 2 2
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr -- -- -- --   in   sy   cs us sy id
 0 0 0 1310076 352304 4  10  6  0  0  0 10  0  0  0  0  497  117  219  0  1 99
 0 0 0 1305832 346672 0  21  0  0  0  0  0  0  0  0  0  513  261  201  0  0 100

root@sol-test-1:>/#bc
1305832/1024/1024
1gb
352304/1024
344mb



Well… these are some tools which can give us idea about memory status…
Though TS is about to deal with real situations… These tools will at least
provide clear picture about mem usage and problems.



Cont……


  

No comments:

Post a Comment