ViJay's Blog: December 2009

Thursday, December 10, 2009

Analyzing Java Thread Dump - weblogic.kernel.Default

Sample Thread Dump JDK 1.5/WL9.2

Below is snippet from JDK1.5/Weblogic 9.2 thread dump

"[ACTIVE] ExecuteThread: '53' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x008c78c8 nid=0x38f4d in Object.wait() [0xd45ff000..0xd45ff9f0]
at java.lang.Object.wait(Native Method)
- waiting on <0xeade2668> (a weblogic.work.ExecuteThread)
at java.lang.Object.wait(Object.java:474)
at weblogic.work.ExecuteThread.waitForRequest(ExecuteThread.java:165)
- locked <0xeade2668> (a weblogic.work.ExecuteThread)
at weblogic.work.ExecuteThread.run(ExecuteThread.java:186)

"[ACTIVE] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x00efa208 nid=0x15 runnable [0xd93ff000..0xd93ffa70]
at java.net.PlainDatagramSocketImpl.receive0(Native Method)
- locked <0xe9606518> (a java.net.PlainDatagramSocketImpl)
at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
- locked <0xe9606518> (a java.net.PlainDatagramSocketImpl)
at java.net.DatagramSocket.receive(DatagramSocket.java:712)
- locked <0xf31a4af0> (a java.net.DatagramPacket)
- locked <0xe92f9968> (a java.net.MulticastSocket)
at weblogic.cluster.FragmentSocket.receive(FragmentSocket.java:202)
at weblogic.cluster.MulticastManager.run(MulticastManager.java:400)
at weblogic.work.ExecuteThread.execute(ExecuteThread.java:209)
at weblogic.work.ExecuteThread.run(ExecuteThread.java:181)

Wednesday, December 9, 2009

Find User unix Command - fuser

=== Find User ===

fuser displays the PIDs of processes using the specified files or file systems.
fuser command is very useful in getting rid of the .nfs files created when processes are killed.
The .nfs* names refer to files that have been deleted but are still being held open by a running process.

You can find .nfs files using ls -la

rm does not work in removing .nfs files - use fuser to find the process that's locking the file and then kill the process

 /usr/sbin/fuser {filename}

kill -9 {pid}

If you stop the processes that have them open, the files will be tidied-up and removed by the file server.

General background Info:

Most operating systems, including UNIX, operate a policy of not actually removing a deleted file (and freeing up it's data blocks) until the last process that has the file open closes it. So, if a running process has a file open and you use the rm(1) command to delete the file the data blocks on the disk will not be freed up by the OS until the process that has the file open closes it.

On a UNIX host using local disk store this behaviour can manifest itself it some seemingly confusing situations. For example, you may wish to free up some space on a file system that's used 900 MB of it's 1GB quota. You have a large file, 200MB say, named myjava.jar that you believe is no longer required, but is actually currently open in your WebLogic server. Not knowing this you delete myjava.jar and do an ls(1) command to see that the file is no longer listed. However, when you use the df(1) command it still reports that 900 MB of it's 1GB quota is used because your WebLogic server still has the file open. When you shutdown the WebLogic server or the server closes the file the disk space will be released and df(1) will report 700MB of it's 1GB is used.

If the file that is removed is on NFS mounted store then it is possible for a file to be deleted on one client whilst still being open on another client. In this situation the same rule of not actually deleting the file until the last process with it open closes it still applies. However, in order for the NFS file handle used by the client that still has the file open not to be broken a filename reference must be maintained. In order to achieve this and remove the files name from the directory ( so it doesn't show up in an ls command output) the NFS file server renames the deleted file to a name beginning '.nfs'. These are the files you are seeing. When the last process with these files open dies or closes them they will be tidied up and removed. Trying to delete them before they are closed will only result in the file being renamed again.

Useful Unix Commands

=== Find Class within Jar file ===

To find a class within a binary jar file

 for i in `ls *.jar`; do (jar tf $i  grep '{classname}' ) && echo $i;done

If you want to search within all subdirectories use this

 for i in `find . -name ‘*.jar’`; do (jar tf $i  grep '{CLASSNAME}' ) && echo $i;done

=== Find string within file of name ===

find . -name *.xml -exec grep  {} \;

example

find . -name web.xml -exec grep -i servlet {} \;

=== Grep within zip file without unzipping it ===

gzip -c -d {file}.gz | grep {string}

Example

gzip -c -d admin_access.log0001_4Dec08_0147.gz | grep -i adq | wc -l

=== Find process using the port ===

The easy way to do this is using netstat passing the port number

netstat -a | grep 61014

If that does not help getting a pid, run this line below

for i in `ls /proc`; do pfiles $i | grep AF_INET | grep 61014 ; done

Output:

pfiles: permission denied: 12363
sockname: AF_INET 0.0.0.0 port: 61014
pfiles: permission denied: 12384

this shows the process appearing in between pids 12363 and 12384 uses that port.

Weblogic Related Issues/Solutions - Case 1

Problem:
Platform running Weblogic 8.1 on Sun V880 servers. Total RAM of 32 Gb on the machine.
2 Gb assigned to the managed server JVM heap. JDK 1.4

Initial settings:

-XX:+AggressiveHeap -Xms2048m -Xmx2048m  -XX:SurvivorRatio=32 -XX:MaxPermSize=128m

But still there are 20 Full GCs per hour in peak times, before the server crashes.

Analysis

1. It was decided to reduce the SurvivorRatio to 4 and restart with some more flags.
The size of ONE Survivor Space is calculated as
SurvivorSpace = NewSize / (SurvivorRatio + 2)
Keeping SurvivorRatio as 32 means the Survivor spaces are too small for promoting stuff from Eden. Hence we reduce this to 4 which allows for larger Survivor spaces.
2. As per Sun Bug ID: 6218833, setting AggressiveHeap set before Heapsize (Xmx and Xms) can confuse the JVM. Revert the order to have -Xms and -Xmx to come before -XX:+AggressiveHeap or not use it
3. The application has 180+ EJBs with pools of beans. Hence set the -Dsun.rmi.dgc.client.gcInterval=3600000 (1 hour) instead of the default 60000 (1 min). More on this here: http://docs.sun.com/source/817-2180-10/pt_chap5.html
4. The site is restarted once a week at 4:30AM. The patterns stays normal for 2 days – and then degrades into full GC.
5. The Old space is pretty much full – at every minor collection – the Old space must be cleared up for promotion from Young to Old to take place.
6. Permanent space is pretty much full – keeps loading classes and classes ( could that be a problem – the difference between the number of JSP’s per Release?)
Hence we increased the PermSpace from 128M to 256M
7. Ensure we are running the server JVM by using the -server flag
8. Use OptimizeIt or similar profiling tool to see the memory usage and find code bottlenecks.

The settings now were

-server -Xms2048m -Xmx2048m  -XX:MaxNewSize=512m -XX:NewSize=512m -XX:SurvivorRatio=4 -XX:MaxPermSize=256m -Xincgc -XX:+DisableExplicitGC -XX:+AggressiveHeap -XX:-OmitStackTraceInFastThrow

This reduced the Full GCs to one a day.

Error Logs

At the time of the server going out of memory prior to a crash, the logs are filled with repeated errors (up to 100 repetitions) of this sort

java.lang.NullPointerException
<>

Adding the -XX:-OmitStackTraceInFastThrow flag resolves this problem, the root cause of the NPE it self has to be tracked down but we do not have any longer the issue of huge recursive exception strings.

We could now see the stack trace as

java.lang.NullPointerException
at java.util.StringTokenizer.(StringTokenizer.java:117)
at java.util.StringTokenizer.(StringTokenizer.java:133)
at jsp_servlet._framework._security.__login._jspService(login.jsp:294)
at weblogic.servlet.jsp.JspBase.service(JspBase.java:27)
at weblogic.servlet.internal.ServletStubImpl$ServletInvocation

Java Object Life Cycle and Garbage Collection

Java object life cycle and garbage collection occurs like this:

Objects when created are always first allocated to Eden.
When Eden fills up, a fast but not comprehensive GC (minor collection) is run over the young generation only.
All surviving objects are moved from Eden into one Survivor Space.
In consequent minor collections, new objects move from Eden into the other Survivor Space, plus everything from the first Survivor Space (survivors from the previous minor collection) is also moved into the second Survivor Space. Thus one survivor should be empty at that time.
When objects in Survivor Space are old enough (or survivor fills up), they are moved to Tenured. By default the long-lived objects may be copied up to 31 times between the Survivor Spaces before they are finally promoted to the Old generation.
When tenured fills up, a Full GC collection is run that is comprehensive: the entire heap is analyzed, all objects that can be destroyed are killed and memory is reclaimed.
Note: the above lifecycle changes slightly when advanced options such as ConcurrentMarkSweep etc are enabled.

ViJay's Blog