HELP! Server Crashing

Untitled Document

HELP! My server is constantly crashing!

There are a lot of things that may cause a server to crash, this guide is going to primarily look at the hardware side of crashing. There are many things that might be causing the server to crash from a software standpoint such a process that runs out of control or uses too many resources. There are a few things that might be going wrong with a server. Normally the component that goes wrong is the hard drive, simply because it is use so much and is a moving part. The RAM on a server will occasionally go but this is more common when the server is moved around or the RAM moved because it has a chance of being statically shocked. On the less common side of things you could have the CPU, powersupply, ethernet card, or motherboard going out.


A few important things first:

Does the server crash on a regular basis at the same time? If so look at the cronjobs on the server. If you are running RHEL and cPanel you should disable the auditd because it may be causing the crashes.

Check your /var/log/messages file for the time before the crash and when the server boots up because you will sometimes be able to see an error. If something like the ethernet card goes down you should also see it because the log files will still be writing, you just will not be able to access the server physically.

Make sure your kernel and other software is at the latest version. There have been a few kernel versions with bad drivers that can cause stability problems.

The easiest thing to look at it smartctl which is a command line tool that reads the status information from the drive directly. Almost modern drive will have the SMART capability which can be probed.

# smartctl -H /dev/sda

If that shows nothing bad take a look at

# smartctl -a /dev/sda

which will show any errors that the disk has encountered. A few errors are ok and normal but if you have more then just a few then it is definatly something to check out. Now since smartctl is a computer program there is a chance that the diagnostics are not correct. Because of this the next best thing to do is to run a disk check. Badblocks is a common tool used to check the disk for errors. To run it go ahead and run badblocks:

#badblocks -v -v /dev/sda

This disk check is going to take a few minutes and will also be constantly using your drive during this time. If your server is very busy you may have trouble with other services running in the background that need to access the disks. If you still think that the disk is bad i would suggest running badblocks 3-5 times. It is possible that the disk drive is just starting to go bad and it might not detect any problems on the first run. If smartctl and badblocks both show nothing chances are the disk driver is just fine.



The next thing to check out on the server is the RAM. To do this we will use memtester which is a very simple program that simply uses the RAM continously and runs it though a series of tests in order to detect any problems with it. First download and install memtester:

cd /usr/local/src
wget http://pyropus.ca/software/memtester/old-versions/memtester-4.0.6.tar.gz
tar -zxf memtester-4.0.6.tar.gz
cd memtester-4.0.6
make


Now that it has been compiled we are going to have to run it. NOTE when you run memtester it is going to use all of the RAM and your server load will probably jump pretty high. First make sure of how much RAM you have in the system via

free -m

If for some reason this amount is not what you expect the RAM may in fact be bad. It would be in your best interest to have it checked out to make sure why the correct amount is not present, it might simply be because of an error at the datacenter. Below is what you would run if you have 1024 or 1Gb of ram and have it run 5 seperate times to hopefully catch and problems.

./memtester 1024 5

Since the server is using some of the RAM you are going to get an error like the below:
got 460MB (483188736 bytes), trying mlock ...too many pages, reducing...

Though it is best to close as much as possible before the test it is usually not required. Chances are if the server is using the RAM then it is probably good, of course if nothing else can be found it might be worth closing everything and going back and running memtester again. If all 5 tests come back with no problems chances are that the RAM is ok.


Ok that covers most of the stuff that normally goes wrong. The next thing we can try is simply to try to overload the server and see what happens. The program that we will use is simply called stress. It attempt to overload the cpu and hard drive all at once. First go ahead and download and compile the program to run the tests:

cd /usr/local/src
wget http://weather.ou.edu/~apw/projects/stress/stress-0.18.4.tar.gz
tar -zxf stress-0.18.4.tar.gz
cd stress-0.18.4
./configure
make; make install

Now we are going to actually run it. Try using the following command to see what happens. This is more or less going to be a last ditch effort to try and get the server to crash by simulating it being very busy. Since this test is relatively short you may try running it a few times to see if you can get it to crash. If the server does crash it does not necessarily mean that there is a hardware problem but it says something is wrong. It is probably best to switch out the hardware if possible if it crashes running the test assuming as I put above you are running the latest drivers and kernels. Here is the command:

stress --cpu 8 --io 4 --vm 2 --vm-bytes 256M --timeout 60s



If none of the above work the next thing to do is try and just run "top" on the server and see if you can find anything weird. Maybe you will be lucky and there will be some message on the console or the server will become very busy and ends up crashing simply because there is too much running. Hopefully this guide will help you identify hardware problems on your server or at the minimum eliminate some of the doubt that you have in your servers hardware. This guide may be posted on a few forums but the latest version can always be found at http://www.eth0.us/?q=crash .

Powered by Drupal - Theme created by Danger4k