HELP! My server is constantly crashing!
There are a lot of things that may cause a server to crash, this guide is going to primarily look at the hardware side of crashing. There are many things that might be causing the server to crash from a software standpoint such a process that runs out of control or uses too many resources. There are a few things that might be going wrong with a server. Normally the component that goes wrong is the hard drive, simply because it is use so much and is a moving part. The RAM on a server will occasionally go but this is more common when the server is moved around or the RAM moved because it has a chance of being statically shocked. On the less common side of things you could have the CPU, powersupply, ethernet card, or motherboard going out.
A few important things first:
Does the server crash on a regular basis at the same time? If so look at
the cronjobs on the server. If you are running RHEL and cPanel you should
disable
the auditd because it may be causing the crashes.
Check your /var/log/messages file for the time before the crash and when the
server boots up because you will sometimes be able to see an error. If something
like the ethernet card goes down you should also see it because the log files
will still be writing, you just will not be able to access the server physically.
Make sure your kernel and other software is at the latest version. There have
been a few kernel versions with bad drivers that can cause stability problems.
The easiest thing to look at it smartctl which is a command line tool that
reads the status information from the drive directly. Almost modern drive will
have the SMART capability which can be probed.
# smartctl -H /dev/sda
If that shows nothing bad take a look at
# smartctl -a /dev/sda
which will show any errors that the disk has encountered. A few errors are ok
and normal but if you have more then just a few then it is definatly something
to check out. Now since smartctl is a computer program there is a chance that
the diagnostics are not correct. Because of this the next best thing to do is
to run a disk check. Badblocks is a common tool used to check the disk for errors.
To run it go ahead and run badblocks:
#badblocks -v -v /dev/sda
This disk check is going to take a few minutes and will also be constantly using
your drive during this time. If your server is very busy you may have trouble
with other services running in the background that need to access the disks.
If you still think that the disk is bad i would suggest running badblocks 3-5
times. It is possible that the disk drive is just starting to go bad and it might
not detect any problems on the first run. If smartctl and badblocks both show
nothing chances are the disk driver is just fine.
The next thing to check out on the server is the RAM. To do this we will use
memtester which is a very simple program that simply uses the RAM continously
and runs it though a series of tests in order to detect any problems with it. First
download and install memtester:
cd /usr/local/src
wget http://pyropus.ca/software/memtester/old-versions/memtester-4.0.6.tar.gz
tar -zxf memtester-4.0.6.tar.gz
cd memtester-4.0.6
make
Now that it has been compiled we are going to have to run it. NOTE when you run
memtester it is going to use all of the RAM and your server load will probably
jump pretty high. First make sure of how much RAM you have in the system via
free -m
If for some reason this amount is not what you expect the RAM may in fact be
bad. It would be in your best interest to have it checked out to make sure why
the correct amount is not present, it might simply be because of an error at
the datacenter. Below is what you would run if you have 1024 or 1Gb of ram and
have it run 5 seperate times to hopefully catch and problems.
./memtester 1024 5
Since the server is using some of the RAM you are going to get an error like
the below:
got 460MB (483188736 bytes), trying mlock ...too many pages, reducing...
Though it is best to close as much as possible before the test it is usually
not required. Chances are if the server is using the RAM then it is probably
good, of course if nothing else can be found it might be worth closing everything
and going back and running memtester again. If all 5 tests come back with no
problems chances are that the RAM is ok.
Ok that covers most of the stuff that normally goes wrong. The next thing we
can try is simply to try to overload the server and see what happens. The program
that we will use is simply called stress. It attempt to overload the cpu and
hard drive all at once. First go ahead and download and compile the program to
run the tests:
cd /usr/local/src
wget http://weather.ou.edu/~apw/projects/stress/stress-0.18.4.tar.gz
tar -zxf stress-0.18.4.tar.gz
cd stress-0.18.4
./configure
make; make install
Now we are going to actually run it. Try using the following command to see what
happens. This is more or less going to be a last ditch effort to try and get
the server to crash by simulating it being very busy. Since this test is relatively
short you may try running it a few times to see if you can get it to crash. If
the server does crash it does not necessarily mean that there is a hardware problem
but it says something is wrong. It is probably best to switch out the hardware
if possible if it crashes running the test assuming as I put above you are running
the latest drivers and kernels. Here is the command:
stress --cpu 8 --io 4 --vm 2 --vm-bytes 256M --timeout 60s
If none of the above work the next thing to do is try and just run "top" on
the server and see if you can find anything weird. Maybe you will be lucky
and there will be some message on the console or the server will become very
busy and ends up crashing simply because there is too much running. Hopefully
this guide will help you identify hardware problems on your server or at the
minimum eliminate some of the doubt that you have in your servers hardware.
This guide may be posted on a few forums but the latest version can always
be found at http://www.eth0.us/?q=crash .
Recent comments
1 year 32 weeks ago
1 year 32 weeks ago
2 years 13 weeks ago
2 years 23 weeks ago
2 years 24 weeks ago
2 years 30 weeks ago
2 years 30 weeks ago
2 years 30 weeks ago
2 years 30 weeks ago
2 years 30 weeks ago