Is this server stuffed?

Is this server stuffed

  • Not at all - you're just paranoid.

    Votes: 0 0.0%
  • Maybe - but with Windows how can you tell?

    Votes: 0 0.0%
  • Looks suspicious - you should worry.

    Votes: 0 0.0%
  • It's about to die spectacularly - look for another job.

    Votes: 0 0.0%
  • It's already dead. When are you going to bury it?

    Votes: 0 0.0%

  • Total voters
    0

time

Storage? I am Storage!
Joined
Jan 18, 2002
Messages
4,891
Location
Brisbane, Oz
We have a co-located server that has had a few issues:

Because of software problems (memory leak), it is set to reboot every night. Two to three weeks ago, it crashed about 9:00am (after customers had been using it for one to two hours). I think this was because a password change meant that the scheduled reboot hadn't happened for 4 - 5 days. The big switch fixed it.

About 10 days ago, it failed to complete the reboot and just sat there in limbo. The big switch fixed it. :(

Four days ago, around 2:20am (three hours after the reboot), it had a heart attack and shut down unexpectedly. Because it's co-located, it has a wall-sized true UPS and 24hr air-conditioning. The big switch fixed it. :cry:

It has a sister server at a different location that has demonstrated the reboot failure at least twice in the last three months (and a few other issues that are hard to be certain about). Both have Asus motherboards and mirrored disks with a Promise Controller card. We estimate they're 3 - 4 years old. Oh yeah, and they run Win2k (and a bunch of other stuff).

I'm a little sensitive about uptime because customers pay to access it. :bow:
 

time

Storage? I am Storage!
Joined
Jan 18, 2002
Messages
4,891
Location
Brisbane, Oz
Oops, forgot: while looking at the scripts, we checked the syntax of the Windows Shutdown command with 'shutdown /?'.

It displayed the help for the Call command! :eek:

As a test, we tried to use the command to shutdown a remote PC. It displayed a lot of crap relating to the Java server, but that was it. :erm:

I rebooted it later, but to no avail - Shutdown help was still from a completely different program. :blue:
 

Mercutio

Fatwah on Western Digital
Joined
Jan 17, 2002
Messages
20,329
Location
I am omnipresent
Website
s-laker.org
Have you done an SFC on it? I've seen the "displays help for the wrong command" thing before, and seen SFC fix it.

For the rest, who can say for sure? Colo'd equipment might be making horrible grinding fan noises and you'd never know.
 

time

Storage? I am Storage!
Joined
Jan 18, 2002
Messages
4,891
Location
Brisbane, Oz
Thanks Merc, that's worth a shot.

I have a somewhat feeble theory for the unexpected power down:

When rebooted, some PCs slow or stop their fans momentarily (don't they?). And Asus likes to apply speed control through their motherboards.

What if the CPU fan stalled at the 11:20pm reboot? The P800 might well have made it to 2:20am before a thermal shutdown (assuming the board has one). Any takers?

Failing that, I'm thinking about replacing the power supply and waiting to see what happens - if I can stop worrying long enough.
 

Buck

Storage? I am Storage!
Joined
Feb 22, 2002
Messages
4,514
Location
Blurry.
Website
www.hlmcompany.com
When I first saw this topic, I thought you were referring to the server that our StorageForum host uses. :D

Anyway, you mentioned something about a sister server. Do you have any other servers that have their hardware identically configured to this one? How about at least using the same hardware model numbers? Any problem with those? Any possibibility that the power option settings for this system changed in Windows or in the BIOS?

Although this could easily be a hardware issue, I had a similar situation with a customer last week. His desktop system began to have shutdown issues. Very quickly, these problems advanced to startup issues. Suddenly, his system would black screen at startup and freeze. He removed his drives from the system and brought the computer to me. We plugged in another drive configured with Windows 2000 (same OS as he has) configured for the same VIA chipset (266A) and everything worked beautifully. There were no hardware issues, and he was dead sure that the motherboard I sold him roughly two years ago was bad. The problem is software related. Fortunately, he's smart enough to fix his software woes, I just had to find that direction for him.

PS: Time, it is nice to have you posting on this forum again.
 

blakerwry

Storage? I am Storage!
Joined
Oct 12, 2002
Messages
4,203
Location
Kansas City, USA
Website
justblake.com
I agree, this sounds like the software got stuffed. That may be related to an underlying hardware problem though.

If it's been working for 4 years, I would think about reinstalling the OS and make a once over of the inside of the case to make sure it's clean, there are no failed fans, and that there are no leaking capcitors or loose connections between components.


If the server is only used from 7:00am to 5:00pm or whatever, you might want to think about running memtest+ on the board for a few hours during the off time just to help confirm that the problem isnt hardware related.


btw, any power saving or fan throttling should be disabled if possible.
 

time

Storage? I am Storage!
Joined
Jan 18, 2002
Messages
4,891
Location
Brisbane, Oz
Nothing of interest, Fushigi. :(

Actually, we were wondering if there was some way to increase the verbosity of the logging, or add a third party logging tool that records more events?

Blakerwry, it's only been deployed for about two years (I think). And I've just realized I may have exaggerated the age. A couple of months ago, I had reason to believe that the motherboard might still be just inside the Asus 3-year warranty.

We had its sister apart at that point looking for any physical problems, but couldn't find any. What do you think about my fan theory?

The server may actually have multiple issues. The power down could be totally separate from the 'sticking reboot', which because it's happened on both servers, I suspect may be a design fault (exacerbated by age) with the Asus boards. That is, they seem to be getting past Windows shutdown - at least visually.

Buck, thanks for the welcome back. This month is the first opportunity I've had to sit down and knock out a few posts.
 

blakerwry

Storage? I am Storage!
Joined
Oct 12, 2002
Messages
4,203
Location
Kansas City, USA
Website
justblake.com
I've never used the Asus boards that implement this, but most boards put the fan at 100% for the 1st few moments of poot up to make sure the fans start, then they drop the speed down as temperatures permit. Anything else would have to be component specific and tested or would be a design flaw if not.

Besides, on a warm boot the fan should never stop spinning (atleat on the boards I've used). I dont think the fan throttling is the problem unless the fan is dying (had a problem on my antec PP303x where the dying fan wouldn't always start... this was a temp controlled fan that was controlled by a temperature sensitive resistor.)


So I dont really think the fan throttling would cause a problem unless an underlying problem already existed. But it's something I would probably disable on any server.. esp a co-located server where noise doesn't matter.
 

Tannin

Storage? I am Storage!
Joined
Jan 15, 2002
Messages
4,432
Location
Huon Valley, Tasmania
Website
www.redhill.net.au
If that was my machine, Time, I think I'd be resorting to one of my favorite brute-force diagnostic tricks right now. Replace either the hardware or the software. That is, either take a spare box and swap the existing hard drive into it, or take a spare hard drive and perform a 100% fresh install of the software onto it. Sit back and observe results.

A crude approach, to be sure, but so long as you have the spare parts around to implement it with, it takes a surprisingly short time to do (once you grasp the nettle and get on with it), and it can save you a lot of head-scratching and stuffing-about time as you ponder "what if" possibilities.

But the best thing about it is that, at a single stroke, you can wipe out a full 50% of your possible causes. Either it is hardware or it is software. Once you know which, you have a great load of confusion taken off your mind, and can often get down to the exact cause quickly and easily.
 

time

Storage? I am Storage!
Joined
Jan 18, 2002
Messages
4,891
Location
Brisbane, Oz
Thanks Tony. May I take this opportunity to ask you to comment on either the Epox capacitor or replacement thread? I'm at the point of considering Asus ...

There are 'technical difficulties' with replacing either all the hardware or all the software. Firstly, there are three hard drives (two mirrored through a Promise (!) controller). Secondly, reinstalling all the software would take about two days - if all goes well. :cry:
 

time

Storage? I am Storage!
Joined
Jan 18, 2002
Messages
4,891
Location
Brisbane, Oz
Hey, don't look at me - it's not my server!

I had been pushing for that, but doubts have been raised over what would happen in a restore given the software is split over two drives. However, it's still something that I'm determined to make happen after this is cleared up. Don't forget it's co-located.
 

blakerwry

Storage? I am Storage!
Joined
Oct 12, 2002
Messages
4,203
Location
Kansas City, USA
Website
justblake.com
simply image one drive onto either a set of optical media, tape, or HDD and label it drive 1 and then do the same for the other.... label drive 2...

what questions?
 

blakerwry

Storage? I am Storage!
Joined
Oct 12, 2002
Messages
4,203
Location
Kansas City, USA
Website
justblake.com
with compression it's usually not that bad... you could also make two ghost images onto a single HDD(something I'd prefer over the CD route for that much data).
 

MaxBurn

Storage Is My Life
Joined
Jan 20, 2004
Messages
3,243
Location
SC
I have had three ASUS P4S533's in several flavors (-x -nothing etc) and with their CPU fan control what they do is set the fan to maximum and then step it down. On a reboot the fan instantly goes to maximum once the board is reset and then starts stepping down again in a few seconds once the board finishes POST. On my boards you can choose the minimum speed it will reduce to in the BIOS and you can turn the feature completely off. I would think this would be the same on other boards ASUS has made. Point is I don't think there is an instant that the fan doesn't receive power as the 12v rail doesn't get interrupted. You could also hard wire the fan to an extra HD plug to test with.

I agree with the cutting the problem in half idea. You have to do something to determine if it's hardware or software. A fresh install of the software or pulling the disk and putting it in another server will answer some big questions.
 
Top