Interruptions - rtj70
It makes sense to copy any text before posting to avoid losing it.

Edited by rtj70 on 15/06/2009 at 12:31

Interruptions - Stephen
More server problems mean that once again there will be several 10 minute interruptions today.
Please be warned not to compose long postings on line that you might infuriatingly lose.


Well it takes about a minute to switch the site to a different server. (About 15 seconds to power down and 50 or so to power up). We moved the site several times last Thursday and today when HP go in to replace the motherboard on one of the machines we will have to move in a similar fashion. That is the site, the video server, and the ad server.
If all goes well you shouldn't notice anything unless you happen to be unlucky and hit the post button within that minute or so. If you are doing a long essay or are worried about losing anything, then follow rtj70's good advice below.
Sorry for the inconvenience.
We changed servers in an attempt to achieve greater reliability but it isn't yet happening.


It is unfortunate that we seem to have more failures than most and even our support company has some sympathy here. However, without the redundancy that we have put in place we wouldn't be looking at the site now. We did have a total server failure on one server on Friday 3am and on another on Sunday 8:30. So to have the site still going with no data loss and only 20 mins downtime on Sunday isn't too bad.
For those interested we believe there is a bad batch of RAM in the servers (which incidentally came direct from HP) and an older batch of RAM added on 22/5 seems to have problems too. One of the HP servers has problems with HP monitoring software and is due to have the board replaced. We are currently one mirror server down and the immediate backup server ok but reporting ram errors.
Hopefully HP can sort out all three servers today.

----------------------------------
Stephen Khoo
www.khoosys.net
Interruptions - mike hannon
Good luck with the alterations.
I hope HP don't charge you the same sort of rate as they charge for the ink in their printers...
Interruptions - Stephen
Thanks. We have a next day on site support with them so there shouldn't be a charge.
----------------------------------
Stephen Khoo
www.khoosys.net
Interruptions - teabelly
Not good for HP! If there's a bad batch then probably best to clear all of it and get complete replacements and insist it is tested first so you don't get another bad batch.

What did you have before that was worse?!
Interruptions - pmh2
Time to negotiate a good sized refund or discount from the server operators!




p
Interruptions - bell boy
why not try daddies sauce if hp is coagulating?
Interruptions - Pugugly
Well said BB
Interruptions - martint123
Just for info - if peeking at a users profile, I get:-

Fatal error: Call to a member function style() on a non-object in /home/www/honestjohn.co.uk/html/scripts/_head.php on line 47

Interruptions - rtj70
So do I now you've highlighted it. I will let someone in support know about the error.
Interruptions - Dynamic Dave
Had the same error come up while doing something else earlier.

Have also reported a couple of other bugs that have crept into the system as well. ie, borders missing when previewing, editing, or moving a post.

Edited by Dynamic Dave on 15/06/2009 at 22:28

Interruptions - theterranaut
Reminds me of the time that the company I worked for bought in a batch of several hundred Dell workstations. (They were big into large-scale PC deployments, still are, AFAIK. I was the network bod.)
Mysteriously, users were complaining of systems hanging. Suspicion was on the OS, but the OS guys could find nothing wrong.
I happened to be on site one day, was chatting to a user when blam! her workstation locked up. "There, its done it again!" she exclaimed. The graphics output on-screen looked very messed up...and familiar. Looked very like when older OS's messed up, and took the graphics output with them by overwriting the area of memory used for the display.
The Dell utility supplied for testing the systems came up with nothing, but running an old utility I had around (MemTest86) showed a very similar crash at a specific cycle in the test. This was replicable across the whole of the issued machines.

Long story, but does show that bad/incompatible memory does get supplied.

Dell replaced all the sticks FoC btw.

tt
Interruptions - Rattle
Its amazing how you can get away with running a system with faulty memory though. I had to replace the RAM in a clients machine because it was doing that overlap thing as you say, this was also a Dell but a very basic base unit it didn't even have an AGP slot (AGP was modern at the time). Anyway later on I built a system to play with Windsows 2003 server on and I had no money so I built it using faulty parts, amazingly it actually worked with the same faulty RAM I replaced in this system.

I run memtest86 all the time but like to leave it running for several hours. I think the problem with RAM is the cheaper stuff is just too massed produced as I have doubts to what testing is actually done on them. Server RAM thankfully is a lot more expensive so this issue with this site is worrying. If server RAM is supplied faulty what hope do we have in the cheap £10 stuff we have in our desktops?

At least things are much more reliable now, I remember around ten years ago when building a system there was usualy 1 in 3 chance of something being DOA. Now touch would I cannot remember the last time I had to return any component. Built two computers yesterday (for my parents and a motherboard replacement for myself) and everything just worked flawlessy.

When we think about just how much memory is packed into such small DRAM chips its no wonder faulty RAM is still common. When you think back in 1986 1MB RAM was considered good its amazing that you can now 1024MB sticks for less than £10.
Interruptions - J Bonington Jagworth
"I run memtest86 all the time"

Me too. I used to have to remind myself to do it whenever I had a faulty PC to deal with (so easy to blame Windows) but it's amazing what it uncovers. By coincidence, I just had an HP laptop to look at that was unable to boot into safe mode, but mem86 reported loads of errors. Took out the extra RAM and all is sweetness and light!
Interruptions - Stephen
I think we just have a knack for getting a stroke of bad luck on all our servers.
On the previous lot we had 5 hard drive failures, 16 fans gone, 2 built-in network cards dead in 4 years. My hunch now is that the older stuff was just being cooked in the rack.
The new HP boxes monitor temperature and have provided evidence to support this hunch.
The main two at the bottom of the rack run at 19 degrees, whereas the backup one 3 feet higher is at 30 degrees. ... the old ones were at the very top of the rack. A month ago the backup server shut itself down due to overheating. Since then the ISP has installed a ventilated back door and is about to install a ventilated front door too. I would prefer all to be below 23 degrees so we shall see what they can achieve first. The problem is that HP state their operating specs are from 10 to 35 - so they are within range.
The new boxes have also had issues: No 3 reported bad memory after a month which then got replaced. This year the USB drive it booted off failed and couldn't be replaced easily. We upgraded all of them by 8GB 3 weeks ago and with the problems reported on 1 memory tested that machine for 3 days. No problems reported in that time. We added another 8 GB to all 3 on Thursday - a batch shipped direct from HP - and server 3 failed after 3 hours, 2 the next day at 3am, and 1 this last Sunday at 8:30pm. Removing these dimms solved the issue somewhat, with 3 now reporting errors on a previous 4 GB stick and 2 doing similar ... when they didn't before.
1 has an issue of its own and will get a new motherboard this afternoon. HP will resolve the issue with the others with 8 or so 4GB DIMMs.
This stuff is pucker HP of course - so is around £270 per 8GB. Not cheap then.
I am sure when they settle down they will be just fine. Linux is nice and stable, but the KVM virtual isation currently seems to have a memory leak where we are losing just under 10MB per hour on HJ and similar on the ad server. I expect a patch will fix this though.

----------------------------------
Stephen Khoo
www.khoosys.net
Interruptions - teabelly
You considered Sun kit? You can run linux on a lot of their stuff now. Used to buy their gear. Had ancient stuff running and would probably have one failure every other year. Worst year I think we had a cpu failure and a disk failure. Mostly it just ran.

If HP's own ram is this unreliable I'd be looking at third party stuff!

I'd also check the quality of the electricity supply. Is there a UPS in the same room? Ours used to turn the bottom row of monitors a funny colour as the magnetic field was so large! I'd suspect there is an issue with the room somehow as I've never known kit to be as unreliable as you are experiencing. Heating, ventilation and possibly excess dust seems to be the most likely cause. Is there a dopey cleaner that goes in there and bangs into all the racks?!
Interruptions - theterranaut
Its getting like Slashdot in here now... :)


tt
Interruptions - J Bonington Jagworth
"Is there a dopey cleaner that goes in there and bangs into all the racks?"

LOL! Probably unplugs the whole lot so they can use the cleaning equipment...

Not original, I'm afraid. Based on this old story, which also explains the origin:
thoselegends.blogspot.com/2007/09/cleaner-polishes...l)
Interruptions - J Bonington Jagworth
"This stuff is pukka HP of course"

But where do they get it from? Carly Fiorina wrecked that company, IMO.

Edited by J Bonington Jagworth on 16/06/2009 at 20:15

Interruptions - theterranaut
You should do a backroom thread some time, Stephen, with the underpinnings of the HJ set up described; hardware, OS's, networking kit, etc, etc. Not for any other reason than that some of us would find it very interesting.

That said, might be a bit risky re known vulnerabilities/exploits. Still would be a good read though.

tt
Interruptions - J Bonington Jagworth
I second that, tt.
Interruptions - pmh2
I third that !


p
Interruptions - Falkirk Bairn
There are lots of benefits in SUN+Unix
Interruptions - Lud
some of us would find it very interesting.


Others would find it as incomprehensible as a water vole would find a learned discussion on camshaft bearing cap bolts and why they apparently undo themselves for no discernible reason. But that's no reason not to have it.

Edited by Lud on 16/06/2009 at 20:58

Interruptions - Martin Devon
Others would find it as incomprehensible as a water vole would find a learned discussion
on camshaft bearing cap bolts and why they apparently undo themselves for no discernible reason.


Plastic Camshaft perchance?

MD
Interruptions - Martin Devon
"This stuff is pukka HP of course"

All Pukka stuff is sub contract. Isn't everything badge Engineering?

MD