Site outtage overnight

News items for the DSLR Users website. Also covers upcoming but important events. Locked at present time, although recognised members may post messages in response to news items posted.

Moderator: Moderators

Forum rules
Please ensure that you have a meaningful location included in your profile. Please refer to the FAQ for details of what "meaningful" is.

Site outtage overnight

Postby gstark on Sun Jan 06, 2013 10:12 am

For those who didn't notice, there was a site outtage overnight.

It started around 10:30 last night. I noticed an issue within the hour, but couldn't connect to the server, so I power cycled it, and everything basically seemed to come back, but I wasn't comfy.

My discomfort was well founded, and about 4am it went down again, with indications of a hdd failing.

After chatting with corenetworks support, we were able to clone the dying hdd and replace it, with everything back up as normal just a little after 8 this morning.

Please let me know if you see anything new and unusual.
g.
Gary Stark
Nikon, Canon, Bronica .... stuff
The people who want English to be the official language of the United States are uncomfortable with their leaders being fluent in it - US Pres. Bartlet
User avatar
gstark
Site Admin
 
Posts: 22919
Joined: Thu Aug 05, 2004 11:41 pm
Location: Bondi, NSW

Re: Site outtage overnight

Postby Matt. K on Sun Jan 06, 2013 10:25 am

We, the dedicated followers of the forum, rarely are aware of the sweat, tears and dedicated skill it takes to keep this forum functioning. Bravo Gary. Now get some well earned sleep and thank you from all of us. :cheers: :cheers: :cheers:
Regards

Matt. K
User avatar
Matt. K
Former Outstanding Member Of The Year and KM
 
Posts: 9981
Joined: Mon Sep 06, 2004 7:12 pm
Location: North Nowra

Re: Site outtage overnight

Postby Mj on Sun Jan 06, 2013 5:40 pm

Good job Gary... I indeed notice that went down and I was receiving database connection type errors.
I was more than confident that you would have it in hand... these things happen of course. The trick is always to isolate the cause ASAP.

cheers,

Michael.
Photography is not a crime, but perhaps my abuse of artistic license is?
User avatar
Mj
Senior Member
 
Posts: 1048
Joined: Fri Aug 20, 2004 3:37 pm
Location: Breakfast Point, Sydney {Australia}

Re: Site outtage overnight

Postby gstark on Sun Jan 06, 2013 6:39 pm

Mj wrote:The trick is always to isolate the cause ASAP.


And therein lay the real challenge: there was no indication of why the site went down - nothing obvious in the logs, and I couldn't actually access the site after it had died: ssh was out, webmin and phpadmin both out. Nothing.

I needed to power cycle the box - which I can do from my control panel - and then see what was happening. Nothing too major after the first outage, but I wasn't really comfortable; it certainly was feeling like the hdd was cactus.

Then, after restarting from the second outage, when trying to rsync the web sources, I had a couple of failures, and then I could see a SMART error on /dev/sda in the log. At this time it was also starting to fall over after maybe 5 minutes of up time, so by the time I'd logged the fault, with my descriptions of the symptoms, it had died several more times.

The last time though confused me a little: webmin wasn't responding correctly, websites not at all, but I had fairly good access to the fielsystem while I remained logged in through ssh. My thoughts then moved towards the PSU as the possible problem, and I relayed that to Corenetworks.

They listened to what I was saying, agreed that the SMART messages pointed to hdd failure, but checked the PSU as well, then tried to (and succeeded) clone the old drive onto a new one, and away we went. In terms of support, I'd say it was about three hours from fault lodgement to being back on line, with prompt and meaningful responses, from people who were helpful and knowledgeable.

Yes, I know that this is what the standard should be, but the reality is that this is so rare that it becomes notable when it's actually achieved.

Had the cloning failed, I would have needed a IP-KVM, and to reinstall a new OS onto the new HDD. probably another hour or two work for me in terms of managing that process.
g.
Gary Stark
Nikon, Canon, Bronica .... stuff
The people who want English to be the official language of the United States are uncomfortable with their leaders being fluent in it - US Pres. Bartlet
User avatar
gstark
Site Admin
 
Posts: 22919
Joined: Thu Aug 05, 2004 11:41 pm
Location: Bondi, NSW

Site outtage overnight

Postby Geoff M on Sun Jan 06, 2013 6:41 pm

Well done Gary. I did notice the outage and I also noticed that at one point after the fix was done, a post from Rooz was missing from a thread I started but it is now back.
Fuji X-Pro1 | X-E1 | X-T1 | XF14 | XF23 | XF27 | XF35 | XF56 | XF60 | XF10-24 | XF18-55 | XF55-200 | MCEX-11

http://gmarshall.zenfolio.com

http://xtographer.weebly.com
User avatar
Geoff M
Senior Member
 
Posts: 1225
Joined: Sat Jun 18, 2005 10:54 pm
Location: Tamborine Mountain QLD.


Return to Announcements