Bare Metal Blog

System Administrator. Lover of craft beer. Washed up metalhead. FLAC enthusiast.

What Happened?

February 21, 2018 in #Homelab

tl;dr: a tragic RAID failure that lost all of my data

The Long Version

Anyone who knows anything about running servers knows the feeling when one of the blinking blue lights turns into a steady flashing red light...hard drive failure. With the advent of SSDs this is becoming less common (yes they can still fail, but the failure rate is lower than HDDs). However, I am running 10 disks in my SuperMicro X8DTN. I had originally purchased 10 used 500GB HDDs in an eBay auction for around $300 USD. I knew that I wanted to run at least a RAID 5 or RAID 6 configuration and 10 drives seemed like enough for both and even had the benefit of allowing me to allocate drives for standby replacements. As I've learned in the past year, used drives =/= new drives. I was able to successfully sustain through several (four) HDD failures over the past year with minimal downtime and no loss of data. Unfortunately toward the end of 2017 I had two of the older drives fail at the same time. Having two go almost back-to-back put me in a bit of a panic. I ordered two replacement drives. I didn't account for the fact that I would be failing out two drives at the same time, which I hadn't done before.

The RAID Card

In my SuperMicro server, there is a dedicated RAID card that holds the configuration of the RAID array. I had tinkered with its software a handful of times, and figured that by now I had figured it out. I was wrong. I failed out the two broken drives and started running into issues after trying to adopt the new drives into the array. The RAID card was quite unappreciative that I failed out both drives at the same time. In fact, it was so upset that it decided to hold my data hostage and refuse to adopt the new drives. After some time messing with the RAID software I started to become worried that my data had vanished and the RAID was destroyed.

Coming To Grips

My worst fears came to be realities when I decided I would try and boot the server to make sure I was still in a working state.

No boot devices found

Message received. I had ruined everything. Hundreds upon hundreds of long nights working on VMs, in addition to thousands of hours worth of my personal music collection had given up the ghost. I took some time away. I gave in for a while. I didn't want to have to start over. But after taking some time away, I realized that I actually missed it. I missed having a constant project going. I missed spinning up VMs and wondering whether or not there was a way for me to "host that at home".

Moving Forward

I've quite obviously decided to get back on the horse and give it another go. Start from scratch and use the knowledge that I've gained over the past two years as a System Administrator to build things better, smarter, and with more documentation. I'm still trying to devise a way to handle backups and things of that nature. I realize that this is just a homelab and it's all very volatile, but the idea is to hone my skills in and outside of work.

If you're still managing to read this, I'll consider you interested in joining me in my journey to becoming the best System Administrator I can be. Hopefully we'll learn new things together, and be better off for it.

I'm still going to do my album reviews. I'm still going to make posts about things I'm working on in my lab. I'm still going to try and provide a source of knowledge to those less knowledgeable than myself, and maybe be a source of inspiration to those who know significantly more than me. I'll still be posting Nagios custom check scripts that I develop. I'll try and be consistent with my posting. Hold me to it.

Thanks,

  • Buck
Share on Google+
No Newer Posts
No Older Posts