22

My Sunday Morning until afternoon. FML. So I was experiencing nightly reboots of my home server for three days now. Always at 3:12am strange thing. Sunday morning (10am ca) I thought I'd investigate because the reboots affected my backups as well. All the logs and the security mails said was that some processes received signal 11. Strange. Checked the periodics tasks and executed every task manually. Nothing special. Strange. Checked smart status for all disks. Two disks where having CRC errors. Not many but a couple. Oh well. Changing sata cables again 🙄. But those CRC errors cannot be the reason for the reboots at precisely the same time each night. I noticed that all my zpools got scrubbed except my root-pool which hasn't been scrubbed since the error first occured. Well, let's do it by hand: zpool scrub zroot....Freeze. dafuq. Walked over to the server and resetted. Waited 10 minutes. System not up yet. Fuuu...that was when I first guessed that Sunday won't be that sunny after all. Connected monitor. Reset. Black screen?!?! Disconnected all disks aso. Reset. Black screen. Oh c'moooon! CMOS reset. Black screen. Sigh. CMOS reset with a 5 minute battery removal. And new sata cable just in cable. Yes, boots again. Mood lightened... Now the system segfaults when importing zroot. Good damnit. Pulled out the FreeBSD bootstick. zpool import -R /tmp zroot...segfault. reboot. Read-only zroot import. Manually triggering checksum test with the zdb command. "Invalid blckptr type". Deep breath now. Destroyed pool, recreated it. Zfs send/recv from backup. Some more config. Reboot. Boots yeah ... Doesn't find files??? Reboot. Other error? Undefined symbols???? Now I need another coffee. Maybe I did something wrong during recovery? Not very likely but let's do it again...recover-recover. different but same horrible errors. What in the name...? Pulled out a really old disk. Put it in, boots fine. So it must be the disks. Walked around the house and searched for some new disks for a new 2 disk zfs root mirror to replace the obviously broken disks. Found some new ones even. Recovery boot, minimal FreeBSD Install for bootloader aso. Deleted and recreated zroot, zfs send/recv from backup. Set bootfs attribute, reboot........
It works again. Fuckit, now it is 6pm, I still haven't showered. Put both disks through extensive tests and checked every single block. These disks aren't faulty. But for some reason they froze my system in a way so that I had to reset my BIOS and they had really low level data errors....? I Wonder if those disks have a firmware problem? So that was most of my Sunday. Nice, isn't it? But hey: calm sea won't make a good sailor, right?

Comments
  • 5
    Too long, and Yes I read it. I feel your pain.
  • 0
    @Torbuntu sorry for that. I didn't plan it ;o) at least there is a happy end.
  • 0
    @Torbuntu system is working again, backups work again as well, I won't use Intel 535 SSDs anymore, and overal I have got some more XP in the field of investigating hardware failures
Add Comment