Inaccurate Design

This is why I love RAID-Z on ZFS

Tuesday, 24 November 2009

Well today I had to move all of my servers to a different room in the house. After the move, I powered them all back on, and out of curiosity, checked the status of my ZFS storage pool:

   yyyyy@xxxxx:~# zpool status storage
     pool: storage
     state: ONLINE
    status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
    action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
     scrub: resilver completed after 0h0m with 0 errors on Sat Nov 21 17:15:26 2009

        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            c9t0d0  ONLINE       0     0     0
            c9t1d0  ONLINE       0     0     4  164K resilvered
            c8t0d0  ONLINE       0     0     0
            c8t1d0  ONLINE       0     0     0

    errors: No known data errors

An error? Yet it was picked up and corrected thanks to the built-in checksumming done by ZFS. Note how it was the checksum that was bad, and not a bad read or write operation, meaning this may have slipped through depending on your hardware RAID configuration.

To double check, I manually started a scrub:

yyyyy@xxxxx:~# zpool scrub storage

And when that had completed (successfully), I cleared the error:

yyyyy@xxxxx:~# zpool scrub storage

As it’s a one off, I’m hoping it was just that. However if it happens again on the same device then I’ll run the manufacturer’s tools on it and maybe replace it under warranty.