Good day everyone, I need some help with my zpool I created a long time ago. I have 8 Drives in a Z1 raid, each are 3Tb Seagate 7200rpm SAS drives. A couple of weeks ago I had a drive start trowing some errors after going strong for almost 4 years so I quickly replaced it with a spare I had just ordered in. I wasn’t totally sure what commands to run so I looked around on a few forums and on the zfs wiki as well and found it would be a simple few commands:
sudo zpool offline TheMass 9173635512214770897
sudo zpool labelclear /dev/sdc
sudo zpool replace TheMass 9173635512214770897 /dev/sdc
As Context here is my lsblk output:
sda 8:0 0 2.7T 0 disk
└─md124 9:124 0 2.7T 0 raid0
├─md124p1 259:11 0 2G 0 part
└─md124p2 259:12 0 2.7T 0 part
sdb 8:16 0 2.7T 0 disk
└─md121 9:121 0 2.7T 0 raid0
├─md121p1 259:17 0 2G 0 part
│ └─md116 9:116 0 2G 0 raid1
└─md121p2 259:18 0 2.7T 0 part
sdc 8:32 0 2.7T 0 disk
├─sdc1 8:33 0 2.7T 0 part
└─sdc9 8:41 0 8M 0 part
sdd 8:48 0 2.7T 0 disk
└─md125 9:125 0 2.7T 0 raid0
├─md125p1 259:9 0 2G 0 part
└─md125p2 259:10 0 2.7T 0 part
sde 8:64 0 2.7T 0 disk
└─md120 9:120 0 2.7T 0 raid0
├─md120p1 259:19 0 2G 0 part
└─md120p2 259:20 0 2.7T 0 part
sdf 8:80 0 2.7T 0 disk
└─md123 9:123 0 2.7T 0 raid0
├─md123p1 259:13 0 2G 0 part
│ └─md117 9:117 0 2G 0 raid1
└─md123p2 259:14 0 2.7T 0 part
sdg 8:96 0 2.7T 0 disk
└─md122 9:122 0 2.7T 0 raid0
├─md122p1 259:15 0 2G 0 part
│ └─md116 9:116 0 2G 0 raid1
└─md122p2 259:16 0 2.7T 0 part
sdh 8:112 0 2.7T 0 disk
└─md119 9:119 0 2.7T 0 raid0
├─md119p1 259:21 0 2G 0 part
│ └─md117 9:117 0 2G 0 raid1
└─md119p2 259:22 0 2.7T 0 part
I removed the old sdc drive, and replaced it with a new one and then ran those commands, The pool began to re-silver and I thought everything was alright until I noticed the new sdc drive didn’t have all the other formatting on it like the other drives, and my performance isn’t what it use to be. My pool is up and running given zpool status:
`pool: TheMass state: ONLINE scan: scrub repaired 0B in 03:47:04 with 0 errors on Fri Oct 6 20:24:26 2023 checkpoint: created Fri Oct 6 22:14:02 2023, consumes 1.41M config:
NAME STATE READ WRITE CKSUM
TheMass ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
md124p2 ONLINE 0 0 0
scsi-35000c500562bfc4b ONLINE 0 0 0
md119p2 ONLINE 0 0 0
md121p2 ONLINE 0 0 0
md122p2 ONLINE 0 0 0
md123p2 ONLINE 0 0 0
md125p2 ONLINE 0 0 0
md120p2 ONLINE 0 0 0`
So my question is, did I do this correctly? if not, what and where did I go wrong so I can fix this? Also, if you could give me the commands that I would need, that would be amazing!
If theres any other commands you need me to run for information just let me know!
I’m not familiar with ZFS on Linux, but what is 9173635512214770897 referencing? The command is usually zpool replace pool device [new_device] So if you physically swapped out the old disk and put in a new one, you only need to specify the new disk. If you leave the old one plugged it you list both (old one first).
I don’t know what best practice is for specifying disks to ZFS on Linux, but arch wiki suggests not using /dev/sdc, but the ID instead https://wiki.archlinux.org/title/ZFS#Identify_disks
Also you don’t need to offline the pool to replace a disk, you can keep using it as it resilvers.
The number you’re talking about is in reference to the old disk. I had swapped out the old drive and when I tried to run the command
Pool replace pool device [new_device]
It yelled at me that I needed to reference the old drive as well? So I’m not sure why it didn’t work the way you said it should
Huh annoying. You can run zdb -C TheMass To get more info about the pool and the disks in it. Might list enough disk detail to give you confidence it’s using the layout you want.
For me identifying disks usually ends up being unplugging them one by one and checking which shows OFFLINE. Could be worth the trouble to know for sure its specifying and using the disks.
In any case a good time to setup a backup for anything you can’t replace.
Yeah that’s what I’m thinking of doing now, backup everything important just in case and continue to work on it.
Thankfully I can identify any disk rather easily with the command you mentioned (used it before to grab a drives serial number which is printed on the drive its self).
Unless the lsblk output is wrong, you have single drive RAID0 configured on all of the other drives. I’m not sure why would anyone do that, I’d expect all of them to be set up in similar fashion as the sdc
is. It would explain the different device name.
The speed difference might or might not be same issue. It might be completely separate thing. Like a different drive record size or something like that, so it might be a good idea to troubleshoot each problem separately.
I use to have it on a LSI raid card a long time ago before switching to a hba card. I had each drive passed through the lsi card as a single disk raid and then I used zfs to create a pool. I’m guessing this is what caused this now that I think about it.
I have like 9tb on this pool so moving everything off of it and then redoing the pool would be currently impossible so I wonder how I would fix this? Replace one drive at a time with some of my spares and swap them around?
Yeah one at a time would work, but it would be quite a bit of writing to rotate all.
As for the performance, are you replacing the failed drive with the same model or did you use a different one?
Is it possible to replace a disk with the same disk? Like effectively wiping a disk and replacing it with its self so I don’t have to use up my spare drives as rotation drives and add needless wear to them?
I made sure to replace the dead drive with the exact same drive as the rest. All of them are Seagate 7200rpm 3tb SAS drives.