One or more devices has experienced an error resulting in data corruption - Исправление ошибок и поиск оптимальных решений проблем

Рекомендация использовать ECC память в корпоративных системах известна. Дома почти все обходятся памятью без контроля чётности. А что всё же происходит с zfs пулом, расположенным на системе с дефектной памятью без контроля чётности? Будут лезть ошибки, проблемы будут молча накапливаться, всё встанет колом? Тем более, что есть известная работа DRAM Errors in the Wild: A Large-Scale Field Study. В которой авторы пишут «Примерно на трети машин или примерно на 8% DIMM планок в нашем исследовании была зафиксирована как минимум одна поддающаяся коррекции ошибка за год» Цифры пугающие. И вот в нашей конфе подкинули ссылочку как раз на практический случай.

Как часто бывает, флуда больше, чем сути, поэтому краткое изложение. Человек собрал некую (недо) пре-продакшен тестовую систему с использованием ZFS on Linux и не-ECC памяти, частично битой. По его словам, это же железо ранее работало без видимых проблем, но после резкого увеличения загрузки на ввод-вывод полезли странные проблемы.

Результат после scrub проблемного пула

root@kvm2:~# zpool status pool: zroot state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 0 in 5h25m with 0 errors on Tue Mar 10 04:05:24 2015 config:


NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 9
mirror-0 ONLINE 0 0 19
ata-WDC_WD20EARS-00S8B1_WD-WCAVY4199811 ONLINE 0 0 19
ata-WDC_WD20EARS-00S8B1_WD-WCAVY4452032 ONLINE 0 0 20
errors: 3 data errors, use '-v' for a list

То, что проблема в памяти он установил пост-фактум, а нам можно полюбопытствовать что собственно произошло.

Видно, что на одном из дисков зеркала вылезло 20 ошибок контрольных сумм, а на другом — 19. При этом на пуле осталось не исправлено 9. Могу объяснить уменьшение тем, что половина ошибок была скорректирована. Далее, на данные пришлось только 3 неисправимых ошибки. Не могу утверждать, но возможно часть ошибок была скорректирована за счёт хранения боле одной копии метаданных.

Отчёт о том, какие файлы повреждены.

root@kvm2:~# zpool status -v ... errors: Permanent errors have been detected in the following files:


/zroot/kvm1/copy500/copy500.qcow2
/zroot/kvm2/prd2/prd2.qcow2

В любом случае картинка довольно оптимистичная — из 39 ошибок до данных доехали только три, что привело к повреждению двух файлов.

После публикации этой ссылки в нашей конфе несколько камрадов поделились (отсюда несколько постов и далее) личным опытом по теме битой памяти на zfs системе. Цитирую камрадов с сокращениями.
[Подробнее]
Power User: «у меня как то плашка памяти сдохла, тысячи (если не десятки тысяч) ошибок — я уж и не помню — то ли READ то ли CKSUM. Но быстро заметил, ни одного файла не потерял…

а как заметил?
Есть у меня еще с до-зед-эф-эсных времен привычка, скачал что то полезное, сделал контрольную сумму всех файлов в папке, потом уже кладу на НАС. Тут что-то надо было установить, делаю установку оно вываливается, проверил контрольную сумму — битая, другой архив проверил (который писал хороших пару лет назад) — он тоже битый — ну у меня паника… После пары экспериментов — выяснилось что убитые файлы каждый раз разные (мелкие)…
Потом уже zpool status — новая паника — ошибки по всем дискам — я его прямо на горячую и дернул из UPS… Ну и только потом дошло память почекать…»

jenci: «у меня была проблема с памятью. всего 1 бит побился (время отвремени)… были периодически исправлены по scrub-у пару блоков на zfs. я подумал что это мой диск один проблемный подыхает в raidz2. выяснилось все «проще» планка одна битая.»

OverLocker: «У меня тоже была трабла когда иногда появлялись несколько байт ошибок. Скраб их правил. Причиной являлась тоже битая память.»

Итого, можно видеть, что zfs — не панацея от битой памяти, и появление ошибок из-за неё реально. Но значительная доля ошибок (порядка 90% в примере выше) всё же устраняется zfs. В условиях характерной для дома низкой нагрузки ввода-вывода (записал — и хранишь, изредка читаешь) проводя регулярный скраб и наблюдая возникающие ошибки вполне реально отловить сбойную память до того, как данные серьёзно пострадают. А единичные спорадические сбои, подобные возникающим из-за космических лучей и естественного радиационного фона, zfs со значительной вероятностью корректирует.

Так что я лично, пожалуй, окончательно успокоюсь насчёт ECC памяти. Она точно нужна в production, под значительной нагрузкой. Дома — обойдёмся обычной.

Источник

Это 1 в 1 тоже самое с точностью до названия пула, названия gpt лейблов и мелких отличий в количестве ФС. Это совсем не суть проблемы!
Суть проблемы в том, что рабочая методика установки (проверено мной же на вирт. машине, да и в общем то это один из первых сайтов где нужно искать) не работает. Причем не работают даже чрезвычайно упрощенные варианты создания zfs, не говоря уже про саму установку.
Например:
Создание пула, создание в нем файловой системы, сваливание туда кучи мелких файлов (src), попытка скопировать их из этой фс на ufs, заканчивает с переменным успехом большим количеством ошибок.
Создание пула в файле на винте (другом) с последующим таким же тестом (см. выше) тот же результат.
Можно сделать вывод: что до проблем с методикой установки еще очень далеко. Это не проблемы винта, скорее это проблема с zfs. Может какие нибудь лимиты?

Источник

SOLVED One or more devices has experienced an error resulting in data corruption (freenas-boot)

tomasi

Cadet

I have new HP 8 gen micro server with following disk configuration:
1. slot WD SSD (boot)
2. slot 8 TB HDD
3. slot 8 TB HDD
4. slot 8 TB HDD
storage controller is AHCI mode, boot from slot1, disks in slot 2-3 are zpool in zraid1.
I moved system dataset to freenas-boot (because of ssd)

Server is after boot showing alert:
CRITICAL: Nov. 19, 2017, 8:19 a.m. — The boot volume state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

Need help to identify what is wrong. I tried already FreeNAS reinstall several times. Is system SSD lemon?
Or point me to some direction.

In details it looks like this:

after running smartctl:

test completed without error:

here is hw configuration:

Build FreeNAS-11.0-U4 (54848d13b)
Platform Intel(R) Xeon(R) CPU E3-1220L V2 @ 2.30GHz
Memory 16312MB (ECC)

tomasi

Cadet

Ericloewe

Not-very-passive-but-aggressive

Your boot SSD is clearly crapping out for some reason. Replace it and reinstall to new media.

Might as well take a look at the SMART data before doing so, though.

tomasi

Cadet

it must be SSD.
SMART data bellow, but nothing i can point.

Ericloewe

Not-very-passive-but-aggressive

Green750one

Dabbler

You could try replacing the cable on the SSD first. I’ve had numerous issues because of crappy cables!

Sent from my G3221 using Tapatalk

tomasi

Cadet

it’s not a cable (but good idea), i assumed first bad connection, that’s why i bought adapter (2,5 to 3,5) to have it properly in the HDD bay in server.
http://www.raidsonic.de/products/internal_cases/mobile_racks/index_en.php?we_objectID=3570

I just need to wait to do RMA for SSD.

tomasi

Cadet

I got WD green SSD back from RMA. Got new one in new box.
Guess what?
Same error appeared after reboot :

pool: freenas-boot
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: none requested
config:
NAME STATE READ WRITE CKSUM
freenas-boot DEGRADED 0 0 658
ada0p2 DEGRADED 0 0 2.57K too many errors
errors: Permanent errors have been detected in the following files:
//usr/local/www/freenasUI/system/ixselftests/__pycache__
//usr/local/www/freenasUI/support/__pycache__
//usr/local/www/freenasUI/system/alertmods/__pycache__
/var/db/system/rrd-76c11d7f8a944b3d8e42fe35420dbaa3/freenas.local/geom_stat/geom_ops_rwd-ada0p1.rrd

So.
SSD was good, but there was still something going on badly. Somehow restart of a server been corruption files

I ordered new SSD — another type Kingston 120GB SSDNow UV400, now after fresh install looks fine.
My idea is, that either controller of WD green (Controller Silicon Motion SM2256S) is incompatible with freeBSD or HP microserver gen8 or B120i Controller .
Kingston has Controller Marvell 88SS1074.

Green750one

Dabbler

I got WD green SSD back from RMA.
Guess what?
Same error appeared after reboot :

So.
SSD was good, but there was still something going on badly.

I ordered new SSD — another type Kingston 120GB SSDNow UV400, now after fresh install looks fine. (I made about 10 installs of freenas . )
My idea is, that either controller of WD green (Controller Silicon Motion SM2256S) is incompatible with freeBDS or HP microserver gen8 or B120i Controller .
Kingston has Controller Marvell 88SS1074.

The other thing to check is the psu and how you have everything cabled. I’ve had issues with running too many devices on a single channel.
And I know you’ve re-cabled, but I’d do it again. I’ve also had issues with rubbish sata cables

Источник

[SOLVED] ZFS I/O error

Drag_and_Drop

Member

I currently have an annoying issue on my home server.

Every couple days my SSD ZFS pool failed with I/O errors

But the drive is online and when I delete the pool, format the drive and recreate a pool on its healty for another couple of days.
I cant see any I/O errors in the kernel log, the SMART data are also ok and when I stress test the drive before I create a ZFS pool on it, its healthy as well (did a read write test with over 1 TB of data)

Also rm /etc/zfs/zpool.cache didn’t helped

So:
1. Where can I look else why ZFS is failing?
2. is there a way to force ZFS to bring the array online?
3. Something else I could try?

My final workaround will be format the drive as btrfs or ext4 but if ZFS could do the job, I would prefer it

Nemesiz

Well-Known Member

You are using single disk for ZFS pool. Almost no protection.

Maybe your SSD is wearied out ? What smart is saying ?

Try to set copies from 1 to 2 or 3

Nemesiz

Well-Known Member

Drag_and_Drop

Member

# smartctl -a /dev/sde
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.44-1-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: SandForce Driven SSDs
Device Model: KINGSTON SV300S37A120G
Serial Number: 50026B723A03FA88
LU WWN Device Id: 5 0026b7 23a03fa88
Firmware Version: 505ABBF1
User Capacity: 120,034,123,776 bytes [120 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Mar 19 14:06:16 2017 CET
SMART support is: Available — device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 48) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0021) SCT Status supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 118 118 050 Pre-fail Always — 0/210880202
5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always — 0
9 Power_On_Hours_and_Msec 0x0032 090 090 000 Old_age Always — 9596h+56m+36.500s
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always — 671
171 Program_Fail_Count 0x0032 000 000 000 Old_age Always — 0
172 Erase_Fail_Count 0x0032 000 000 000 Old_age Always — 0
174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age Offline — 112
177 Wear_Range_Delta 0x0000 000 000 000 Old_age Offline — 9
181 Program_Fail_Count 0x0032 000 000 000 Old_age Always — 0
182 Erase_Fail_Count 0x0032 000 000 000 Old_age Always — 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always — 0
189 Airflow_Temperature_Cel 0x0000 024 065 000 Old_age Offline — 24 (Min/Max 10/65)
194 Temperature_Celsius 0x0022 024 065 000 Old_age Always — 24 (Min/Max 10/65)
195 ECC_Uncorr_Error_Count 0x001c 120 120 000 Old_age Offline — 0/210880202
196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always — 0
201 Unc_Soft_Read_Err_Rate 0x001c 120 120 000 Old_age Offline — 0/210880202
204 Soft_ECC_Correct_Rate 0x001c 120 120 000 Old_age Offline — 0/210880202
230 Life_Curve_Status 0x0013 100 100 000 Pre-fail Always — 100
231 SSD_Life_Left 0x0013 100 100 010 Pre-fail Always — 0
233 SandForce_Internal 0x0000 000 000 000 Old_age Offline — 25566
234 SandForce_Internal 0x0032 000 000 000 Old_age Always — 10216
241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always — 10216
242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age Always — 9862

SMART Error Log not supported

SMART Self-test Log not supported

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Источник

Adblock
detector

Источник

3 день безуспешно бьюсь с установков freebsd 8.1 на zfs. Конфигурация компа атлонх2; 2Гб; 750+750+500 винты. Винты дерьмовые, да, wd green. Но думаю для домашней фалопомойки торент качалки пойдут. Смарт в норме.

Как ставлю:

Загружаюсь с CD 8.1-RELEASE-zfsv15-amd64 special edition с сайта

http://mfsbsd.vx.sk/

(по причине того, что там с live-cd ssh сервер имеется)

Монтирую iso’шку freebsd8.1_disk1 с флешки.

Ставлю по инструкции:

http://wiki.freebsd.org/RootOnZFS/GPTZFSBoot/Mirror

Все проходит успешно, до пункта

Цитата:

chroot /zroot

На котором я получаю сообщение

Цитата:

/bin/csh: Input/output error

Или ошибка с /zroot/libexec/ld-elf.so.1 от случая к случаю меняется.
Команда

Цитата:

zpool status -v

выдает следующее

Цитата:

zpool status -v
pool: zroot
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see:

http://www.sun.com/msg/ZFS-8000-8A

scrub: none requested
config:

NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 3
gpt/disk1 ONLINE 0 0 9

errors: Permanent errors have been detected in the following files:

/zroot/bin/tcsh
/zroot/libexec/ld-elf.so.1

Попытка копировать чего-нибудь из /zroot куда-нибудь приводит к следующему

Цитата:

mfsbsd# cp -r boot/ tmp/
cp: tmp/kernel/mfip.ko.symbols: Bad address
cp: tmp/kernel/arcmsr.ko.symbols: Bad address
cp: tmp/kernel/asmc.ko: Bad address
cp: tmp/kernel/geom_part_bsd.ko: Bad address
cp: tmp/kernel/cam.ko.symbols: Bad address
cp: tmp/kernel/ahc.ko.symbols: Bad address
cp: tmp/kernel/if_ae.ko.symbols: Bad address
cp: tmp/kernel/if_iwn.ko.symbols: Bad address
cp: tmp/kernel/iwn5150fw.ko: Bad address
cp: tmp/kernel/if_bwn.ko.symbols: Bad address
cp: tmp/kernel/cbb.ko.symbols: Bad address
cp: boot/kernel/kernel.symbols: Input/output error
cp: tmp/kernel/if_igb.ko: Bad address
cp: tmp/kernel/geo
… много вывода …
cp: tmp/zfsloader: Bad address
cp: tmp/loader: Bad address
cp: tmp/pxeboot: Bad address

Не все файлы получаются битыми, некоторые успешно читаются.
Далее если ввести

Цитата:

zpool status -v

будет

Цитата:

http://www.sun.com/msg/ZFS-8000-8A

scrub: none requested
config:

NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 436
gpt/disk1 ONLINE 0 0 889

errors: Permanent errors have been detected in the following files:

zroot:<0x301>
zroot:<0x402>
zroot:<0x307>
zroot:<0x40b>
zroot:<0x60b>
zroot:<0x40c>

… много вывода …

zroot:<0x2fb>
zroot:<0x3fe>
zroot:<0x2ff>

Тут тоже от случая к случаю меняется, иногда только адреса, иногда названия битых файлов, или все вместе. Подозрительно кстати, что в сравнении с предыдущим выводом исчезло описание о битости /zroot/bin/tcsh и /zroot/libexec/ld-elf.so.1 в читабельном виде

Что пробовал проверить:

Пробовал пул с зеркалом и без, то же самое.

Пробовал создавать пулы принудительно 14 и 15 версии (да, знаю, что с 15 8.1 по дефолту не грузица) то же самое.

Пробовал создавать пулл в файле, файл на ufs — то же самое

Ставил систему на выделенный (для опыта) раздел ufs, chroot туда, выгрузка старых zfs.ko и opensolaris.ko и загрузка из нового /, установка посредством утилит и модулей нового / — эффект тот-же.

Затирал винты нулями (dd) 2 раза. Не помогает.

Самое интересное: установка на виртуальную машину (virtualbox) посредством того же мануала и тех же источников установки завершается благополучно и система успешно грузится (хоть и при загрузке слегка сыплет еррорами по поводу работы ата, но это в дефолтном ядре видимо дело, установочный диск так же ругается) На виртуалку ставил с рейдом и без, одинаково работает.

Что посоветуете?

Источник

Several permanent errors were reported on my zpool today.

  pool: seagate3tb
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        seagate3tb  ONLINE       0     0    28
          sda       ONLINE       0     0    56

errors: Permanent errors have been detected in the following files:

        /mnt/seagate3tb/Install.iso
        /mnt/seagate3tb/some-other-file1.txt
        /mnt/seagate3tb/some-other-file2.txt

Edit: I’m sure sure if those CKSUM values are accurate. I was redacting data and may have mangled those by mistake. They may have been 0. Unfortunately, I can’t find a conclusive answer in my notes and the errors are resolved now so I’m not sure, but everything else is accurate/reflects what zpool was reporting.

/mnt/seagate3tb/Install.iso is one example file reported as having a permanent error.

Here’s where I get confused. If I compare my «permanently errored» Install.iso against a backup of that exact same file on another filesystem, they look identical.

shasum "/mnt/seagate3tb/Install.iso"
1ade72fe65902b2a978e5504aaebf9a3a08bc328  /mnt/seagate3tb/Install.iso
shasum "/mnt/backup/Install.iso"
1ade72fe65902b2a978e5504aaebf9a3a08bc328  /mnt/backup/Install.iso
cmp /mnt/seagate3tb/Install.iso /mnt/backup/Install.iso
diff /mnt/seagate3tb/Install.iso /mnt/backup/Install.iso

The files seem to be identical. What’s more, the file works perfectly fine. If I use it in an application, it behaves like I’d expect it to.

As the docs state:

Data corruption errors are always fatal.

But based on my rudimentary file verifications, I’m not sure I understand the definition of fatal.

status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.

action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.

Maybe I’m missing something, but the file seems perfectly fine as far as I can tell, and does need any restoration nor does it show any corruption, despite the reccomendation from ZFS.

I’ve seen other articles with the same error, but I have yet to find an answer to my question.

What is the permanent error with the file? Is there some lower level issue with the file that’s just not readily apparent to me? If so, why would that not be detected by a shasum as a difference in the file?

From a layperson’s perspective, I see nothing to indicate any error with this file.

Источник

Code:

pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.67T in 2 days 02:43:35 with 1 errors on Fri Sep  4 17:13:53 2020
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     1
      raidz1-0                  DEGRADED     0     0     2
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             UNAVAIL      0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        tank/video@autosnap-weekly.2020-08-09.00:00:00:/tank/Blah/Blah/Blah/FooBar.mp4

What is the correct way forward here? I have read the illumos link, but it leaves me scratching my head. Questions:

Can I try deleting without breaking things?
Should I try to rm the file, or destroy the snapshot?
Or am I looking at restoring from backup either way?

Last edited: Sep 9, 2020

Code:
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.67T in 2 days 02:43:35 with 1 errors on Fri Sep  4 17:13:53 2020
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     1
      raidz1-0                  DEGRADED     0     0     2
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             UNAVAIL      0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        tank/video@autosnap-weekly.2020-08-09.00:00:00:/Blah/Blah/Blah/FooBar.mp4
What is the correct way forward here? I have read the illumos link, but it leaves me scratching my head. Questions:

Can I try deleting without breaking things?

Should I try to rm the file, or destroy the snapshot?

Or am I looking at restoring from backup either way?

Snapshots, MySQL data files and at times images have been the culprits. Zfs destroy snapshots and restart the server should bring it back online.

Thread Starter
#3

I destroyed the snapshot and rebooted. Now I have this:

Code:

  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Sep  4 18:44:08 2020
    42.1G scanned at 1.00G/s, 19.4M issued at 472K/s, 10.0T total
    0 resilvered, 0.00% done, no estimated completion time
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     0
      raidz1-0                  DEGRADED     0     0     0
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        <0x159>:<0x11bb9>

So I guess I will see how the resilver goes…

Thanks for the advice!

I destroyed the snapshot and rebooted. Now I have this:
Code:
  pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Sep  4 18:44:08 2020
    42.1G scanned at 1.00G/s, 19.4M issued at 472K/s, 10.0T total
    0 resilvered, 0.00% done, no estimated completion time
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     0
      raidz1-0                  DEGRADED     0     0     0
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        <0x159>:<0x11bb9>
So I guess I will see how the resilver goes…

Thanks for the advice!

The partitions will come back online though the errors may remain. If you moved the server, you may need unplug and plug back the drive.

Thread Starter
#5

I assume I should wait for the resilver to finish…?

FYI, I still have the drive that is being replaced. I don’t know if that could help or not.

I assume I should wait for the resilver to finish…?

FYI, I still have the drive that is being replaced. I don’t know if that could help or not.

Wait and see.

Thread Starter
#7

sudo zpool status -v tank

Code:

  pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.66T in 2 days 07:35:28 with 1 errors on Mon Sep  7 02:19:36 2020
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     1
      raidz1-0                  DEGRADED     0     0     2
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        tank/video@autosnap-weekly.2020-08-16.00:00:00:/tank/Blah/Blah/Blah/FooBar.mp4

I.e. same file, different snapshot. So…
sudo zfs destroy tank/video@autosnap-weekly.2020-08-16.00:00:00
sudo zpool status -v tank

Code:

  pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.66T in 2 days 07:35:28 with 1 errors on Mon Sep  7 02:19:36 2020
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     1
      raidz1-0                  DEGRADED     0     0     2
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        <0x73b>:<0x11bb9>

Finally,
sudo reboot

And…
sudo zpool status -v tank

Code:

  pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Sep  7 06:21:34 2020
    27.5G scanned at 783M/s, 1000K issued at 27.8K/s, 10.0T total
    0 resilvered, 0.00% done, no estimated completion time
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     0
      raidz1-0                  DEGRADED     0     0     0
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        <0x73b>:<0x11bb9>

So….

Should I go ahead and remove remaining snapshots on that dataset, or just go one by one?

Last edited: Sep 9, 2020

sudo zpool status -v tank

Code:

  pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.66T in 2 days 07:35:28 with 1 errors on Mon Sep  7 02:19:36 2020
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     1
      raidz1-0                  DEGRADED     0     0     2
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        tank/video@autosnap-weekly.2020-08-16.00:00:00:/Blah/Blah/Blah/FooBar.mp4

I.e. same file, different snapshot. So…
sudo zfs destroy tank/video@autosnap-weekly.2020-08-16.00:00:00
sudo zpool status -v tank

Code:

  pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.66T in 2 days 07:35:28 with 1 errors on Mon Sep  7 02:19:36 2020
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     1
      raidz1-0                  DEGRADED     0     0     2
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        <0x73b>:<0x11bb9>

Finally,
sudo reboot

And…
sudo zpool status -v tank

Code:

  pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Sep  7 06:21:34 2020
    27.5G scanned at 783M/s, 1000K issued at 27.8K/s, 10.0T total
    0 resilvered, 0.00% done, no estimated completion time
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     0
      raidz1-0                  DEGRADED     0     0     0
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        <0x73b>:<0x11bb9>

So….

Should I go ahead and remove remaining snapshots on that dataset, or just go one by one?

Code:

zfs list -rt snapshot -s creation -o name tank | xargs -n 1 | zfs destroy -r

Try it without the zfs destroy command first.

How come a file got corrupted on RAID5 when only 1 disk failed? Are you sure the pool was alright before the disk failed? Why resilvering started after the reboot?

After resilver completes, deleting the broken files and zpool scrub should fix the errors.

However I played a little with my test VM and I managed to actually create an error like this:

Code:

root@freebsd12:/tank# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0 days 00:00:04 with 1 errors on Mon Sep  7 20:16:18 2020
config:

    NAME        STATE     READ WRITE CKSUM
    tank        ONLINE       0     0     1
      raidz1-0  ONLINE       0     0     2
        vtbd1   ONLINE       0     0     0
        vtbd2   ONLINE       0     0     0
        vtbd4   ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x48>

Removing the file linked to that metadata fixed the problem, but there is somehow less space available (the same files don’t fit again).

Removing the file linked to that metadata fixed the problem,

If you don’t mind me asking how did you locate the file linked to the metadata. No filename is shown in the above ‘zpool status -v’ command.

Thread Starter
#11

The filename was shown originally in both cases. In both cases, the affected file was in a snapshot. Same file path, two different snapshots. I have destroyed the snapshots. Once the snapshot is destroyed, then the filename is no longer shown. Also, destroying the snapshot and rebooting seems to trigger a resilver after the reboot. The pool was scrubbing without error once a month before I started the drive replacement. That was the last of 6 replacements from 2TB to 4TB in order to expand the pool. I stared those replacements a couple years ago. It’s an old pool, and as you can see it is not on a 4K alignment, so I will have to migrate the data and start again anyway. Meanwhile, waiting for the current resilvering which will take another day or so…

If you don’t mind me asking how did you locate the file linked to the metadata.

I was removing until the error disappeared Doesn’t zdb retrieve such information though?

Thanks; scrubbing did not help remove it here. And zpool status now shows removed despite that the hard disk in still machine and untouched for long.

Thread Starter
#14

Code:

  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.64T in 2 days 05:56:39 with 1 errors on Wed Sep  9 12:18:13 2020
config:

    NAME                        STATE     READ WRITE CKSUM
    tank                        DEGRADED     0     0     1
      raidz1-0                  DEGRADED     0     0     2
        label/zdisk1            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3            ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4            ONLINE       0     0     0  block size: 512B configured, 4096B native
        replacing-4             DEGRADED     0     0     0
          13239389112982662359  UNAVAIL      0     0     0  was /dev/label/zdisk5/old
          label/zdisk5          ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6            ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        /tank/Blah/Blah/Blah/FooBar.mp4

Notice that the problem is now in the file itself and not a snapshot.

Will attempting to delete this file cause the zpool or dataset to become unavailable?

(Sure would’ve been nice to get a list of all these issues instead of having to do a 2.5 day resilver for each one.)

zpool detach tank 13239389112982662359

If you need this file you have to restore the entire pool from the backup, otherwise you can delete the file and scrub the pool but before that the pool must be healthy and without cksum errors.

Scrubbing indeed cleared it long ago.

Thread Starter
#17

sudo zpool detach tank 13239389112982662359

Code:

sudo zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 1.64T in 2 days 05:56:39 with 1 errors on Wed Sep  9 12:18:13 2020
config:

    NAME              STATE     READ WRITE CKSUM
    tank              ONLINE       0     0     1
      raidz1-0        ONLINE       0     0     2
        label/zdisk1  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk5  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6  ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: Permanent errors have been detected in the following files:

        /tank/video/Blah/Blah/Blah.FooBar.mp4

sudo rm /tank/video/Blah/Blah/Blah/FooBar.mp4

sudo zpool scrub tank

Code:

  pool: tank
 state: ONLINE
status: One or more devices are configured to use a non-native block size.
    Expect reduced performance.
action: Replace affected devices with devices that support the
    configured block size, or migrate data to a properly configured
    pool.
  scan: scrub repaired 0 in 0 days 10:11:03 with 0 errors on Fri Sep 11 21:50:04 2020
config:

    NAME              STATE     READ WRITE CKSUM
    tank              ONLINE       0     0     1
      raidz1-0        ONLINE       0     0     2
        label/zdisk1  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk2  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk3  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk4  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk5  ONLINE       0     0     0  block size: 512B configured, 4096B native
        label/zdisk6  ONLINE       0     0     0  block size: 512B configured, 4096B native

errors: No known data errors

So that worked! Thanks, everyone for your input. Now I’m going to destroy it and restore the data anyways to fix the block size.

No it’s NOT, you have cksum errors. Can you check your log using zpool history -i tank and see what’s is logged there. Look for the read/write error on some hard-disk(maybe old one already removed) and if you can run some memory test to verify if there’s no any bad RAM modules (i hope you are using ECC Ram)

Do NOT clear the chksum errors until you figured out from where they come first zpool clear tank raidz1-0

Thread Starter
#19

Ah, yes, I see, you’re right. Thanks for the heads up. Anyway, unfortunately, I didn’t have any more time to mess with it. I destroyed the pool and am restoring the data from backup now.

Источник

root@vmh1:~# smartctl -a /dev/sde
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.44-1-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Vendor Specific ID# ATTRIBUTE_NAME 1 Raw_Read_Error_Rate 5 Retired_Block_Count 9 Power_On_Hours_and_Msec 0x0032 12 Power_Cycle_Count 171 Program_Fail_Count 172 Erase_Fail_Count 174 Unexpect_Power_Loss_Ct 177 Wear_Range_Delta 181 Program_Fail_Count 182 Erase_Fail_Count 187 Reported_Uncorrect 189 Airflow_Temperature_Cel 0x0000 194 Temperature_Celsius 195 ECC_Uncorr_Error_Count 196 Reallocated_Event_Count 0x0033 201 Unc_Soft_Read_Err_Rate 204 Soft_ECC_Correct_Rate 230 Life_Curve_Status 231 SSD_Life_Left 233 SandForce_Internal 234 SandForce_Internal 241 Lifetime_Writes_GiB 242 Lifetime_Reads_GiB Data Structure revision number: 10
SMART Attributes with Thresholds:
FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
0x000f 118 118 050 Pre-fail Always — 0/210880202
0x0033 100 100 003 Pre-fail Always — 0
090 090 000 Old_age Always — 9596h+56m+36.500s
0x0032 100 100 000 Old_age Always — 671
0x0032 000 000 000 Old_age Always — 0
0x0032 000 000 000 Old_age Always — 0
0x0030 000 000 000 Old_age Offline — 112
0x0000 000 000 000 Old_age Offline — 9
0x0032 000 000 000 Old_age Always — 0
0x0032 000 000 000 Old_age Always — 0
0x0032 100 100 000 Old_age Always — 0
024 065 000 Old_age Offline — 24 (Min/Max 10/65)
0x0022 024 065 000 Old_age Always — 24 (Min/Max 10/65)
0x001c 120 120 000 Old_age Offline — 0/210880202
100 100 003 Pre-fail Always — 0
0x001c 120 120 000 Old_age Offline — 0/210880202
0x001c 120 120 000 Old_age Offline — 0/210880202
0x0013 100 100 000 Pre-fail Always — 100
0x0013 100 100 010 Pre-fail Always — 0
0x0000 000 000 000 Old_age Offline — 25566
0x0032 000 000 000 Old_age Always — 10216
0x0032 000 000 000 Old_age Always — 10216
0x0032 000 000 000 Old_age Always — 9862

SMART Error Log not supported

SMART Self-test Log not supported

Источник

SOLVED One or more devices has experienced an error resulting in data corruption (freenas-boot)

tomasi

Cadet

tomasi

Cadet

Ericloewe

Not-very-passive-but-aggressive

tomasi

Cadet

Ericloewe

Not-very-passive-but-aggressive

Green750one

Dabbler

tomasi

Cadet

tomasi

Cadet

Green750one

Dabbler

[SOLVED] ZFS I/O error

Drag_and_Drop

Member

Nemesiz

Well-Known Member

Nemesiz

Well-Known Member

Drag_and_Drop

Member

Читайте также: