[SASAG] file corruption.

M. Kim hekk at scn.org
Tue May 22 09:15:03 PDT 2007


What caused file corruption?
What would you do when that happens?
Any action to prevent that other than backup.
Any method to check file health other than e2fsck?



Scenario>

Database got very slow. DBA bounced db and still unusually very slow.
DBA asked SA to restart a server hoping that would fix slowness problem.
This thoughtless SA just restarted by 'init 6' after stopping most 
applications.

At boot, file check was forced because this server has not been bounced 
for almost 1 year.
in RH, by default, want to run file check by 180 days.

After long wait, this file check failed and asked to do manual check by 
'e2fsck'.

------------------------------------------------------------------------------------ 

These are typed which means it might have some typos.
------------------------------------------------------------------------------------ 

Checking root filesystem
/: clean, 173778/2146304 files, 3342624/4287346 blocks
                                                           [   OK   ]
Remounting root filesystem in read-write mode:             [   OK   ]
Setting up Logical Volumn Management:                      [   OK   ]
vgchange  -- volumn group "vg01" successfully activated    [   OK   ]

Activating swap partitions:                                [   OK   ]
Finding module dependencies:
mdadm: /dev/md0 has been started with 2 drives.
mdadm: /dev/md1 has been started with 2 drives.
Checking filesystems
/boot: clean, 40/26104 files, 18333/104391 blocks.
/dev/vg01/u02 has gone 321 days without being checked:
/dev/vg01/u02: Inode 65546, i_blocks is 1804040, should be 112, FIXED
/dev/vg01/u02: Inode 65548, i_blocks is 524849, should be 112, FIXED
/dev/vg01/u02: Inode 65552,   i_blocks is 2099240, should be 112, FIXED
/dev/vg01/u02: Inode 6986387, i_blocks is 6984552, should be 1243X12, FIXED
/dev/vg01/u02: Duplicate or bad block in use!
/dev/vg01/u02: Duplicate blocks found..... Invoking duplicate block pass..
PAss 1B: rescan for duplicate/bad blocks
/dev/vg01/u02: Duplicate/bad block(s) in inoe 11:/dev/vg01/u02:   
524:/dev/vg01/u02:  525:/dev/vg01/u02:   526:/dev/vg01/u02:   
527:/dev/vg01/u02:  /dev/vg01/u02: Duplicate/bad block(s) in inoe 
6506388:/dev/vg01/u02:   524:/dev/vg01/u02:  525:/dev/vg01/u02:   
526:/dev/vg01/u02:   527:/dev/vg01/u02: 
(MJ Note: long pause, at least 10 minutes)

/dev/vg01/u02: Pass 1C: Scan directories for inodes with dup blocks
/dev/vg01/u02: Pass 1D: Reconciling duplicate blocks
/dev/vg01/u02: (There are 2 inodes containing duplicate/bad blocks.)
/dev/vg01/u02: File /lost+found (inode #11, mod time Fri Jun 23 21:49:26 
2006)
  has 4 duplicate block(s), shared with 1 file(s):
/dev/vg01/u02: /???/rman/rman_fullbackup_PNBDB_622178203.130.1.1.bus(
inode #6586388, mod time Thu May 10 03:18:23 2007)
/dev/vg01/u02:

/dev/vg01/u02: UNEXPECTED INCONSISTENCY: RUN fsck MANUALLY.
   (i.e., without -a or -p options)
/dev/vg01/u03 has gone 321 days without being checked, check forced
/dev/vg01/u03: |======================================                  
/47.5%

(MJ note: after that file check)
/dev/vg01/u03 has gone 321 days without being checked, check forced
/dev/vg01/u03: 421/16646144 files (42.3% non-contiguous), 
13558542/33292288 blocks

                                                       [FAILED]


------------------------------------------------------------------------------------ 

(got into rescue mode with RH CD 1)
ran 'e2fsck -c -y -f /dev/db01/u02'

after ran e2fsck multiple times because server still couldn't pass the 
file checkup.

Finally passed test with this.

/dev/vg01/u02: ****** FILE SYSTEM WAS MODIFIED *******
/dev/vg01/u02: 11/16646144 files (0.0% non-contiguous), 530566/33292288 
blocks
(Repair filesystem)



After server got up finally, all data on these 2 partitions, 
/dev/vg01/u02, /dev/vg02/u03, are gone.
There is only 'lost+found'.

After that, SA, DBA, others worked on restoring from backups.

This 2 partitions use software raid with lvm. I have used lvm long time 
and just applied software raid under lvm about one year.
This file corruption happened twice after that software raid.  I don't 
want to blame to software raid.

Is there anyone that have implemented software raid in production 
environment?

setup : RHEL 3 (2.4.21-27.EL) with oracle 10g.


Any advice will be appreciated.

Best,

Myung-Jwa,






More information about the Members mailing list