[SASAG] file corruption.

M. Kim hekk at scn.org
Tue May 22 13:24:05 PDT 2007


Dan Wilder wrote:
> On Tue, May 22, 2007 at 09:15:03AM -0700, M. Kim wrote:
>   
>> What caused file corruption?
>>     
>
> You did not mention what level of RAID you are using.  I hope it isn't
> RAID-0!
>
> Why are you using lvm?  Personally I love it but I have specific reasons
> to use it which may not apply to your situation.  The downside is
> it costs bandwidth and it adds another layer of complexity to
> potentially fail.
>
>   
Raid 1. set 4 hard disks for 2 hard disks capacity with Raid 1. 
implement lvm on software raid 1.
the reason to use lvm is need more space than one hard disk can have.
Yes, I could get a bigger hard disk. also lvm would give me flexibility 
to change size later.

> Any number of things can cause file corruption.  Some of the most
> common:
>
> * Bad RAM
> * Bad capacitor on motherboard
> * Disk failure
>
> Less common:
>
> * Kernel bug
> * Overheating
>
>   
How could I check RAM/Bad capacitor on motherboard? these system has 
been up for 1 years.
Is it possible to get bad after certain time?
I want to exclude disk failure because it passed 'smartctl'.

Kernel bug: Is it possible that application could cause this problem?

I have had these on /var/log/messages. probably i might need oracle for 
this help.

May 22 07:50:25 xxxxx kernel: application bug: sqlplus(13511) has 
SIGCHLD set to SIG_IGN but calls wait().
May 22 07:50:25 xxxxx kernel: (see the NOTES section of 'man 2 wait'). 
Workaround activated.


smartctl does temperature monitoring also, want to exclude overheating.

>> What would you do when that happens?
>>     
>
> Restore from backups.  Don't count on e2fsck or any other filesystem
> check.  All it guarantees is that your filesystem is good.  There is
> _no_ guarantee of data integrity.
>   
Yes, that's what we did. still want to do what caused. and don't want 
this happen again.
Application(oracle) was doing fine and didn't create anything unusual in 
oracle logs as well as in the system logs.  This happened once in test 
system and this time it happened in prod system. these system had bought 
at the same time and have same file system as well as application.


>   
>> Any action to prevent that other than backup.
>>     
>
> * Use redundant RAID: RAID-1, RAID-5 etc.  
> * Don't get caught in The Raid Trap: one way or another, do frequent
> 100% read check of all disk surface.
> * Once in a while do a visual inspection of your motherboard looking for
> domed-out electrolytic capacitors.
> * Use a UPS and a tested installation of a powerfail daemon.
> * Make sure all fans are turning.
> * Consider database replication.
> * Keep an eye out for reports of kernel bugs.
>
>   

Does raid protect file corruption? I thought that raid can do against 
hard disk failure.
I'm using 'smartctl' to check hard disk. any other way?
Unfortunately, I don't think that I'm capable to do visual inspection of 
my motherboard.
How do domed-out electrolytic capacitors look like?




>> Any method to check file health other than e2fsck?
>>     
>
> e2fsck does not check file health only filesystem health.  
>
>   
Then what does it do?
So system passes e2fsck, still the system might have file corruption?

>> Scenario>
>>
>> Database got very slow. DBA bounced db and still unusually very slow.
>> DBA asked SA to restart a server hoping that would fix slowness problem.
>> This thoughtless SA just restarted by 'init 6' after stopping most 
>> applications.
>>
>> At boot, file check was forced because this server has not been bounced 
>> for almost 1 year.
>> in RH, by default, want to run file check by 180 days.
>>
>> After long wait, this file check failed and asked to do manual check by 
>> 'e2fsck'.
>>
>> ------------------------------------------------------------------------------------ 
>>
>> These are typed which means it might have some typos.
>> ------------------------------------------------------------------------------------ 
>>
>> Checking root filesystem
>> /: clean, 173778/2146304 files, 3342624/4287346 blocks
>>                                                            [   OK   ]
>> Remounting root filesystem in read-write mode:             [   OK   ]
>> Setting up Logical Volumn Management:                      [   OK   ]
>> vgchange  -- volumn group "vg01" successfully activated    [   OK   ]
>>
>> Activating swap partitions:                                [   OK   ]
>> Finding module dependencies:
>> mdadm: /dev/md0 has been started with 2 drives.
>> mdadm: /dev/md1 has been started with 2 drives.
>> Checking filesystems
>> /boot: clean, 40/26104 files, 18333/104391 blocks.
>> /dev/vg01/u02 has gone 321 days without being checked:
>> /dev/vg01/u02: Inode 65546, i_blocks is 1804040, should be 112, FIXED
>> /dev/vg01/u02: Inode 65548, i_blocks is 524849, should be 112, FIXED
>> /dev/vg01/u02: Inode 65552,   i_blocks is 2099240, should be 112, FIXED
>> /dev/vg01/u02: Inode 6986387, i_blocks is 6984552, should be 1243X12, FIXED
>> /dev/vg01/u02: Duplicate or bad block in use!
>> /dev/vg01/u02: Duplicate blocks found..... Invoking duplicate block pass..
>> PAss 1B: rescan for duplicate/bad blocks
>> /dev/vg01/u02: Duplicate/bad block(s) in inoe 11:/dev/vg01/u02:   
>> 524:/dev/vg01/u02:  525:/dev/vg01/u02:   526:/dev/vg01/u02:   
>> 527:/dev/vg01/u02:  /dev/vg01/u02: Duplicate/bad block(s) in inoe 
>> 6506388:/dev/vg01/u02:   524:/dev/vg01/u02:  525:/dev/vg01/u02:   
>> 526:/dev/vg01/u02:   527:/dev/vg01/u02: 
>> (MJ Note: long pause, at least 10 minutes)
>>
>> /dev/vg01/u02: Pass 1C: Scan directories for inodes with dup blocks
>> /dev/vg01/u02: Pass 1D: Reconciling duplicate blocks
>> /dev/vg01/u02: (There are 2 inodes containing duplicate/bad blocks.)
>> /dev/vg01/u02: File /lost+found (inode #11, mod time Fri Jun 23 21:49:26 
>> 2006)
>>   has 4 duplicate block(s), shared with 1 file(s):
>> /dev/vg01/u02: /???/rman/rman_fullbackup_PNBDB_622178203.130.1.1.bus(
>> inode #6586388, mod time Thu May 10 03:18:23 2007)
>> /dev/vg01/u02:
>>
>> /dev/vg01/u02: UNEXPECTED INCONSISTENCY: RUN fsck MANUALLY.
>>    (i.e., without -a or -p options)
>> /dev/vg01/u03 has gone 321 days without being checked, check forced
>> /dev/vg01/u03: |======================================                  
>> /47.5%
>>
>> (MJ note: after that file check)
>> /dev/vg01/u03 has gone 321 days without being checked, check forced
>> /dev/vg01/u03: 421/16646144 files (42.3% non-contiguous), 
>> 13558542/33292288 blocks
>>
>>                                                        [FAILED]
>>
>>
>> ------------------------------------------------------------------------------------ 
>>
>> (got into rescue mode with RH CD 1)
>> ran 'e2fsck -c -y -f /dev/db01/u02'
>>
>> after ran e2fsck multiple times because server still couldn't pass the 
>> file checkup.
>>
>> Finally passed test with this.
>>
>> /dev/vg01/u02: ****** FILE SYSTEM WAS MODIFIED *******
>> /dev/vg01/u02: 11/16646144 files (0.0% non-contiguous), 530566/33292288 
>> blocks
>> (Repair filesystem)
>>
>>
>>
>> After server got up finally, all data on these 2 partitions, 
>> /dev/vg01/u02, /dev/vg02/u03, are gone.
>> There is only 'lost+found'.
>>
>> After that, SA, DBA, others worked on restoring from backups.
>>
>> This 2 partitions use software raid with lvm. I have used lvm long time 
>> and just applied software raid under lvm about one year.
>> This file corruption happened twice after that software raid.  I don't 
>> want to blame to software raid.
>>
>> Is there anyone that have implemented software raid in production 
>> environment?
>>
>> setup : RHEL 3 (2.4.21-27.EL) with oracle 10g.
>>
>>
>> Any advice will be appreciated.
>>
>> Best,
>>
>> Myung-Jwa,
>>
>>
>>
>>     
Thank you for you advice. very appreciated.


>> _______________________________________________
>> Members mailing list
>> Members at lists.sasag.org
>> http://lists.sasag.org/mailman/listinfo/members
>>
>>     
>
>   




More information about the Members mailing list