The redo log is corrupted. If the problem persists, discard the redo log.
Yesterday about an hour before the end of my work day one of our critical servers fell over and was displaying the following message in the vSphere client.
The redo log of VisualSVNServer_1-000001.vmdk is corrupted. If the problem persists, discard the redo log.
The error message refers to a redo log, but this is legacy VMware terminology. VMware have from ESXi 3.1 started to use the term snapshot to mean the same thing but for some reason the error messages still use the old term.
The server was named Subversion and was a VisualSVN Server.
There was a snapshot dated from 15th December 2013 in the Snapshot manager for the Subversion VM so returning to this snapshot would have meant returning to a point several weeks ago and then trying to import the backup of the repository that was made the night of 29th January.
The underlying cause of the corruption cannot be definitively determined but I think was due to the amount of disk activity on the physical disk that constitutes datastore 3_2 on the host server S003-ESXi. This caused the system to fail to write to the log and to create updated delta disks which contain all the changes to the disks since the point of the snapshot.
I believe that if there had not been a snapshot the data corruption probably wouldn’t have happened. I have since educated staff that taking snapshots in vSphere is really not the same as backing up the server and they shouldn’t be doing it on the Subversion server at all.
I resolved the issue with Subversion by carrying out the following steps.
I clicked OK to the error message in the slim hope that the VM could overcome the glitch itself upon a simple reboot.
This didn’t work. So I started the process of backing up the VM by forcing a shutdown of the machine by virtually cutting off the power and then making a copy of the virtual machine folder on the datastore.
Whilst the copy process was going I checked Virtual Machine Logs, vmware-3.log was completely corrupt and the vmware.log was showing some corruption.
The copy process took over an hour as it was 150GB in total size. Mostly due to the two virtual disks the first VisualSVNServer.vmdk which constitutes the C: drive of the server is 40GB and the second VisualSVNServer_1.vmdk which is the E: drive is 100GB.
Having made a copy of everything I attempted to fix the snapshots. I made sure that there was sufficient space on the datastore and then using Snapshot Manager in vSphere created a new snapshot of the Subversion VM.
This operation was successful, so I then tried to commit the changes and to consolidate the disks. This worked for VisualSVNServer.vmdk merging all the changes, but not entirely for VisualSVNServer_1.vmdk, however it did reduce the size of the delta disks significantly meaning that there was likely to be only minimal data lost.
Nothing more could be done through the vSphere client so I then started a process of trying to manually consolidate the following disks into a single disk.
Enabled SSH on the host server s003-esxi.
Using PuTTY I logged into the command line of the host and changed the directory to the relevant directory that contained the virtual machine files for Subversion /vmfs/volumes/Datastore3_2/VisualSVNServer
Then ran the command ls *.vmdk –lrt to display all virtual disk components.
Then starting with the highest number snapshot ran the following command to clone the disk in a way that would merge the delta disks into a copy of the main disk.
vmkfstools –i VisualSVNServer_1-000002.vmdk VisualSVNServer-Recovered_1.vmdk
This process took another hour or so as it was trying to create a 100GB file.
This failed with the following error message displayed:
Failed to clone disk: Bad File descriptor (589833)
Then starting with the next highest number snapshot I ran command to clone the disk without the most recent changes.
vmkfstools –i VisualSVNServer_1-000001.vmdk VisualSVNServer-Recovered_1.vmdk
This process again took about hour as again it was trying to create a 100GB file.
Again this failed with the following error message displayed:
Failed to clone disk: Bad File descriptor (589833)
Abandoned the idea of merging the disks I removed the VM from the inventory in vSphere and then moved all but the following files into a separate folder.
I could then recreate the VM from these files. I downloaded the file VisualSVNServer.vmx which is the virtual machine’s configuration file and stores the settings regarding the virtual devices that make up a virtual machine. I edited the file to change all references to VisualSVNServer_1-000002.vmdk to VisualSVNServer_1.vmdk so that the machine could be booted up ignoring the delta disks and any data they might contain.
Added the VM back into the inventory and then booted up the machine. It booted up fine, checked the E: drive and there appeared to be data written to the disk all the way up to the time that the server fell over so it appeared that there was minimal if any data lost.
Thanks to XtraVirt for the necessary steps.