Disaster, Disaster, Disaster!!! Planning for Murphy’s Law—Part two

In the second article in this series, following on from Disaster, Disaster, Disaster!!! Planning for Murphy’s Law, “Murphy” begins to take a closer look at the list of problems he has experienced …

Disasters: Does this sound familiar?

  • I just want my file back OK. No other info given—not even the file name.

  • I just want it back OK and I don’t know at what time in the past that it got corrupted and I’m not sure about the file name exactly.

  • I just want you to retrieve all versions of the file so I can inspect them. I know what I do when I save them, but they don’t appear to be there any more.

  • I cannot access my files—not a single one.

These problems present particular issues to a system administrator.
How do they handle these sort of questions?

You can ask a few questions but which questions generally get the best results?

I usually ask a few simple questions at first.

  • How important are the file(s).
  • What is the time-frame for your solution. That is, when do you have to have the file recovered?
  • Have they got backups from previous versions?
  • Have they saved previous versions with a system like X_v1, X_v2, X_v3?

You might ask yourself if:

  1. They are working with some form of version control without their knowing.
  2. Microsoft Office Word will have autosave copies so that if it crashes before it writes the final version, then you can go back to that copy.
  3. There are some people who write programs for a living and they may have been setup with a specific Subversion version control system. You may have to check around the system or even another networked system to find the backup.
  4. There are some editing programs like Coda or TextWrangler that arrange for a download from another system. They may have been setup by another systems administrator long before the new guy arrives to try and sort things out. They may simply not be able to reach the other system where their file resides. This may apply if they have lost their NFS link too.

Hey! Here is a thought …
When did you create the file? Was there enough time for the system to do a decent backup that we can go get and recover from? Some people lose a file before the system has had a chance to back it up.

Some people have lost files because they had Word open and they saved a copy to a floppy. Then they copied the file from the floppy to their hard disk. Of course the copy on the floppy was corrupted—floppies are not great at maintaining data integrity. It is always better to save to a hard disk first and then save directly to the floppy later on. Then you have two chances of correctly writing the same data (from memory) to two different media. Of course if Word is corrupt then all bets are off.

Some people have lost files because they save to a floppy and always use the same name. It is good practice to change the name to reflect the version on the floppy—I mean if you have ‘mybestessay’ saved then the next time you work with the file to update it change the name slightly to ‘mybestessay_v2′. That way you may be able to recreate something like your best essay from a version that needs a little work but is mostly all there.

  • We have them backed up on those round tape thingys. We no longer have a reel-to-reel system. Oops

The time to fix this problem was when the last systems admin discovered that the management was decommissioning the only machine that did have the old reel-to-reel tape drive. They should have migrated the contents of all their old tapes to new media. There is no solution unless you can find something compatible to read the old tapes. Student records or Tax file records may be subject to some lengthy “keep forever” or “keep as historical” requirements.

  • We received a tape but it’s from Germany and it’s from some place called EBCDIC.

Most of todays windows machines use either ASCII or UTF-8 character sets however there are other character sets like EBCDIC and you need to convert from one set to another.

  • The tape needs to be converted from EBCDIC to ASCII cartridge.

On a LINUX system you can convert files using ‘dd‘. From ASCII to EBCDIC:

dd if=/path-tofile/inputfile of=/path-to/outputfile conv=ebcdic

From EBCDIC to ASCII:

dd if=/path-tofile/inputfile of=/path-to/outputfile conv=ascii

Compare the file sizes because sometimes things do go wrong. The files should be approximately the same size. If you are worried keep the original file for a long time so you can do other conversions.

NOTE: Some versions of ‘dd‘ do not require or support the use of an equals sign. I once had a tape in EBCDIC format that I converted and due to a device driver bug with records of a certain PRIME number length, it needed a little fixing. I noticed a large difference in file size. I added a few bytes of ‘SPACE’ data to the end of the file so that the total file size was not a multiple of the PRIME number.

  • We have 7 floppies and the engineer over-wrote the first one—we want as many files as possible brought back from the other 6 floppies.

Actually the engineer asked for a blank floppy to test the new floppy drive with. What application wrote the backup? It was written by a ‘tar‘ command so I was lucky I able to write a program in ‘C’ that checked every 512 byte block for a ‘tar‘ header. Then I was able to restart the tar extraction from a point where I found the tail end of one file and the beginning of a tar header at the start of the next tar file. We were able to recover from that point on.

  • We have three tapes and there is a problem with the third tape. No we don’t have the changes to the source code nor do we have the database and the application runtime executables recovered yet.

This problem required two solutions. The first was to get past the corrupted block on the tape—something part software and part physical was required. The second solution was to find a way to patch around the corruption.

Mike the mighty engineer had to tweak the tape drive head while it was reading the tape, to get the best possible signal from the tape. Remember that tape drives have a head that usually is used to write and read the tape and the only problem between writing a file and reading a file off the tape is that the head may have drifted a bit in the meantime.

This generally means that one tape drive cannot always read what another tape drive has written.

Use the following LINUX command to get something off the tape:

dd if=/dev/ntape0 of=/path-to/outputfile conv=noerror,sync

after your ‘Mighty Mike’ does his tape tweaking.

After that you need to find a working binary editor. I found one and compiled it. I was then able to find a point in the file /path-to/outputfile that needed a little bit of fixing. Basically I sacrificed the file just before the corruption by extending the length of that file to include the bad section of the tape. I even changed the name to make it clear that it was the damaged file.

  • We used a mainframe system to archive files from Sun Solaris systems and we have not been able to get Oracle backed up from the remote systems.

This problem was not a technical issue really. It was an installation issue by the vendor. They forgot to install an activation key on the mainframe so that it could act as an Oracle Backup Service. The Oracle client was configured correctly and should have been able to work as per the manual. However the poor Unix Admin was asked to help the mainframe admin when it came to crunch time. Lucky for the admin he was able to sniff the network and from that information gathering phase he was able to prove that the problem was coming from the mainframe end.

  • Friday 11am: We have a Human Resources training coming up on Monday and we need to install the latest Oracle to support the HR software—but Oracle will not install from CDROM.

So the time-frame is critical for the Unix admin and the Oracle admin. People were expecting to see the latest HR software on the latest Oracle. I was able to find enough space on the Unix system to copy down the entire contents of the CDROM. I call this a staging area. I was then able to track down the problem. The Oracle install program was referring to a smallish file (a check file) that basically contained information about the files that needed to be copied to the Unix system. The information comprised a filename and file length and some other stuff. This file should have been the last thing built before going to the CDROM burning process. Somehow one of the files had a last minute change for whatever reason and the size of the file changed. This stopped the installation in it’s tracks.

I was able to modify the check file and amend the size of the file (the length) so that when we ran the installation from the staging area—we were able to do a nice clean trouble free install.

The only other trick was getting the installation process to work from the staging area. A loopback filesystem is always nice to have when that sort of thing happens.

  • We exported the data from Oracle after doing an Oracle upgrade and now it does not import—it complains about CONSTRAINT.

Oracle sent out a CDROM upgrade to the University where I worked. The Oracle SysAdmin came to me after discovering that he could not import a plain old text file. Since I had asked the Oracle SysAdmin to export his databases as a part of my backup strategy and then test restore the database every two weeks we were the very first people to discover this bug. I like to export to flat files from databases. I export for two reasons. One I can see clearly how the database grows over time—I can do a trend analysis on the size of the database. I also reserve the right amount of space on the filesystem I would be using to do any emergency restores from. If I found that the space available on the backup area was getting tight I would consider getting bigger disks or adding disks to the RAID for more space. We always need space to install a restore onto when we import. It makes sense to have enough space for at least one current backup and space for a restored from tape backup.

The problem I have to face with Oracle is that it can be configured for several disks and it is sometimes not possible to restore to another machine with as much space as the original. So I asked the mighty Dominic to reconfigure the configuration files on a nightly basis. I wanted a working configuration and a backup/restore configuration area. I was lucky to have the mighty Dominic able to do this sort of thing. Every night he reconfigured the backup/restore configuration files based on information gleaned by a script. I now know that during the End-of-Year (EOY) period you need a lot more roll-back space available for doing the group certificate processing so I don’t have the EOY processes die from lack of space.

So what Oracle had done was create a new keyword command called ‘CONSTRAINT’. They had modified their ‘export code’ to write out that command as a part of the exported file. What they somehow forgot was to modify the ‘import code’ to recognise the new keyword. Once we identified the new condition we referred to previous export files to see if they had the word CONSTRAINT in them. Funny enough none of them did. So we reported the problem to Oracle and we were the first ones to get a patch or upgrade to fix the problem. They were very very quick, it only took about two weeks.

  • We had a ‘C’ programmer working on this for a month and we need Ace Ventura files converted to ICL Publisher files.

This happened to me on one afternoon at about 3 PM. The two office power girls were great. They did the majority of the hard work. They each typed in half of the escape sequences apiece (building a translation table).

I jammed the two parts together and integrated their work with my code. I worked only on the parts of the problem that involved opening a file and reading it into memory (the whole lot). Then I referred to the girls translation table and found and replaced any escape sequence in memory. Then I simply wrote the resulting file out to another file. I arranged for my code to translate in both directions so that we could change a file, to or from, one format or the other. With this done I was able to convert a file from one format to our format and then back again. I was able to confirm that my program did indeed create the exact same file when we converted back to the original format. This lent confidence to the conversion process, so we then tested the new format file contents with the ICL Publisher and it worked.

When did they need to have a result? They needed to demonstrate that they could do it on the following Monday. Talk about tight deadlines, but with their help it was all done by about 6:30 PM.

  • We have to compile 1.5 million lines of code—yeah! I know we were told about it a month ago, but we just got permission to do it today. We need it by Monday.

They wanted to start work on the Wednesday and expected it to be compiled by Friday noon as they were arranging for a representative to come out and configure the printers. The application was a financial accounting system.

I needed to start by asking the developers of the system where to start compiling. Start with ‘this library’ and then compile ‘that library’ etc. etc. and then compile the executables.

I found about 25 NULL pointer problems and had the executables ready by Friday noon. Later on I applied for a job at the development company and was told that the reason why they hired me was that several other hardware vendors had a month to compile things and that their people had rung up about problems and chewed up a lot of the developers time.

  • We have had this problem with printers printing through a spooling system and it keeps stopping. We have worked on this problem for about a year with no success. Can you fix it before we loose the customer.

Ah! This is a story I love to tell. I was hired as a Unix Systems Admin and was asked during interview about my knowledge of printers. I was introduced to the customer who had the printer problem. I saw the printers and spool devices (basically a lot of memory to store the data that would produce a printed copy if sent down to the printer and a button to increase the number of copies to print). I was introduced to the engineers who were constantly out fixing the printers.

The customer soon got me back to look at the current printer problem. I examined the printers and the spool devices carefully for some sort of problem and could not find any real issue in the whole setup. YES the serial port doing the printing was setup for Xon/Xoff flow control and YES the printer was setup for Xon/Xoff flow control. Hard to say about the spool device in between. YET the printer had hiccuped. It printed some of the report and then stopped dead. The only way to revive the printer was to power it off. The document printed some of the document again and then died again. The problem was that it was not consistent. Slightly different amounts of the document would be printed before the printer died.

The customer staff would retype sections or pages and send that off to the printer, just to keep the reports flowing.

On my third visit I installed a few changes in the way documents were processed during printing. I made sure that a copy of the file was squirrelled away before it went off to the serial port for printing.

My forth visit was in response to yet another dead stop printer. I went out and collected the file that I had squirrelled away onto a tape.

I went back to work and did a HEX dump of the entire file to paper. I took the printer manual and checked every escape sequence. Bingo!!! There was an escape sequence inside the print file that should not be there. I very carefully went through the printer manual.

“When the printer receives an invalid escape sequence, it will:

  1. finish printing what is inside it’s buffer.
  2. stop printing and wait for a reset sequence from the computer”.

Flow control would probably make some difference because the buffer was never always full due to slight changes in the timing of flow control and printing. Whatever was left in the buffer got printed and THEN the printer would stop.

Apparently this particular printer was expecting a much smarter program to control its printing. The reset sequence was a little more complex than Unix device drivers or printer drivers were capable of. So the end result was that someone had to turn off the printer to recover the situation.

On my firth and final visit I typed in a ‘C’ program that basically checked every escape sequence that was heading down to the printer. If it was invalid it was removed. Thus no invalid escape sequences ever made it to the spooler let alone the printer. I checked with the application providers and it seems that their code was not bullet proof. If a typist accidentally hit the ESCAPE key while hitting the F1 function key close by, then it was possible for random escape sequences to be captured by their application and dumped into a database. Later on when reports were due the application grabbed things out of the database and put them into the print file contents.

I was about to leave when the customer’s boss said “You see all my staff hanging around the coffee machine. I have three big customers of my own who need 15 reports each done and on their desks today by 5 PM. We have to print them by 3 PM and have them checked for accuracy. I want you to make sure that those reports are on my desk by 3 PM.” I rang my boss and told him the story and asked if I could hang around at the customer’s site for the rest of the day. This I did and the customer and I talked all afternoon until 3 PM. No problems were reported by the printers as they kept on printing all day long.

A week went by and my boss called me into her office. I had received a commendation from the local customer engineer. “I have not been out to the site all week this week. Generally I spend about 50% of my time at the site trying to fix the printer problems. I have spent a lot of time out at this customer site in the last 12 months. I think that the problem has finally been fixed”.

My real reward though was the knowledge that several people kept their jobs because of my luck in finding the real problem.

  • We have a tape that writes 60 MB’s but when we wish to recover all the files it does not recover them all, most of them are lost and we get an I/O error.

I bumped into the Mighty Mike and another engineer and my supervisor ARGUS all gathered around a single machine which seemed to be stripped down to bare framework stage so that the engineers could examine every aspect of the 60 MB cartridge tape drive. My supervisor had written a bit of code to try and help the customer overcome a problem. His backup script never wrote more than about 10 MB of data to a tape that was able to hold 60 MB’s. I watched as the tape did it’s thing and sure enough an I/O error popped up and complained. This particular tape was created with a simple command and should have contained about 55 MB worth of data.

An I/O error usually means that some hardware has detected an error. I noticed though, that the problem the I/O error was reporting was that it exceeded about 16 MB. That sparked a memory in my mind. “Hummm”. I checked the ulimit (a limit in the csh that prevents run-away-processes from consuming more than a certain amount of disk space). It seems that the default shell for the ‘root’ user was /bin/sh, what is affectionately called the Bourne Shell. However you can over-ride the default and use the C Shell /bin/csh instead. This meant that the C Shell came with its own problems—including ulimit.

I changed the ulimit from 16MB to 160MB. We ran the recovery command again and it worked fine. Problem solved and so simple. All that trouble for something that was doing it’s job. When the data coming off the tape and being written to the disk exceeded the 16 MB ulimit, the C Shell reported the error.

NOTE: The C Shell reported the error. It was not a hardware error message at all. However it is so easy to have this sort of problem.

The names of all people mentioned have been changed. However I would like to mention that the character ‘Mighty Mike’ has sadly passed away and has been missed by many engineers and more than a few of his friends. I wish his family could have known the man that he was to his co-workers. I always liked the man and he always had a funny story to tell or a whopper to try and get past you.

He once travelled to Bracknell (England) for work and was told that the company building he had to report to was “You cannot miss it, Its the big black building you see when you come over the rise at such and such”. So Mighty Mike drove over the rise only to be presented with FOG in the valley as thick as black smoke. Needless to say he had a tough time finding that building.

This series follows on with Disaster, Disaster, Disaster!!! Planning for Murphy’s Law—Part three


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>