anniemal's iBook started having problems several days
ago. It got slow. Then various apps didn't want to behave. It could
not open a web page in Safari or Opera. A lengthy poking-at session
eventually revealed error messages (once I discovered the Console
tool in Applications:Utilities) indicating a problem with one of the
files in the Library folder with a web-related-sounding name.
syntonic_comma copied the file from his iBook and handed
it to me on a USB thumb drive, and all seemed to be well.
Then the machine got slower and slower again. The afternoon before
last, it wouldn't let me switch login sessions -- I got the beach
ball spinning endlessly. I couldn't launch Activity Monitor either
-- same result. Thinking that perhaps we were running into problems
with virtual memory (not that such a thing ought to happen, but I did
have an awful lot of Opera windows open under my login session, which
I couldn't get to because
anniemal's login session was
the one visible), I tried closing a few open apps. No improvement.
But having quit Terminal, I couldn't re-launch it to issue 'kill'
commands. Nor could I launch Console ... or, as it turned out,
anything else.
So I shut the machine down.
Then it wouldn't boot. Whoops. It did the grey screen, then it did the blue screen, then it put that box in the middle of the blue screen where it reported that it was checking disks, starting the network, etc., and finally starting a login session ... and with a few pixels left on the progress bar, it wedged. Bleah.
It took me a while, without access to a browser and Google or
any installed help files, to find out how to boot an iBook into
single user mode. (I was actually looking for verbose-boot mode,
which I didn't find out how to do until later. But that's okay,
because what I would've seen there would have made single-user the
obvious next step anyhow.) Knowing that OS X is Unix underneath,
I knew there had to be a way to get it to show me all those reassuring
progress messages I'd gotten used to from the time I administered
my first Xenix box for the U.S. Army Corps of Engineers way back
when. (For the record, in case anyone who needs this info doesn't
already know it, holding down command-S while powering up will
start the user in single-user mode with a white-on-black text
console. Holding down command-V will cause an otherwise normal
boot but with the string of initialization messages displayed
until the blue screen takes over, and will also show a few
messages at system shutdown. Booting with the 'C' key pressed
attempts to boot from the CD drive. Maybe all iBook owners know
this, but I've got an excuse: I don't own or have sysadmin
responsibility for one, I just pick up
anniemal's
to check my mail via telnet or look at the web when I'm visiting.)
Okay, first off, having access to the console in single-user mode let me go peek at /var/log/system.log (I was expecting it to be /var/log/messages, as on my Linux machines, but that wasn't hard to figure out). There I saw complaints about a couple more files in /Library, including one that looked (based on a casual parsing of the name) as though it might be needed to open any window on the screen, or something. Urk! Second, I figured here was my chance to run fsck on /, which I did.
Heh. "You don't need to fsck this file system, Citizen; this is a journalled file system. Move along." Well, that's not the phrasing that fsck used, but that was the gist. I replied with "fsck -f dammit" (okay, okay, I only typed "fsck -f"; the "dammit" was merely spoken), and suddenly things looked a lot more a) informative, b) interesting, and c) depressing. "Overlapping allocation extent: file 57126d", and a Whole Lot more lines (one and a half to two screens full) just like that with different numbers. Oddly, fsck reported that it had Fixed Everything, cheerfully, and that the file system was A-OK, but running it again showed the same list of overlapping file allocations. Oh ew.
syntonic_comma came home, with ideas. I'd
been trying to figure out how to mount an external (Firewire)
drive, with no luck. I wanted to back up the user directories.
There's no entry in /dev that could point to the external
drive! As near as I can figure, some sort of magic occurs
in the Finder to create the devices on the fly in order to
mount them when a drive is detected being plugged in, but
since Finder wasn't running ... Argh! (Something similar
seems to be the case with network devices -- I thought if I
could bring up Airport, I could use FTP or NFS or Appletalk
to move files. But the only device I could start was 'lo'.
Trying to ifconfig 'en0' or 'en1' got me a "no such interface"
error. And yes, those are the names of the 10baseT and
Airport interfaces when the rest of OS X is running.) Cute
trick for a look-how-automagic-we-are system! Rather a PITA
when whatever performs that magic isn't running and all I've
got is the command line! Well
syntonic_comma had
found instructions for putting a Mac into "Firewire slave"
mode (uh, hold down the 'T' key while powering up, IIRC),
which made
anniemal's iBook look like a simple
external drive to
syntonic_comma's iBook. Choosing
to experiment to find out why the instructions said
not to have any other Firewire devices attached at the same
time (instead of taking whomever's word for it, since no
explanation was given), we plugged both machines into the
drive designated for system backups, and he copied out the
stuff that we'd be really annoyed to lose (/Users, /Applications,
and /usr/local). At that point the plan was to scrub the
disk, reinstall the OS, and then put the user files back and
spend however long it took to fix up ownerships and permissions.
Fortunately I found the bit in The Missing Manual that talked about doing a non-clean reinstall, and we figured that might save us time if it worked, and not cost us too terribly much time if it turned out to be a waste. In the meantime we poked around with "find / -inum" to see which files were affected (yup, a bunch of /Library files, plus a bunch of Opera and Safari cache files and some Palm stuff), though that wouldn't tell us which files were overlapping which. Installing OS 10.3.5 over 10.3.9 had a side effect of moving all the old system files (including all the ones with improper disk allocations) to /Previous\ Systems, so I did þe olde "rm -rf" on that (which took two passes because some stuff didn't want to go away the first time but went silently the second), and going back to single-user gave us a clean fsck (after some directory and inode fixups on the first pass).
At that point we still had the stuff in place that we had backed up just in case, saving us some work. But the system was still terribly sluggish after a normal (multi-user) boot. Blowing away /Applications/Palm helped a lot there -- I do not know whether the Palm software was the cause or another victim; I do know that some of its files were among the overlapped ones -- sped things up, but Privoxy was missing so nothing that accessed the outside world via HTTP worked until we disabled the system-wide proxy setting. Reinstalling Privoxy later and re-enabling the proxy preference worked. Currently her machine appears to be working properly, but I've not yet reinstalled the Palm software. I'm considering buying a copy of The Missing Sync instead, having been told that in addition to its extra features, it's also more reliable. (Previously observed strangeness with the Palm stuff is the subject of another (future) entry...)
Part of what took so long was backing up, reinstalling, checking things, letting System Updater do its things to bring the system current to 10.3.9 again (two more reboots there, and long downloads even over broadband), etc.. Another big chunk of time was spent just figuring out what the heck had happened that we needed to fix. And a rather huge amount of time was spent figuring out or looking up on the web how to do things that, well, I guess nobody is ever "supposed" to need to do on an OS X box and therefore aren't well documented anywhere really convenient. (Yah, hiding "the hard stuff" is all well and good until something goes wrong. Then I want the damned tools to let me diagnose and repair things in full-on geek-mode thankyewverymuch. And I never did find out how to bring up a network interface or mount an external disk in single-user mode.)
I must say that the "be an external drive for another computer" mode is pretty cool though. And a comment I saw somewhere gave me the impression that older Macs could do this using SCSI instead of Firewire and I just never knew about it ...?
Anyhow, I'm considering this evidence of a bug in MacOS 10.3.9,
because no matter how ill-behaved an application or how fugly a
force-quit, the OS and its filesystem routines should be in control
of what goes in which disk sectors. And the OS somehow wound up
allocating the same sectors to two different files ... thirty or
forty times. I'm not sure why this is even possible. (I don't
suppose there's an equivalent of the Devil Book for OS X, or for
Mach, is there? Reading up on internals of the filesystem code
would probably let me sleep easier at this point. I like
understanding stuff, the more so if it's got flaws in it that
might bite me.) What
syntonic_comma found on the
web regarding similar problems pointed to force-quitting
browsers under OS X, but last night
justgus37
said he'd seen the same allocation error show up under other
flavours of Unix, so the problem is not new, nor unique to
MacOS ... but inherited or not, it's a bug. And if it's that
old and still with us, I'm guessing that it's probably not been
figured out and thus probably still exists in 10.4, though I'd be
quite happy to hear that I'm wrong.
So there's the tale. A frustrating, educational couple of days. And a realization that I haven't properly kept up with how things work under the hood. Though I'm glad I had all that Unix-geek knowledge to use in figuring out where to start.
(no subject)
Yes, the older SCSI PowerBooks had an HDI-30 SCSI plug, and there was some way the cable signaled to the machine to go into disk mode (shorting pin 30 to GND? shrug).
(no subject)
(no subject)
SCSI thing
OTOH, I have no idea if there's electronic caveats (like with that weird nonsense about the bus termination, and "termpwr" and funny details like that), and nor do I know what software-level stuff is required, if it is even actually possible to do something useful with that technique.
Sorry if I made no sense there, am dragging brain along on a string ATM
(no subject)
Quit, eject the disk, restart. The Disk Utility in 10.4 is somewhat better than the one in 10.3.x. If it gives you a clean bill of health after a couple of go-rounds, I'd trust it.
If for some reason the overlapping extents are still problematic, the BEST utility is Disk Warrior, which you pay about $75 for. The only problem is, you have to boot the machine off either the Disk Warrior CD or use the entire computer as a FireWire Hard Drive using "Target Disk Mode." (You can read up on that online. Basically, boot the unhappy computer holding T until you see an orange logo bouncing, then plug the computer into another Mac using a FireWire cable. It will appear as a disk (or two, if you've partitioned the internal drive) on the other computer's desktop. Newer Macs like iBooks do not do SCSI.)
Anyway, Disk Warrior will almost certainly fix it. If it can't, nobody can.
Once you've got the errors fixed, then go back to Disk Utility and repair disk and repair permissions.
If your hard disk has less than 1 GB free, try to free up some space. There's a dead simple way to do that: download Cocktail or Onyx. (Pretty much all Mac software that can be downloaded is available through http://www.versiontracker.com )
Use either Onyx or Cocktail to do all the routine maintenance like deleting caches, running the daily, weekly, and monthly maintenance tasks, rotating log files, etc. Most of this is nice but hardly necessary, but it all has the net effect of cleaning out crap.
Once you do that, if you still don't have a gig free, you're going to have to move some big files to CD or DVD. You need some free space.
Anyway, the short answer is, yeah, there were a lot of problems with disk extent overlaps. There are probably still a few, but there's a lot fewer in 10.4.latest than in 10.3.5 or even 10.3.9. Sometimes it can be caused by freak failures, but sometimes it's just buggy code, unfortunately.
Sorry I can't be more help, but I'm traveling and online via modem. email me if you have more questions. (Ask meredith for the address.)