anniemal's iBook started having problems several days
ago. It got slow. Then various apps didn't want to behave. It could
not open a web page in Safari or Opera. A lengthy poking-at session
eventually revealed error messages (once I discovered the Console
tool in Applications:Utilities) indicating a problem with one of the
files in the Library folder with a web-related-sounding name.
syntonic_comma copied the file from his iBook and handed
it to me on a USB thumb drive, and all seemed to be well.
Then the machine got slower and slower again. The afternoon before
last, it wouldn't let me switch login sessions -- I got the beach
ball spinning endlessly. I couldn't launch Activity Monitor either
-- same result. Thinking that perhaps we were running into problems
with virtual memory (not that such a thing ought to happen, but I did
have an awful lot of Opera windows open under my login session, which
I couldn't get to because
anniemal's login session was
the one visible), I tried closing a few open apps. No improvement.
But having quit Terminal, I couldn't re-launch it to issue 'kill'
commands. Nor could I launch Console ... or, as it turned out,
anything else.
So I shut the machine down.
Then it wouldn't boot. Whoops. It did the grey screen, then it
did the blue screen, then it put that box in the middle of the blue
screen where it reported that it was checking disks, starting the
network, etc., and finally starting a login session ... and with a
few pixels left on the progress bar, it wedged. Bleah.
It took me a while, without access to a browser and Google or
any installed help files, to find out how to boot an iBook into
single user mode. (I was actually looking for verbose-boot mode,
which I didn't find out how to do until later. But that's okay,
because what I would've seen there would have made single-user the
obvious next step anyhow.) Knowing that OS X is Unix underneath,
I knew there had to be a way to get it to show me all those reassuring
progress messages I'd gotten used to from the time I administered
my first Xenix box for the U.S. Army Corps of Engineers way back
when. (For the record, in case anyone who needs this info doesn't
already know it, holding down command-S while powering up will
start the user in single-user mode with a white-on-black text
console. Holding down command-V will cause an otherwise normal
boot but with the string of initialization messages displayed
until the blue screen takes over, and will also show a few
messages at system shutdown. Booting with the 'C' key pressed
attempts to boot from the CD drive. Maybe all iBook owners know
this, but I've got an excuse: I don't own or have sysadmin
responsibility for one, I just pick up
anniemal's
to check my mail via telnet or look at the web when I'm visiting.)
Okay, first off, having access to the console in single-user
mode let me go peek at /var/log/system.log (I was expecting it
to be /var/log/messages, as on my Linux machines, but that wasn't
hard to figure out). There I saw complaints about a couple more
files in /Library, including one that looked (based on a casual
parsing of the name) as though it might be needed to open any
window on the screen, or something. Urk! Second, I figured
here was my chance to run fsck on /, which I did.
Heh. "You don't need to fsck this file system, Citizen;
this is a journalled file system. Move along." Well, that's
not the phrasing that fsck used, but that was the gist. I
replied with "fsck -f dammit" (okay, okay, I only typed
"fsck -f"; the "dammit" was merely spoken), and suddenly
things looked a lot more a) informative, b) interesting, and
c) depressing. "Overlapping allocation extent: file 57126d",
and a Whole Lot more lines (one and a half to two screens full)
just like that with different numbers. Oddly, fsck reported
that it had Fixed Everything, cheerfully, and that the file
system was A-OK, but running it again showed the same list of
overlapping file allocations. Oh ew.
syntonic_comma came home, with ideas. I'd
been trying to figure out how to mount an external (Firewire)
drive, with no luck. I wanted to back up the user directories.
There's no entry in /dev that could point to the external
drive! As near as I can figure, some sort of magic occurs
in the Finder to create the devices on the fly in order to
mount them when a drive is detected being plugged in, but
since Finder wasn't running ... Argh! (Something similar
seems to be the case with network devices -- I thought if I
could bring up Airport, I could use FTP or NFS or Appletalk
to move files. But the only device I could start was 'lo'.
Trying to ifconfig 'en0' or 'en1' got me a "no such interface"
error. And yes, those are the names of the 10baseT and
Airport interfaces when the rest of OS X is running.) Cute
trick for a look-how-automagic-we-are system! Rather a PITA
when whatever performs that magic isn't running and all I've
got is the command line! Well
syntonic_comma had
found instructions for putting a Mac into "Firewire slave"
mode (uh, hold down the 'T' key while powering up, IIRC),
which made
anniemal's iBook look like a simple
external drive to
syntonic_comma's iBook. Choosing
to experiment to find out why the instructions said
not to have any other Firewire devices attached at the same
time (instead of taking whomever's word for it, since no
explanation was given), we plugged both machines into the
drive designated for system backups, and he copied out the
stuff that we'd be really annoyed to lose (/Users, /Applications,
and /usr/local). At that point the plan was to scrub the
disk, reinstall the OS, and then put the user files back and
spend however long it took to fix up ownerships and permissions.
Fortunately I found the bit in The Missing Manual that
talked about doing a non-clean reinstall, and we figured that
might save us time if it worked, and not cost us too terribly
much time if it turned out to be a waste. In the meantime we
poked around with "find / -inum" to see which files were affected
(yup, a bunch of /Library files, plus a bunch of Opera and Safari
cache files and some Palm stuff), though that wouldn't tell us
which files were overlapping which. Installing OS 10.3.5 over
10.3.9 had a side effect of moving all the old system files (including
all the ones with improper disk allocations) to
/Previous\ Systems, so I did þe olde "rm -rf" on that
(which took two passes because some stuff didn't want to go
away the first time but went silently the second), and going
back to single-user gave us a clean fsck (after some directory
and inode fixups on the first pass).
At that point we still had the stuff in place that we had
backed up just in case, saving us some work. But the system
was still terribly sluggish after a normal (multi-user) boot.
Blowing away /Applications/Palm helped a lot there -- I do not
know whether the Palm software was the cause or another victim;
I do know that some of its files were among the overlapped
ones -- sped things up, but Privoxy was missing so nothing
that accessed the outside world via HTTP worked until we
disabled the system-wide proxy setting. Reinstalling Privoxy
later and re-enabling the proxy preference worked. Currently
her machine appears to be working properly, but I've not yet
reinstalled the Palm software. I'm considering buying a copy
of The Missing Sync instead, having been told that in addition
to its extra features, it's also more reliable. (Previously
observed strangeness with the Palm stuff is the subject
of another (future) entry...)
Part of what took so long was backing up, reinstalling,
checking things, letting System Updater do its things to
bring the system current to 10.3.9 again (two more reboots
there, and long downloads even over broadband), etc.. Another
big chunk of time was spent just figuring out what the heck had
happened that we needed to fix. And a rather huge
amount of time was spent figuring out or looking up on the web
how to do things that, well, I guess nobody is ever "supposed"
to need to do on an OS X box and therefore aren't well documented
anywhere really convenient. (Yah, hiding "the hard stuff" is all
well and good until something goes wrong. Then I want the damned
tools to let me diagnose and repair things in full-on geek-mode
thankyewverymuch. And I never did find out how to bring up a
network interface or mount an external disk in single-user mode.)
I must say that the "be an external drive for another computer"
mode is pretty cool though. And a comment I saw somewhere gave
me the impression that older Macs could do this using SCSI instead
of Firewire and I just never knew about it ...?
Anyhow, I'm considering this evidence of a bug in MacOS 10.3.9,
because no matter how ill-behaved an application or how fugly a
force-quit, the OS and its filesystem routines should be in control
of what goes in which disk sectors. And the OS somehow wound up
allocating the same sectors to two different files ... thirty or
forty times. I'm not sure why this is even possible. (I don't
suppose there's an equivalent of the Devil Book for OS X, or for
Mach, is there? Reading up on internals of the filesystem code
would probably let me sleep easier at this point. I like
understanding stuff, the more so if it's got flaws in it that
might bite me.) What
syntonic_comma found on the
web regarding similar problems pointed to force-quitting
browsers under OS X, but last night
justgus37
said he'd seen the same allocation error show up under other
flavours of Unix, so the problem is not new, nor unique to
MacOS ... but inherited or not, it's a bug. And if it's that
old and still with us, I'm guessing that it's probably not been
figured out and thus probably still exists in 10.4, though I'd be
quite happy to hear that I'm wrong.
So there's the tale. A frustrating, educational couple of
days. And a realization that I haven't properly kept up with
how things work under the hood. Though I'm glad I had all that
Unix-geek knowledge to use in figuring out where to start.