eftychia: My face, wearing black beret, with guitar neck in corner of frame (pw34)
Add MemoryShare This Entry
posted by [personal profile] eftychia at 03:47pm on 2005-09-29 under

[livejournal.com profile] anniemal's iBook started having problems several days ago. It got slow. Then various apps didn't want to behave. It could not open a web page in Safari or Opera. A lengthy poking-at session eventually revealed error messages (once I discovered the Console tool in Applications:Utilities) indicating a problem with one of the files in the Library folder with a web-related-sounding name. [livejournal.com profile] syntonic_comma copied the file from his iBook and handed it to me on a USB thumb drive, and all seemed to be well.

Then the machine got slower and slower again. The afternoon before last, it wouldn't let me switch login sessions -- I got the beach ball spinning endlessly. I couldn't launch Activity Monitor either -- same result. Thinking that perhaps we were running into problems with virtual memory (not that such a thing ought to happen, but I did have an awful lot of Opera windows open under my login session, which I couldn't get to because [livejournal.com profile] anniemal's login session was the one visible), I tried closing a few open apps. No improvement. But having quit Terminal, I couldn't re-launch it to issue 'kill' commands. Nor could I launch Console ... or, as it turned out, anything else.

So I shut the machine down.

Then it wouldn't boot. Whoops. It did the grey screen, then it did the blue screen, then it put that box in the middle of the blue screen where it reported that it was checking disks, starting the network, etc., and finally starting a login session ... and with a few pixels left on the progress bar, it wedged. Bleah.

It took me a while, without access to a browser and Google or any installed help files, to find out how to boot an iBook into single user mode. (I was actually looking for verbose-boot mode, which I didn't find out how to do until later. But that's okay, because what I would've seen there would have made single-user the obvious next step anyhow.) Knowing that OS X is Unix underneath, I knew there had to be a way to get it to show me all those reassuring progress messages I'd gotten used to from the time I administered my first Xenix box for the U.S. Army Corps of Engineers way back when. (For the record, in case anyone who needs this info doesn't already know it, holding down command-S while powering up will start the user in single-user mode with a white-on-black text console. Holding down command-V will cause an otherwise normal boot but with the string of initialization messages displayed until the blue screen takes over, and will also show a few messages at system shutdown. Booting with the 'C' key pressed attempts to boot from the CD drive. Maybe all iBook owners know this, but I've got an excuse: I don't own or have sysadmin responsibility for one, I just pick up [livejournal.com profile] anniemal's to check my mail via telnet or look at the web when I'm visiting.)

Okay, first off, having access to the console in single-user mode let me go peek at /var/log/system.log (I was expecting it to be /var/log/messages, as on my Linux machines, but that wasn't hard to figure out). There I saw complaints about a couple more files in /Library, including one that looked (based on a casual parsing of the name) as though it might be needed to open any window on the screen, or something. Urk! Second, I figured here was my chance to run fsck on /, which I did.

Heh. "You don't need to fsck this file system, Citizen; this is a journalled file system. Move along." Well, that's not the phrasing that fsck used, but that was the gist. I replied with "fsck -f dammit" (okay, okay, I only typed "fsck -f"; the "dammit" was merely spoken), and suddenly things looked a lot more a) informative, b) interesting, and c) depressing. "Overlapping allocation extent: file 57126d", and a Whole Lot more lines (one and a half to two screens full) just like that with different numbers. Oddly, fsck reported that it had Fixed Everything, cheerfully, and that the file system was A-OK, but running it again showed the same list of overlapping file allocations. Oh ew.

[livejournal.com profile] syntonic_comma came home, with ideas. I'd been trying to figure out how to mount an external (Firewire) drive, with no luck. I wanted to back up the user directories. There's no entry in /dev that could point to the external drive! As near as I can figure, some sort of magic occurs in the Finder to create the devices on the fly in order to mount them when a drive is detected being plugged in, but since Finder wasn't running ... Argh! (Something similar seems to be the case with network devices -- I thought if I could bring up Airport, I could use FTP or NFS or Appletalk to move files. But the only device I could start was 'lo'. Trying to ifconfig 'en0' or 'en1' got me a "no such interface" error. And yes, those are the names of the 10baseT and Airport interfaces when the rest of OS X is running.) Cute trick for a look-how-automagic-we-are system! Rather a PITA when whatever performs that magic isn't running and all I've got is the command line! Well [livejournal.com profile] syntonic_comma had found instructions for putting a Mac into "Firewire slave" mode (uh, hold down the 'T' key while powering up, IIRC), which made [livejournal.com profile] anniemal's iBook look like a simple external drive to [livejournal.com profile] syntonic_comma's iBook. Choosing to experiment to find out why the instructions said not to have any other Firewire devices attached at the same time (instead of taking whomever's word for it, since no explanation was given), we plugged both machines into the drive designated for system backups, and he copied out the stuff that we'd be really annoyed to lose (/Users, /Applications, and /usr/local). At that point the plan was to scrub the disk, reinstall the OS, and then put the user files back and spend however long it took to fix up ownerships and permissions.

Fortunately I found the bit in The Missing Manual that talked about doing a non-clean reinstall, and we figured that might save us time if it worked, and not cost us too terribly much time if it turned out to be a waste. In the meantime we poked around with "find / -inum" to see which files were affected (yup, a bunch of /Library files, plus a bunch of Opera and Safari cache files and some Palm stuff), though that wouldn't tell us which files were overlapping which. Installing OS 10.3.5 over 10.3.9 had a side effect of moving all the old system files (including all the ones with improper disk allocations) to /Previous\ Systems, so I did þe olde "rm -rf" on that (which took two passes because some stuff didn't want to go away the first time but went silently the second), and going back to single-user gave us a clean fsck (after some directory and inode fixups on the first pass).

At that point we still had the stuff in place that we had backed up just in case, saving us some work. But the system was still terribly sluggish after a normal (multi-user) boot. Blowing away /Applications/Palm helped a lot there -- I do not know whether the Palm software was the cause or another victim; I do know that some of its files were among the overlapped ones -- sped things up, but Privoxy was missing so nothing that accessed the outside world via HTTP worked until we disabled the system-wide proxy setting. Reinstalling Privoxy later and re-enabling the proxy preference worked. Currently her machine appears to be working properly, but I've not yet reinstalled the Palm software. I'm considering buying a copy of The Missing Sync instead, having been told that in addition to its extra features, it's also more reliable. (Previously observed strangeness with the Palm stuff is the subject of another (future) entry...)

Part of what took so long was backing up, reinstalling, checking things, letting System Updater do its things to bring the system current to 10.3.9 again (two more reboots there, and long downloads even over broadband), etc.. Another big chunk of time was spent just figuring out what the heck had happened that we needed to fix. And a rather huge amount of time was spent figuring out or looking up on the web how to do things that, well, I guess nobody is ever "supposed" to need to do on an OS X box and therefore aren't well documented anywhere really convenient. (Yah, hiding "the hard stuff" is all well and good until something goes wrong. Then I want the damned tools to let me diagnose and repair things in full-on geek-mode thankyewverymuch. And I never did find out how to bring up a network interface or mount an external disk in single-user mode.)

I must say that the "be an external drive for another computer" mode is pretty cool though. And a comment I saw somewhere gave me the impression that older Macs could do this using SCSI instead of Firewire and I just never knew about it ...?

Anyhow, I'm considering this evidence of a bug in MacOS 10.3.9, because no matter how ill-behaved an application or how fugly a force-quit, the OS and its filesystem routines should be in control of what goes in which disk sectors. And the OS somehow wound up allocating the same sectors to two different files ... thirty or forty times. I'm not sure why this is even possible. (I don't suppose there's an equivalent of the Devil Book for OS X, or for Mach, is there? Reading up on internals of the filesystem code would probably let me sleep easier at this point. I like understanding stuff, the more so if it's got flaws in it that might bite me.) What [livejournal.com profile] syntonic_comma found on the web regarding similar problems pointed to force-quitting browsers under OS X, but last night [livejournal.com profile] justgus37 said he'd seen the same allocation error show up under other flavours of Unix, so the problem is not new, nor unique to MacOS ... but inherited or not, it's a bug. And if it's that old and still with us, I'm guessing that it's probably not been figured out and thus probably still exists in 10.4, though I'd be quite happy to hear that I'm wrong.

So there's the tale. A frustrating, educational couple of days. And a realization that I haven't properly kept up with how things work under the hood. Though I'm glad I had all that Unix-geek knowledge to use in figuring out where to start.

There are 5 comments on this entry. (Reply.)
ckd: small blue foam shark (Default)
posted by [personal profile] ckd at 08:06pm on 2005-09-29
While I haven't seen that level of file system hosage on a Mac in a long time, it can obviously happen. A more GUI-friendly way of doing much of the same things, btw, is to boot from the OS disk (using the C key trick) and then, when the installer starts, hit the menus and go into Disk Utility. That has verify and repair options...which, er, run fsck.

Yes, the older SCSI PowerBooks had an HDI-30 SCSI plug, and there was some way the cable signaled to the machine to go into disk mode (shorting pin 30 to GND? shrug).
 
posted by [identity profile] en-ki.livejournal.com at 08:42pm on 2005-09-29
I was just gonna say that. I made the mistake of trying to fix my slightly fucked HFS+ via Unix and getting all frustrated and deciding my hard disk was dead, but then I said "what the hell, I'll try DU" and it worked. So I'm a bad nerd.
 
posted by [identity profile] merde.livejournal.com at 09:11pm on 2005-09-29
i emailed a pointer to this post to one of my friends at apple. he may have some useful input. or not.
 
posted by [identity profile] eviltomble.livejournal.com at 05:12am on 2005-09-30
I only really skimmed the JE because I am asleep today and stuff, and my brain feels like it's full of wire woll, but I have heard it said before that one can technically use a SCSI bus as a netowrk of sorts in such a way, yes. Like SCSI hosts can act as targets, or whatever the terms are.

OTOH, I have no idea if there's electronic caveats (like with that weird nonsense about the bus termination, and "termpwr" and funny details like that), and nor do I know what software-level stuff is required, if it is even actually possible to do something useful with that technique.

Sorry if I made no sense there, am dragging brain along on a string ATM
 
posted by [identity profile] haineux.livejournal.com at 07:14pm on 2005-09-30
If you have a friend with a Tiger installer disk (10.4), borrow it and boot off of it. (IE insert CD or DVD and hold down C as soon as you hear the boot beep. It takes longer than normal to boot to it.) Now Select "Disk Utility" and tell it to Repair Disk and then Repair Permissions.

Quit, eject the disk, restart. The Disk Utility in 10.4 is somewhat better than the one in 10.3.x. If it gives you a clean bill of health after a couple of go-rounds, I'd trust it.

If for some reason the overlapping extents are still problematic, the BEST utility is Disk Warrior, which you pay about $75 for. The only problem is, you have to boot the machine off either the Disk Warrior CD or use the entire computer as a FireWire Hard Drive using "Target Disk Mode." (You can read up on that online. Basically, boot the unhappy computer holding T until you see an orange logo bouncing, then plug the computer into another Mac using a FireWire cable. It will appear as a disk (or two, if you've partitioned the internal drive) on the other computer's desktop. Newer Macs like iBooks do not do SCSI.)

Anyway, Disk Warrior will almost certainly fix it. If it can't, nobody can.

Once you've got the errors fixed, then go back to Disk Utility and repair disk and repair permissions.

If your hard disk has less than 1 GB free, try to free up some space. There's a dead simple way to do that: download Cocktail or Onyx. (Pretty much all Mac software that can be downloaded is available through http://www.versiontracker.com )

Use either Onyx or Cocktail to do all the routine maintenance like deleting caches, running the daily, weekly, and monthly maintenance tasks, rotating log files, etc. Most of this is nice but hardly necessary, but it all has the net effect of cleaning out crap.

Once you do that, if you still don't have a gig free, you're going to have to move some big files to CD or DVD. You need some free space.

Anyway, the short answer is, yeah, there were a lot of problems with disk extent overlaps. There are probably still a few, but there's a lot fewer in 10.4.latest than in 10.3.5 or even 10.3.9. Sometimes it can be caused by freak failures, but sometimes it's just buggy code, unfortunately.

Sorry I can't be more help, but I'm traveling and online via modem. email me if you have more questions. (Ask meredith for the address.)

Links

January

SunMonTueWedThuFriSat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24
 
25
 
26
 
27
 
28
 
29
 
30
 
31