posted by [identity profile] eviltomble.livejournal.com at 01:15am on 2005-12-02
I didn't see where/how you said uniq and sort went wrong earlier so I'm afraid no suggestions for that...

...but on the topic of backups, would you be able to point me towards a good source of info on using tapes under *nix? Apart from mt(1), tar(1) or st(4)? ;) (or tar.info for that matter)

I think I've possibly figured out setmarks vs filemarks due to a M$ page, but amongst other things I'm a little unclear about how the "blocking factor" of tar interacts with gzip etc. And erasing vs overwriting data. And appending to tapes. And the form of the incremental snapshot files from gnu tar. Etc etc...

I'm roughly set to archive a load of stuff to one of my lovely new DAT tapes (and later start doing regular backups), but I want to make sure I know all the fiddly details before relying on it for things. Experiments with dd, mt-st and cat only manage to get me so far :P
 
posted by [identity profile] dglenn.livejournal.com at 02:51am on 2005-12-02
(Actually, by "backup tools" I meant "tools for when other tools fail", not "tools for making backups", but hey, it's still as good a place as any to hang your question...)

Alas, my relevant experience is from way long ago (I've got tape drives that I've not gotten around to setting up under Linux yet -- part of December's to-do list -- but I used tapes a lot with Xenix in the 1980s). All I ever used extensively under *ix was tar, but tar did what I needed it to do. I'm mostly hoping that once I install a Jumbo drive in a Linux box it'll work just like the old QIC-20 drives did way back when, and spare me having to climb a learning curve.

Appending to a tape would have been/will be useful, but I never quite trusted it, so I never really used that feature. The two backup schemes I used were: a) a simple rotation of tapes with a nightly full backup (when the entire filesystem fit on one tape (40MB seemed like so much space back then!)), occasionally setting a tape aside for longer-term storage and replacing it with a fresh one, and b) unattended incremental backups Monday night through Thursday night and a full backup Friday afternoon with me there to change tapes. A key element in both setups was that the cron script wrote the tape with "tar cv" then rewound it and did a "tar tv" to check whether each file that the first pass said it was writing was actually there (I may have also first built a manifest with 'find' to compare to, but I'm not sure). That way if there was a bad tape or some other glitch, I found out when I came in the next morning, not when someone desperately needed a file that didn't get backed up.

I think I played with 'tar u' and wound up deciding to use 'find / -newer archive-datestamp' for incremental backups instead (obviously, the cron job refreshed /archive-datestamp ...) Once I get the right sizes of tape drives hooked up, and if any of my ancient tapes are still readable, maybe I can find copies of my old backup scripts.

I don't know about tar's blocking factor interacting with gzip (I should go read the gzip man page now); IIRC, the blocking factor was more a matter of what made the tape drive happiest, dealing with physical speed of the drive, buffer size, and reliability, hoping to come as close to "streaming" as your hardware would let you get away with. But there are cobwebs obscuring some of those corners of my memory.

So I'm hoping other folks whose skills are more current (or at least whose memories are fresher) will chime in here.

As I recall, I mostly used 'dd' to read tapes other people had made using different techniques than I usually used, and to try to recover stuff off a tape that had a glitch, not to make backups myself. (But I do remember being really glad it was there.) Never did much with 'cpio' and 'backup' beyond experimenting enough to decide I preferred 'tar'; did use the rewind and retension commands of 'mt', but I think that was all I did with it.
 
posted by [identity profile] eviltomble.livejournal.com at 11:08pm on 2005-12-03
Actually, by "backup tools" I meant "tools for when other tools fail", not "tools for making backups",
Oh! Oops... I saw ftp and Kermit etc being systems for moving data about, that being roughly what backing-up involves...
All I ever used extensively under *ix was tar, but tar did what I needed it to do. I'm mostly hoping that once I install a Jumbo drive in a Linux box it'll work just like the old QIC-20 drives did way back when, and spare me having to climb a learning curve.
*nod* Tar's what I'm still planning to use. I expect you know the Ftape driver is still supported on Linux, and there's an "FTape HOWTO" if that's of use to you :)
Appending to a tape would have been/will be useful, but I never quite trusted it, so I never really used that feature.
*nod* I've heard that it can trounce existing data, so the scheme I had in mind was to rotate through a set of 3 tapes, 1 a day, putting the latest L2 backup at the end. That way, whatever got damaged, if anything, should be stale data, and the previous working backup should be on another tape. L1s I hope to put on 2 alternating tapes in similar fashion, maybe once a week. L0/full backups wouldn't be appended because they'd be far too big, and I plan on doing them for separate "areas" of the filesystem to reduce the time for each one.

Unfortunately I get the impression looking again, that you can't remove a DDS tape without rewinding the thing, so it'd need to be fast forwarded each time. The main point of appending to tapes would be to avoid wearing the start of the tape out, and I'm not sure if that'd still apply in this case. Of course, you're still supposed to "retension" the things anyway :P

A key element in both setups was that the cron script wrote the tape with "tar cv" then rewound it and did a "tar tv" to check whether each file that the first pass said it was writing was actually there
Yeah, I was thinking something like that too. Very wise :)
I think I played with 'tar u' and wound up deciding to use 'find / -newer archive-datestamp' for incremental backups instead
The recent versions of gnutar seem to have an even more useful system for this that also stores info about files that have gone, using the -G and -g options. These are for creating new incremental archives though, rather than altering existing ones.
I don't know about tar's blocking factor interacting with gzip (I should go read the gzip man page now)
Yeah, I got what the blocking factor was for, but I had various confusions about it. As I've not been able to unearth much more info on tar and tapes, I'll probably ask on an appropriate tech support type forum. Thanks for your suggestions! :)
 
posted by [identity profile] dglenn.livejournal.com at 02:56am on 2005-12-02
Oh, and as for how 'sort' and/or 'uniq' failed me earlier: in the web stats post, there are a few URLs that appear twice when they should have been summed into one line, and there were a few more that I fixed by hand. (I don't think it was a matter of inconsistent trailing spaces, but I've not bothered to recreate all my steps to make sure. I'll certainly watch out for those next time and hope it all works right. Or obtain a proper web stats analysis program.)
 
posted by [identity profile] eviltomble.livejournal.com at 11:18pm on 2005-12-03
Hmm, so there are... I suppose if you still have the source data, the thing to do would be to check the output of sort before it gets fed to uniq -c, and see if things appear out of place there. If not, try getting a diff of the lines out of uniq -c before they go into the next sort (I presume you used a sort on the initial numeric field at the end?). Getting the difference between lines might be fiddly though.
Maybe "od" or similar might be of more use? Personally I wouldn't want to resort to somebody else's program though, I'd just have to spend time learning the thing and likely find it isn't what I'm after.
 
posted by [identity profile] dglenn.livejournal.com at 06:03pm on 2005-12-05
Oh, I still have the source data, of course, but the obsessive spell broke and I wandered away from the problem. When I get around to doing the same thing with the full logs (a few years worth), I'll want to make sure it all sorts correctly then.

I'm wondering whether a 3-dimensional graph can be arranged in a way that'll show me interesting things about the relative poularity of different pages over time, rather than just looking pretty. (Maybe if I limit the graph to the few most popular pages, instead of trying to crowd all of them in there...)

Links

January

SunMonTueWedThuFriSat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24
 
25
 
26
 
27
 
28
 
29
 
30
 
31