Well, every once in a while we’re forced to do something that isn’t particularly interesting or pleasant. Last week it happened again: I had to import a few pretty big PSTs (most 2Gig, one 10Gig, with about 100.000 emails) into our dovecot IMAP.
Doing this with Outlook itself was out of the question: it took way too long, even on the local network (many hours for a 1G file) and was prone to hanging and crashes, which were obviously a pain to debug and start over.
Thunderbird was unfortunately not much better, since – at least in our tests – it didn’t import the read status of the emails (they were all marked as unread) and also wasn’t particularly good at handling folders with strange names, containing “.”, “/” or some more obscure characters. We had used it before for smaller files, where manually dealing with the problems was acceptable, but this time it required something a bit more elaborate, if we were to keep our sanity.
Enter libpst. It includes the handy readpst utility which dumps all emails in usable formats in a directory tree, one directory per folder. Unfortunately the Debian version is somewhat outdated and doesn’t support the newer Outlook formats, so I did some packaging and even a little bit of patching. It seems Thunderbird also uses this library, which would explain why it didn’t handle the Read-Status (haven’t confirmed this though; just read it somewhere).
The last step was this not-so-little script, which uses the dumped directories from readpst and imports them in IMAP. It would have probably been a bit more elegant to use libpst directly, but I unfortunately didn’t have the time to mess around with that. I did have to mess around a lot with encodings though, ergo the unholy chaos with unicode()s and str.encode()s thrown around like rice at a wedding (I could never really wrap my head around charset problems; the subject boggles my mind to this very day).
code after the jump
Read the rest of this entry »