My Self-Hosted GMail Backup
For the past few months, I’ve had a bit of a background project going to ensure that my cloud-hosted personal data is safely archived on my own, self-hosted hardware, just in case. Google services are nice 'n' all, but I’m not 100% happy trusting them with everything in the long run.
Part of this project has been to archive my old email collection from GMail, which dates back to the initial public beta in 2004(ish?) -- and make it searchable, because what’s the point in having all that email if you can’t find the needle in the 20-year haystack when you need it?
Enter “notmuch” -- a “fast, global-search and tag-based email system”, which runs as a set of UNIX CLI commands, and is inspired by Sup, a mailreader I used previously. I have a self-hosted home server running Ubuntu 20.04 with a chunky SATA disk, so that's where I'll run it.
Here’s the process I followed:
Order a Google Takeout of your GMail account. This takes a couple of days to prepare. Request the 50GB tgz files.
When you get the email telling you it’s ready, download the files (this is awkward as you can only download one at a time, and only via your web browser, not fun). scp them to your server, and to a disk with lots of free space (/x/4 in my case).
Extract each one:
cd /x/4/tmp tar xvfz takeout-20250322T145242Z-001.tgz tar xvfz takeout-20250322T145242Z-002.tgz ... rm takeout-20250322T145242Z-00*tgz
You will wind up with a few bits of uninteresting metadata, and one gigantic mbox file: Takeout/Mail/All\ mail\ Including\ Spam\ and\ Trash.mbox . In order to make this useful, it needs to be converted into Maildir format, so install “mb2md”:
sudo apt install mb2md
Now run it, creating a GMailTakeout directory for the result:
mkdir -p /x/4/GMailTakeout mb2md -s /x/4/tmp/Takeout/Mail/All\ mail\ Including\ Spam\ and\ Trash.mbox -d /x/4/GMailTakeout
This takes quite a while for 20 years of email! Unfortunately, the resulting single directory is still unusably huge, so split it into 100 new Maildir folders:
cd /x/4/GMailTakeout/cur find . -type f -print > /tmp/dirlisting perl -ne ' $dir = sprintf("dir_%03d", ($. % 100)); (-d $dir) or mkdir($dir); chop; rename($_, "$dir/$_") or die "cannot rename $_"; ' /tmp/dirlisting cd /x/4/GMailTakeout mv cur/* . for f in dir_* ; do mkdir mail$f mail$f/{new,tmp} ; mv $f mail$f/cur ; done
The result of this is 100 Maildirs, /x/4/GMailTakeout/maildir_000 to /x/4/GMailTakeout/maildir_099, each containing about 300MB of email, in my case.
There really isn't much need to keep the mails labelled as spam, so let's just nuke them in advance:
grep -r 'X-Gmail-Labels: Spam' . | perl -pe 's/:.*$//' | xargs -n 100 rm -f
Next step is to install “notmuch” and create a “notmuch” configuration. I used the Debian packaged “notmuch”, version 0.29.3. Install using apt-get, and then run “notmuch”. Accept the defaults for the config, and don’t add any mail folders yet.
My initial attempt was simply to import the lot in one go: this went badly, throwing up a multi-day progress indicator, and with no safe way to checkpoint partial progress, and it quickly started consuming lots of RAM, causing me to suspect some leaking.
I aborted it and tried this instead to index each dir one-by-one:
for f in /x/4/GMailTakeout/maildir_* ; do ln -s $f ~/mail/ && nice notmuch new ; done
Unfortunately, this also turned out badly. The import of each maildir gradually slowed as data built up in notmuch’s Xapian indexes. After processing about 60 maildirs, memory consumption during the import became a problem, and the “notmuch” processes started being killed by the Linux OOM killer. In a couple of cases this resulted in corrupt index files and data loss. Ouch.
So I started again, with a new approach:
#!/bin/sh set -exu mkdir -p /x/4/GMailTakeout/notmuchbackup/xapian/ for f in /x/4/GMailTakeout/maildir_0* do ln -s $f ~/mail/ && nice notmuch new nice notmuch compact cp /home/jm/mail/.notmuch/xapian/* /x/4/GMailTakeout/notmuchbackup/xapian/ done
Calling “notmuch compact” does seem to help, trimming the size of the indexes as it goes; taking a copy of the Xapian indexes in a backup dir is for extra safety. Since the “-e” shell flag is in place, any OOMs or other random failures will crash the entire script and ensure the last backup is still safe to use for recovery.
Unfortunately this still got bogged down and started OOMing fairly reliably after about maildir_065, 2 days into the process; at this point, I decided to keep that set of dirs as “notmuch config 1” and start a separate import process, into another index, as “notmuch config 2”. Accordingly, I moved ~/mail to ~/mail1 , ~/.notmuch-config to ~/.notmuch-config1, created a ~/mail2 , and started a new notmuch config file pointing at that instead. Ideally I’ll be able to merge the indexes at some point, but it’s no biggie.
With these two aliases, it’s pretty painless:
alias notmuch1='notmuch --config=$HOME/.notmuch-config1' alias notmuch2='notmuch --config=$HOME/.notmuch-config2'
After another day or so of indexing, this is the result --
du -sh /home/jm/mail?/.notmuch/xapian 19G /home/jm/mail1/.notmuch/xapian 4.2G /home/jm/mail2/.notmuch/xapian
Notmuch supports pretty much all the nice email search features that GMail does, but seemingly more reliably, and faster; I’ve already been able to use this new mail index to find a mail that (worryingly!) GMail's own search can’t seem to locate -- my license for the Moom OSX window manager tool purchased over a decade ago:
time notmuch1 search moom "Many Tricks" thread:00000000000034fe 2013-10-15 [1/1] Many Tricks; Your Many Tricks purchase (inbox unread) thread:00000000000c267b 2013-10-15 [1/1] sales@manytricks.com; Your Moom License (attachment inbox unread) real 0m0.068s user 0m0.048s sys 0m0.016s
And it’s just nice to have 20 years of email archived safely, off the cloud, and indexed.
Next steps? Maybe lieer would be good to try, to download incremental updates as we go forward. Let's see.