(Mostly) Automatic File Backups in Linux

Introduction, and my philosophy on backups:

My philosophy for general file storage (or at least what works for me) is to have a centralized server, where I store all files that have any worth in keeping, and to have regular backups. I never store any files on my computers, unless I am actively working on them (and even then, I usually have a cron job rsync the files to the server every few minutes). Any one of my computers can crash and burn, and it will be no big deal (in fact one of my laptops completely died two days ago... No worry!). Even if my server dies I won't have much downtime! I will describe my method for keeping backups in detail below, and hopefully I can rub off some ideas on you. The scripts I use are fairly specific to my needs, but I think they are still generic enough that they may be easily modified to suit the needs of others. I assume the reader knows enough that they can figure out what the scripts do on their own, but I am willing to assist anyone if they need help getting one of the scripts to work for their needs (just don't expect me to write a new script for you).

My file server is simply a Linux box running Samba and NFS servers, as well as password protected HTTPS so I can easily grab any files I need from remote locations so long as there is internet connectivity. My files are stored on external USB drives, one of which is the master, which is the drive I will actively use, and the second drive is the backup. Every night I have a cron job run a script (backup_files.sh) which uses rsync to mirror the two drives. In addition, the script removes any junk files that may have been left behind on the drive, such as backup files from vi, and metadata that my Mac likes to spew all over the place (which I can't stand!). The script also keeps a log, which is important to check every so often to make sure everything is actually working. Before doing anything, the script makes sure both drives are mounted, so nothing bad happens in the event one of the drives is missing (like the master drive crashing, and the backup script erasing everything on the backup to "synchronize" them). I will describe how this works later.

Now simply having a drive and it's backup is not enough in my opinion. If one drive fails, then you are left with only one good copy of everything. If there is a fire, then both drives are gone. So, every month or so I'll swap the backup drive with another, and I keep the unused drive at a remote location. Now my server and the entire building it is in can be destroyed with everything in it, but at least my files will be safe! In addition to the backup I keep on my home server, I also back up all my files to my computer at work in a separate script (backup_files_remotely.sh). This backup also runs nightly, and executes after the local backup has completed. The script requires the use of an RSA key pair, with the private key on my home server, and the public key in $HOME/.ssh/authorized_keys on my work server.

For the most part, using the scripts I describe below, I have a fully automated backup system. The only thing I need to do is check the logs once and a while to make sure nothing broke, and swap out my backup drives.

Main backup script:

Below is the main backup script on my home server, which uses rsync to mirror the contents of a master drive to a backup drive. Before running a script like the one below for the first time, it is useful to pass the -n option to rsync to do a dry run. This way rsync will tell you exactly what it plans on doing, and this will save you from acidentally removing any files with a broken script. The $@ argument after rsync allows any additional arguments at the command line to be sent directly to rsync (such as -n to test the script before executing it for real) without having to edit the script.

This script will first check for the presence of a file named .identity on the root directory of the master and backup paths. This ensures that you do not delete the contents of the backup drive when the master drive is not mounted, and that you do not copy the contents of the master drive to the root drive of your system (potentially filling it to capacity) when the backup drive is not mounted. These are both problems I've encountered before I finally added this drive checking feature to the script. The USB drive mounting script shown below also makes use of the .identity file to determine where to mount the drive, based on the name of the drive stored in the .identity file.

backup_files.sh
#!/bin/bash
# Note:  FILES_DIR and BACKUP_FILES_DIR must both contain a file with filename .identity for this script to run
FILES_DIR=$HOME"/files/"
BACKUP_FILES_DIR=$HOME"/backup_files/"
EXCLUDE="lost+found .identity"
LOG_FILE=$HOME"/logs/backup_files_log.txt"
KEEP_LOG="1" # set to 0 to disable, 1 to keep a running log, 2 to delete the log and record only current session
TMP_FILE=$HOME"/.backup_files_running"

if [ ! -e $TMP_FILE ]; then
  touch $TMP_FILE

  EXCLUDED=""
  for i in $EXCLUDE; do
    EXCLUDED="$EXCLUDED --exclude=$i";
  done

  if [[ $KEEP_LOG -eq 1 || $KEEP_LOG -eq 2 ]]; then
    if [ $KEEP_LOG -eq 2 ]; then
      rm -f $LOG_FILE
    fi
    if [ -e $FILES_DIR/.identity ] && [ -e $BACKUP_FILES_DIR/.identity ]; then
      date +%F\ %T\ %A | tee -a $LOG_FILE

      ID_FILES=`cat $FILES_DIR/.identity`
      ID_BACKUP_FILES=`cat $BACKUP_FILES_DIR/.identity`
      echo "" | tee -a $LOG_FILE
      echo "Starting rsync backup, from" $ID_FILES "to" $ID_BACKUP_FILES | tee -a $LOG_FILE
      rsync $EXCLUDED --delete-after -av $@ $FILES_DIR $BACKUP_FILES_DIR | tee -a $LOG_FILE
      ERROR=$?

      echo "" | tee -a $LOG_FILE
      date +%F\ %T\ %A | tee -a $LOG_FILE
      echo "--------------------------------------------------------------------------------" | tee -a $LOG_FILE

      rm -f $TMP_FILE
      exit $ERROR
    else
      date +%F\ %T\ %A | tee -a $LOG_FILE

      echo "" | tee -a $LOG_FILE
      echo "Drives are not mounted (or no .identity file exists on drive)" | tee -a $LOG_FILE 
      if [ ! -e $FILES_DIR/.identity ]; then
        echo $FILES_DIR "is not mounted" | tee -a $LOG_FILE
      fi
      if [ ! -e $BACKUP_FILES_DIR/.identity ]; then
        echo $BACKUP_FILES_DIR "is not mounted" | tee -a $LOG_FILE
      fi

      echo "" | tee -a $LOG_FILE
      echo "--------------------------------------------------------------------------------" | tee -a $LOG_FILE

      rm -f $TMP_FILE
      exit 3
    fi
  else
    if [ -e $FILES_DIR/.identity ] && [ -e $BACKUP_FILES_DIR/.identity ]; then
      date +%F\ %T\ %A

      ID_FILES=`cat $FILES_DIR/.identity`
      ID_BACKUP_FILES=`cat $BACKUP_FILES_DIR/.identity`
      echo ""
      echo "Starting rsync backup, from" $ID_FILES "to" $ID_BACKUP_FILES
      rsync $EXCLUDED --delete-after -av $@ $FILES_DIR $BACKUP_FILES_DIR
      ERROR=$?

      echo ""
      date +%F\ %T\ %A
      echo "--------------------------------------------------------------------------------"

      rm -f $TMP_FILE
      exit $ERROR
    else
      date +%F\ %T\ %A

      echo ""
      echo "Drives are not mounted (or no .identity file exists on drive)"
      if [ ! -e $FILES_DIR/.identity ]; then
        echo $FILES_DIR "is not mounted"
      fi
      if [ ! -e $BACKUP_FILES_DIR/.identity ]; then
        echo $BACKUP_FILES_DIR "is not mounted"
      fi

      echo ""
      echo "--------------------------------------------------------------------------------"

      rm -f $TMP_FILE
      exit 3
    fi
  fi
else
  echo "Backup is already running"
  exit 2
fi

Backing up files remotely:

Below is the script I use for backing up my files on my server to a remote computer (at work). The script requires the use of an RSA key pair to run on its own, with the private key located on the local computer which initiates the script (in my case, my home server), and the public key located in $HOME/.ssh/authorized_keys of the remote computer (in my case, my computer at work). This script does not check for .identity files in the root directories like the backup_files.sh script does. I keep this script located on the master drive I am backing up, so it is impossible to have this script run and clear out the remote drive when the master drive is not mounted.

backup_files_remotely.sh
#!/bin/sh
# To backup multiple source dirs into the backup dir, separate dirs with a space and do not end dir paths with a slash
# To copy the contents of the source dir into the backup dir, end with a slash
USERNAME="nick"
SSH_KEY=$HOME"/.ssh/rsa_key"
SOURCE_DIR=$HOME"/files/"
BACKUP_DIR="remote.server_address.net:files/"
EXCLUDE="lost+found .identity"
LOG_FILE=$HOME"/logs/backup_files_remotely_log.txt"
KEEP_LOG="2" # set to 0 to disable, 1 to keep a running log, 2 to delete the log and record only current session
TMP_FILE=$HOME"/.backup_files_remotely_running"

EXCLUDED=""
for i in $EXCLUDE; do
  EXCLUDED="$EXCLUDED --exclude=$i";
done

if [ ! -e $TMP_FILE ]; then
  touch $TMP_FILE
  if [[ $KEEP_LOG -eq 1 || $KEEP_LOG -eq 2 ]]; then
    if [ $KEEP_LOG -eq 2 ]; then
      rm -f $LOG_FILE
    fi
    date +%F\ %T\ %A | tee -a $LOG_FILE

    echo "" | tee -a $LOG_FILE
    rsync -e "ssh -i $SSH_KEY" $EXCLUDED --delete-after -av $@ $SOURCE_DIR $USERNAME@$BACKUP_DIR | tee -a $LOG_FILE
    ERROR=$?

    echo "" | tee -a $LOG_FILE
    date +%F\ %T\ %A | tee -a $LOG_FILE
    echo "--------------------------------------------------------------------------------" | tee -a $LOG_FILE
  else
    date +%F\ %T\ %A

    echo ""
    rsync -e "ssh -i $SSH_KEY" $EXCLUDED --delete-after -av $@ $SOURCE_DIR $USERNAME@$BACKUP_DIR
    ERROR=$?
    date +%F\ %T\ %A

    echo ""
    echo "--------------------------------------------------------------------------------"
  fi
  rm -f $TMP_FILE
  exit $ERROR
else
  echo "Backup is already running"
  exit 2
fi

Backing up laptops and computers:

To automatically back up the contents of a laptop to a server (or remote computer), the backup_files_remotely.sh may be used, with the private key on the laptop, and the public key in the server's $HOME/.ssh/authorized_keys. I keep a backup directory on the desktop of my laptops where I store any files that I actively work with, which synchronizes itself to a directory on my server every 15 minutes.

Checking for errors between master and backup copies:

While rsync generally works well for keeping everything backed up, the only thing rsync actually checks by default are the time files were modified, and the file sizes. During a backup, rsync does a 128-bit MD4 checksum after copying files to make sure they copied correctly. I have noticed instances where this seemed to fail, but it could be that the data in one of the files mutated sometime between the rsync transfer and the file comparison which I performed later. As a rough estimate, in my own experience I'll have about one file differ for every 500GB copied. If any programs modify the contents of a file, but keep the modified time the same, and the resulting file size happens to stay fixed, rsync will not catch any difference between the original and updated files. To get around this problem, we can force rsync to do file comparisons based on file size and checksums with the -c switch. With this option, a checksum of all files on the sending side will be generated, and checksums will be generated on the receiving side only for files whose file size is the same as that on the sending side. This results in a much slower backup, but at least it will catch some of the holes in the typical but much faster method of archiving. To verify the integrity of the backed up data, one could schedule a cron job to run the backup_files.sh script with the -c switch every few weeks. This job could be sched As long as the checksumming version of the backup script runs before the usual one, the nightly backup without checksumming will see that a backup is already running.

Another option is to directly compare the files with diff. Twice a month I have a cron job run the script below to compare the master and backup drives with diff. If a file differs it will let me know, and I will compare the file with another backup (like my backup at work, using md5sum to compare checksums if the files are large), and recopy the file to the drive with the altered copy (and re-check that file with diff to make sure it took).

backup_diff.sh
#!/bin/bash
FILES_DIR=$HOME"/files/"
BACKUP_FILES_DIR=$HOME"/backup_files/"
LOG=$HOME"/logs/backup_diff.txt"
KEEP_LOG="1" # set to 0 to disable, 1 to keep a running log, 2 to delete the log and record only current session

if [[ $KEEP_LOG -eq 1 || $KEEP_LOG -eq 2 ]]; then
  if [ $KEEP_LOG -eq 2 ]; then
    rm -f $LOG_FILE
  fi
  date +%F\ %T\ %A | tee -a $LOG
  echo 'Starting backup diff between' `cat $FILES_DIR/.identity` 'and' `cat $BACKUP_FILES_DIR/.identity` | tee -a $LOG
  diff -rq $FILES_DIR $BACKUP_FILES_DIR | tee -a $LOG
  ERROR=$?
  date +%F\ %T\ %A | tee -a $LOG
  echo "--------------------------------------------------------------------------------" | tee -a $LOG
else
  date +%F\ %T\ %A
  echo 'Starting backup diff between' `cat $FILES_DIR/.identity` 'and' `cat $BACKUP_FILES_DIR/.identity`
  diff -rq $FILES_DIR $BACKUP_FILES_DIR
  ERROR=$?
  date +%F\ %T\ %A
  echo "--------------------------------------------------------------------------------"
fi
exit $ERROR

External USB hard drive mounting script:

The problem with using USB drives is the same drive may be mapped to a different device after reboot or removal (for example, a drive which was originally mapped to /dev/sdc1 may later be /dev/sdd1 the next time it is detected by the computer). The way I decided to keep all my drives in order is to tag them with a file named .identity within the root of each drive, with the name of the drive in the contents of the .identity file. This way upon bootup, I can have a script read the .identity file, and decide where that drive needs to be mounted. If a device doesn't have a .identity file, I consider that drive not present (this is how the backup script above checks that the drives are both present before proceeding). Below is the script (mount_drives.sh) I use on my home server for mounting my master drive and one of the backups (note that I have four backup drives all mounting to the same location, but at any given time only one of them will ever be on the system).

This script acts like a service (although once it mounts or unmounts the drives, it is finished and no longer resident in memory), and accepts start, stop, and restart commands. When you "start" the script (the default if no argument is passed to the script), the script steps through the devices listed in DRIVES and looks for drives to mount, and mounts them in the locations specified. Sending the "stop" argument to the script will have the script look at all of the directories where a drive may be mounted, and check if they contain a .identity file (indicating a drive is mounted), and then unmount any mounted drives. Sending "restart" to the script simply runs "stop" then "start" with a 2 second delay in between.

I have this set up as a service on my server, so that my external USB drives are mounted to the correct locations upon startup. In Ubuntu, this may be done by placing the script in /etc/init.d/, and running "sudo update-rc.d mount_drives.sh defaults" to create links in the /etc/rcX.d directories (which will automatically send start/stop commands to the script when changing run levels), where X is the run level. The script may be removed as a service with "sudo update-rc.d mount_drives.sh remove". When changing drives out, I'll manually run the script (as root) with the appropriate commands.

mount_drives.sh
#!/bin/sh
DRIVE_1_DIR="/home/nick/files/"
DRIVE_1_ID="files0"
DRIVE_2_DIR="/home/nick/backup_files/"
DRIVE_2_ID="files1"
DRIVE_3_DIR="/home/nick/backup_files/"
DRIVE_3_ID="files2"
DRIVE_4_DIR="/home/nick/backup_files/"
DRIVE_4_ID="files3"
DRIVE_5_DIR="/home/nick/backup_files/"
DRIVE_5_ID="files4"
TEMP_MOUNT_DIR="/mnt/"
DRIVES="/dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1"

mount_drives_start() {
  for DRIVE in $DRIVES; do
    mount -t ext3 $DRIVE $TEMP_MOUNT_DIR
    if [ -e $TEMP_MOUNT_DIR/.identity ]; then
      ID=`cat $TEMP_MOUNT_DIR/.identity`
      echo "The ID for" $DRIVE "is" $ID
      umount $TEMP_MOUNT_DIR
      if [ $ID == $DRIVE_1_ID ]; then
        if [ ! -e $DRIVE_1_DIR/.identity ]; then
          echo "Mounting" $DRIVE "to" $DRIVE_1_DIR
          mount -t ext3 $DRIVE $DRIVE_1_DIR
        else
          echo "Something is already mounted to" $DRIVE_1_DIR
          echo $DRIVE "will not be mounted"
        fi
      elif [ $ID == $DRIVE_2_ID  ]; then
        if [ ! -e $DRIVE_2_DIR/.identity ]; then
          echo "Mounting" $DRIVE "to" $DRIVE_2_DIR
          mount -t ext3 $DRIVE $DRIVE_2_DIR
        else
          echo "Something is already mounted to" $DRIVE_2_DIR
          echo $DRIVE "will not be mounted"
        fi
      elif [ $ID == $DRIVE_3_ID  ]; then
        if [ ! -e $DRIVE_3_DIR/.identity ]; then
          echo "Mounting" $DRIVE "to" $DRIVE_3_DIR
          mount -t ext3 $DRIVE $DRIVE_3_DIR
        else
          echo "Something is already mounted to" $DRIVE_3_DIR
          echo $DRIVE "will not be mounted"
        fi
      elif [ $ID == $DRIVE_4_ID  ]; then
        if [ ! -e $DRIVE_4_DIR/.identity ]; then
          echo "Mounting" $DRIVE "to" $DRIVE_4_DIR
          mount -t ext3 $DRIVE $DRIVE_4_DIR
        else
          echo "Something is already mounted to" $DRIVE_4_DIR
          echo $DRIVE "will not be mounted"
        fi
      elif [ $ID == $DRIVE_5_ID  ]; then
        if [ ! -e $DRIVE_5_DIR/.identity ]; then
          echo "Mounting" $DRIVE "to" $DRIVE_5_DIR
          mount -t ext3 $DRIVE $DRIVE_5_DIR
        else
          echo "Something is already mounted to" $DRIVE_5_DIR
          echo $DRIVE "will not be mounted"
        fi
      else
        echo "The .identity file does not match any known drive"
        echo $DRIVE "will not be mounted"
      fi
    else
      umount $TEMP_MOUNT_DIR
      echo $DRIVE "does not exist, or does not have a .identity file"
    fi
  done
}

mount_drives_stop() {
  if [ -e $DRIVE_1_DIR/.identity ]; then
    echo "Unmounting" $DRIVE_1_DIR
    umount $DRIVE_1_DIR
  else
    echo "No drive mounted at" $DRIVE_1_DIR", or no .identity file is on drive"
  fi

  if [ -e $DRIVE_2_DIR/.identity ]; then
    echo "Unmounting" $DRIVE_2_DIR
    umount $DRIVE_2_DIR
  else
    echo "No drive mounted at" $DRIVE_2_DIR", or no .identity file is on drive"
  fi

  if [ -e $DRIVE_3_DIR/.identity ]; then
    echo "Unmounting" $DRIVE_3_DIR
    umount $DRIVE_3_DIR
  else
    echo "No drive mounted at" $DRIVE_3_DIR", or no .identity file is on drive"
  fi

  if [ -e $DRIVE_4_DIR/.identity ]; then
    echo "Unmounting" $DRIVE_4_DIR
    umount $DRIVE_4_DIR
  else
    echo "No drive mounted at" $DRIVE_4_DIR", or no .identity file is on drive"
  fi

  if [ -e $DRIVE_5_DIR/.identity ]; then
    echo "Unmounting" $DRIVE_5_DIR
    umount $DRIVE_5_DIR
  else
    echo "No drive mounted at" $DRIVE_5_DIR", or no .identity file is on drive"
  fi
}

mount_drives_restart() {
  mount_drives_stop
  sleep 2
  mount_drives_start
}

case "$1" in
'start')
  mount_drives_start
  ;;
'stop')
  mount_drives_stop
  ;;
'restart')
  mount_drives_restart
  ;;
*)
  mount_drives_start
esac

Drive cleanup:

Before I backup my files, I like to first clean out any metadata junk files left on the drive. The script below searches the directory passed to the script on the command line for certain metadata files, and then removes them. These include "Thumbs.db" (from Windows when thumbnail caching is not disabled), ".DS_Store" and files prefixed with "._" (metadata files Mac OS X loves to spew all over anything it touches), a directory named ".TemporaryItems" in the path passed to the script (Mac OS X will leave this in the root directory of a drive after deleting files), and files suffixed with "~" (backup files from VIM and a few other text editors). This script may be used to clean a drive mounted at /home/nick/files with the command "./delete_metadata /home/nick/files". If no argument is passed to the script, it will search the working directory.

delete_metadata.sh
#!/bin/sh
echo "Searching for Thumbs.db Windows thumbnail metadata"
find $1 -name 'Thumbs.db' -exec rm -vf {} \;
echo "Searching for .DS_Store Macintosh metadata"
find $1 -name '.DS_Store' -exec rm -vf {} \;
echo "Searching for ._* Macintosh metadata"
find $1 -name '._*' -exec rm -vf {} \;
echo "Searching for *~ backups"
find $1 -name '*~' -exec rm -vf {} \;
echo "Checking for .TemporaryItems/ in" $1
if [ -e $1/.TemporaryItems/ ]; then
  rm -rfv $1/.TemporaryItems/
fi


Back to Home
Contact Me