High Availability Virtualization using Proxmox VE and Ceph

From Jacksonville Linux Users Group
Jump to: navigation, search

Contents

High Availability Virtualization using Proxmox VE and Ceph

Proxmox VE is a virtualization solution using Linux KVM, QEMU, OpenVZ, and based on Debian but utilizing a RHEL 6.5 kernel. Combining Proxmox VE with Ceph enables a high availability virtualization solution with only 3 nodes, with no single point of failure.

At the time of this writing, the current version of Proxmox is 3.2, and the current version of Ceph is Firefly (0.80.4).

Proxmox supports enhanced features such as live migration of VMs from one host to another, as well as auto-starting a VM on another host if the host it is running on fails.

Slides from Presentation

Testlab Overview

Network

  • VLANs:
    • 50 - management (proxmox web interface), with gateway access to the internet
      • 10.50.10.0/24 gateway 10.50.10.1
    • 55 - storage - Jumbo frames
      • 10.55.10.0/24
  • Switch is configured to use LACP(802.3ad) for dual links to the Mac Minis. Ports are configured as 'trunk' ports using only tagged vlans.

Installation and Configuration of the OS

MAC ONLY: pre-installation Steps

  • Boot into internet recovery
    • Power on Mac while holding the `Option` key
    • When boot menu appears, press `COMMAND-R` to start internet recovery
    • Note: must have ethernet plugged into onboard NIC, or Wifi available, requires DHCP to assign IP address and provide internet access!
  • Open Disk utility to perform basic disk partitioning to ready it for Linux (can this be done properly in the Debian installer? Specifically the hidden EFI partition?)
    • Choose Disk Utility from the OSX Utilities main screen
    • Highlight the main boot drive (in our case the SSD not the HDD)
    • If prompted if you want to restore "Fusion Drive" say "Ignore", we're not using that
    • Click on the "Partition" tab
    • Choose 1 partition as the partition layout
    • The defaults should be fine, hit apply
    • You can now go to the Apple (top left) and shutdown
  • Insert Network adapter compatible with the Debian Wheezy kernel (on-board NIC will NOT work and installer doesn't like not having a working NIC even if we aren't using it yet)

Debian 7.5 (wheezy) Base Install

  • Boot the Debian 7.5 netinstall media (dd it to a USB thumb drive)
    • If installing on Mac, hold the `Option` key during power-on and select the EFI boot option which will boot off the thumb drive
  • Skip network setup (we use VLANs, installer doesn't allow this)
    • You may need to select eth0, then when trying to detect DHCP on a network interface, cancel the operation, then choose `Do not configure the network at this time`
  • Choose `Manual` partitioning method:
    • Mac Note: Delete only the hfs+ partition, leave the EFIboot partition in place!
    • Create this partition table:
      • /boot 500M, ext4
      • Remaining disk space in LVM2 Physical Volume with VG named vg_${hostname}
      • LV 20G root, / , ext4
      • so you end up with something like:
        • /dev/sda1 is /boot
        • /dev/sda2 is a LVM2 physical volume with a volume group name of something like vg_proxmox3
        • and you have a single 20G logical volume named 'root' in that volume group that can be accessed via /dev/vg_proxmox3/root
        • The remaining disk space in the volume group is UNUSED
        • Mac Note: will also have an EFIboot partition!
      • Note: correct, we're not using swap
  • When prompted if you want to continue without a network mirror, say `yes`
  • When prompted for software to install, just select the `Standard system utilities` (should be the only option on the netinstall)

MAC ONLY: First Boot Steps

  • At this point, the Mac does not know how to boot Linux. We need to temporarily work around that by using rEFInd.
  • Obtain USB bootable version from http://sourceforge.net/projects/refind/files/0.8.2/refind-flashdrive-0.8.2.zip/download Extract the zip file and write it to the USB flash drive using dd on another computer.
  • Plug the flash drive into the Mac.
  • During boot, hold down the `Option` key and select the EFI option. This will bring up the rEFInd boot loader which can then chain load grub installed on the local hard drive.

First Login: Console (Bootstrap networking so we can SSH in)

  • Since we are using tagged VLANs networking is down, lets config it enough that we can dump the console so we can ssh in
    • Create our VLAN interface
      modprobe 8021q
      ifconfig eth0 up
      ip link add link eth0 name eth0.50 type vlan id 50
      
    • Get an IP address
      • via dhcp:
        dhclient eth0.50
        
      • or manually (e.g. no DHCP server available):
        ifconfig eth0.50 10.50.10.101 netmask 255.255.255.0
        route add -net 0.0.0.0 gw 10.50.10.1 eth0.50
        echo "nameserver 8.8.8.8" > /etc/resolv.conf
        
    • Get IP address:
      ifconfig eth0.50 | grep "inet addr"
      
  • We do not have an SSH server, need to config some repositories, make your `/etc/apt/sources.list` look like this:
    deb http://ftp.debian.org/debian/ wheezy main contrib non-free
    deb-src http://ftp.debian.org/debian/ wheezy main contrib non-free
    
    deb http://ftp.debian.org/debian/ wheezy-updates main contrib non-free
    deb-src http://ftp.debian.org/debian/ wheezy-updates main contrib non-free
    
    deb http://security.debian.org/ wheezy/updates main contrib non-free
    deb-src http://security.debian.org/ wheezy/updates main contrib non-free
    
  • Install openssh-server:
    apt-get update && apt-get install openssh-server -y
    

Now SSH in instead!

  • Now SSH into the IP address we grabbed!
  • The below steps are meant to be cut-and-pasted so where possible utilities like 'sed' are used to modify configuration files. It is not recommended to attempt to manually type the steps, but since you should be SSH'd into the VM host(s), this should not be an issue. With more complex statements, a detailed description of the task being performed is provided.

MAC ONLY: post-installation Steps

If you want to continue using a USB thumb drive to boot with, you can skip these steps.

Mac upgrade to 3.14 kernel

  • We need to install the 3.14 kernel because 'hfs-bless' relies on an ioctl that doesn't exist in the stock 3.2.0 Wheezy kernel
echo "deb http://ftp.debian.org/debian wheezy-backports main contrib non-free" >> /etc/apt/sources.list && \
apt-get update && aptitude -t wheezy-backports install linux-image-amd64 && \
reboot

Allow Mac to boot directly off the internal Disk

  • We need to create a fake MacOS system so the EFI in the Mac will allow us to boot from it without the need for rEFInd
  • Install HFS utilities so we can generate an HFS+ partition:
apt-get install hfsprogs gdisk -y
  • Unmount and format the EFI partition, we are going to reformat it from vfat to HFS+ as our fake Mac system. It will remount in the same `/boot/efi` location, but the partition type will be changed:
umount /boot/efi && \
cp /etc/fstab /etc/fstab.old && \
grep -v '/boot/efi' /etc/fstab.old > /etc/fstab && \
mkfs.hfsplus /dev/sda1 -v Debian && \
echo $(blkid -o export -s UUID /dev/sda1) /boot/efi auto defaults 0 0 >> /etc/fstab && \
mount -a
  • We need to explicitly retag the `/dev/sda1` partition as an apple boot partition with type code AF00
    • Run: `gdisk /dev/sda`
    • type `t` to change the partition's type code
    • choose partition `1`
    • Enter Hex Code: `AF00`
    • type `w` to write the changes
    • Choose `yes` to proceed
  • We need to modify `grub-install` to force it to install to our hfs+ `/boot/efi`, and run it so it installs a `/boot/efi/EFI/debian/grubx64.efi`:
sed -i -e 's/efi_fs=\(.*\)/efi_fs="fat" # XXX: \1/' /usr/sbin/grub-install && \
grub-install
  • Now it is time to make this `/boot/efi` look like a fake MacOS system:
mkdir -p /boot/efi/System/Library/CoreServices && \
ln /boot/efi/EFI/debian/grubx64.efi /boot/efi/System/Library/CoreServices/boot.efi && \
echo "This file is required for booting" > /boot/efi/mach_kernel
  • Finally we need to bless the grub efi bootloader so it is booted. We have 2 options for doing this. The first is to use a precompiled binary, second is to compile it ourselves.
Prepare to bless (make mac bootable)
First Option: Precompiled binary
  • Grab the pre-built binary package and extract it:
cd /tmp && \
wget http://www.brad-house.com/other/mactel-boot-0.9-wheezy-x64.tar.gz && \
tar -zxvpf mactel-boot-0.9-wheezy-x64.tar.gz && \
cd mactel-boot-0.9
Second Option: Compile it ourselves
  • Install a compiler and tools
apt-get install gcc make bzip2 icnsutils librsvg2-bin -y
  • Grab the source code and build it:
cd /tmp && \
wget http://www.codon.org.uk/~mjg59/mactel-boot/mactel-boot-0.9.tar.bz2 && \
tar -jxvpf mactel-boot-0.9.tar.bz2 && \
cd mactel-boot-0.9 && \
make PRODUCTVERSION=Debian
  • Generate a boot icon:
rsvg-convert -w 128 -h 128 -o ./debian.png /usr/share/reportbug/debian-swirl.svg && \
png2icns ./debian.icns ./debian.png
Make it bootable
  • Now Install (same instructions for both precompiled binary and self-compiled):
cp SystemVersion.plist /boot/efi/System/Library/CoreServices/ && \
./hfs-bless /boot/efi/System/Library/CoreServices/boot.efi && \
cp debian.icns /boot/efi/.VolumeIcon.icns
  • Reboot, this time remove the rEFInd thumb drive!
reboot

Configure networking permanently

  • Install deps
    apt-get install bridge-utils ifenslave -y
    
  • Create network interfaces:
    echo "8021q" >> /etc/modules && \
    echo "bonding" >> /etc/modules
    

    cat > /etc/network/interfaces << 'EOF'
    auto lo
    iface lo inet loopback
    
    # Bond for LACP/LAG ports
    auto bond0
    iface bond0 inet manual
      bond-mode 802.3ad
      bond-miimon 100
      bond-downdelay 200
      bond-updelay 200
      bond-lacp-rate 1
      bond-slaves none
      mtu 9000
    
    auto eth0
    iface eth0 inet manual
      bond-master bond0
    
    auto eth1
    iface eth1 inet manual
      bond-master bond0
    
    # Management/Proxmox network
    auto bond0.50
    iface bond0.50 inet manual
      vlan-raw-device bond0
      mtu 1500
    
    auto vmbr50
    iface vmbr50 inet static
      bridge_ports bond0.50
      bridge_stp off
      bridge_fd 0
      bridge_maxwait 60
      bridge_waitport 30
      # proxmox1 = 10.50.10.44, proxmox2 = 10.50.10.45, proxmox3 = 10.50.10.46
      address 10.50.10.44
      netmask 255.255.255.0
      gateway 10.50.10.1
      mtu 1500
      # Enable Multicast Querier on bridges or cman/corosync's
      # multicast may not work.
      post-up ( echo 1 > /sys/devices/virtual/net/vmbr50/bridge/multicast_querier && sleep 5 )
    
    # Storage network
    auto bond0.55
    iface bond0.55 inet static
      vlan-raw-device bond0
      address 10.55.10.44
      netmask 255.255.255.0
      mtu 9000
    
    EOF
    
    • NOTE: manually edit this to set the IP address appropriately for THIS machine
    • NOTE2: Proxmox requires bridges to be used for VMs to be named `vmbrN` where N is a value of 0-4095.
    • NOTE3: For some reason bond0.55 doesn't come up properly until after Proxmox VE is installed (dependency issue perhaps? or maybe they fixed a bug in debian's init system?)
  • DNS config (just using google's public DNS)
    cat > /etc/resolv.conf << 'EOF'
    nameserver 8.8.8.8
    nameserver 8.8.4.4
    EOF
    
  • Set your hostname
    • vm1 host
      echo "proxmox1.testdomain.com" > /etc/hostname 
      
    • vm2 host
      echo "proxmox2.testdomain.com" > /etc/hostname 
      
    • vm3 host
      echo "proxmox3.testdomain.com" > /etc/hostname 
      
  • Configure your `/etc/hosts` file appropriately so all nodes know about eachother, even in the event of DNS failures:
    cat > /etc/hosts << 'EOF'
    127.0.0.1     localhost.localdomain localhost
    10.50.10.44   proxmox1.testdomain.com proxmox1
    10.50.10.45   proxmox2.testdomain.com proxmox2
    10.50.10.46   proxmox3.testdomain.com proxmox3
    10.55.10.44   ceph1.testdomain.com ceph1
    10.55.10.45   ceph2.testdomain.com ceph2
    10.55.10.46   ceph3.testdomain.com ceph3
    
    # The following lines are desirable for IPv6 capable hosts
    ::1     ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters
    ff02::3 ip6-allhosts
    EOF
    
  • Make sure timezone is set correctly:
    echo "US/Eastern" > /etc/timezone
    

Install Updates and Reboot

  • Install the latest updates and reboot (hopefully networking comes back up from the prior step)
apt-get update && apt-get dist-upgrade -y && reboot
  • NOTE: Machine will reboot after this step and will assume the IP address in the new network configuration (IT WILL BE DIFFERENT THAN BEFORE!). SSH back in after this reboot.

Install and configure any other useful utilities

Configure NTP

apt-get install ntp -y
  • Add our other servers as peers, and open it up so our peers can connect to us (you'll notice we also make sure no left-over dhcp ntp config exists so it reads the main ntp.conf file):
HOSTNAME=`hostname -s` && \
for x in proxmox1 proxmox2 proxmox3 ; do
  if [ "${HOSTNAME}" != "$x" ] ; then
     echo "peer $x" >> /etc/ntp.conf
  fi
done && \
echo "restrict 10.50.10.0 mask 255.255.255.0" >> /etc/ntp.conf && \
rm -f /var/lib/ntp/ntp.conf.dhcp && \
service ntp restart

Install IRQBalance

  • For servers where they have high-bandwidth network cards and/or high speed disks, the interrupts should be distributed across the available CPUs or it can cause severe performance issues:
apt-get install irqbalance -y

Install Proxmox

  • Add the Proxmox repo, and update:
    echo "deb http://download.proxmox.com/debian wheezy pve-no-subscription" > /etc/apt/sources.list.d/proxmox.list && \
    wget -O- "http://download.proxmox.com/debian/key.asc" | apt-key add - && \
    apt-get update && \
    apt-get dist-upgrade -y
    
  • Install the latest 2.6.32 kernel:
    KERNELVER=`apt-cache search pve-kernel-2.6.32 | sort -r | head -n 1 | awk '{ print $1 };' | cut -d- -f3-` && \
    apt-get install pve-firmware pve-kernel-${KERNELVER} pve-headers-${KERNELVER} -y
    
  • Make sure the kvm_intel module gets loaded ... proxmox doesn't appear to automatically do this:
    echo "kvm_intel" >> /etc/modules
    
  • Mac Mini Only:
    • Add `noapic` to `GRUB_CMDLINE_LINUX_DEFAULT` in `/etc/default/grub` and run `update-grub`:
      sed -i -e 's/GRUB_CMDLINE_LINUX_DEFAULT="\(.*\)"/GRUB_CMDLINE_LINUX_DEFAULT="\1 noapic"/' /etc/default/grub && \
      update-grub
      
    • When booting the new kernel, our onboard nic and thunderbolt nic will work. However Linux will remember this temporary USB nic we've been using up to this point, so we need to tell it to forget about it:
      rm -f /etc/udev/rules.d/70-persistent-net.rules
      
    • Switch to on-board NIC
  • Reboot
    • NOTE: Make sure to select the 2.6.32 PVE KERNEL!!! It is NOT the Default!!!
  • Remove the Debian Kernel
    apt-get remove `dpkg -l | grep linux-image | awk '{ print $2 };'` linux-base -y && \
    update-grub
    
  • Install proxmox VE packages:
    apt-get install proxmox-ve-2.6.32 postfix ksm-control-daemon vzprocps open-iscsi bootlogd -y
    
  • An enterprise repo might have been added, get rid of it
    rm -f /etc/apt/sources.list.d/pve-enterprise.list
    
  • You can now connect to the server using https port 8006 on each respective node ... however they aren't yet clustered so don't do that yet.

Install Ceph

  • Install the latest ceph version and tools:
pveceph install -version firefly && \
apt-get install ceph-deploy -y


Creating a Proxmox Cluster

Basic node joining

  • node1:
    • Create the cluster
      pvecm create proxmoxcluster
      
    • Check the status
      pvecm status
      
  • node2/3:
    • Join to the cluster:
      pvecm add proxmox1
      
    • Check the status
      pvecm status
      
  • Join the fence domain (All Nodes):
    • Enable fencing:
      sed -i -e 's/# FENCE_JOIN.*/FENCE_JOIN="yes"/' /etc/default/redhat-cluster-pve && \
      service cman restart && \
      service pve-cluster restart && \
      fence_tool join
      
    • You can now run `fence_tool ls` and you should see all nodes listed.

Make sure RGmanager is configured properly

For some reason, a default proxmox install may be missing the <rm> section in the cluster.conf file, we need to add it.

  • node1:
if ! grep '<rm>' /etc/pve/cluster.conf ; then
   VER=`cat /etc/pve/cluster.conf | grep config_version | sed -e 's/.*config_version="\(.*\)".*/\1/'`
   VER=`expr $VER + 1`
   sed -i -e "s|config_version=\".*\"|config_version=\"$VER\"|" -e "s|</cluster>|  <rm></rm>\n</cluster>|" /etc/pve/cluster.conf
   if [ "$?" = "0" ] ; then
      echo 'Updated'
   else
      echo "ERROR UPDATING /etc/pve/cluster.conf"
   fi
else
  echo "No update needed"
fi

Reboot each node

  • Reboot each node one at a time, wait for it to come up and rejoin the cluster
    • you can monitor with `pvecm status` and `fence_tool ls`

Configure fencing devices

Fencing devices are used to forcibly restart a troubled node, such as if it abruptly stops communicating in the cluster. These are important to ensure consistent cluster state.

The below steps show what you might perform to enable clustering in a server environment where the server has a dedicated IPMI NIC which is fairly common for server class hardware.

  • All nodes:
    • Install IPMItool
      apt-get install ipmitool -y
      
  • node1 only:
    • Configure the fencing driver:
      • Copy the config file to a new file, then open that new file with an editor (no, you can't edit directly, the web commit relies on the cluster.conf.new filename):
        cp /etc/pve/cluster.conf /etc/pve/cluster.conf.new && \
        nano /etc/pve/cluster.conf.new
        
      • IMPORTANT: Increment the <cluster config_version=
      • Add the fencing device information prior to the <clusternodes>
          <fencedevices>
            <fencedevice agent="fence_ipmilan" name="proxmoxipmi1" lanplus="1" ipaddr="192.168.11.106" login="ADMIN" passwd="ADMIN" power_wait="5"/>
            <fencedevice agent="fence_ipmilan" name="proxmoxipmi2" lanplus="1" ipaddr="192.168.11.149" login="ADMIN" passwd="ADMIN" power_wait="5"/>
            <fencedevice agent="fence_ipmilan" name="proxmoxipmi3" lanplus="1" ipaddr="192.168.11.123" login="ADMIN" passwd="ADMIN" power_wait="5"/>
          </fencedevices>
        
      • For each clusternode, bind it to one of the fencing devices:
            <clusternode name="proxmox1" votes="1" nodeid="1">
              <fence>
                <method name="1">
                  <device name="proxmoxipmi1"/>
                </method>
              </fence>
            </clusternode>
            <clusternode name="proxmox2" votes="1" nodeid="2">
              <fence>
                <method name="1">
                   <device name="proxmoxipmi2"/>
                </method>
              </fence>
            </clusternode>
            <clusternode name="proxmox3" votes="1" nodeid="3">
              <fence>
                <method name="1">
                  <device name="proxmoxipmi3"/>
                </method>
              </fence>
            </clusternode>
        
    • Validate the config
      ccs_config_validate -v -f /etc/pve/cluster.conf.new
      
    • Open the web interface https://proxmox1:8006 and log in
      • Go to Datacenter then the HA tab, it should show the differences, click Activate to active and sync the config

    Create some storage areas for VMs and ISOs

    • All tasks in this section must be performed on all nodes
    • Install XFS, it is required for Ceph (do NOT use ext4/ext3 -- but btrfs is acceptable too)
      apt-get install xfsprogs -y
      
    • Create pv/vg on the platter-based array (clear first few bytes to ensure pvcreate succeeds):
      HOSTNAME=`hostname -s` && \
      dd if=/dev/zero of=/dev/sdb bs=8192k count=10 && \
      pvcreate /dev/sdb && \
      vgcreate vg_platter_${HOSTNAME} /dev/sdb
      
    • Create the logical volume for fast SSD storage:
      HOSTNAME=`hostname -s` && \
      lvcreate -L 200G -n ceph_ssd vg_${HOSTNAME}
      
    • Create the logical volume for slow platter storage:
      HOSTNAME=`hostname -s` && \
      lvcreate -L 400G -n ceph_platter vg_platter_${HOSTNAME}
      
    • Create the logical volumes for the NFS/Gluster ISO share:
      HOSTNAME=`hostname -s` && \
      lvcreate -L 20G -n isos0 vg_platter_${HOSTNAME} && \
      lvcreate -L 20G -n isos1 vg_platter_${HOSTNAME}
      
      • Why 2 iso LVs you may ask? Because we want to use replica 2 with glusterfs which requires the total number of 'bricks' in the volume to be a multiple of the replica count. So 3nodes * 2brickspernode = 6bricks which is a multiple of 2.
    • Format using XFS
      HOSTNAME=`hostname -s` && \
      XFSOPTS="-i size=1024 -n size=16k" && \
      mkfs.xfs ${XFSOPTS} /dev/vg_${HOSTNAME}/ceph_ssd && \
      mkfs.xfs ${XFSOPTS} /dev/vg_platter_${HOSTNAME}/ceph_platter && \
      mkfs.xfs ${XFSOPTS} /dev/vg_platter_${HOSTNAME}/isos0 && \
      mkfs.xfs ${XFSOPTS} /dev/vg_platter_${HOSTNAME}/isos1
      
      • Ceph makes extensive use of extended attributes, make sure the inode size is large enough: `-i size=1024`
      • Increasing the logical block size for the directories from the default 4 K, decreases the directory I/O, which in turn improves the performance of directory operations.: `-n size=16k`
    • Create the mount points
      mkdir -p /data/ceph_ssd && \
      mkdir -p /data/ceph_platter && \
      mkdir -p /data/isos0 && \
      mkdir -p /data/isos1
      
    • Make them auto-mount
      HOSTNAME=`hostname -s` && \
      XFSOPTS="rw,noexec,nodev,noatime,allocsize=4M,inode64,logbufs=8,logbsize=256k,nobarrier" && \
      echo "/dev/vg_${HOSTNAME}/ceph_ssd             /data/ceph_ssd     xfs ${XFSOPTS} 0 0" >> /etc/fstab && \
      echo "/dev/vg_platter_${HOSTNAME}/ceph_platter /data/ceph_platter xfs ${XFSOPTS} 0 0" >> /etc/fstab && \
      echo "/dev/vg_platter_${HOSTNAME}/isos0        /data/isos0        xfs ${XFSOPTS} 0 0" >> /etc/fstab && \
      echo "/dev/vg_platter_${HOSTNAME}/isos1        /data/isos1        xfs ${XFSOPTS} 0 0" >> /etc/fstab
      
      • NOTE: `nobarrier` should really only be set if you have a BBU on your raid controller. (Or does ceph do enough checking that this is now safe?)
      • NOTE: there is evidence that `allocsize=4M` is no longer desirable after a change made in kernel 2.6.38 (but we're on 2.6.32)
    • Mount them
      mount -a
      

    Creating the Ceph Cluster

    node1: Initialize the ceph network

    pveceph init --network 10.55.10.0/24
    

    all nodes: Create the ceph monitors

    pveceph createmon
    

    node1: Delete default cruft

    • Delete the default pools, they don't suit our purposes:
    for x in rbd data metadata ; do 
      ceph osd pool delete $x $x --yes-i-really-really-mean-it
    done
    

    node1: Create the OSD

    • node1: Create the ceph OSD (basically the blocks for storage)
    cd /etc/ceph && \
    ceph-deploy gatherkeys 0 && \
    ceph-deploy --overwrite-conf osd prepare ceph1:/data/ceph_ssd ceph2:/data/ceph_ssd ceph3:/data/ceph_ssd ceph1:/data/ceph_platter ceph2:/data/ceph_platter ceph3:/data/ceph_platter && \
    cp /etc/ceph/ceph.conf /etc/pve/ceph.conf && \
    ceph-deploy osd activate ceph1:/data/ceph_ssd ceph2:/data/ceph_ssd ceph3:/data/ceph_ssd ceph1:/data/ceph_platter ceph2:/data/ceph_platter ceph3:/data/ceph_platter
    
    • Note we used '0' for the gatherkeys, that is because proxmox set it up as mon.${num} rather than mon.${hostname}
    • Also note we are using ceph-deploy not pveceph so we might need to fix things up, later, as you'll see

    all nodes: restore broken symlink

    • `/etc/ceph/ceph.conf` symlink was removed by ceph-deploy, fix it:
    rm -f /etc/ceph/ceph.conf && \
    ln -sf /etc/pve/ceph.conf /etc/ceph/ceph.conf
    
    • You'll notice that `/etc/ceph/ceph.conf` is a symlink to the pxefs `/etc/pve/ceph.conf`. But the --overwrite-conf kills this symlink so we need to copy the newly generated data back and restore the symlink.
      • If we weren't using ceph-deploy and only using the `pveceph` commands, we wouldn't need to worry about this .... but we can't since pveceph only allows the use of raw disks for OSD creation.

    node1: Make sure ceph doesn't overwrite our crush map on startup

    • Important!: Ceph can destroy our crush map if we don't do this step and brick our entire setup
    • Set the `osd_crush_update_on_start = false` flag in `ceph.conf`:
    sed -i -e 's/^public_network\(.*\)/public_network\1\nosd_crush_update_on_start = false/' /etc/pve/ceph.conf
    

    node1: Create our crush ruleset

    • This tells ceph how to distribute data
    • Ceph by default tries to group our OSDs (disks/lvs) by host, but they're different so it isn't the best grouping for us. We want to group ssds together and platter disks together, so we will do that below.
    • Lets look at the current (default) crush map:
    ceph osd getcrushmap -o /tmp/crushmap.bin && \
    crushtool -d /tmp/crushmap.bin -o /tmp/crushmap.txt && \
    cat /tmp/crushmap.txt
    
    • Create one host entry per disk (if the host had multiple of the same type, they could be combined):
    ceph osd crush add-bucket ceph1-ssd host && \
    ceph osd crush add-bucket ceph2-ssd host && \
    ceph osd crush add-bucket ceph3-ssd host && \
    ceph osd crush add-bucket ceph1-platter host && \
    ceph osd crush add-bucket ceph2-platter host && \
    ceph osd crush add-bucket ceph3-platter host
    
    • Unassign our existing OSDs from their current hosts
    ceph osd crush rm osd.0 proxmox1 && \
    ceph osd crush rm osd.3 proxmox1 && \
    ceph osd crush rm osd.1 proxmox2 && \
    ceph osd crush rm osd.4 proxmox2 && \
    ceph osd crush rm osd.2 proxmox3 && \
    ceph osd crush rm osd.5 proxmox3
    
    • Delete the stale host entries
    ceph osd crush rm proxmox1 && \
    ceph osd crush rm proxmox2 && \
    ceph osd crush rm proxmox3
    
    • Add the OSDs to the new host entries:
    ceph osd crush add osd.0 0.200 host=ceph1-ssd && \
    ceph osd crush add osd.1 0.200 host=ceph2-ssd && \
    ceph osd crush add osd.2 0.200 host=ceph3-ssd && \
    ceph osd crush add osd.3 0.400 host=ceph1-platter && \
    ceph osd crush add osd.4 0.400 host=ceph2-platter && \
    ceph osd crush add osd.5 0.400 host=ceph3-platter
    
    • NOTE: The weight, as seen above as 0.200 and 0.400 is supposed to roughly dictate the size with 1.000 meaning 1TB. Though this can be used to denote disk speed as well.
    • Create new buckets that contain the relevant hosts
    ceph osd crush add-bucket platter root && \
    ceph osd crush add-bucket ssd root
    
    • Assign host entries to proper roots
    ceph osd crush move ceph1-ssd root=ssd && \
    ceph osd crush move ceph2-ssd root=ssd && \
    ceph osd crush move ceph3-ssd root=ssd && \
    ceph osd crush move ceph1-platter root=platter && \
    ceph osd crush move ceph2-platter root=platter && \
    ceph osd crush move ceph3-platter root=platter
    
    • Create rules using the root buckets we created
    ceph osd crush rule create-simple ssd-rule ssd host && \
    ceph osd crush rule create-simple platter-rule platter host
    
    • Delete stale stuff we don't use
    ceph osd crush rule rm replicated_ruleset && \
    ceph osd crush rm default
    

    node1: Create ceph pools

    • This defines what pools use our crush map settings so we can distribute to ssds vs platter drives
    pveceph createpool ssd     -min_size 1 -pg_num 256 -size 2 -crush_ruleset 1 && \
    pveceph createpool platter -min_size 1 -pg_num 256 -size 2 -crush_ruleset 2
    
    • NOTE: During the presentation I got a timeout error after running this command with the latest ceph 0.80.4 release, running `/etc/init.d/ceph restart` on each node appeared to resolve the issue. It did appear the command had succeeded as seen by `ceph osd lspools`
    • Settings from above are as follows
      • min_size is the minimum number of nodes that must be available that contain the data
      • pg_num should be `(#OSDs * 100) / replicasize` rounded up to the next power of 2.
        • `(3 * 100) / 2 = 150` round up next power of 2 = 256
      • size is the number of replicas across the cluster
    • Wait for `ceph health` to say HEALTH_OK

    node1: Create a trust between proxmox and ceph

    • node1: Create a trust for accessing the pool from proxmox
    mkdir -p /etc/pve/priv/ceph && \
    cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/ssd.keyring && \
    cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/platter.keyring
    
    • node1: Create the proxmox storage, create one for both the ssd and one for the platter storage, name them the same as the pool names
      • Go to the web interface at https://proxmox1:8006, log in as root
        • Go to Datacenter->Storage
        • Click on Add
        • Choose RBD
        • Enter `ssd` (or `platter`) as the ID, or whatever you chose for the trust name
        • Enter the pool name created via pveceph createpool (we use the same name ourselves of `ssd` or `platter`)
        • For the monitor host, enter "ceph1 ceph2 ceph3" since each of our nodes runs a monitor.
          • It would be nice to use 'localhost', but unfortunately ceph doesn't listen on localhost by default, only the external ip.

    Using GlusterFS for simple ISO storage and distribution

    • Ceph is not meant for file storage for things like ISO images, and the CephFS add-on layer is still unstable but wouldn't necessarily work for our purposes. So we decided to use GlusterFS for ISO storage. Infact, GlusterFS has a built-in NFS server, which Proxmox prefers, so this works out well. We shouldn't have multiple potential writers so don't need to worry about things like split-brain multiple writers. Most of our usage will be read-only.
    • Install it on all nodes:
    apt-get install glusterfs-server -y
    
    • Basic config on all nodes:
    sed -i -e 's/option working-directory\(.*\)/option working-directory\1\n    option rpc-auth-allow-insecure on/' /etc/glusterfs/glusterd.vol
    
    • Start it on all nodes
    service rpcbind restart && \
    service glusterfs-server restart
    
    • Create storage directory for glusterfs on all nodes:
    for x in isos0/brick0 isos1/brick1 ; do
      mkdir /data/$x && \
      chmod 755 /data/$x
    done
    

    proxmox1 only: Create the cluster

    • Locate the peers:
    gluster peer probe ceph2 && \
    gluster peer probe ceph3
    

    proxmox1 only: Create the share and set options

    • Create the share, as stated previously, we have 2 bricks per node because we want replica 2.
    gluster volume create isos replica 2 ceph{1..3}:/data/isos0/brick0 ceph{1..3}:/data/isos1/brick1
    
    • Set some share options:
    gluster volume set isos server.allow-insecure on && \
    gluster volume set isos server-quorum-type server
    
    • Start the volume:
    gluster volume start isos
    

    Configure the GlusterFS ISO storage domain in Proxmox

    • Attach the storage group to the Gluster NFS storage
      • Go to https://proxmox1:8006 and log in as root
      • Go to Datacenter -> Storage
      • Click Add->NFS
        • NOTE: if you try glusterfs, the web image uploader will NOT work.
      • ID: isos
      • Server: localhost
      • Volume Name: /isos
      • Content: ISO

    Final Proxmox Configuration Steps

    Uploading ISOs

    • Go To DataCenter and select one of the hosts (why? dunno, it's shared storage), then select the isos
    • Click on Content, then Upload and the rest is straight forward
    • You might get a "communication failure (0)" on large uploads, but it seems as though it is successful even with that error.

    Create a VM

    Create the Profile

    • At the top right corner of the screen, click on Create VM
    • Chose a node
    • Choose a name, e.g. 'centostest'
    • Next
    • Choose an OS type (most likely Linux 3.X/2.6 Kernel)
    • Next
    • Choose CD/DVD ISO Disc image (you uploaded one right?)
    • Next
    • Hard Disk Bus/Device: VIRTIO:0
    • Storage: ssd
    • Disk Size: 32GB
    • Cache: No Cache
    • Next
    • Select your desired sockets/cores
    • Select the type as Westmere (gives us stuff like aesni)
    • Next
    • Memory: 1024MB
    • Next
    • Choose Bridged mode
      • On vmbr0 (did you create other networks?)
      • Model: Virtio
    • Next
    • Confirm (this may take a couple of minutes as it is creating the disk image, etc)

    Launch it

    • Under the host/node you created the VM you should now see your vm name, click on it
    • You should see 'Start' at the top right of the screen click it
    • Click the console button, and the console should pop up

    Creating a new Network

    Viewing the Console

    Ubuntu 12.04

    • We need to update the java plugins and not used icedtea:
    sudo add-apt-repository ppa:webupd8team/java && \
    sudo apt-get update && \
    sudo apt-get install oracle-java7-installer && \
    sudo update-java-alternatives -s java-7-oracle
    

    Troubleshooting

    Cluster

    • Some useful commands:
    root@proxmox2:~# fence_tool ls
    fence domain
    member count  3
    victim count  0
    victim now    0
    master nodeid 2
    wait state    none
    members       1 2 3 
    
    root@proxmox2:~# pvecm status
    Version: 6.2.0
    Config Version: 4
    Cluster Name: proxmoxcluster
    Cluster Id: 12596
    Cluster Member: Yes
    Cluster Generation: 124
    Membership state: Cluster-Member
    Nodes: 3
    Expected votes: 3
    Total votes: 3
    Node votes: 1
    Quorum: 2  
    Active subsystems: 6
    Flags: 
    Ports Bound: 0 177  
    Node name: proxmox2
    Node ID: 3
    Multicast addresses: 239.192.49.101 
    Node addresses: 192.168.11.45 
    
    • If you have strange issues, check to see if /etc/pve/local/pve-ssl.{key|pem} exists on the servers, if not, run `pvecm updatecerts` on the nodes it does not exist to regenerate them. Not sure why this happened!

    VM Install

    • Sometimes the screen resolution of the Linux OS installer you are using is too high since the Java VNC client does not perform scaling, if that happens, for instance with CentOS 7, hit 'Tab' on the install option, then append `vga=0x315 nomodeset` to the kernel command line. This will reduce the screen resolution to 800x600 so the bottom of the screen isn't chopped off.