Upgrading Solaris 10 with Zones on ZFS

From Docupedia

Written By: Steve Ayotte

Date: 9/30/2008

Contents

Introduction

Right away, I'd like to point out that my intended audience is experienced sysadmins. The problem addressed in this article isn't one you should EVER find yourself in with a system that is in revenue-generating production, and the workaround is NOT suitable for use on a machine that might hurt your cashflow if it goes up in flames. On the other hand, it's an enlightening excercise with Zones and if you, like me, find yourself trying to fix some machines for a department that will at worst leave already-committed customers somewhat unimpressed if their internal machines have to be rebuilt from scratch, this might be the thing you've been looking for.

Some required knowledge ahead of time:

  • ZFS
  • Solaris Zones
  • Installing/configuring Solaris 10

The Problem

Way back in ancient history (June 2006) Sun Microsystems rolled this great new volume management / filesystem / software RAID technology, ZFS, out with their latest official release of Solaris 10. I've lost the link to the original post that mentioned it, but at the time one of the developers who was ecstatic about it wrote a post on his public blog enumerating a bunch of GREAT IDEAS for how we could all take advantage of the power of ZFS. One of his examples was placing a Solaris Zone's root within a ZFS volume and then using that zone as a template for deploying more zones rapidly--- the advantage of ZFS here was that you could clone a ZFS volume in no-time and consume nearly no extra disk space in the process.

Whoops! That wasn't a recommendation, it was just an observation! We came to find when the next update of Solaris was released (November 2006) that it was impossible to upgrade a system that had zones with their roots on ZFS (this is still the case as of September 2008[1]). The reasons are a little complex, but the gist of it is that the Solaris Upgrade procedure won't declare a zone properly upgraded unless the patch/package/etc. sets are all 'exactly' the same as the global zone, and having a zone under a ZFS root makes that impossible.

There's some work being done in OpenSolaris to make it possible to back-out all patches on a zone which, if my understanding is correct, will make this possible in the future but it's not here yet (link?). As of the writing of this article (September 2008) there is still no support for upgrading zones created in this fashion planned for the next release of Solaris 10 (scheduled for October 2008[2]), so we're left with only a few options.

Our Current Options

  1. Detach the zones, copy them into a UFS area, re-attach them, and perform the upgrade. I haven't tried this, but I imagine it'd work pretty well if you have that much space left to stick a UFS filesystem into.
  2. Don't upgrade and hope that Solaris 10u7 (10/08 aka U6 will *not*) will support upgrading normally.
  3. Serious hackery. This is what this article is about.

A Hackish Workaround

On with the serious hackery!

Caveats

First, this workaround does not actually upgrade the zones, it upgrades the underlying system. If your zones inherit most of the libraries and binaries from the global zone then they'll still gain most of the benefit of the upgrade, but the truth is that your zones are going to be left in an inconsistent state. If you can't live with this, you need to rebuild the system and the zones from scratch.

The Workaround

  1. For each zone do:
    1. Save the zone's configuration:
      zonecfg -z <zone name> export > <zone name>.cfg
    2. Save this .cfg file somewhere safe (i.e. not on this machine)
    3. Detach the zone:
      zoneadm -z <zone name> detach
  2. Export the entire ZPool that the zones were living in:
    zpool export <zpool name>
  3. Rebuild the host OS, being sure not to destroy your ZPool
  4. Build a sample minimal zone on system
  5. Detach the sample minimal zone
  6. Within the sample zone's root you'll now find a file named "SUNWdetached.xml"; save / set this aside.
  7. Delete the sample minimal zone (or keep it, whatever, but we're done with it for now).
  8. Import the ZPool back into the system:
    zpool import <zpool name>
  9. Re-attach the zones from the old system to the new system (here comes the hackery). For each zone:
    1. Create a new SUNWdetached.xml for each zone using the template from the sample zone that we saved earlier. This is the real hack. For each zone:
      1. Copy/paste to the side the lines from (inclusive) the <zone ...> line up to (not including) the first <package ...> line
      2. Overwrite the SUNWdetached.xml file in the zone's root with the template we saved from the sample zone
      3. Edit the new SUNWdetached.xml file, overwriting those first lines with the ones we saved from the original file
      4. Leave all other lines identical to those in sample zone's SUNWdetached.xml
        1. Here's an example of our template file:
          <?xml version="1.0" encoding="UTF-8"?>
          <!DOCTYPE zone PUBLIC "-//Sun Microsystems Inc//DTD Zones//EN" "file:///usr/share/lib/xml/dtd/zon
          ecfg.dtd.1">
          <!--
              DO NOT EDIT THIS FILE.  Use zonecfg(1M) and zoneadm(1M) attach.
          -->
          <zone name="testzone" zonepath="/data/zones/testzone" autoboot="false">
            <inherited-pkg-dir directory="/lib"/>
            <inherited-pkg-dir directory="/platform"/>
            <inherited-pkg-dir directory="/sbin"/>
            <inherited-pkg-dir directory="/usr"/>
            <package name="SUNWocfd" version="11.10.0,REV=2005.01.21.15.53"/>
            <patch id="125095-15"/>
            <patch id="128010-10"/>
            <package name="SUNWlucfg" version="11.10,REV=2007.03.09.13.13"/>
          (snipped...)
          
        2. Here's an example of our hacked-up file based on the template:
          <?xml version="1.0" encoding="UTF-8"?>
          <!DOCTYPE zone PUBLIC "-//Sun Microsystems Inc//DTD Zones//EN" "file:///usr/share/lib/xml/dtd/zonecfg.dtd.1">
          <!--
              DO NOT EDIT THIS FILE.  Use zonecfg(1M) and zoneadm(1M) attach.
          -- >
          <zone name="nobby" zonepath="/data/zones/nobby" autoboot="false">
            <inherited-pkg-dir directory="/lib"/>
            <inherited-pkg-dir directory="/platform"/>
            <inherited-pkg-dir directory="/sbin"/>
            <inherited-pkg-dir directory="/usr"/>
            <dataset name="zonepool/nobby_pool"/>
            <network address="10.88.3.183/24" physical="bge0"/>
            <device match="/dev/lofictl"/>
            <device match="/dev/lofi/*"/>
            <device match="/dev/rlofi/*"/>
            <package name="SUNWocfd" version="11.10.0,REV=2005.01.21.15.53"/>
            <patch id="125095-15"/>
            <patch id="128010-10"/>
            <package name="SUNWlucfg" version="11.10,REV=2007.03.09.13.13"/>
          (snipped...)
          
    2. Restore/import zonecfg, a la
      # zonecfg -z <zone name> -f zoneconfig.cfg
    3. Attach the zone, a la
      # zoneadm -z <zone name> attach
    4. Boot zone into single user mode:
      # zoneadm -z <zone name> boot -s && zlogin -C <zone name>
    5. Fix InetD connection_backlog default
      In one of the updates between 6/06 and 5/08, a default value was added to the SMF configuration for inetd named "connection_backlog". Since our zones haven't been properly upgraded, they won't have this default set properly and we need to do it manually (not setting this causes rpc/gss sub-service of inetd to fail and consequently the system will not boot normally).
      1. We can see the problem as follows:
         -bash-3.00# inetadm -l rpc/gss
         SCOPE    NAME=VALUE
                  name="100234"
                  endpoint_type="tli"
                  proto="ticotsord"
                  isrpc=TRUE
                  rpc_low_version=1
                  rpc_high_version=1
                  wait=TRUE
                  exec="/usr/lib/gss/gssd"
                  user="root"
         default  bind_addr=""
         default  bind_fail_max=-1
         default  bind_fail_interval=-1
         default  max_con_rate=-1
         default  max_copies=-1
         default  con_rate_offline=-1
         default  failrate_cnt=40
         default  failrate_interval=60
         default  inherit_env=TRUE
         default  tcp_trace=FALSE
         default  tcp_wrappers=FALSE
         Error: Property connection_backlog is missing and has no defined default value.
        
      2. And we can fix the problem as follows:
         # inetadm -M connection_backlog=10
    6. Reboot the zone.
  10.  ???
  11. PROFIT!!!

References

  1. http://opensolaris.org/os/community/zones/faq/#sa_zfs
  2. http://opensolaris.org/os/community/zones/faq/#cfg_zfsboot