Tuesday, January 30, 2007

What I Do

I've always maintained that what I do for a living is not who I am. I do not define myself by my job. The past few days at work have been a good example of the annoying stuff I wrestle with daily.

I am a Unix system administrator. Hear me weep.

Our database administrator informed me that he and the application developers wanted a server that was a copy of our new data warehouse, not yet in production, for testing purposes. Great idea.

I had set up the data warehouse as a non-global zone (virtual server) on a Solaris 10 server. I put the root of the zone is on a ZFS (Zettabyte File System) so I could do snapshots of the entire server when necessary. I looked for an easy way to clone the zone to create a duplicate test zone. Sure enough, Solaris has a command to clone a zone. However, my current operating system version is Solaris 10 6/06 and the clone feature is brand new in the 11/06 version.

I downloaded five CD-format upgrade images from sun.com and burned them onto discs. That took a good chunk of time. I scanned through 266 pages of release notes to plan for a couple "gotchas". I then went to the computer room to do the upgrade.

About half an hour into the upgrade, the system informed me that since I had a non-global zone installed, I would need to use the upgrade image in DVD format. So back to the office I went, and spent several more hours downloading the DVD images, concatenating them, and burning it to a DVD. Throwing a 4GB+ file around on the network takes some time.

It then occurred to me that one of the servers I wanted to upgrade is old enough not to even have a DVD drive. The only option left would be to set up a Solaris Jumpstart installation server and install the upgrades over the network.

I picked a third server of mine, mounted the DVD iso image with a "dolofi" utility I'd downloaded, and ran the commands to set up an installation server. I set up the client parameters on the server, then booted the client (the server I wanted to upgrade) with the "boot net - nowin" command. It was unable to mount the installation image, and kept stopping with a disturbing "panic" message.

After some googling, I found that there is a bug in the Solaris network installations that causes it to not work on supernetted networks. Sun doesn't call it a bug, but those who have run into the problem do. Short story is that Sun doesn't plan on fixing it.

The deal is that our network is a series of sequential class C networks. Our subnets are ranges within those numbers. For example, our main subnet is x.x.8.x through x.x.11.x. The installation server and the client are both 9.x numbers, but the default router is an 8.x. The default mask of a Sun client booting for a network installation is 255.255.255.0 (and annoyingly can't be changed by any boot arguments), so a 9.x number can't see an 8.x number even though they're on the same subnet. The easiest workaround is to install a DHCP (Dynamic Host Configuration Protocol) server that would assign a mask of 255.255.252.0 along with other network parameters to a client when it boots. That assignment would allow the client to see the installation server and the default router.

So, I ramped up the learning curve a little more and dug into the DHCP manuals. It actually didn't take too long before I had one set up, but there are several "vendor options" that need to be entered and configured, and the platform type of each of the servers you want to install have to be entered for each option. I set up options for a workstation on my desk I use for testing.

After an afternoon of wrestling over parameters to get them just right (the manuals conveniently left out a couple I needed), I was able to get my workstation to boot from the installation server with the "boot net:dhcp" command. Sweet. I proceeded to upgrade my workstation to the latest and greatest Solaris 10 11/06 release.

Back to the original server I wanted to upgrade, after checking the release notes, I knew I had to temporarily remove the lofs (loopback mounted file systems) of /usr/local and /usr/openv on the non-global zone with the zonecfg utility. I also had to detach one of the slices in the Solaris Volume Manager mirror that made up the boot partition and use "metaroot" to assign the boot device back to the device name c0t0d0s0 instead of the metadevice d0.

With those short tasks behind me, I requested more server downtime from the developers, and shutdown the data warehouse server in preparation for the upgrade. I added it as a registered client on the installation server, and added an entry for it, keyed by ethernet address, on the DHCP server. I just duplicated the entry that worked for my workstation.

Hmmm. When I went to boot it, it panicked and said it could mount the installation images.

ok boot net:dhcp - nowin
Boot device: /pci@1f,0/pci@1,1/network@0,1:dhcp File and args: - nowin
panic - boot: Could not mount filesystem.

I issued a snoop dhcp command on the installation server and saw that it was talking to the server correctly, but for some reason was unable to mount the Solaris miniroot. The location of the miniroot should have been provided as a DHCP option.

Back to google, I finally located someone who'd had a similar problem. I used these commands to debug the DHCP conversation between the two computers:

$ snoop -vv ether 0:14:4f:3b:21:15 | grep DHCP

DHCP: Message type = DHCPDISCOVER
DHCP: Client Class Identifier = "SUNW.Sun-Fire-V490"
DHCP: Requested Options:
DHCP: 1 (Subnet Mask)
DHCP: 3 (Router)
DHCP: 12 (Client Hostname)
DHCP: 43 (Vendor Specific Options)
DHCP: Maximum DHCP Message Size = 1472 bytes

[ ..... ]

DHCP: Message type = DHCPOFFER
DHCP: DHCP Server Identifier = x.x.9.172
DHCP: IP Address Lease Time = -1 seconds
DHCP: Subnet Mask = 255.255.252.0
DHCP: Boot File Name = SUNW.sun4u

The "Client Class Identifier" reminded me about those platform types I had to enter as part of the vendor options for the macros on the DHCP server. I checked - sure enough! I had missed entering "SUNW.Sun-Fire-V490" on two of the options. I put it in, restarted the DHCP server, and rebooted the client.

Fantastic! It booted right up and started the installation program. I walked through the steps to identify the system parameters. A few minutes later, the process stopped and reported that the upgrade had failed because of the non-global zones, and that I'd have to restore them from backup. Yikes. Fortunately, I'd read in the release notes that this was a lie. However, it was supposed to have been fixed by removing the lofs file systems like I'd done.

Back to more googling. After an hour or so of reading manuals from sun.com and other Solaris forums, I found a note that reported that the installation utility for Solaris 10 11/06 had a bug in it that caused it to be unable to upgrade non-global zones when their root was on a ZFS. The developers are planning to fix it in the next release sometime in 2007.

Sigh.

All in all, I've spent about a week working on this along with some other projects, but I'm back to where I'd started. I did upgrade my workstation, so that's nice, and I learned how to set up a network installation server and a DHCP server in the process, and that's nice too, but in the end I have two large servers that I wanted to upgrade but cannot.

My desire to have that handy little "clone" command that had prompted this whole process will have to wait for the next OS release "sometime" this year.

In short, I worked for about a week and really didn't accomplish anything tangible. So what else is new?

1 Comments:

At 3:42 PM, Blogger solobreak said...

An epic tale of resourcefulness... and you didn't take the easy way out:


history>todays_blog_entry.txt

 

Post a Comment

<< Home