Minimal-downtime PSUs on CRS with Cloned Golden Images

Posted by Jeremy ⋅ November 13, 2013 ⋅ 4 Comments

About a month or two ago, I was doing some work toward developing a process to patch CRS out-of-place using cloned golden images. I held off on publishing anything because I wanted to do some testing but we’ve been so busy with deployments and maintenance over the past month that I haven’t had a chance. I think that it might benefit a few people to go ahead and post the work I’ve done so far even though I’m not finished. Thus… note that this material is still very much in-process.

This is a relevant topic for any organization who deploys Oracle-based clusters with regularity and who needs a solid process for managing the software. Many large companies already use a package/clone approach to manage various patch-levels of database software (I’ve been directly involved in this). The company produces a golden-tarball of the database software which includes the current corporate standard patch-level including one-offs that remediate previously encountered bugs. That golden-tarball is the only thing that needs to be copied or deployed anywhere. It gets internally version-controlled and centrally distributed and reduces both time and mistakes during the deployment of Oracle software to new systems and updates to existing systems. This approach is very advantageous and fully supported by Oracle.

The same process hasn’t yet been possible with CRS because Oracle has not yet produced documentation to sufficiently decouple the software installation part from the runtime update part. Therefor, with CRS we can’t replace the typical opatch-based approach with a tarball and clone approach.

The official docs do cover the following four scenarios for CRS:

Create a new cluster by copying already-patched-home from some other cluster.
Add a new node to an existing cluster by copying already-patched-home from the same cluster. (I have also tested this with an already-patched-home from a different cluster and it works. [11.2.0.3])
Rolling out-of-place upgrade of existing, running cluster. (Major upgrades like 11.1 to 11.2 only, base releases only.)
Moving CRS software to a new location on the server.

These four procedures offer a starting point for our investigation toward a single, central, internally version controlled golden-tarball of CRS with all required patches. It’s also worthwhile to review the docs about adding and removing nodes. The biggest outstanding problem I see is PSUs and one-offs. I would like to create a golden-tarball of CRS with the additional PSUs and one-offs which installs into a new GRID_HOME reflecting its internal version number that we assign (e.g. 11.2.0/grid_23). On my existing systems I’d like to just automatically deploy this directory everywhere then have a simple rolling process to switch the active CRS from the old home to this one. Easy with RAC homes, apparently impossible with CRS homes…

Recently I discovered one additional interesting bit of information in an Oracle support note. (Apparently it’s been published for a few years now and I didn’t find it until now!) Note 1136544.1 gives an official technique for out-of-place PSU application on CRS. I’m not entirely sure but the steps in this note may have originally been generated by the new oplan utility. One of the interesting things about this note is that it uses a perl script called patch112.pl and a perl library called crsconfig_lib to reconfigure CRS to run in a new directory. The first three of the aforementioned processes (create, add, upgrade) from the official docs eventually call OUI and a root script to set up the copied CRS home. The forth aforementioned process (move) calls rootcrs.pl to reconfigure CRS to a new directory. From my reading through patch112 and crsconfig_lib, I can see that it’s updating the OLR location and init scripts in /etc. The cdata directory in the grid home contains the OLR. Based on some cleanup procedures in the documentation, I think that the crf directory (the new cluster health monitor) might also contain clusterware config or state files that would need to be retained when using a golden-tarball. Note 1136544.1 is also interesting because – unlike the move process above – it grabs an inconsistent snapshot of these two directories (i.e. copies them without shutting down CRS first) – and CRS doesn’t care when it switches over to the cloned and patched home but happily starts up and continues on.

Based on all of this, I think that the golden-tarball/clone approach for CRS might actually be possible with some small modifications to the procedure in Note 1136544.1:

Instead of copying existing home, copy an already-patched home from some other cluster. (Created with instructions from official CRS cloning docs including cleanup.)
Copy only the following directories from running clusterware home, instead of everything:
- Config & state files (induced from cleanup procedures)
  - cdata
  - crf
  - gpnp
  - crs/install/crsconfig_params
  - crs/install/crsconfig_addparams
- ASM config files (common knowledge)
  - network/admin/*.ora
  - dbs/*ASM*
- Few log files
  - log
  - cfgtoollogs
Skip the step where you apply the patches.

The real trick here is that we have to be very careful to copy over all the important configuration files without missing anything and yet copy absolutely no binaries! The installation will be corrupted if we overwrite a patched file with an un-patched file. I think that the list above should be safe and sufficiently complete but as I said before, I haven’t tested this yet. I will likely give it a try sometime over the next few months and post my results. In the meantime I’m very interested in feedback about this idea – let me know what you think!

About Jeremy

Building and running reliable data platforms that scale and perform. about.me/jeremy_schneider

View all posts by Jeremy »

Discussion

4 thoughts on “Minimal-downtime PSUs on CRS with Cloned Golden Images”

FWIW, MOS note 1136544.1 was originally published in June 2010. The most recent update to the note was in June 2013, so it isn’t really “new” by any definition :).

LikeLike

Posted by Dan Norris (@dannorris) | November 14, 2013, 7:10 am
- How about that. I updated the wording in my blog article here – I could only see the modified date and I just never noticed the note until now. Not surprising to me; the Oracle Support KB is really full of hidden gems like this. I’m sure there are quite a few more interesting notes that I haven’t found yet!
  
  LikeLike
  
  Posted by Jeremy | November 14, 2013, 8:21 am
Hi Jeremy, caught this post and I did this exercise this past summer. The note you refer to is helpful, but there are some missing pieces. I found a presentation called “Minimal Downtime Patching” by Edgars Rudans to be helpful. You are correct about the config, OLR and wallet files being a problem. One of the other big problems that tripped me up is that you must use OPatch version 11.2.0.3.0 due to bug 16990706. The other key is using patch112.pl which you mentioned. The bottom line is that it is possible, but questionable if it’s really worth the effort. I concluded that you’re better off cloning from the existing grid home than trying to create a gold copy across platforms. It just seems to be less work.

LikeLike

Posted by Andy Rivenes | November 14, 2013, 6:07 pm
- Just found Edgars’ presentation on the UKOUG website – great tip! His presentation looks like exactly the same sort of thing I’m working on! Very good presentation too.
  
  LikeLike
  
  Posted by Jeremy | November 18, 2013, 1:50 pm