This is the third of twelve articles in a series called Operationally Scalable Practices. The first article gives an introduction and the second article contains a general overview. In short, this series suggests a comprehensive and cogent blueprint to best position organizations and DBAs for growth.
As a starting point for our discussion of scalable practices, it makes sense to talk about three fundamentals. These will provide a strong foundation for everything else we discuss. First, change history; second, checklists; and third, a few server basics.
Change History
Change History is the single most important concept in Operationally Scalable Practices. Read that sentence again.
The basic idea is simple:
- Every old version is kept forever.
- Incremental differences are easy to view.
- Each change has a timestamp and human person associated with it which is also easy to view.
This idea can be applied to anything that is stored on a computer and is valuable even if you’re the only one making changes. Meditate on it until it permeates your thinking and becomes integral to who you are. Some day you should find yourself with an unexplainable sinking, nauseous feeling while working and suddenly realize it’s because you’re changing something without change history.
Getting started is easy, and everybody starts the same way. Before updating that excel spreadsheet you made a “backup” copy and named it “spreadheet-orig.xls” and kept it in the same directory. Eventually you end up with a whole bunch of files copied over several computers with confusing names and inaccurate timestamps. You can’t really tell which is which anymore – but you can relax a little because anyone who doesn’t admit they’ve done this is lying.
But not surprisingly, it’s 2013 and we can do much better. Strategies for keeping change history broadly fall into two categories: version control systems and document management. Each of these is a rich subject. I’m going to assume some basic familiarity and move directly into suggesting a few tips specifically for database operations teams.
Version Control Systems
It doesn’t have to be complicated, but the DBA team should have a VCS solution. I can’t state this strongly enough; there’s no excuse for not having one.
- git is an easy and good place to start
- if your company already has licenses and/or expertise for a VCS (subversion, team, clearcase, etc) then weigh pros and cons of leveraging it for your DBA team
- everyone on the DBA team should know how to get a copy of code and how to commit changes to it.
- we’ll discuss specifics later, but when you setup your VCS never compromise rule #3: easy to see the human person associated with each change
- if you’re choosing a version control system, at least make sure you understand the differences between centralized and distributed models. There are advantages and disadvantages to each.
Document Management
- Some document formats have built-in change history. For example, Microsoft Office or OpenDocument. This can be an improvement over multiple files but it doesn’t count for change history as far as I’m concerned.
- Wikis have become commonplace for good reason – they’re great for ad-hoc sharing of information with change history. Getting team buy-in can be hard work but pays huge dividends. One suggestion if you’re setting up a new wiki for your team: carefully weigh security considerations but deploy with a bias toward less restrictive, wider access. There are a lot of community-supported and commercial wiki packages – both hosted and on-premise – at many price points. Even the smallest company should be able to find something that fits their needs!
- More comprehensive and formal document management systems are integrated with approval and project workflows specific to a particular organization and have more sophisticated archival, auditing and coordination features. These are often necessary for critical documentation or regulatory requirements.
It doesn’t need to be complicated but the DBA team needs to have some place to keep documents and processes with change history. And once you have this place designated, start to build up your team’s process library! Any process that is repeated should be outlined as specifically as possible and added to your team’s documentation library. Even if it slows you down a little, write up processes next time that you’re executing them. This is another fundamental which you can’t compromise on.
Checklists
I remember one time when I messed up a maintenance operation. My team was working on a cluster which hosted several RAC databases. Each database had a large number of consolidated applications and one of our tasks was to move all activity away from a particular node. Many apps weren’t yet certified for RAC so after relocating the services we would need to restart the apps and let them cleanly re-connect on new cluster nodes.
I received notification that the maintenance window had begun, so I relocated the services and told my colleagues to run the script that restarted the applications. We then proceeded to check that all of the applications had come back online. Just as we were finishing up, I had a shocking revelation. I had not relocated the services for all of the databases on the cluster – I had forgotten one!
I had to relocate the services on this forgotten database and then we had to repeat the restarts and tests for all of the applications on that DB, keeping everyone at their computers for a full 45 minutes extra. I certainly didn’t impress anybody with that mistake. The worst part of this story is that an hour after we finished I realized that I had forgotten to relocate services on one final small database on the cluster which hosted three more applications.
That was really embarrassing, but it drove home a very important lesson – never go into a maintenance operation without a checklist.
Most people already know that checklists are important. Paul Vallee has a great talk about FIT-ACER where he points out that checklists aren’t just for forgetful people – many extraordinarily smart people won’t work without them. (Like doctors and astronauts!)
Anything is better than nothing, and a simple text document is a start. But personally I think spreadsheets are just as easy and you can write the word “DONE” next to each item as you complete it. For activities that require coordination, a google spreadsheet can work great. Of course there are sophisticated tools too (recently I’ve been learning about IBM/UrbanCode Release) – but you can get a lot of mileage out of simple tools before you need one of these.
Checklists are closely related to documented processes. It takes extra time to plan something before you do it. It takes extra time to practice something several times, or to keep going back and updating the documented process as you actually execute it. But the investment pays dividends and it’s a critical, fundamental concept in operationally scaleable practices.
Server Basics
I’m just going to touch on three basics about servers: inventory, monitoring and backups.
Inventory
This comes down to finding the balance with being organized. You can be too sloppy or you can waste time always trying to make it perfect.
I’ve seen plenty of commercial and in-house inventories, both automated and manually populated. I’ve even designed and coded some of my own systems to gather data and report on large numbers of databases and schemas. One bias I do have is that if you’re using manually populated documents (like spreadsheets) then I strongly favor collaborative platforms – see the document management section above. At a minimum this probably belongs in the wiki, until you need to move to something more sophisticated. Other than that I just think it’s important to keep things relatively simple.
Having a solid inventory can actually be more challenging than you’d expect. Often there are many lists of databases or servers and each is incomplete in some regard. Hopefully pure utilitarianism will help your team find the balance in this area!
Monitoring
The most basic centralized monitoring solution is your email inbox: scheduled jobs that send you emails are the first step. But beyond this there are plenty of open source and commercial centralized monitoring solutions. After email you move to paging and then you begin to think about on-call rotations and end-to-end redundancy in both technological and staff terms. I learned a lot about this at Pythian; I think it’s something they do well.
I do have one suggestions on monitoring which I don’t think everyone gets right: always keep the business in mind. (Am I repeating myself again about business priorities?) When monitoring systems grow, they can take on a life of their own and lead to something I call Compulsive Monitoring Disorder. Each time you find (or create) a new metric to monitor, it’s easy to begin worrying that it’s wrong. (Yes, this is closely related to Compulsive Tuning Disorder and the BCHR.) Monitoring is important but you’ve really got to keep it in its place. Sometimes you may have to treat it like a starving wild animal that wants to eat you alive…
Whatever it takes, always keep the business as your top priority. Even the things you monitor should be driven by business priorities – from end to end. If a certain event regularly causes a page and it’s not high enough priority to fix, then just adjust or disable the event and stop letting those pages waste your time. Add a note to the bottom of a checklist somewhere. Lets be honest, you may never get to it – and that’s OK!
Backups
Finally, a very brief word on backups:
“YES!”
Backups are kindergarden for DBAs. It’s not hard, and you’d better have it down. But it’s also like professional development. We’ll discuss some specifics later but you should be practicing on some sort of regular basis to stay sharp and verify your processes. Your goal isn’t to do restores from memory; it’s to instantly know which wiki or support note has the procedure you need – and then cut-and-paste. The last time I had to do a high-pressure restore of a critical production database, I wasn’t typing from memory… I was double-checking each command from docs or support notes before executing. I would even sometimes have my teammate look over my shoulder at a command before I pressed the enter key. (Which, as Paul Vallee says, is the most dangerous key on the keyboard!) Just remember that your backups need to be solid and this is foundational for your operations.
Summary
Change history (version control systems and document management), checklists and a few server basics (inventory, monitoring and backups): how do you score? These basics will give us a solid foundation to build on as we move into more specific and detailed topics related to managing Oracle database infrastructure.
I like the way you have presented this.
In my experience, people confuse Change Management and Change History, believing that because they have a “history of changes” they are ok. I’ve never used the term “Change History” because of that confusion, so I stick with “Version Control”.
I did want to make one criticism, that you have simply assumed everyone understands “backups”. Even in your narrative you don’t discuss “backups” but “recovery”, so why simplify it?
Again, in my experience, many sites do “backups” or have “backup policies”, with no relationship to what they really need to meet recovery objectives and business requirements.
I annoy people by saying I won’t look at or write a “backup strategy” or “backup policy”, I only document “recovery”.
This can be critical when there are related data sources, like a database and flat files (look at Blackboard Learn), that must be treated as a consistent set for recovery purposes, but are subject to different and unrelated backup strategies because they are handled by different “teams”. In “tests” they handle it, because its actually controlled, while in reality they frequently leave inconsistencies.
So, lets not call it “kindergarten”.
LikeLike
Thanks for visiting Andrew! You make some good points about backups, I’ll have to rethink how I’m presenting it here. I really like your idea of focusing on *recovery* rather than backups; I’ll probably rework the end of my article to reflect that idea.
I do still think that in at least one sense recovery is at least “elementary school” – but you raise another sense in which there’s a lot more to it. In my view, kindergarden recovery means you can at least recover *something*. I think that way too many small or growing companies aren’t even there.
But as you point out, recovery must be tied to business needs – especially looking at what RPOs and RTOs are needed or what different pieces need to be recovered together. And then there’s the whole topic of effeciency with backups – bandwidth effeciency, storage effeciency… I once helped architect an oracle database backup system that coupled rolling forward image copies with dedupe storage (interesting project) and I can attest that recovery can get very complicated very fast!
Nonetheless, I stand by my assertion that it’s elementary school. My main reason isn’t because it’s simple. My biggest reason is that in my opinion, if you can’t recover what the business needs, then you have no business progressing any further in my series on “scalable practices” until this is resolved. You can’t go to high school until you finish elementary school. And DBAs should hold off on learning MAA or private clouds or other fun stuff like that until they can recover what the business needs.
LikeLike