Last week was busy… making travel arrangements for this week’s trip to New York (technically Jersey) and some light analysis of AWR reports from exadata RAT runs and some heavy troubleshooting of a Solaris x86 RAC cluster with random node reboots. (I think I finally traced the node reboots to a kernel CPU/scheduling problem). I really did thoroughly enjoy my time in Africa despite being nowhere near Oracle software – but it feels good to be working on challenging cluster problems again!
Before I completely forget the details from my work in Africa I want to wrap up my article about high-level lessons learned earlier this week. By the way, I’m not just stretching obscure aspects of my work in Africa to get stuff that sounds good. I view these cultural lessons which we learned together as the most central and most important technical aspect of my work at the hospital. And it might be surprising, but it’s true: the same cultural adjustments are important and oft-missed here in corporate America.
The first two lessons were to (1) understand the fundamentals and (2) avoid unjustifiable complexity. The remaining two lessons I want to talk about are slightly less technical but equally important.
- People first, Technology second
I’m using the word “people” here to sum up three major components of our accomplishments at the hospital: organizational policy, user education and technical training.
First, a fun technical story:
On day 1 after our arrival at the hospital, several issues required immediate attention – I discussed one example in the previous article. A second issue involved the network links to the outside world. In particular, sites like gmail were often completely unreachable. I worked hard on this one. Eventually I got the reproducible test case by connecting to an AWS instance very close to the other end of their link. I compared low-level packet traces from both sides to see what was happening.
I could initiate a very large download and throttle the connection on our end, causing TCP window full messages to cascade all the way up to the AWS source server – normal behavior for throttling connections. But I did notice that the source server sent a very large chunk of data before it started throttling to the same rate that I was demanding.
Next I opened a second connection with a different protocol on a different port. The throttled connection wasn’t even using a fraction of our contracted bandwidth – but every single packet on the second port took a consistent 60 seconds or so to get through. If I killed the download then things would zoom at normal high speeds again.
The killer was that the TLS handshake in HTTPS connections needed a response to the initial packet within 30 seconds in order to continue. Small file downloads had no impact – but if a download was large enough, then SSL quit working completely – although HTTP would still (slowly) connect.
My best guess was that this network provider (not an African company BTW) was running some enormous cache in the middle combined with a single packet queue per customer – no fair queuing by TCP connection. It doesn’t make complete sense but I haven’t yet thought of anything better. Maybe those other packets were just stuck in line on a huge cache throttled by packets in front of them? I actually didn’t think it was possible to configure network equipment to be this stupid… but maybe? Whatever the cause, the end result was that a large download – even throttled – generally trashed our whole network connection.
I did pursue a ticket with the network provider, but it was a lot of effort to keep the issue moving. We discussed technical solutions on our end. Block large downloads? Sometimes they’re needed for the business. I spent quite a bit of time researching various QOS solutions – and this was when I learned a very important lesson. I especially learned from and ebook called How To Accelerate Your Internet (with several good African case studies) and some slides from Christian Benvenuti (International Centre for Theoretical Physics).
Communication and people can do things that technology can’t.
We weren’t able to resolve this issue in a “technical” way – so how did we solve it? We had something that I’ll call our PEOPLE strategy – it addressed this challenge and many others too. Here are a few elements of the strategy:
- We wrote a computer policy for the hospital, approved and enforcable at the top.
- Made sure a few tools were in place to identify violations.
- Setup a new wiki for the entire hospital community to use for any purposes.
- Designed the network so that every new device was redirected to a wiki page with the policy.
- Held extensive discussions with all staff and compound residents on several technical subjects.
- Hand-picked a few users for special training on several topics.
We kept the policy short and simple and easy to remember – it was three rules that I could recite on my fingers. But it covered our needs well, protecting us from bandwidth problems and from malware.
I worked to catalyze broad user education. Lots of conversations with non-technical people. Basic conversations, not fancy ones! It began to cultivate a new culture around technology.
I briefly mentioned a new wiki up there – this was also a big part of our user education strategy. Except for a small restricted section, the wiki was 100% open to be updated by anybody. Part of our cultural change included training everyone at the hospital to use this wiki as a central repository for processes and information. With a high frequency of arrivals and departures, with people often rotating into and out of well-defined functions, the usefulness of the wiki was immediately apparent.
The third important part of our people strategy was the special training. There were three unique qualities of this training:
- Documentation: Much like the work I’ve done for Oracle RAC Attack, we worked very hard to boil down tough concepts and processes into extremely verbose step-by-step instructions on the wiki. Whenever possible, I taught people to start practicing step-by-step documentation for everything they do.
- Reproducibility: Enabled by the growing documentation library on the wiki, we expected that everyone who received training would be able to train someone else on the same thing. When we switched to a new internet access system, I trained a few people on setting up accounts and had those few people handle the rest of the compound.
- Functional Selection (rather than technical selection): also enabled by the documentation library, we started choosing the most logically positioned people for jobs instead of the most technical people. As one example, several non-IT people helped setup the new internet accounts. As a second example, the special projects coordinator – despite not being at IT person at all – followed extensive documentation to practice a bare-metal server recovery… and built our new server in the process! As a previous director, she was logically positioned for training which could potentially provide access to any data at the hospital.
In Corporate America, we typically have more financial resources, more technology at our disposal, and much longer staff retention. Even if you have the same problems that we had in Africa, your situation will require different strategies. But the main point here is universally, absolutely crucial: there are two sides to every project, the technical side and the people side.
Even if you’re working with outside partners to provide technical expertise, there is an in-house “people” component to your project. Even if it’s being called an appliance, it’s gonna be your baby to use & maintain & retire someday. Even if it’s a cloud-based service, you have to deal with the upgrades & functionality changes & workarounds for buggy, uncommon use cases. Every new technology you start working with will require somebody in-house to start learning about it. Never underestimate the people side of technology! It’s always there and it’s always important.
I think that we easily get immersed in the technology side of our projects (myself included) and we often need a reminder to keep the people side in view.
- Adapt the Process instead of Customizing the Product
Finally, I have one quick word about processes and products. Now I don’t want to overstate this point; obviously there’s a cost-benefit analysis that happens for each case – and often changing business processes is not an easy thing. But I do believe that as technologists we have a tendency to favor customization a little more than we should. And in general (technologists and managers and executives) I think that we tend to not investigate many possible process changes because we think that it’s not really plausible. I think that if we start asking, we might be surprised how willing people sometimes are to change the way they work in order to have the business as a whole work better. If we’re getting some good new tools then people could totally understand changes to better accommodate those tools.
When I was working at the hospital, there were two specific places where this discussion happened. The first was around a new pharmacy system that we were putting in place. The present system is heavily paper-based and for many reasons the hospital is pushing ahead with a new computerized system. It’s the classic conversation about software customization and frankly I didn’t do an awful lot besides convince hospital staff to work more closely with the company who writes the pharmacy software. (Great company – and eager to help – and they totally understand this conversation.)
The second conversation was a little more unusual: server administration. Specifically, around NTFS file permissions and administrator access to user home directories. One requirement of the hospital was the ability to audit contents of redirected user home directories. By default these directories are created so that even Administrators cannot peer into them.
Of course none of us were experienced windows server administrators. With some fiddling we managed to get folder redirection to work. I had found that many folder redirections on the old network didn’t work because of incorrect permissions on home directory folders. We could have done more testing to figure out what permissions worked – but we were running short on time. So this was where I encouraged the staff to “go light on changing things”. In the end we found another very good solution which didn’t require any NTFS permission changes.
I often aim for environments to be as close as possible to the engineers who are building the software. I’ll ask people from my vendors, “what do your engineers mostly develop on?” With some companies it’s hard to get a straight answer, but I like to ask!
When it comes to customization in Oracle databases, of course underscore parameters come to mind right away. Those little buggers can be life-savers sometimes… but their indiscriminately global effects can also be killers. Be careful!
Well I hope you can see a bit of why I enjoyed the work in Africa. Even though I wasn’t working for a fortune 500 company with millions to invest in bleeding-edge technology, there were still interesting and challenging projects. It was a great opportunity to hone my skills at helping people find the best ways to use technology in a real life business. And to be honest, I think that’s the one thing I’m most passionate about.
I hope that I’ve challenged your thinking a little bit around how technology serves your company. I hope that these new ideas push you to the next level in your professional career. And with any luck I might have even convinced a few people to help out in the non-profit world.
Discussion
Trackbacks/Pingbacks
Pingback: Lessons From Rural Africa : Ardent Performance Computing - August 13, 2012