Tag Archives: Operations

DevOps and Ownership

“You build it, you run it” has been my mantra for many years now. A number of times I was approached by management and they asked who should be operating stuff that I had built. Because, allegedly, my time was too precious for doing such a mundane task like operations.

This is to all managers: Operations is neither mundane nor something for junior staff. It is in fact exactly the opposite. Operations is what keeps the organization alive. Operations is where the best people should be, because here the rubber (the developed software) hits the road. Operations is your last line of defense, when (not if) something goes catastrophically wrong. Operations is a key influencing factor on your organization’s ROI. Operations determines your ability to be agile on the market. Operations is key for customer satisfaction. I could go on and on, but likely you long got my point.

Of course there are some aspects to operations that, when things are done the wrong way, are repetitive and far from challenging. But that should mostly be behind us. Yes, in the 1960s we had people who did nothing but enter data. And until not too long ago a lot of operations was just ticking off check boxes on a to-do list. But with things like infrastructure as code (see my recent post on starting with Chef Infra Server), this should really be something from the past. What you need today are people who take pride in running a lean, highly automated, highly resilient IT organization.

And that is where it should be clear to everybody, that DevOps is much more about organization, knowledge, and collaboration beyond traditional “borders”, than about technology.

By the way: My response to management about who should run my stuff, has always been “me”. Because the applications were built to be as maintenance-free as possible. Only the occasional support ticket had to be answered and with proper logging/auditing that is nothing that takes a lot of time. And fixing the occasional bug was not a big deal either, thanks to Clean Code and test-automation.

This allowed me to support 6 business-critical applications as a “side-project”, i.e. no time was officially allocated. Comparable applications operated by other departments had at least three people full-time for support only.

Ignoring Warnings in Log Files

I know that many people think along the lines of ignoring warnings in log files. But I am responsible for an environment where we  run applications that are sort-of business critical. So I need to take warnings seriously, especially if their wording is not really clear. The consequence is that any unnecessary warning causes operational problems. Because who can tell me, kind-of “written in blood”, that I can absolutely always and forever ignore this warning? Only in that case could I consider adding an exception to the log monitoring system and of course to the system documentation, the operations manual, etc. So for me, and from my consulting past I know many customer think the same, this is not just a small nuisance but a real issue.

On the other hand I have had many discussions with people who told me that I could just ignore this or that entry. In many cases it turned out after some discussion, that the log level was actually chosen badly and INFO would have been more appropriate. In that respect the semantics of the commonly used log levels deserve a closer look. Here are two good links (link 1, link2) for definitions. When I first read them, my initial thought was that I might have overreached with the first paragraph of this post. But looking at the example from link 1 about WARN a bit closer, I think my concerns are still valid.

So what can be done? Reality is that you rarely have the ability to get a log statement changed. So you do need a scalable approach to deal with log messages that you consciously choose to ignore. It involves primarily two things: Firstly, you need to have documentation why the decision was made that a given log message is not critical. Secondly, there should be an automated link with your log file monitoring system, that configures an exception in it. Depending on your business this whole area might also be regulated, so the legal side may very well play a role as well. But that is outside the scope of this post.

I know this post is not really actionable, but still wanted to share my thoughts.

 

Michael T. Nygard: Release It! Design and Deploy Production-Ready Software

[amtap amazon:asin=0978739213]

I bought this book about a year ago and -shame on me- only just read it. It’s really great for everyone that is interested in designing and developing robust software. So in that sense a must-read for all of us.

The book is organized in four general sections: Stability, capacity, general design issues and operations. For all of them a number of typical scenarios are described and general approaches discussed. The author seems to have a pretty strong Java and web application background (at least those are the areas of most of his examples), but the patterns and solutions are great for all systems, languages and use-cases.

So overall what we have here is a book that is fun to read and at the same time offers great insight into large-scale software systems. In my view everyone who works in this field can benefit from this book.

The author is also blogging on Amazon.com and seems to cover quite a few interesting topics.