Configuration Management – Part 9: The Audit Trail

Keeping track of  changes is a critical functionality in every configuration management system because there are legal requirements like  SOX (Sarbanes-Oxley Act) that require it. It can be accomplished in several ways. Basically you can either use an existing tool like a VCS (version control system) or have something custom-built.

When possible, I tend to prefer a VCS because it is (hopefully) already part of your process and governance approach. A typical workflow is that the underlying assets (i.e. configuration files) will be changed and then the VCS client be used to commit the change. The commit message allows to record the intent here, which is the critical information.

But there are cases when you need to be able to track things outside the VCS. In all cases I have seen so far the reason was that some information should not be maintained within the VCS for security or operational reasons. While organizations are often relaxed about data like host names in non-PROD environments, this changes abruptly when PROD comes into play. While I always think “security by obscurity” when I have that discussion, it is also a fight not worth having.

The other reason is operational procedures. The operations team often has a well-established approach that maintains configuration files for many applications in a unified way. The latter typically involves a dedicated location on network storage where configuration data sit. Ideally, there should also be a generic mechanism to track changes here. A dedicated VCS is of course a good option, but operations staff without a development background often (rightly) shy away from that route.

So it comes down to what the configuration management system itself offers. What I have implemented in WxConfig is a system where every operation that changes configuration data results in an audit event that gets persisted to disk. It includes metadata (e.g. what user initiated the change from which IP address), the actual change (e.g. file save from UI or change of value via API), and the old and new version of the affected configuration file.

The downside compared to a well-chosen commit message for VCS is that the system cannot record the intent. But on the other hand no change is lost, because no manual activity is needed. In practice this far outweighs the missing intent, at least for me. Also it has proven to be helpful during development when I had accidentally removed data. It was far easier to restore the latter from an audit record compared to looking them up in their original source.

All audit data get persisted to files and the metadata is recorded as XML. That allows automated processing, if required by e.g. a GRC system (Governance, Risk Management, and Compliance) or legal frameworks like the aforementioned Sarbanes-Oxley Act.

Ubiquity Networks Unify Controller on Raspberry Pi: Startup exits with RC=1

As mentioned in My WiFi Setup with Ubiquiti Networks UAP-AC-PRO I run the Unify Controller software on a Raspberry Pi 3. There is a ready-made package available for Debian and Ubuntu Linux, that can easily be used for this and I have been doing so for more than a year.

Just a yesterday, though, I broke things by overdoing it a bit with the removal of unneeded software from the Raspberry Pi. Through some “chain” the Unifi Controller had been removed and after re-installation it did not work anymore. Instead I saw a constant CPU utilization of an entire core by Java and also errors in /var/log/unifi/server.log :

[2017-12-26 13:07:04,783] <launcher> INFO system - *** Running for the first time, creating identity ***
[2017-12-26 13:07:04,791] <launcher> INFO system - UUID: yyyyyyy-yyyy-yyyy-yyyyyy-yyyyyyy
[2017-12-26 13:07:04,817] <launcher> INFO system - ======================================================================
[2017-12-26 13:07:04,819] <launcher> INFO system - UniFi 5.6.26 (build atag_5.6.26_10236 - release) is started
[2017-12-26 13:07:04,819] <launcher> INFO system - ======================================================================
[2017-12-26 13:07:04,867] <launcher> INFO system - BASE dir:/usr/lib/unifi
[2017-12-26 13:07:05,057] <launcher> INFO system - Current System IP:
[2017-12-26 13:07:05,059] <launcher> INFO system - Hostname: zzzz
[2017-12-26 13:07:05,071] <launcher> INFO system - Valid keystore is missing. Generating one ...
[2017-12-26 13:07:05,072] <launcher> INFO system - Generating Certificate[UniFi]... please wait...
[2017-12-26 13:08:33,574] <launcher> INFO system - Certificate[UniFi] generated!
[2017-12-26 13:08:53,004] <UniFi> ERROR system - [exec] error, rc=1

The last couple of lines were showing up repeatedly, so obviously the system tried to restart over and over again. When you search the Internet for this problem, you will find out that you are not alone. Most solutions address available memory and not all people succeed with the various approaches to increase it (typically by removing memory from graphics and increasing swap space).

What I realized was that most discussions were for older versions and a recurring theme was that things changed between minor versions. So something that had worked for v5.6.19 did not necessarily work for v5.6.22 and vice versa. Also, changes to how Java was dealt with were mentioned quite often. Running Java-based applications on Linux can be somewhat delicate, so I do not blame the folks at Ubiquity Networks for that.

This was when I realized that the JVM on my system had changed. Before the accidental cleanup I had used the Oracle 8 JVM that gets installed via the Debian package oracle-java8-jdk. So I re-installed the latter and configured it as the default JVM via

sudo apt-get install oracle-java8-jdk
sudo update-alternatives --config java

This solved my problems instantly and things are up and running again.

ESXi 6.5: Intel NICs Not Found

While playing around with an ESXi 6.5 test system, I accidentally killed all network connectivity by setting the NICs to pass-through. This post gives a bit of background and the solution that worked for me.

The system is home-built with a Fujitsu D3410-B2 motherboard and an Intel dual-port Gigabit NIC (HP OEM). The motherboard has a Realtek RTL8111G chip for its NIC, which does allegedly work with community drivers, but not out-of-the-box. One of the things I want to run on this box is a pfSense router. So, when I discovered, that the Realtek NIC was available for pass-through, I enabled this. I also enabled one(!) of the two ports of my Intel dual-port NIC. At least, that is what I had intended to do.

Because what really happened was that all three NICs were set to pass-through, which of course meant that ESXi itself had no NIC available to itself any more. This issue showed after the next reboot, when the console told me that no supported NICs had been found in the system. Perhaps not wrong in strict terms, but certainly a bit misleading, when you are not very experienced with ESXi.

Searching the net did not provide a real answer. But after a couple of minutes I realized that perhaps my change about pass-through might be the culprit. The relevant file where these settings are stored is /etc/vmware/esx.conf. I searched for lines looking like this

/device/000:01.0/owner = "passthru"

and replaced them with

/device/000:01.0/owner = "vmkernel"

After that I just had to reboot and things were fine again.


Re-Writing Software from Scratch

It is not uncommon that an existing piece of non-trivial software gets re-written from scratch. Personally, I find this preference for a greenfield approach interesting, because it is not something I share. In fact, I believe that it is a fundamentally flawed approach in many  cases, so why are people still doing this today?

Probably because they just like it more and also it is perceived, at least when things get started, as more glorious. But when you dig deeper the picture changes dramatically.  Disclaimer: Of course there are situations when re-starting is the only option. But it typically involves aspects like required special hardware not being available any more.

When I started writing this post with making some notes, it all started with technical arguments. They are of course relevant, but the business side is much more critical. Just re-phrase the initial statement about re-writing to something like

Instead of gradually improving the existing software and learn along the way, we spend an enormous amount of money on something that delivers no additional value compared to what we have now. For this period, which we currently estimate to be 2 years (it is very likely to be longer), apart from very minor additions, the business will not get anything new, even if the market requires it; so long-term we risk the existence of the organization. And yes, we may actually loose some key members of staff along the way. Also, it is not certain that the new software ever works as expected. But should it really do, the market has changed so much, that the software is not very useful for doing business anyway and we can start all over again.

Is there anyone who still thinks that re-writing is generally a good idea?

Let us now change perspective and look at it from a software vendor’s point of view. Because the scenario above was written with an in-house application in mind. What is different, when we look at a company that develops and sells enterprise software? For this text the main properties of enterprise software are that it is used in large organizations to support critical business processes. How keen will customers be to bet their own future on something new, i.e. not tested? But even if they waited long enough for things to stabilize, there would be the migration effort. And if that effort comes towards them anyway, they may just as well look at the market for alternatives. So you would actively encourage your customer base to turn to the competition. Brilliant idea, right?

What becomes clear looking at things like that, is what the core value proposition of enterprise software is: investment protection. That is why we will, for the foreseeable future, continue to have mainframes with decades-old software running on them. Yes, these machines are expensive. But the alternatives are more expensive and in addition pose enormous risk.

In the context of commercial software vendors one argument for a re-write is that of additional revenue. It is often seen as easier to get a given amount of money for a new product than an improved version of something already out there. But that is the one-off view. What you typically want as a software vendor is a happy customer that pays maintenance every year and, whenever they need something new, first turns to you rather than the competition for options. Also, such a happy customer is by far the best marketing you can get. It may not look as sexy as getting new customers all the time, but it certainly drives the financial bottom line.

Switching over to the technical side, there are a few arguments that are typically made in favor of a restart. My favorite is the better architecture of the new solution, which will be the basis for future flexibility, growth, etc. I believe that most people are sincere here and think they can come up with something better. But the truth is that unless someone has done something similar successfully in the past, there is a big chance that the effort is hugely underestimated. Yes, technology has advanced and we have better languages and frameworks. But on the other hand the requirements have also grown dramatically. Think about high availability, scalability, performance and all the others. Even more important, though, is the business side. With something brand new people will have greatly increased expectations. So giving them something like-for-like will probably not be seen as success.

The not-invented-here syndrome is also relevant in this context and particularly with more junior teams. I have seen a case when an established framework used in more than 9,000 business-critical (i.e. direct impact on revenue) installations was dismissed in favor of something the team wanted to develop themselves. And I can tell you that the latter was a really bad implementation. Whether it was a misguided sense of pride or a fundamental lack of knowledge I cannot say. But while certainly being the most extreme incarnation I have seen so far, it was certainly not the only one.

So far my thoughts on the subject. Please let me know what you think about this topic.

Theory of Constraints and “The Goal”

Two  years after reading Eliyahu M. Goldratt‘s famous book “The Goal” for the first time, I had a second go at it. It proved, again, to be an entertaining and at the same time enlightening read. I will not come up with yet another summary, you will find plenty of those already. Instead I want to point out a few interesting links to other areas.

Let me start with an interesting connection to my article about personal objectives. One of the statements I had made there, was that a global optimum cannot be reached by local optima everywhere; some local deficiencies would need to be accepted to reach the overall goal. (Apart from the common-sense aspect, this is also formally proven in systems theory.) Well, exactly the same point is mentioned in “The Goal” when Dr. Goldratt writes that at the end of the day, only the amount of shipped goods, i.e. goods sold, counts. Any intermediate over-achievements (“We beat the robot”), is really completely irrelevant.

Another interesting link is one with the book “Turn the Ship Around” by David Marquet (see this for a quick summary). Mr Marquet makes the point that in his experience a deciding factor for an organization’s (in his case the crew on a nuclear attack submarine) performance is, whether its members work to avoid individual mistakes or to achieve a common goal. Similar, in “The Goal” there is a paragraph about the success of various changes in a manufacturing context. It is basically about the different intent (and Mr Marquet calls his style intent-based leadership, by the way). Instead of trying to reduce costs, the manufacturing team looked at things from a revenue generation point of view.

In the addendum “Standing on the Shoulders of Giants” a short introduction is given into the Toyota Production System (TPS). One interesting comment in this section is, that other car manufacturers which have also implemented a comparable system, have not achieved the same level of improvement. This is attributed to the guiding principle: Toyota focused on throughput (and by that ultimately on revenue), whereas all the others looked at cost reduction. The analogy to that is a study (source unknown as of this writing) about customer satisfaction vs. profit optimization. Allegedly, those companies that focus on customers not only win on this front. But in the mid- to long-term they also constantly outperform those that focus on financials.