For almost two years now, The Phoenix Project has been making its way around our IT department. Every couple months it seems someone else discovers this fictional narrative and is amazed at the insights made. We’ve discussed over cube walls, at lunch, and even in meetings (rabbit trails!). We started asking things like, “Who’s the Brent of your team?”. Leaders, IT professionals, developers, security teams… it has something to say to everyone. Good non-fiction books about theology or business are adequate but wrap those same thoughts around a fictional narrative like The Lion, The Witch, and The Wardrobe or The Phoenix Project and it captures the imagination so much better.
Below are a collection of direct takeaways from The Phoenix Project as well as my own thoughts…
IT isn’t a department. It’s a strategic capability of the business. It is a competency you need to possess as an entire company.
The DevOps approach: The result of the IT community realizing that the developer’s approach of Agile development is wonderful but has to be much broader than that in order to be useful to the business. Agile doesn’t work until it’s a part of something a lot bigger involving IT operations, UAT, the business, customers, and of course, application developers. That something bigger is what’s now being called the “DevOps approach”.
There are 4 distinct categories of IT Operations work (each with there own expectations):
- Business projects
- Internal IT projects
- Unplanned work
Three ways to create fast flow of work through development and operations:
- Attack the constraints. A constraint is typically one person who wears multiple “expert” hats and through which a disproportionate amount of work “has to” go through in order to get done correctly. There are 5 steps for attacking the constraints which is one of the main ways to increase the flow of work:
- Identify the constraint (who are they?)
- Exploit the constraint (get the most out of them – meaning: only critical work)
- Subordinate the constraint (everything else is secondary to achieving #2)
- Elevate the constraint (change other systems to increase capacity of constraint)
- If a new constraint arises during any of the previous steps, go back to step 1
- Eradicate sources of unplanned work
- Standardize and automate infrastructure. Without that we have an infrastructure snowflake problem (no two servers/load balancers/deployments are alike).
Work capacity planning
Every Service Catalog Item in our ITSM needs the following metadata attached to it:
A bill of inputs, list of required work centers, routing/flow needed between those work centers. Together this is called a “bill of resources”. Once we have that, along with the Service Catalog work orders and our resources, we’ll be able to get a better handle on what our demand and capacity is. This is how we discover whether we can accept new work or not and what type of work we can accept.
Preventative oil changes and vehicle maintenance policies are like preventative vendor patches and change management policies. By showing how IT risks jeopardize business performance measures, you can start making better business decisions. So preventative maintenance must be shown to prevent actual events that affect business performance, otherwise it’s just busy work. Routine maintenance policies had better be shown to aid business performance or else drop it!!
Preventative maintenance (like managing a strong, comprehensive monitoring solution) needs to be elevated to a higher priority in order to ensure system availability. If you aren’t improving, entropy ensures you’re getting worse. This is part of the “Improvement kata”. One way to develop this culture is to require everyone make at least one improvement, of any size, once a month and report what it was.
Why the 30 minute change took 2 weeks!!
Wait Time = %busy / %idle (or b/(100-b) where b = % utilized). While the book is over-simplifying things here, the truth is that whatever the graph is, it is definitely asymptotic (exponential-like; see general shape of graph below). Therefore, if the resource is 90% utilized then the wait time is 9x’s longer than if the person is 50% utilized. Wait time is 11x’s longer when 99% utilized compared to when you where 90% utilized. Here’s what the graph visually looks like (this is why work sits on our desk for so long before getting to it):
“Improving daily work is even more important than doing daily work” -Gene Kim
One terrible developer behavior is to spend all your cycles on features instead of stability, security, scalability, manageability, operability, continuity, and all those other beautiful ‘itties. (aka: “nonfunctional requirements”). This is bad because cutting those kinds of corners builds “technical debt” which like financial debt compounds over time and if not paid back will cause the organization to pay more and more interest in terms of unplanned work (and outages). Unplanned work is very expensive because it comes at the expense of planned work, which causes us to have to cut corner on our planned work, causing more technical debt, etc. This can be fought by putting IT Operations in the developer’s feedback loop and also by involving developers more in the operations. Also, non-functional requirements are as important as the functional requirements. No more, no less.
Measure the Right Thing
Regardless what we measure, if we aren’t measuring these things then we aren’t measuring the right things (if we don’t know this then how do we ever know what’s important and what’s not?):
- Are we competitive?
- Do we know what to build? (do we understand customer needs and wants?)
- Do we have the right products? (Product portfolio)
- Can we build it effectively? (R&D Effectiveness)
- Can we ship it soon enough to matter? (Time to market)
- Can we convert products to interested prospects? (sales pipeline)
- Are we effective?
- Are customers getting what we promised them? (Customer on-time delivery)
- Are we gaining or losing customers? (Customer retention)
- Can we factor these into our sames planning process? (Sales forecast accuracy)
For each key business performance measure (see list above for example), IT leaders should know what IT systems those measures rely on. Additionally we need to understand the risk those IT systems pose to the business. Once we understand that we can develop IT controls that compensate for those risks. Demonstrating we can mitigate the IT risks means we can directly affect key business performance measures for the better!
Deciding which systems get replaced by better systems and when
Determine what applications and infrastructure are “fragile” and plan to replace these soon. replacing these may be just as valuable to the business as replacing an application with one with more features/capabilities. Everyone knows what the “fragile” applications or infrastructure is. It’s those that everyone is afraid to touch because typically when they do it goes down.
In order to design for quality, we need to create constant feedback loops from IT operations back into development, designing quality into the earliest stages of the product. To do that you can’t have nine-month-long releases. You need much faster feedback.
Dev, Ops, QA, and the business working together are a super-tribe that can achieve amazing things. The book Continuous Delivery is the seminal work that codifies the practices and principles of this “DevOps approach” to application delivery. In a nutshell, we need a deployment pipeline where we can create test and production environments, and then deploy code into them, entirely on-demand. Doing this will reduce setup times and eliminate errors so that we can match whatever rate of change developers set the tempo at.
Goal: Be able to manage 10 production deploys per day!
- Typical response: “What?!? I’m pretty sure no one is asking for that. Isn’t that setting the target higher than what the business needs?”
- Rebuttal: Even if you don’t need to do 10 deploys/day, having this capability means you are essentially prepared for a deploy at a moment’s notice whenever the business needs it and to have confidence it can be done reliably. Business agility is all about detecting and responding to changes in the market and being able to take larger and more calculated risks. This means being able to continually experiment. If you can’t out-experiment & beat your competition in time to market and agility you’re sunk. Features are always a gamble. So the faster you can get those features to market and test them, the better off you’ll be. Incidentally, this also means you’ll have something in the market faster so that the business starts making money faster too. Secondly, since it’s hard to quantify how reliable a deploy will be, the next best thing is to measure something that is quantifiable that tracks close with reliability. If you can handle 10 deploys per day then you can, with high assurance, likewise claim your deploys are reliable. Fragile deploys can’t be done at that rate. Aside from reliability, being able to handle 10 deploys per day also means deploys don’t take a lot of employee time and even though you don’t actually do that many, you can be ready to do one absolutely anytime it’s needed with little or no heads up.
- Outcomes matter. Processes, controls, not even the work you complete matters. Only outcomes matter.
- In order to discover where the business relies on IT to achieve its goals, you have to leave the realm of IT
- Our goal is not just to improve business performance but to get earlier indicators of whether we’re going to achieve them or not
- When work is requested of IT operations, that work needs to be prioritized consistently based on a communicated set of values (otherwise our customers won’t be able to trust the the work being released will get worked timely enough).
- Now fairly famous talk (a little NSFW language): “10+ Deploys per day: Dev and Ops Cooperation at Flickr” (this was DevOps before it had a name)
- Continuous Delivery (Humble and Farley)
- Five Dysfunctions of a Team (Patrick Lencioni)
- The Goal (Eliyahu M. Goldratt)
- The Phoenix Project Discussion Group
- IT Revolution Press Blog