Let’s avoid this > “We’re heading right at the ground, sir! Excellent, all engines full power!
RackN is refining our from “start to scale” message and it’s also our 1 year anniversary so it’s natural time for reflection. While it’s been a year since our founders made RackN a full time obsession, the team has been working together for over 5 years now with the same vision: improve scale datacenter operations.
As a backdrop, IT-Ops is under tremendous pressure to increase agility and reduce spending. Even worse, there’s a building pipeline of container driven change that we are still learning how to operate.
Over the year, we learned that:
- no one has time to improve ops
- everyone thinks their uniqueness is unique
- most sites have much more in common than is different
- the differences between sites are small
- small differences really do break automation
- once it breaks, it’s much harder to fix
- everyone plans to simplify once they stop changing everything
- the pace of change is accelerating
- apply, rinse, repeat with lesson #1
Where does that leave us besides stressed out? Ops is not keeping up. The solution is not to going faster: we have to improve first and then accelerate.
What makes general purpose datacenter automation so difficult? The obvious answer, variation, does not sufficiently explain the problem. What we have been learning is that the real challenge is ordering of interdependencies. This is especially true on physical systems where you have to really grok* networking.
The problem would be smaller if we were trying to build something for a bespoke site; however, I see ops snowflaking as one of the most significant barriers for new technologies. At RackN, we are determined to make physical ops repeatable and portable across sites.
What does that heterogeneous-first automation look like? First, we’ve learned that to adapt to customer datacenters. That means using the DNS, DHCP and other services that you already have in place. And dealing with heterogeneous hardware types and a mix of devops tools. It also means coping with arbitrary layer 2 and layer 3 networking topologies.
This was hard and tested both our patience and architecture pattern. It would be much easier to enforce a strict hardware guideline, but we knew that was not practical at scale. Instead, we “declared defeat” about forcing uniformity and built software that accepts variation.
So what did we do with a year? We had to spend a lot of time listening and learning what “real operations” need. Then we had to create software that accommodated variation without breaking downstream automation. Now we’ve made it small enough to run on a desktop or cloud for sandboxing and a new learning cycle begins.
We’d love to have you try it out: rebar.digital.
* Grok is the correct work here. Thinking that you “understand networking” is often more dangerous when it comes to automation.