You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We built tools to let us (and anyone) run two code paths side-by-side.
Then performance data was published by instrumentation into our graphite cluster.
Don't Roadblock
Long-running branches are kryptonite.
We could land short-lived branches to "dark ship" new code.
With science and instrumentation tools we could gather real production measurements in a controlled fashion.
How to run two permission systems at the same time
One system is the source of all truth.
Have a migrator that creates the second system's data from the first.
The data for the second system is updated by actions on the first system.
Start with just a subset of users.
Trash the data and re-run the migrator.
Eventually never re-run the migrator.
Sometimes it's just hard to figure out if the code you are changing is even being used. Try to be 100% safe.
Just a glorified wrapper around ruby's caller method
... which can be put to holy or horrible uses
The project will ebb and flow
Always have enough side project work available that you can stay fresh and not get bogged down.
People will come and go (and often return) on a long project.
You will be at an impasse many times.
Rely on your team, and allow yourself to experiment with crazy things.
Your creativity is an asset.
Keeping a long project running
Keep your sense of humor.
Rely on your peoples.
Keep close.
Work-life balance is important.
Treat it as a marathon.
Take a vacation.
You actually can leave and go or do something else.
It's fine.
Setbacks
"Repository Networks" were all jacked up at the model level
Setbacks
"Repository Networks" were all jacked up at the model level
"plan owner" data was all shitty due to old bad job processes
Setbacks
"Repository Networks" were all jacked up at the model level
"plan owner" data was all shitty due to old bad job processes
Forking and collaborators were all sorts of stupid
Setbacks
You re-learn that you everything is connected
You end up having to fix way more things than you hoped
And you make some trade-offs and draw boundaries
Those of you familiar with DDD should be hearing bells ringing right now:
Bounded contexts!!!
Also, to reiterate an earlier point:
Your test suite is insufficient, even at 100% coverage
because your test suite can't account for your production data
and your whole history of bugs and bugfixes
Setbacks
So we wrote data quality scripts to find problems
We wrote transition scripts to clean up problems
We wrote throttling tools to make massive transitions never hurt production
It can be difficult or impossible to estimate significant changes to an ongoing system
Keep iterating
Continually communicate and re-negotiate next steps
Work on the most important blocker Right Now
Enterprise
Enterprise
Installed GitHub instances behind company firewalls
Back then were often 6 months behind github.com code base
Customers could be 300+ days behind last Enterprise release
We had almost zero visibility into installed versions and data quality at a customer's site
There were disabled and enabled github.com features on Enterprise
Enterprise
Data transitions for github.com were bundled with migrations for Enterprise upgrades
Database tables and ActiveRecord models persisted in github.com code until Enterprise was fully upgraded
Shipping
Shipping
Made abilities be the "source of truth" for read queries on teams, and orgs
Made abilities be the "source of truth" for repository read queries
Continue writing data to both permissions systems
Gradual removal of science experiment code
Other GitHub "scientist" ships
Rails 3
Yup. LOL.
GitRPC
Puppet Labs
Landed into Engineering as a "Principal Engineer"
Since all the engineering leads were already, well, leading, I got the opportunity to just roam around for a bit looking at how things work.
### (Also, given that I had a history of being a "rescuer", pretty sure no one really wanted me on their team, hahaha)
So I hung out with the "Integration" team
### Which is a team of heavy hitters charged with cleaning up whatever comes downstream from the rest of engineering and needs to work well in the released product.
Lots of bottlenecks and process failures
A more classic cousin of Conway's Law...
At some point QA, release engineering, and test automation were moved out into a separate division of Engineering
As a result, eventually developers did not own how their software was tested, how it was built, how it came together for release
Lots of things went "over the wall"
### Feedback loops got long, developers were unempowered, product releases slowed down, frustration and finger-pointing escalated
My hazy understanding: Release engineering eventually got sick of fixing software thrown over the wall, and so the Integration team was born (closer to home, but still "over the fence")
In early 2014 there were technical attempts to fix problems, starting with the CI (test) system, but the efforts petered out with no significant change
### In early 2015 there were "proof of concept" experiments (like we'd done at GitHub with permissioning systems), but they never moved forward
The fundamental problems were organizational
Interpersonal and team dynamics that persist over years
Self-reinforcing patterns of behavior that prevent real change from happening
Siloing of teams with thick walls preventing cross-cutting change
What now?
We moved testing pipeline definition into version control and made it more self-service
This only took a little concensus building, and the stability benefits were obvious
The side effect is that it makes developers more involved in the testing cycle (it's less over-the-wall)
We started a cross-functional group called The Pit Crew which brings in high-level engineers and test/release folks
It is HIGHLY transparent (more so than any other high-level group in the company)
Its mandate is to find the biggest bottlenecks in the development-to-release cycle and have actual teams prioritize those fixes
For this we had to build high-level consensus that these fixes were necessary to deliver product in the future (because this effort pushes back on product feature work!)
The proof of the commitment is that we already have big bottlenecks being tackled by real people
We are rolling out a long-term (1 year+) revamp of our CI/build/release infrastructure
We need to alleviate fundamental performance and capacity problems
We need to address deep UX problems, and to make it even more self-service
Building consensus
Working 100% in the open
Gathering feedback, and conducting group therapy
Naming the problems to be solved, and the techniques used to solve them
Avoiding the pitfalls of previous attempts
Building a roadmap, but keeping downstream intentionally un-detailed