Shane Hastie: Good day folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. Today I’m sitting down across the miles with Jessica Kerr. Jessica, welcome. Thanks for taking the time to talk to us today.
Jessica Kerr: Thank you, Shane. It’s great to talk with you.
Shane Hastie: If I knew, we could be here.
Jessica Kerr: Right. Exactly. It’s fine. It’s fine. We have removed a lot of disincentives to connect with people around the world.
Shane Hastie: We have indeed. The silver lining is this opportunity to at least be in the same virtual space with more people more frequently.
Jessica Kerr: Yes.
Shane Hastie: So a lot of our audience are probably aware of your content on InfoQ. You’re a frequent contributor to QCon conferences, and there’s a lot of stuff that we’ve got that you have both written and contributed to and spoken about, but there are some that probably don’t know. So give us the one minute overview. Who’s Jessica?
Jessica Kerr: Okay. A lot of that content on InfoQ is various languages. So I’ve spoken about Scala, Clojure, Ruby, Elm, Typescripts, Java. There’s more. I almost feel like I kind of do a survey of the industry sometimes. But more recently I’ve really gotten into wider systems. So over the last few years, I’ve keynoted several conferences with my best talk. This one’s really good. If you watch nothing else of mine, this one matters because it’s about symmathesy, which are learning systems made of learning parts. Once I found this word, I really see that in our software teams in particular and more in software teams than in other symmathesy,. So every forest is symmathesy,. Every ecosystem is a system that as a whole grows and changes because all of its parts are growing and changing. And every team is a symmathesy, because its people we’re constantly learning and that changes our interactions and changes how the team works. But software even more so because the code is learning because we teach it and we can learn from our code.
The importance of accelerating learning in software development [02:06]
Jessica Kerr: I mean, we already do in a lot of ways from tests and logs and databases. Much more so if you have good observability, that’s all about learning from your code. But what matters is that we are learning systems. And in particular, with software, we get to really accelerate this learning and accelerate its impact on the world because software changes the world. It changes what happens when I push a button on the screen. It changes physics in that sense. It changes the world that many people live in so we have this huge impact and these learning systems, and we have not figured out how to do this well. So over the last couple years, I’ve done systems thinking workshops with Kent Beck. I did domain driven design with Eric Evans, because that has a lot of domain driven design and the language and it’s about the learning about the business that we infuse into the software being so important.
The value of observability in software as it enables rapid learning [03:02]
Jessica Kerr: And now I work at Honeycomb, which is the OG of observability, state of the art of observability I think. And that’s also about learning from our systems. So yeah, there’s a lot for us to know. The beauty of this field is that we haven’t figured it out. And we get to experiment because software teams move so much faster, have an impact so much faster and we can learn so much faster than, well, at any other industry that I know of, which isn’t saying much. I’m sure there are others that I’m just not aware of.
Shane Hastie: So observability, this is the culture and methods space. This is the Engineering Culture Podcast. Observability into culture, how do we get that right? And how do we use that?
Observability into culture – don’t measure the easy things [03:47]
So first of all, observability is being able to see inside a system to see what it’s doing and why. So to set up the analogy, in software that means getting every service to emit traces to say, “Hey, I got this request and I got this request because this other service sent it to me.” And then we can weave those together and you can get a distributed trace, which is nice. But it’s about the software being able to say, “I made this decision because this attribute and that attribute and blah, blah, blah, blah, blah.” It is data emitted purely for us as operators to get a better understanding of the software that we’re coexisting with, co-evolving with even.
Jessica Kerr: So how do we do that in our teams? How do we get a view of what’s happening? And this overlaps with legibility. How do we make our teams report out such that we can understand what’s happening at scale? And when you’re talking about this with software, it’s easy, not always fast, but you insert some code to do that. In our teams I think there’s a lot that changes based on the questions we ask. As a manager of a team, for instance, or as a lead or as a product person, we’re probably asking, “Is it done yet?” Not literally. We’re probably asking, “How’s it going? What’s getting behind?” But what we really mean is, “When is it going to be done?” What else do we ask about? Do we ask about, “Oh, were there any security considerations with that change you just made?” Do we ask about, “Oh, how’s the error handling for that? How does this impact the user experience? And did the testers find any bits of it frustrating?” If you ask about those things, then you’re going to find out about them.
And these are all emergent system properties that we care about. We care about the software staying up. We care about it not letting hackers do something we didn’t want it to do. We care about people who use it not getting frustrated and having a delightful experience. None of which are, “When is it going to be done?” So it starts with what you ask about. You can also look for clues. Sometimes you can make the software give you those clues, like a big influencer over both stability and speed of delivery, all the DORA metrics. So meantime to recover. No, not meantime to recover. Time to recover, because mean is a BS metric on an asymmetrical distribution. Also, change failure rates, deploy frequency, and lead time to production. So how long between commit and available in prod? And if you look at even the easy ones to measure, which are lead time, how long do your builds take and how long does your code review take and then your next build and then how long till the deploy? You can measure that. And deploy frequency, you can definitely measure.
If you just look at those, you’re getting a clue. Especially if you look at the change in those. As we grow, are we getting more deploys or are deploys getting more scary and people are doing fewer of them? Major danger alert. Right, so there are some numbers that you can look at as an indicator, but I really want to discourage people from looking at the numbers that are easy to get, whatever JIRA will spit out for you because often what’s easy to measure is not what’s most important.
You can count the number of tickets a team has completed in a week. But what does that say about the… We usually say code quality. What I really think we mean is the malleability of the code that they’re writing. Are we going to be able to change it in the future? What does it say about security? Have we updated our libraries lately? There’s a lot more to it. Is our team becoming more or less able to work together? Are the new people on the team getting their skill level up closer to the experienced people? Are they able to kind of even out the work? Or are the experienced people changing the code so fast that the new people are just floundering? There is no sufficiently smart engineer to get integrated into a code base that’s changing under them without a lot of help, a lot of pairing usually.
Things you can’t measure but you can notice [07:46]
Jessica Kerr: So some of these things you can’t measure, but you can notice. Notice conversations, notice who is super helpful, questions in slack, notice who is writing the documentation. Who is doing that glue work of answering questions from customer support and maintaining relationships with other teams and digging into, “Hey, there’s this field we need to add to the API”? Who is skipping ahead and adding the field and just making a guess at the value and calling it closed? Who is deeply investigating what does this data mean, what security validations do I need on it, where do I get it, is it safe to store, is it safe to log and really gaining business domain knowledge? There’s a lot. And most of it you can’t measure, but you can ask about and you can notice these things if you try. So a lot of observability is about consciously deciding what’s important and opening your eyes and ears for it.
Shane Hastie: The stereotypical technical lead hasn’t been trained in observing culture in teams.
Jessica Kerr: Oh, that’s so true.
Shane Hastie: How do we help them?
The need for technical leaders to build empathy [08:54]
Jessica Kerr: Well, first, do they want to be helped? Because if they don’t want to be helped, we can’t help them. How do you help a technical lead acquire this kind of empathy? I don’t know. That’s something I would look to. I think there’s Sarah Mae and other people are working on that question. I don’t… Okay. The only thing from my perspective that I can contribute is that as you get to be a technical lead… Or no. The reason you get to be a technical lead is because you’re thinking about the system more widely. So as a junior dev, you solve the puzzles that are presented to you by more senior members of the team who are doing that thing of investigating what is this field that I’m asking you to the API and what requirements are around it and what should we watch out for? And then as you get to be like a mid-level dev, you understand the whole piece of software that you’re working on, or at least know where to go to get the information and you start to have familiarity with adjacent systems with your own interfaces.
And then as a team lead, you should be thinking about at least all of those adjacent systems and the ones that might be adjacent in the future and caring about the impact that our changes have on software that we talk to and teams that we talk to. So that is a widening. The trick is that when you widen your view of the system, you need to include the people because that software doesn’t get that way by itself and it doesn’t stay that way by itself. Oh, oh, Charity Majors has a great one. An individual developer can write software, but not deliver it. The unit of delivery is a team. And I think that’s really important because as a developer, I don’t want to just deliver features that’s not in fact useful to the world. My objective as a software team is to provide valued capabilities to customers. And that involves coding the software to provide those capabilities. It also involves that software being up, that software continuing to be malleable and secure, and a lot of different things that are delivery and operating that software, not just writing code.
Shane Hastie: Changing direction, what are the limitations of business aligned product teams?
The risks of business aligned product teams [11:05]
Jessica Kerr: So business aligned product teams are all the rage right now. People want product centricity. Project to product is the next agile, which is great because software’s not a project. It’s not “Deliver this feature.” It is a product. It is an ongoing providing of a capability. But then where do you set the team’s responsibility? This can go everywhere from, “We will tell you what capabilities you provide. And then we will ask you to provide more as time goes on” to “You own the product and your job is to provide business value, red money to the business with your product and you have complete autonomy over that.” The word autonomy implies responsibility for the everything else in the system that really it takes a human to perceive. And when you go to the extreme of that, when each team is responsible for providing business value, how do you account for the value that one team adds to another?
So if you have like… I don’t have an example off the top of my head. But if you have one team responsible for maybe it’s a travel site and one team is in charge of selling flights and the other team is in charge of selling hotels, and the other team is in charge of selling rental cars and you want each of these to be profitable, okay, that makes sense but I could do this in a couple different ways. I could make the hotel part of the page so obnoxious that you focus on that and ignore the flights hypothetically or I could make the hotels part of the page direct you to flights, or I could make the flights part be like, “Oh, here are hotels that are available for that date range.” We can make these things work together or conflict with each other if each team has a number that it’s responsible for. That number could be money. It could also be increased engagement, more clicks or something.
Then there’s nothing to stop them from competing. How do we measure the systemic effects of your team? And also how do we increase your ability to provide capabilities by having a self-service platform? Platform teams are definitely one of the key teams in Team Topologies. I love the book Team Topologies. But how do you justify that when each team is supposed to make a fixed amount of money? I think we are not good at measuring systemic contributions. I don’t have an answer for you on that for how do we do that. We can notice them. We can notice systemic contributions, but if we’re data driven, then we’re going to reward the teams that are hogging the page space or the load time or whatever. This is why I don’t like data driven work. I like data informed decisions.
Shane Hastie: So let’s pick that one apart if we can. Data driven is very fashionable and it’s very easy. How do we interpret that from data to knowledge?
Make data informed decisions, don’t be data driven [13:59]
Jessica Kerr: Right. From data to useful information, which we can then use to decide on useful action. Yeah, when I hear data driven, I just think it’s not my fault. This is, “Blame the data. It wasn’t me.” But data informed means we turn that data into knowledge and then we put it in context, because when you look at a number like clicks or “Did people spend how much time with their mouse over this part of the page?” indicating some level of attention, we can get focused on the number. But the thing is that that has taken the data out of context. This is a property with all metrics. Everything that is legibility, everything that we can add up and some and aggregate and divide and blah, blah, blah, it’s all out of context. So you don’t know whether my mouse was remaining over this page because I was reading it and pointing to it or because actually I just got distracted and went somewhere else.
So when we look at that metric as information… First of all, if you ask a team to focus on it, then they’ll naturally game it. You’ve asked them to. And then it becomes not information. But if we haven’t done that and we have this information of, “I observed that people have their mouse over the hotel portion of the page more than the flight portion of the page,” we can ask why. We could do a little user research and maybe it’s because the hotel part is more confusing. Maybe it would help to combine that information with and how many people are smoothly making it through reserving a hotel. How many people give up on the page while they’re looking at the hotel part and leave versus engaging further? Funnels and drop offs are all attempts to go a little deeper. And then you have the extreme and you can use Intercom to record everything that a user does on the screen and try to get ultra details. And maybe you want to sample that a little bit, but also ask people.
Customer support is really good for this if you can just ask them what you struggle with. Yeah, so a little bit of context can go a long way in turning data into actionable information. Maybe the action we want to take is actually reduce the time they spend on the hotel portion by making it clearer, to increase their real engagement of actually reserving a hotel rather than their difficulty, their time spent. And the problem with this is, it is absolutely different in every case. I mean, you can learn heuristics. You can learn heuristics of when I see a number, I always ask why and see if people know. Always have more than one number. For instance, if you have an OKR, always have multiple key results per objective. It keeps us from narrowing in on the beauty of a number, the value clarity that we get by that pristine, precise definition of good, which is also garbage because it’s ignoring the everything else, all the emergent properties of we actually want people to enjoy being on our site or whatever.
Shane Hastie: Emergent properties.
Many qualities of a software system are emergent properties [17:06]
Jessica Kerr: Right. Emergent properties are properties of a system that are not isolated to one part. They exist as the result of interactions between the system of all the parts together. For instance, availability is an emergent property of software. It means not going down. It means no part of the system is crashing and taking everything else out. And all parts of the system are dealing with the errors that do happen. Security is an emergent property. It means we’re not doing anything we’re not supposed to do. Really tough one. User experience is highly emergent because it’s about the consistency, the expectations that you set up for people, and then how you fulfill those. Super dependent on all the different parts working smoothly together.
I like a lot of decoupling in the back end. We really want to decouple our code. But in the front end, there’s a problem because every part of the front end is coupled at the user, at the person who’s looking at both of those parts. So yes, we really do need that UI to be consistent even though I would love for the teams to be able to change at different rates. Very tricky.
Yeah, so these emergent properties are what make our software valuable. They would allow it to provide capabilities to our internal or external customers. But we can’t measure them directly. We can only get little clues. You can measure uptime and call that availability. But really, if your learning platform is down at midnight in the time zone of the university that is using it, it might impact some students who are trying to hit a last minute deadline. But if it’s down at 10:00 AM when professors are trying to give tests, that’s a much bigger thing. That’s a different level of availability. So it’s better to measure events and which ones were good and which ones were bad than to measure up time. But it’s still just a clue. All of these numbers are clues. And if we treat them like that, their information. And if we treat them as a goal, there is some use to that and there is danger. There’s danger in spoiling the information. And more importantly, there’s danger in trammeling over these emergent properties.
Shane Hastie: Shifting away from the software, culture is an emergent property of the teams and the organization.
Culture is an emergent property of teams and organisations and you can only shift it slowly [19:26]
Jessica Kerr: Yes. Culture is the sum of everything we actually do and actually say. I struggle to think of culture as a property. I feel like we are just putting a label on something that is many things. But yeah, that is the word that we use to describe the overall feeling of a place, what is acceptable there.
Shane Hastie: So if I want to change some of these elements, the big ones at the moment, diversity and inclusion, consciousness of the impact that we are having on society as a whole…
Jessica Kerr: The part where, “Do we value security in this organization?” for instance. It’s rarely stated in your quarterly goals, but some managers ask about it and some teams always take it into consideration. That’s a totally a culture thing. My theory about culture is that you don’t change it. You do shift it. Culture is constantly changing. It’s changing itself. Can you shift the direction of that change? And if you think about it that way, that you’re trying to shift the direction of the change that’s always happening, then you recognize that it has to be slow. You can’t get diversity and inclusion. You can only slowly shift the trend for more or less. For, “Do we think about this about diversity and inclusion when we have a meeting, when we talk over each other? Do we think about it when we’re…” Hiring is the obvious one, because hiring is one way to shift the culture, but not really, because everyone you hire will immediately be absorbed into the much wider system.
Shifting toward caring about security could be, “Do you ask about it?” And how high up the organization do those questions go? What abilities do you give teams to know whether their software is secure? Do you give them the business knowledge that they need to validate the data properly? Do they have quick to deploys and permission to upgrade libraries as soon as the new version comes out? There’s a lot of things you can do to remove obstacles to the culture you want to have and to shift the inherent cultural change in that direction. And it’s always going to be slow and you’re always going to have to be shifting it for years and years and years. It’s not something you’re going to accomplish in a quarter. You’re not going to get anywhere in a quarter. Because if you shift the direction in a tiny bit for a quarter and then you stop, it goes right back.
Shane Hastie: Culture is an elastic band.
Jessica Kerr: Yeah, it’s boingy.
Shane Hastie: Jessica, thanks so much for taking the time to talk to us today. If people want to continue the conversation, where do they find you?
Jessica Kerr: I’m jessitron on Twitter, J-E-S-S-I-T-R-O-N. Also, jessitron.com. If you really want to chat with me, I have a Calendly and I have open office hours at honeycomb.io/office-hours.
Shane Hastie: Thank you so much.
Jessica Kerr: Thanks, Shane.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.