Doordash's Sudeeptha Jothiprakash - Where The Powerhouses Of Containers And Serverless Come In
Doordash’s Sudeeptha Jothiprakash – Where The Powerhouses Of Containers And Serverless Come In
Liran Haimovitch: Welcome to The Production First Mindset, a podcast where we discuss the world of building code from the lab, all the way to production. We explore the tactics, methodologies, and metrics used to drive real customer value by the engineering leaders actually doing it. I’m your host, Liran Haimovitch, CTO and Co-founder of Rookout.
Today’s episode is all about cloud-native applications and modern observability. With us is Sudeeptha Jothiprakash, previously at AWS and a Product Lead at DoorDash. Thank you for joining us, and welcome to the show.
Sudeeptha Jothiprakash: Great to be here. Thank you so much for having me here, Liran.
Liran Haimovitch: Sudeeptha, what can you tell us about yourself?
Sudeeptha Jothiprakash: I’m Sudeeptha Jothiprakash. I’ve spent, as you’ve called out, six and a half years working for AWS and mostly on Observability Solution. The earliest years of my life, I worked on CloudWatch logs, which was essentially the logging platform that evolved to become log analytics over time. I also spent my time understanding observability requirements from customers and builds CloudWatch service lens, which brings together metrics, logs, and traces under a single place so that you can triage and troubleshoot your problem easily.
Today, I’m with DoorDash, and I’m solving some of their consumer issues by offering them plans and partnerships in order to get customers onboarded and try out the DoorDash capabilities easily. I’ve worked with AWS as a Product Manager, as you’ve called out, mostly focused on observability tools for the last six years.
Liran Haimovitch: Pretty cool stuff, a lot of interesting projects you’ve been taking a part of. Now, before we dive into observability, I want to take a step back and discuss cloud workloads in general. Today, we’re hearing more and more about cloud-native workloads, and there are two big powerhouses over there. You have containers and serverless. And I think recently, you’re even seeing that some of the boundaries between them are getting blurrier, and they’re becoming even closer to each other. What’s your take on that?
Sudeeptha Jothiprakash: As you’ve called out, the world has evolved towards being more sustainable in terms of the requirements that customers have on their applications. We’re trying to be more effective in the way computers have been used, and this is where the two powerhouses of containers and serverless come in.
When it comes to more predictable workloads, containers are the most efficient way for you to serve your customers’ requests. An example would be, let’s say when you have constant traffic that’s coming to your web platform, and on average, you have a certain amount of traffic, containers are the right way to go about it.
Serverless essentially looks at a very unique work case, where when you have traffic that’s unpredictable but you still think they’re highly critical for your business, you can ensure that they’re serviced appropriately. An example of this can be when you have IoT sensors that are plugged onto let’s say, a truck, which is helping you identify the temperature of the goods that are essentially pharmaceutical-based, for example. And this is sort of where if there is a drop in temperature, you would like to be notified very quickly in order to make sure that all of the goods are off the temperature and of the quality that you’re expecting. This is where ensuring that the IoT device actually talks to a Lambda function. Perhaps, pinging a Lambda function and ensuring that the message is transferred appropriately is sort of the right mechanism to retain the quality that you’re hoping for in terms of your application responses. I was hoping to help you understand a little more about the two types of workloads, simply because it becomes the fundamentals of the way you’re thinking about observability overall.
Liran Haimovitch: How do you think about observability when it comes to those familiar workloads? When it comes to those workloads, I think, scale up or down in demand? How is observability different?
Sudeeptha Jothiprakash: There are a lot of loads in applications today – workloads and applications today, that essentially have very unpredictable patterns as well. These unpredictable request patterns have to be met by an efficient way of capturing these requests and responding to them. What containers help you do is scale up and scale down very quickly and be efficient about it in terms of your responses. Not just your responses, actually also looking at it in terms of the cost-effectiveness of your application itself.
As ephemeral resources come in and go out, you want to be able to capture the data from them in the most efficient manner without actually having any loss. The worst thing that could happen to you, for example, is you have some really critical responses that you need to send out to your customer and you’ve not scaled your container environment appropriately. When you’re trying to actually triage the issue of why your customers didn’t get the responses; if you do not have the right locks and the metrics, that essentially puts you in a precarious situation. This is sort of where ephemeral resources are being appropriately provisioned and also instrumented in order to get the right telemetry from it. It is very critical when you’re thinking of observability.
Liran Haimovitch: How do you figure out what’s the correct way to instrument and what’s the data you actually need from those workloads, from those containers?
Sudeeptha Jothiprakash: This is sort of where we recommend that you have an observability plan to begin with. Think of your application to have some parts, which are critical workloads and some which are essentially – those that are running in the background, but do not require that level of granularity in terms of the data that you would like to capture. As part of your critical workload, first, you ensure that they’re scaled appropriately and have the right mechanisms for auto-scaling when the traffic increases.
But you also have to be very mindful when you’re setting up the application on what types of data or insights you like to capture from these applications. You need to be mindful of whether you want to enable transactions, you want to enable logging, and the critical metrics that you’d like to capture. All of this needs to be configured well in advance before you actually deploy an application versus trying to figure that out after you get it into the production stage, and then see that you are missing some areas of visibility.
This happens more often than you know, as you probably have heard from others as well either. And we recommend that you sort of be more mindful when you’re getting started with application development itself. Again, you mentioned this because you don’t want to overload yourself with data as well. If you set up metrics, logs, and traces across all of your environments, there are chances of you having an information overload and multiple signals telling you the same issue. Rather, you would like to be able to aggregate these insights from across the critical sources and be alerted appropriately. Did that help?
Liran Haimovitch: Yeah, makes perfect sense. Now, you keep speaking of scaling up workloads, and obviously, with serverless, things scale up on their own. On the other hand, you’ve said you don’t want too much data, you don’t want too many signals. I kind of have to wonder, as the environment scales up, as you’re running tens or even hundreds or thousands of containers at the same time, what happens to the observability part of the system? How does it affect cost or performance or signal-to-noise ratio or other elements?
Sudeeptha Jothiprakash: Great question there. We definitely look at essentially when an application scales up very quickly. But you’re ideally looking at the same set of signals that you did when the application is not scaled as well. Your business telemetry and the critical data or the observability telemetry that you would like to capture should be on a 1:1 ratio. What should happen in the backend as well is to ensure that the telemetry that you’re collecting is aggregating across the scaling that you do.
An example is, let’s say you have a workload environment of 10,000 containers as such, and that scales up to 100,000 just because it’s peak; Black Friday sale, for example. When that scaling happens, you want to ensure that the 100,000 containers are actually sending you the same set of key results or key telemetry by aggregation. This aggregation can happen not just from your logs, but also as signals from your metrics. Because ultimately, you’re trying to ensure that you’re being alerted appropriately on the issues that matter to you most.
I also see a world in which having the data to dive deep further is also very critical. Because once you have that aggregation, it’s great in order to have that first line of response and ensure that you know what the issue is. But when you dive deep further, you should be able to point out as a needle in the haystack between the 100,000 workloads that you have or containers that you have, this particular container landed up, filling this particular process. That’s why your customer got 5x, et cetera, for example. You should have that level of granularity when you need it the most when you have customer issues.
Liran Haimovitch: How do you go about having that level of granularity and that level of aggregation without getting your costs spiraling out of control?
Sudeeptha Jothiprakash: This is sort of from experience, I could speak, that there are sort of two key verticals in my mind. The first recommendation is to log everything, right? You want to make sure that you’re logging very, very robustly in order to ensure that you understand when things go wrong. And you can keep it in low-cost storage if you’d like, in order to – keep it also for compliance and privacy purposes as well. Ensuring that you have logging enabled across at least your critical workloads is super important. From these logs, you can then create metrics that are aggregations. All these aggregations will help you understand the signal-to-noise ratio.
From there, the second pillar is transaction monitoring. While you can have logs across various parts of your application and workloads, what transactions help you do is understand how a request cuts through all of your workloads as such. Being able to tie that metric back to your transaction is very important when you’re trying to troubleshoot, let’s say, a request coming from your customer. This is sort of where ensuring that you have at least these three fundamentals set up for your critical workloads will enable you to be more responsive and reduce your mean time to resolution, as we call it, as part of your root cause analysis.
Is that sort of where you were going with this, Liran? I’m trying to make sure we’re thinking of the right ways in which we capture data.
Liran Haimovitch: I would love to hear a bit of your insights about the costs aspects of observability, kind of. Because personally, I’ve been hearing a lot of customers, Rookout customers, and partners, sharing their pains of overpaying for observability. Because keeping all of that data, whether it’s log – especially logs but also metrics, and definitely tracing, which by itself is incredibly expensive. And having a system scale up, then all of a sudden log everything can become extremely expensive. The other time, there is a huge fear of missing out exactly those logs that are going to end up being important that may save the day.
Sudeeptha Jothiprakash: I’m totally with you on that, Liran. I wish there was a world in which cost wasn’t the issue that created all of these nuances when it comes to application management. But I do understand that we need to be pragmatic about how we scale up and how we ensure that we’re catching the right telemetry as such. The way I personally would look at it is all the critical workloads can have multiple tiers of logging. When I see multiple tiers of logging, some that are captured and something that could be actually analyzed using your log analytics solution, right? That can get go for the first two or three days because most of the issues that you see and you triage, 90% of them actually fall into the first three-day window assets. So, you want to make sure that that data is readily available when your root causing an issue.
After that, you essentially send that into some sort of storage, if I may. This can be making sure that you have a backup in your S3 bucket, which is also required for compliance reasons. Many of our previous customers will talk about how compliance might ask them; they never heard any of their logs or also have tenure or such time as that [00:12:46]. These are real-world requirements where logging costs on, and analytic platforms can add up very quickly. This is sort of where having it in storage, which is low cost makes sense. Compressing your logs as well is a great way to ensure that the cost don’t add up when you send it to your storage.
I’m speaking more about logging here, just because I know that’s been sort of one of the hot topics that many of the folks talk about in terms of observability. But as you call out, there’s also transaction monitoring. Transaction monitoring or tracing just lined up, adding up in terms of costs when it is larger traces that go across multiple parts of your application. But often more than that, those are the ones which are most critical. Because the longest-running transactions are the ones that landed creating the most impact to your application. The way we have seen a lot of the best practices evolve in this space is for you to first map out all of your critical paths, and ensure that those critical paths are at 100% sample rate. The non-critical paths, I think you can look at reducing your sample rate to even 5%, for example.
That keeps you enough signals coming in so that you know that these paths are key, accessed by your customers, and this is sort of the rate at which requests are coming in and so on. But it doesn’t really add up in terms of costs, as you call it out. But when you’re thinking of the most critical paths across your application, I have seen a lot of partners that are talking to us; a customer setting at about – not setting that at 100% in terms of transaction monitoring. They’ll have like a couple of high-profile customer cases or customer issues come in and they don’t have the traceability just as yet.
The final thing I want to mention is we need to make sure that there is a good relationship between the logging that you have and the transactions as well. While you might have the perfect paths in terms of all of the transactions, making sure that they have the connection back to your logs is very critical in order for you to get more evidence in order to reduce or narrow down to an issue that matters to you.
Being able to identify, here’s an issue that the customer ID, XY and Z was able to tell you about, and ensuring that you understand the same thing from the logs is very important. What I mentioned earlier, make sure that your logging is set up for the first three days and make sure that your critical paths are set up for transactions. They need to be mapped 1:1 for all of your critical workloads. If not, you will be in a situation where you might have the transactions and not the logs, and the other way around as well.
Liran Haimovitch: Makes sense. Earlier in our conversation, you mentioned the importance of not just relying on engineering metrics, but also bringing business metrics into the observability space and keeping track of the business in your application. That’s a topic I’m very passionate about and I would love to hear your take on it.
Sudeeptha Jothiprakash: This is actually one of the areas I feel we do least about in terms of observability. Each application and the business outcomes it drives are so bespoke and so unique, that there’s no one-template-meets-all. What happens often is that the developers may not know what are the right business signals that we need to capture as part of the application performance. It often leads to asymmetry in terms of the data that you’re able to capture and analyze as such. I’m 100% with you, this is an area that I think across observability we are trying to solve, but it’s often harder for us to tackle rather than thinking of the ephemeral resources or the serverless resources as such.
Going back to this point, where ensuring that you have the right observability plan as you build the application, this is where for example, product managers come in and say, “Here are the levels of granularity that I need in order to understand how the business is impacted by this particular application.” It is very important for you to define very early on while you’re building the product specs. If you have that ingrained and you know which events to capture what types of signals matter to you, then when issues actually occur, you’ll be able to directly relate it to, here’s how it impacted the number of customers coming in or the number of orders dropped because a certain part of your application didn’t scale very well, and so on. It’s being mindful very early on, as I called out, that really helped set up the application and the product managers for success.
Liran Haimovitch: If you’re thinking of those plans, I definitely agree observability plans are super important. I wish observability was more agile but it sometimes isn’t. I think one of the most important elements that come into play here is the same that with everything cloud-based is the division of responsibility. Now, you’ve been to DoorDash, you’ve been to AWS, and you’ve seen from multiple angles, what’s your point on that? What can product managers and engineers can rely on their cloud vendors or infrastructure providers to provide versus where should they be spending most of their time, focus, and energy?
Sudeeptha Jothiprakash: I’ll sort of answer the last part of your question there. The most important thing for DevOps engineers to be doing is building really cool stuff. I think, unfortunately, we have so many things that are going on around that, that they often more than not spend more time actually managing their applications rather than building stuff. A lot of that load, as you called out, has to be divided in my mind into three categories.
One is, what can you get natively from either your cloud service providers or areas and mechanisms that you have from the open-source tooling that can enable you to monitor your applications? Second is, as you and I discussed, the business telemetry that is most important for the team to figure out through product management conversations and ensuring that we understand what business objectives is this application trying to drive. And the third is that aggregation workflow and ensuring that you’re reducing the signal-to-noise across these two segments – all signals that’s coming from your applications that can be more native driven, and then you have your custom data that’s coming from your business. All of them, how are they aggregated and how are they set up in terms of correlation in order for you to identify this, “Ah, drop in order rate is because we didn’t do XY and Z properly in a certain application.”
Those three pillars need to come together very well. I’m fully acknowledging the fact that it’s easier said than done when you’re thinking of an observability plan as such. And even more so, one of the things that I’ll throw in as a wrench is when we are evolving your application so rapidly, I believe the rate of committing code right now is within a couple of days or Sub-D [00:20:00] level today. As you’re changing code so quickly, being able to ensure that you have the right set of telemetry is also hard, and I’m fully acknowledging that aspect.
But it’s almost up to the developer and the product management’s mindset to come together to make observability a critical aspect of their job. Because I think more often than not, and definitely in previous worlds when we had on-prem solutions and monolith side applications, the mindset was observability was an afterthought. We can no longer afford that because of how ephemeral these applications have become and how distributed environments operate. It is very critical for observability to be sort of the front of all of your development overall.
Liran Haimovitch: I couldn’t agree more. I have to admit, I haven’t seen that many companies actually managed to get a grip on business metrics, especially in real-time. That alone combined them with technical metrics and really slugs and so on. But I wish we were all there as you imagine a much better world than the one I know.
Sudeeptha Jothiprakash: I’m really with you on that. And I think this is a problem that, I guess, is also an opportunity for many of the companies to iterate in. Because ultimately, we need a framework in the way we think about what are these business objectives of each product update that we do, for example. What are we trying to drive? How are we expecting customers to react, and what are some of the key signals that help us understand success? Many of the companies do OKRs, and OKRs are a great way to think about what are some of the business objectives that we’d like to try. Ensuring that we convert these OKRs into observability goals as well is very important. I think that’s a step that I mindfully am trying to also inculcate as part of my journey now. But I can see that a lot of businesses are still evolving that mindset just as you.
Liran Haimovitch: Makes sense. Now, there’s one last question I would love to ask you; a question I asked a lot of my guests. You’ve been an engineer before you were a product manager. Now, you’ve been a product manager for a long while. You’ve been around software for a large portion of your life. I have to ask, what’s the single bug in your career that you remember the most?
Sudeeptha Jothiprakash: I’ll probably – without giving a lot of details about what the bug was, there was a point in time where I pushed code, which didn’t have the component, which converted the number of bytes to gigabytes. This was coming from, let’s say, a different pipeline. I was calling something that was residing in a completely different pipeline, and I expected that conversion metric to be there or the conversion package to be there.
What this was doing is also helping us build customers on their logging expenses as such. You can imagine how this gets scaled up so quickly, right? We had a lot of – and this is a very minute, one-hour bug maybe. But it can cause a lot of panic to our customers when if they actually see bills that are times a thousand because every gigabyte is now being represented in terms of the bytes that they use. We were able to quickly catch this as part of our tests. Our tests sort of show these crazy numbers in billions of dollars that it would actually be charging our customers, if not for those test situations.
Luckily, we were able to sort of roll back and ensure that it didn’t impact any production environments. But these are sort of those minute things, right? You want to ensure it works really well, and I learned pretty early on that you should understand your dependencies very clearly and ensure that your dependencies are also there when you call them.
Liran Haimovitch: Definitely. Interesting story.
Sudeeptha Jothiprakash: Thank you.
Liran Haimovitch: Sudeeptha, thank you for joining us. It’s been a pleasure having you on the show.
Sudeeptha Jothiprakash: I’m very happy to be here, Liran. Thank you so much for the chat, and I really appreciate the conversation.
Liran Haimovitch: That’s a wrap on another episode of The Production First Mindset. Please remember to like, subscribe, and share this podcast. Let us know what you think of the show, and reach out to me on LinkedIn or Twitter, @productionfirst. Thanks again for joining us.