Episode 519: Kumar Ramaiyer on Building a SaaS : Software
Kumar Ramaiyer, CTO of the Planning Business Unit at Workday, discusses the infrastructure services needed and the design and lifecycle of supporting a software-as-a-service (SaaS) application. Host Kanchan Shringi spoke with Ramaiyer about composing a cloud application from microservices, as well as key checklist items for choosing the platform services to use and features needed for supporting the customer lifecycle. They explore the need and methodology for adding observability and how customers typically extend and integrate multiple SaaS applications. The episode ends with a discussion on the importance of devops in supporting SaaS applications.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.
Kanchan Shringi 00:00:16 Welcome all to this episode of Software Engineering Radio. Our topic today is Building of a SaaS Application and our guest is Kumar Ramaiyer. Kumar is the CTO of the Planning Business Unit at Workday. Kumar has experience at data management companies like Interlace, Informex, Ariba, and Oracle, and now SaaS at Workday. Welcome, Kumar. So glad to have you here. Is there something you’d like to add to your bio before we start?
Kumar Ramaiyer2 00:00:46 Thanks, Kanchan for the opportunity to discuss this important topic of SaaS applications in the cloud. No, I think you covered it all. I just want to add, I do have deep experience in planning, but last several years, I’ve been delivering planning applications in the cloud faster at Oracle, now at Workday. I mean, there’s lot of interesting things. People are doing distributed computing and cloud deployment have come a long way. I’m learning a lot every day from my amazing co-workers. And also, there’s a lot of strong literature out there and well-established same patterns. I’m happy to share many of my learnings in this today’s dish.
Kanchan Shringi 00:01:23 Thank you. So let’s start with just a basic design of how a SaaS application is deployed. And the key terms that I’ve heard of there are the control plane and the data plane. Can you talk more about the division of labor and between the control plane and data plane, and how does that correspond to deploying of the application?
Kumar Ramaiyer2 00:01:45 Yeah. So before we get there, let’s talk about what is the modern standard way of deploying applications in the cloud. So it’s all based on what we call as a services architecture and services are deployed as containers and often as a Docker container using Kubernetes deployment. So first, containers are all the applications and then these containers are put together in what is called a pod. A pod can contain one or more containers, and these parts are then run in what is called a node, which is basically the physical machine where the execution happens. Then all these nodes, there are one or more nodes in what is called a cluster. Then you go onto other hierarchal concepts like regions and whatnot. So the basic architecture is cluster, node, parts and containers. So you can have a very simple deployment, like one cluster, one node, one part, and one container.
Kumar Ramaiyer2 00:02:45 From there, we can go on to have hundreds of clusters within each cluster, hundreds of nodes, and within each node, lots of parts and even scale out parts and replicated parts and so on. And within each part you can have lots of containers. So how do you manage this level of complexity and scale? Because not only that you can have multi-tenant, where with the multiple customers running on all of these. So luckily we have this control plane, which allows us to define policies for networking and routing decision monitoring of cluster events and responding to them, scheduling of these parts when they go down, how we bring it up or how many we bring up and so on. And there are several other controllers that are part of the control plane. So it’s a declarative semantics, and Kubernetes allows us to do that through just simply specifically those policies. Data plane is where the actual execution happens.
Kumar Ramaiyer2 00:03:43 So it’s important to get a control plane, data, plane, the roles and responsibilities, correct in a well-defined architecture. So often some companies try to write lot of the control plane logic in their own code, which should be completely avoided. And we should leverage lot of the out of the box software that not only comes with Kubernetes, but also the other associated software and all the effort should be focused on data plane. Because if you start putting a lot of code around control plane, as the Kubernetes evolves, or all the other software evolves, which have been proven in many other SaaS vendors, you won’t be able to take advantage of it because you’ll be stuck with all the logic you have put in for control plane. Also this level of complexity, lead needs very formal methods to reasonable Kubernetes provides that formal method. One should take advantage of that. I’m happy to answer any other questions here on this.
Kanchan Shringi 00:04:43 While we are defining the terms though, let’s continue and talk maybe next about sidecar, and also about service mesh so that we have a little bit of a foundation for later in the discussion. So let’s start with sidecar.
Kumar Ramaiyer2 00:04:57 Yeah. When we learn about Java and C, there are a lot of design patterns we learned right in the programming language. Similarly, sidecar is an architectural pattern for cloud deployment in Kubernetes or other similar deployment architecture. It’s a separate container that runs alongside the application container in the Kubernetes part, kind of like an L for an application. This often comes in handy to enhance the legacy code. Let’s say you have a monolithic legacy application and that got converted into a service and deployed as a container. And let’s say, we didn’t do a good job. And we quickly converted that into a container. Now you need to add lot of additional capabilities to make it run well in Kubernetes environment and sidecar container allows for that. You can put lot of the additional logic in the sidecar that enhances the application container. Some of the examples are logging, messaging, monitoring and TLS service discovery, and many other things which we can talk about later on. So sidecar is an important pattern that helps with the cloud deployment.
Kanchan Shringi 00:06:10 What about service mesh?
Kumar Ramaiyer2 00:06:11 So why do we need service mesh? Let’s say once you start containerizing, you may start with one, two and quickly it’ll become 3, 4, 5, and many, many services. So once it gets to a non-trivial number of services, the management of service to service communication, and many other aspects of service management becomes very difficult. It’s almost like an RD-N2 problem. How do you remember what is the worst name and the port number or the IP address of one service? How do you establish service to service trust and so on? So to help with this, service mesh notion has been introduced from what I understand, Lyft the car company first introduced because when they were implementing their SaaS application, it became pretty non-trivial. So they wrote this code and then they contributed to the public domain. So it’s, since it’s become pretty standard. So Istio is one of the popular service mesh for enterprise cloud deployment.
Kumar Ramaiyer2 00:07:13 So it ties all the complexities from the service itself. The service can focus on its core logic, and then lets the mesh deal with the service-to-service issues. So what exactly happens is in Istio in the data plane, every service is augmented with the sidecar, like which we just talked about. They call it an NY, which is a proxy. And these proxies mediate and control all the network communications between the microservices. They also collect and report elementary on all the mesh traffic. This way that the core service can focus on its business function. It almost becomes part of the control plane. The control plane now manages and configures the proxies. They talk with the proxy. So the data plane doesn’t directly talk to the control plane, but the side guard proxy NY talks to the control plane to route all the traffic.
Kumar Ramaiyer2 00:08:06 This allows us to do a number of things. For example, in Istio CNY sidecar, it can do a number of functionality like dynamic service discovery, load balancing. It can perform the duty of a TLS termination. It can act like a secure breaker. It can do L check. It can do fault injection. It can do all the metric collections logging, and it can perform a number of things. So basically, you can see that if there is a legacy application, which became container without actually re-architecting or rewriting the code, we can suddenly enhance the application container with all this rich functionality without much effort.
Kanchan Shringi 00:08:46 So you mentioned the legacy application. Many of the legacy applications were not really microservices based, they would have in monolithic, but a lot of what you’ve been talking about, especially with the service mesh is directly based on having multiple microservices in the architecture, in the system. So is that true? So how did the legacy application to convert that to modern cloud architecture, to convert that to SaaS? What else is needed? Is there a breakup process? At some point you start to feel the need for service mesh. Can you talk a little bit more about that and is either microservices, architecture even absolutely critical to having to build a SaaS or convert a legacy to SaaS?
Kumar Ramaiyer2 00:09:32 Yeah, I think it is important to go with the microservices architecture. Let’s go through that, right? When do you feel the need to create a services architecture? So as the legacy application becomes larger and larger, nowadays there is a lot of pressure to deliver applications in the cloud. Why is it important? Because what’s happening is for a period of time and the enterprise applications were delivered on premise. It was very expensive to upgrade. And also every time you release a new software, the customers won’t upgrade and the vendors were stuck with supporting software that is almost 10, 15 years old. One of the things that cloud applications provide is automatic upgrade of all your applications, to the latest version, and also for the vendor to maintain only one version of the software, like keeping all the customers in the latest and then providing them with all the latest functionalities.
Kumar Ramaiyer2 00:10:29 That’s a nice advantage of delivering applications on the cloud. So then the question is, can we deliver a big monolithic applications on the cloud? The problem becomes lot of the modern cloud deployment architectures are containers based. We talked about the scale and complexity because when you are actually running the customer’s applications on the cloud, let’s say you have 500 customers in on-premise. They all add 500 different deployments. Now you’re taking on the burden of running all those deployments in your own cloud. It is not easy. So you need to use Kubernetes type of an architecture to manage that level of complex deployment in the cloud. So that’s how you arrive at the decision of you can’t just simply running 500 monolithic deployment. To run it efficiently in the cloud, you need to have a container rest environment. You start to going down that path. Not only that many of the SaaS vendors have more than one application. So imagine running several applications in its own legacy way of running it, you just cannot scale. So there are systematic ways of breaking a monolithic applications into a microservices architecture. We can go through that step.
Kanchan Shringi 00:11:40 Let’s delve into that. How does one go about it? What is the methodology? Are there patterns that somebody can follow? Best practices?
Kumar Ramaiyer2 00:11:47 Yeah. So, let me talk about some of the basics, right? SaaS applications can benefit from services architecture. And if you look at it, almost all applications have many common platform components: Some of the examples are scheduling; almost all of them have a persistent storage; they all need a life cycle management from test-prod type of flow; and they all have to have data connectors to multiple external system, virus scan, document storage, workflow, user management, the authorization, monitoring and observability, losing type of search email, et cetera, right? A company that delivers multiple products have no reason to build all of these multiple times, right? And these are all ideal candidates to be delivered as microservices and reused across the different SaaS applications one may have. Once you decide to create a services architecture, and you want only focus on building the service and then do as good a job as possible, and then putting all of them together and deploying it is given to someone else, right?
Kumar Ramaiyer2 00:12:52 And that’s where the continuous deployment comes into picture. So typically what happens is that one of the best practices, we all build containers and then deliver it using what is called an artifactory with appropriate version number. When you are actually deploying it, you specify all the different containers that you need and the compatible version numbers, all of these are put together as a quad and then delivered in the cloud. That’s how it works. And it is proven to work well. And the maturity level is pretty high with widespread adoption in many, many vendors. So the other way also to look at it is just a new architectural way of developing application. But the key thing then is if you had a monolithic application, how do you go about breaking it up? So we all see the benefit of it. And I can walk through some of the aspects that you have to pay attention to.
Kanchan Shringi 00:13:45 I think Kumar it’d be great if you use an example to get into the next level of detail?
Kumar Ramaiyer2 00:13:50 Suppose you have an HR application that manages employees of a company. The employees may have, you may have anywhere between five to a hundred attributes per employee in different implementations. Now let’s assume different personas were asking for different reports about employees with different conditions. So for example, one of the report could be give me all the employees who are at certain level and making less than average corresponding to their salary range. Then another report could be give me all the employees at certain level in certain location, but who are women, but at least five years in the same level, et cetera. And let’s assume that we have a monolithic application that can satisfy all these requirements. Now, if you want to break that monolithic application into a microservice and you just decided, okay, let me put this employee and its attribute and the management of that in a separate microservice.
Kumar Ramaiyer2 00:14:47 So basically that microservice owns the employee entity, right? Anytime you want to ask for an employee, you’ve got to go to that microservice. That seems like a logical starting point. Now because that service owns the employee entity, everybody else cannot have a copy of it. They will just need a key to query that, right? Let’s assume that is an employee ID or something like that. Now, when the report comes back, because you are running some other services and you got the results back, the report may return either 10 employees or 100,000 employees. Or it may also return as an output two attributes per employee or 100 attributes. So now when you come back from the back end, you will only have an employee ID. Now you had to populate all the other information about these attributes. So now how do you do that? You need to go talk to this employee service to get that information.
Kumar Ramaiyer2 00:15:45 So what would be the API design for that service and what will be the payload? Do you pass a list of employee IDs, or do you pass a list of attributes or you make it a big uber API with the list of employee IDs and a list of attributes. If you call one at a time, it is too chatty, but if you call it everything together as one API, it becomes a very big payload. But at the same time, there are hundreds of personas running that report, what is going to happen in that microservices? It’ll be very busy creating a copy of the entity object hundreds of times for the different workloads. So it becomes a massive memory problem for that microservice. So that’s a crux of the problem. How do you design the API? There is no single answer here. So the answer I’m going to give with in this context, maybe having a distributed cache where all the services sharing that employee entity probably may make sense, but often that’s what you need to pay attention to, right?
Kumar Ramaiyer2 00:16:46 You had to go look at all workloads, what are the touch points? And then put the worst case hat and think about the payload size chattiness and whatnot. If it is in the monolithic application, we would just simply be traveling some data structure in memory, and we’ll be reusing the pointer instead of cloning the employee entity, so it will not have much of a burden. So we need to be aware of this latency versus throughput trade-off, right? It’s almost always going to cost you more in terms of latency when you are going to a remote process. But the benefit you get is in terms of scale-out. If the employee service, for example, could be scaled into hundred scale-out nodes. Now it can support lot more workloads and lot more report users, which otherwise wouldn’t be possible in a scale-up situation or in a monolithic situation.
Kumar Ramaiyer2 00:17:37 So you offset the loss of latency by a gain in throughput, and then by being able to support very large workloads. So that’s something you want to be aware of, but if you cannot scale out, then you don’t gain anything out of that. Similarly, the other things you need to pay attention are just a single tenant application. It doesn’t make sense to create a services architecture. You should try to work on your algorithm to get a better bond algorithms and try to scale up as much as possible to get to a good performance that satisfies all your workloads. But as you start introducing multi-tenant so you don’t know, so you are supporting lots of customers with lots of users. So you need to support very large workload. A single process that is scaled up, cannot satisfy that level of complexity and scale. So that time it’s important to think in terms of throughput and then scale out of various services. That’s another important notion, right? So multi-tenant is a key for a services architecture.
Kanchan Shringi 00:18:36 So Kumar, you talked in your example of an employee service now and earlier you had hinted at more platform services like search. So an employee service is not necessarily a platform service that you would use in other SaaS applications. So what is a justification for creating an employee as a breakup of the monolith even further beyond the use of platform?
Kumar Ramaiyer2 00:18:59 Yeah, that’s a very good observation. I think the first starter would be to create a platform components that are common across multiple SaaS application. But once you get to the point, sometimes with that breakdown, you still may not be able to satisfy the large-scale workload in a scaled up process. You want to start looking at how you can break it further. And there are common ways of breaking even the application level entities into different microservices. So the common examples, well, at least in the domain that I’m in is to break it into a calculation engine, metadata engine, workflow engine, user service, and whatnot. Similarly, you may have a consolidation, account reconciliation, allocation. There are many, many application-level concepts that you can break it up further. So that at the end of the day, what is the service, right? You want to be able to build it independently. You can reuse it and scale out. As you pointed out, some of the reusable aspect may not play a role here, but then you can scale out independently. For example, you may want to have a multiple scaled-out version of calculation engine, but maybe not so many of metadata engine, right. And that is possible with the Kubernetes. So basically if we want to scale out different parts of even the application logic, you may want to think about containerizing it even further.
Kanchan Shringi 00:20:26 So this assumes a multi-tenant deployment for these microservices?
Kumar Ramaiyer2 00:20:30 That’s correct.
Kanchan Shringi 00:20:31 Is there any reason why you would still want to do it if it was a single-tenant application, just to adhere to the two-pizza team model, for example, for developing and deploying?
Kumar Ramaiyer2 00:20:43 Right. I think, as I said, for a single tenant, it doesn’t justify creating this complex architecture. You want to keep everything scale up as much as possible and go to the — particularly in the Java world — as large a JVM as possible and see whether you can satisfy that because the workload is pretty well known. Because the multi-tenant brings in complexity of like lots of users from multiple companies who are active at different point in time. And it’s important to think in terms of containerized world. So I can go into some of the other common issues you want to pay attention to when you are creating a service from a monolithic application. So the key aspect is each service should have its own independent business function or a logical ownership of entity. That’s one thing. And you want a wide, large, common data structure that is shared by lot of services.
Kumar Ramaiyer2 00:21:34 So it’s generally not a good idea, in particular, if it is often needed leading to chattiness or updated by multiple services. You want to pay attention to payload size of different APIs. So the API is the key, right? When you’re breaking it up, you need to pay a lot of attention and go through all your workloads and what are the different APIs and what are the payload size and chattiness of the API. And you need to be aware that there will be a latency with a throughput. And then sometimes in a multi-tenant situation, you want to be aware of routing and placement. For example, you want to know which of these parts contain what customer’s data. You are not going to replicate every customer’s information in every part. So you need to cache that information and you need to be able to, or do a service or do a lookup.
Kumar Ramaiyer2 00:22:24 Suppose you have a workflow service. There are five copies of the service and each copy runs a workflow for some set of customers. So you need to know how to look that up. There are updates that need to be propagated to other services. You need to see how you are going to do that. The standard way of doing it nowadays is using Kafka event service. And that needs to be part of your deployment architecture. We already talked about it. Single tenant is generally you don’t want to go through this level of complexity for single tenant. And one thing that I keep thinking about it is, in the earlier days, when we did, entity relationship modeling for database, there is a normalization versus the denormalization trade-off. So normalization, we all know is good because there is the notion of a separation of concern. So this way the update is very efficient.
Kumar Ramaiyer2 00:23:12 You only update it in one place and there is a clear ownership. But then when you want to retrieve the data, if it is extremely normalized, you end up paying price in terms of a lot of joins. So services architecture is similar to that, right? So when you want to combine all the information, you have to go to all these services to collate these information and present it. So it helps to think in terms of normalization versus denormalization, right? So do you want to have some kind of read replicas where all these informations are collated? So that way the read replica, addresses some of the clients that are asking for information from collection of services? Session management is another critical aspect you want to pay attention to. Once you are authenticated, how do you pass that information around? Similarly, all these services may want to share database information, connection pool, where to log, and all of that. There’s are a lot of configuration that you want to share. And between the service mesh are introducing a configuration service by itself. You can address some of those problems.
Kanchan Shringi 00:24:15 Given all this complexity, should people also pay attention to how many is too many? Certainly there’s a lot of benefit to not having microservices and there are benefits to having them. But there must be a sweet spot. Is there anything you can comment on the number?
Kumar Ramaiyer2 00:24:32 I think it’s important to look at service mesh and other complex deployment because they provide benefit, but at the same time, the deployment becomes complex like your DevOps and when it suddenly needs to take on extra work, right? See anything more than five, I would say is nontrivial and need to be designed carefully. I think in the beginning, most of the deployments may not have all the complex, the sidecars and service measure, but a period of time, as you scale to thousands of customers, and then you have multiple applications, all of them are deployed and delivered on the cloud. It is important to look at the full strength of the cloud deployment architecture.
Kanchan Shringi 00:25:15 Thank you, Kumar that certainly covers several topics. The one that strikes me, though, as very critical for a multi-tenant application is ensuring that data is isolated and there’s no leakage between your deployment, which is for multiple customers. Can you talk more about that and patterns to ensure this isolation?
Kumar Ramaiyer2 00:25:37 Yeah, sure. When it comes to platform service, they are stateless and we are not really worried about this issue. But when you break the application into multiple services and then the application data needs to be shared between different services, how do you go about doing it? So there are two common patterns. One is if there are multiple services who need to update and also read the data, like all the read rate workloads have to be supported through multiple services, the most logical way to do it is using a ready type of a distributed cache. Then the caution is if you’re using a distributed cache and you’re also storing data from multiple tenants, how is this possible? So typically what you do is you have a tenant ID, object ID as a key. So that, that way, even though they’re mixed up, they’re still well separated.
Kumar Ramaiyer2 00:26:30 But if you’re concerned, you can actually even keep that data in memory encrypted, using tenant specific key, right? So that way, once you read from the distributor cache, and then before the other services use them, they can DEC using the tenant specific key. That’s one thing, if you want to add an extra layer of security, but, but the other pattern is typically only one service. Won’t the update, but all others need a copy of that. The regular interval are almost at real time. So the way it happens is the ownership, service still updates the data and then passes all the update as an event through Kafka stream and all the other services subscribe to that. But here, what happens is you need to have a clone of that object everywhere else, so that they can perform that update. It’s basically that you cannot avoid. But in our example, what we talked about, all of them will have a copy of the employee object. Hasn’t when an update happens to an employee, those updates are propagated and they apply it locally. Those are the two patterns which are commonly adapted.
Kanchan Shringi 00:27:38 So we’ve spent quite some time talking about how the SaaS application is composed from multiple platform services. And in some cases, striping the business functionality itself into a microservice, especially for platform services. I’d like to talk more about how do you decide whether you build it or, you know, you buy it and buying could be subscribing to an existing cloud vendor, or maybe looking across your own organization to see if someone else has that specific platform service. What’s your experience about going through this process?
Kumar Ramaiyer2 00:28:17 I know this is a pretty common problem. I don’t think people get it right, but you know what? I can talk about my own experience. It’s important within a large organization, everybody recognizes there shouldn’t be any duplication effort and they one should design it in a way that allows for sharing. That’s a nice thing about the modern containerized world, because the artifactory allows for distribution of these containers in a different version, in an easy wave to be shared across the organization. When you’re actually deploying, even though the different products may be even using different versions of these containers in the deployment country, you can actually speak what version do you want to use? So that way different versions doesn’t pose a problem. So many companies don’t even have a common artifactory for sharing, and that should be fixed. And it’s an important investment. They should take it seriously.
Kumar Ramaiyer2 00:29:08 So I would say like platform services, everybody should strive to share as much as possible. And we already talked about it is there are a lot of common services like workflow and, document service and all of that. When it comes to build versus buy, the other things that people don’t understand is even the multiple platforms are multiple operating systems also is not an issue. For example, the latest .net version is compatible with Kubernetes. It’s not that you only need all Linux versions of containers. So even if there is a good service that you want to consume, and if it is in Windows, you can still consume it. So we need to pay attention to it. Even if you want to build it on your own, it’s okay to get started with the containers that are available and you can go out and buy and consume it quickly and then work a period of time, you can replace it. So I would say the decision is purely based on, I mean, you should look in the business interest to see is it our core business to build such a thing and also does our priority allow us to do it or just go and get one and then deploy it because the standard way of deploying container is allows for easy consumption. Even if you buy externally,
Kanchan Shringi 00:30:22 What else do you need to ensure though, before you decide to, you know, quote unquote, buy externally? What compliance or security aspects should you pay attention to?
Kumar Ramaiyer2 00:30:32 Yeah, I mean, I think that’s an important question. So the security is very key. These containers should support, TLS. And if there is data, they should support different types of an encryption. For example there are, we can talk about some of the security aspect of it. That’s one thing, and then it should be compatible with your cloud architecture. Let’s say we are going to use service mesh, and there should be a way to deploy the container that you are buying should be compatible with that. We didn’t talk about APA gateway yet. We’re going to use an APA gateway and there should be an easy way that it conforms to our gateway. But security is an important aspect. And I can talk about that in general, there are three types of encryption, right? Encryption addressed and encryption in transit and encryption in memory. Encryption addressed means when you store the data in a disc and that data should be kept encrypted.
Kumar Ramaiyer2 00:31:24 Encryption is transit is when a data moves between services and it should go in an encrypted way. And encryption in memory is when the data is in memory. Even the data structure should be encrypted. And the third one is, the encryption in memory is like most of the vendors, they don’t do it because it’s pretty expensive. But there are some critical parts of it they do keep it encrypted in memory. But when it comes to encryption in transit, the modern standard is still that’s 1.2. And also there are different algorithms requiring different levels of encryption using 256 bits and so on. And it should conform to the IS standard possible, right? That’s for the transit encryption. And also there are a different types of encryption algorithms, symmetry versus asymmetry and using certificate authority and all of that. So there is the rich literature and there is a lot of well understood ardency here
Kumar Ramaiyer2 00:32:21 And it’s not that difficult to adapt on the modern standard for this. And if you use these stereotype of service mesh adapting, TLS becomes easier because the NY proxy performs the duty as a TLS endpoint. So it makes it easy. But when it comes to encryption address, there are fundamental questions you want to ask in terms of design. Do you encrypt the data in the application and then send the encrypted data to this persistent storage? Or do you rely on the database? You send the data unencrypted using TLS and then encrypt the data in disk, right? That’s one question. Typically people use two types of key. One is called an envelope key, another is called a data key. Anyway, envelope key is used to encrypt the data key. And then the data key is, is what is used to encrypt the data. And the envelope key is what is rotated often. And then data key is rotated very rarely because you need to touch every data to decrypted, but rotation of both are important. And what frequency are you rotating all those keys? That’s another question. And then you have different environments for a customer, right? You may have a best product. The data is encrypted. How do you move the encrypted data between these tenants? And that’s an important question you need to have a good design for.
Kanchan Shringi 00:33:37 So these are good compliance asks for any platform service you’re choosing. And of course, for any service you are building as well.
Kumar Ramaiyer2 00:33:44 That’s correct.
Kanchan Shringi 00:33:45 So you mentioned the API gateway and the fact that this platform service needs to be compatible. What does that mean?
Kumar Ramaiyer2 00:33:53 So typically what happens is when you have lots of microservices, right? Each of the microservices have their own APIs. To perform any useful business function, you need to call a sequence of APIs from all of these services. Like as we talked earlier, if the number of services explodes, you need to understand the API from all of these. And also most of the vendors support lots of clients. Now, each one of these clients have to understand all these services, all these APIs, but even though it serves an important function from an internal complexity management and skill purpose from an external business perspective, this level of complexity and exposing that to external client doesn’t make sense. This is where the APA gateway comes in. APA gateway access an aggregator, of these a APAs from these multiple services and exposes simple API, which performs the holistic business function.
Kumar Ramaiyer2 00:34:56 So these clients then can become simpler. So the clients call into the API gateway API, which either directly route sometimes to an API of a service, or it does an orchestration. It may call anywhere from five to 10 APIs from these different services. And all of them don’t have to be exposed to all the clients. That’s an important function performed by APA gateway. It’s very critical to start having an APA gateway once you have a non-trivial number of microservices. The other functions, it also performs are he does what is called a rate limiting. Meaning if you want to enforce certain rule, like this service cannot be moved more than certain time. And sometimes it does a lot of analytics of which APA is called how many times and authentication of all those functions are. So you don’t have to authenticate delivery service. So it gets authenticated at the gateway. We turn around and call the internal API. It’s an important component of a cloud architecture.
Kanchan Shringi 00:35:51 The aggregation is that something that’s configurable with the API gateway?
Kumar Ramaiyer2 00:35:56 There are some gateways where it is possible to configure, but that standards are still being established. More often this is written as a code.
Kanchan Shringi 00:36:04 Got it. The other thing you talked about earlier was the different types of environments. So dev, test and production, is that a standard with SaaS that you provide these different types and what is the implicit function of each of them?
Kumar Ramaiyer2 00:36:22 Right. I think the different vendors have different contracts and they provide us part of selling the product that are different contracts established. Like every customer gets certain type of tenants. So why do we need this? If we think about even in an on-premise world, there will be a typically a production deployment. And once somebody buys a software to get to a production it takes anywhere from several weeks to several months. So what happens during that time, right? So they buy a software, they start doing a development, they first convert their requirements into a model where it’s a model and then build that model. There will be a long phase of development process. Then it goes through different types of testing, user acceptance testing, and whatnot, performance testing. Then it gets deployed in production. So in the on-premise world, typically you will have multiple environments: development, test, and UAT, and prod, and whatnot.
Kumar Ramaiyer2 00:37:18 So, when we come to the cloud world, customers expect a similar functionality because unlike on-premise world, the vendor now manages — in an on-premise world, if we had 500 customers and each one of those customers had four machines. Now these 2000 machines have to be managed by the vendor because they are now administering all those aspects right in the cloud. Without significant level of tooling and automation, supporting all these customers as they go through this lifecycle is almost impossible. So you need to have a very formal definition of what these things mean. Just because they move from on-premise to cloud, they don’t want to give up on going through test prod cycle. It still takes time to build a model, test a model, go through a user acceptance and whatnot. So almost all SaaS vendors have these type of concept and have tooling around one of the differing aspects.
Kumar Ramaiyer2 00:38:13 Maybe, how do you move data from one to another either? How do you automatically refresh from one to another? What kind of data gets promoted from one to another? So the refresh semantics becomes very critical and do they have an exclusion? Sometimes a lot of the customers provide automatic refresh from prod to dev, automatic promotion from test to test team pull, and all of that. But this is very critical to build and expose it to your customer and make them understand and make them part of that. Because all the things they used to do in on-premise, now they have to do it in the cloud. And if you had to scale to hundreds and thousands of customers, you need to have a pretty good tooling.
Kanchan Shringi 00:38:55 Makes sense. The next question I had along the same vein was disaster recovery. And then perhaps talk about these different types of environment. Would it be fair to assume that doesn’t have to apply to a dev environment or a test environment, but only a prod?
Kumar Ramaiyer2 00:39:13 More often when they design it, DR is an important requirement. And I think we’ll get to what applies to what environment in a short time, but let me first talk about DR. So DR has got two important metrics. One is called an RTO, which is time objective. One is called RPO, which is a point objective. So RTO is like how much time it’ll take to recover from the time of disaster? Do you bring up the DR site within 10 hours, two hours, one hour? So that is clearly documented. RPO is after the disaster, how much data is lost? Is it zero or one hour of data? Five minutes of data. So it’s important to understand what these metrics are and understand how your design works and clearly articulate these metrics. They’re part of it. And I think different values for these metrics call for different designs.
Kumar Ramaiyer2 00:40:09 So that’s very important. So typically, right, it’s very important for prod environment to support DR. And most of the vendors support even the dev and test-prod also because it’s all implemented using clusters and all the clusters with their associated persistent storage are backed up using an appropriate. The RTO, time may be different between different environments. It’s okay for dev environment to come up a little slowly, but our people objective is typically common between all these environments. Along with DR, the associated aspects are high availability and scale up and out. I mean, our availability is provided automatically by most of the cloud architecture, because if your part goes down and another part is brought up and services that request. And so on, typically you may have a redundant part which can service the request. And the routing automatically happens. Scale up and out are integral to an application algorithm, whether it can do a scale up and out. It’s very critical to think about it during their design time.
Kanchan Shringi 00:41:12 What about upgrades and deploying next versions? Is there a cadence, so test or dev case upgraded first and then production, I assume that would have to follow the customers timelines in terms of being able to ensure that their application is ready for accepted as production.
Kumar Ramaiyer2 00:41:32 The industry expectation is down time, and there are different companies that have different methodology to achieve that. So typically you’ll have almost all companies have different types of software delivery. We call it Artfix service pack or future bearing releases and whatnot, right? Artfixes are the critical things that need to go in at some point, right? I mean, I think as close to the incident as possible and service packs are regularly scheduled patches and releases are, are also regularly scheduled, but at a much lower care as compared to service pack. Often, this is closely tied with strong SLAs companies have promised to the customers like 4-9 availability, 5-9 availability and whatnot. There are good techniques to achieve zero down time, but the software has to be designed in a way that allows for that, right. Can each container be, do you have a bundle bill which contains all the containers together or do you deploy each container separately?
Kumar Ramaiyer2 00:42:33 And then what about if you have a schema changes, how do you take advantage? How do you upgrade that? Because every customer schema have to be upgraded. A lot of times schema upgrade is, probably the most challenging one. Sometimes you need to write a compensating code to account for so that it can work on the world schema and the new schema. And then at runtime, you upgrade the schema. There are techniques to do that. Zero downtime is typically achieved using what is called rolling upgrade as different clusters are upgraded to the new version. And because of the availability, you can upgrade the other parts to the latest version. So there are well established patterns here, but it’s important to spend enough time thinking through it and design it appropriately.
Kanchan Shringi 00:43:16 So in terms of the upgrade cycles or deployment, how critical are customer notifications, letting the customer know what to expect when?
Kumar Ramaiyer2 00:43:26 I think almost all companies have a well-established protocol for this. Like they all have signed contracts about like in terms of downtime and notification and all of that. And they’re well-established pattern for it. But I think what is important is if you’re changing the behavior of a UI or any functionality, it’s important to have a very specific communication. Well, let’s say you are going to have a downtime Friday from 5-10, and often this is exposed even in the UI that they may get an email, but most of the companies now start at today, start in the enterprise software itself. Like what time is it? But I agree with you. I don’t have a pretty good answer, but most of the companies do have assigned contracts in how they communicate. And often it is through email and to a specific representative of the company and also through the UI. But the key thing is if you’re changing the behavior, you need to walk the customer through it very carefully
Kanchan Shringi 00:44:23 Makes sense. So we’ve talked about key design principles, microservice composition for the application and certain customer experiences and expectations. I wanted to next talk a little bit about regions and observability. So in terms of deploying to multiple regions, how important does that, how many regions across the world in your experience makes sense? And then how does one facilitate the CICD necessary to be able to do this?
Kumar Ramaiyer2 00:44:57 Sure. Let me walk through it slowly. First let me talk about the regions, right? When you’re a multinational company, you are a large vendor delivering the customers in different geographies, regions play a pretty critical role, right? Your data centers in different regions help achieve that. So regions are chosen typically to cover broader geography. You’ll typically have a US, Europe, Australia, sometimes even Singapore, South America and so on. And there are very strict data privacy rules that need to be enforced these different regions because sharing anything between these regions is strictly prohibited and you are to conform to you are to work with all your legal and others to make sure what is to clearly document what is shared and what is not shared and having data centers in different regions, all of you to enforce this strict data privacy. So typically the terminology used is what is called an availability region.
Kumar Ramaiyer2 00:45:56 So these are all the different geographical locations, where there are cloud data centers and different regions offer different service qualities, right? In terms of order, in terms of latency, see some products may not be offered in some in regions. And also the cost may be different for large vendors and cloud providers. These regions are existing across the globe. They are to enforce the governance rules of data sharing and other aspects as required by the respective governments. But within a region what is called an availability zone. So this refers to an isolated data center within a region, and then each availability zone can also have a multiple data center. So this is needed for a DR purpose. For every availability zone, you will have an associated availability zone for a DR purpose, right? And I think there is a common vocabulary and a common standard that is being adapted by the different cloud vendors. As I was saying right now, unlike compromised in the cloud in on-premise world, you will have, like, there are a thousand customers, each customer may add like five to 10 administrators.
Kumar Ramaiyer2 00:47:00 So let’s say they that’s equivalent to 5,000 administrators. Now that role of that 5,000 administrator has to be played by the single vendor who’s delivering an application in the cloud. It’s impossible to do it without significant amount of automation and tooling, right? Almost all vendors in lot in observing and monitoring framework. This has gotten pretty sophisticated, right? I mean, it all starts with how much logging that’s happening. And particularly it becomes complicated when it becomes microservices. Let’s say there is a user request and that goes and runs a report. And if it touches, let’s say seven or eight services, as it goes through all these services previously, maybe in a monolithic application, it was easy to log different parts of the application. Now this request is touching all these services, maybe multiple times. How do you log that, right? It’s important to most of the softwares have thought through it from a design time, they establish a common context ID or something, and that is law.
Kumar Ramaiyer2 00:48:00 So you have a multi-tenant software and you have a specific user within that tenant and a specific request. So all that have to be all that context have to be provided with all your logs and then need to be tracked through all these services, right? What’s happening is these logs are then analyzed. There are multiple vendors like Yelp, Sumo, Logic, and Splunk, and many, many vendors who provide very good monitoring and observability frameworks. Like these logs are analyzed and they almost provide a real time dashboard showing what is going on in the system. You can even create a multi-dimensional analytical dashboard on top of that to slice and dice by various aspect of which cluster, which customer, which tenant, what request is having problem. And that can be, then you can then define thresholds. And then based on the threshold, you can then generate alerts. And then there are pager duty type of a software, which there, I think there’s another software called Panda. All of these can be used in conjunction with these alerts to send text messages and whatnot, right? I mean, it has gotten pretty sophisticated. And I think almost all vendors have a pretty rich observability of framework. And we thought that it’s very difficult to efficiently operate the cloud. And you basically want to figure out much earlier than any issue before customer even perceives it.
Kanchan Shringi 00:49:28 And I assume capacity planning is also critical. It could be termed under observability or not, but that would be something else that the DevOps folks have to pay attention to.
Kumar Ramaiyer2 00:49:40 Completely agree. How do you know what capacity you need when you have these complex and scale needs? Right. Lots of customers with each customers having lots of users. So you can fast over provision it and have a, have a very large system. Then it cuts your bottom line, right? Then you are spending a lot of money. If you have a hundred capacity, then it causes all kinds of performance issues and stability issues, right? So what is the right way to do it? The only way to do it is through having a good observability and monitoring framework, and then use that as a feedback loop to constantly upgrade your framework. And then Kubernetes deployment where that allows us to dynamically scale the parts, helps significantly in this aspect. Even the customers are not going to ramp up on day one. They also probably will slowly ramp up their users and whatnot.
Kumar Ramaiyer2 00:50:30 And it’s very important to pay very close attention to what’s going on in your production, and then constantly use the capabilities that is provided by these cloud deployment to scale up or down, right? But you need to have all the framework in place, right? You have to constantly know, let’s say you have 25 clusters in each clusters, you have 10 machines and 10 machines you have lots of parts and you have different workloads, right? Like a user login, user running some calculation, user running some reports. So each one of the workloads, you need to deeply understand how it is performing and different customers may be using different sizes of your model. For example, in my world, we have a multidimensional database. All of customers create configurable type of database. One customer have five dimension. Another customer can have 15 dimensions. One customer can have a dimension with hundred members. Another customer can have the largest dimension of million members. So hundred users versus 10,000 users. There are different customers come in different sizes and shape and they trust the systems in different way. And of course, we need to have a pretty strong QA and performance lab, which think through all these using synthetic models makes the system go through all these different workloads, but nothing like observing the production and taking the feedback and adjusting your capacity accordingly.
Kanchan Shringi 00:51:57 So starting to wrap up now, and we’ve gone through several complex topics here while that’s complex itself to build the SaaS application and deploy it and have customers onboard it at the same time. This is just one piece of the puzzle at the customer site. Most customers choose between multiple best of breed, SaaS applications. So what about extensibility? What about creating the ability to integrate your application with other SaaS applications? And then also integration with analytics that less customers introspect as they go.
Kumar Ramaiyer2 00:52:29 That is one of the challenging issues. Like a typical customer may have multiple SaaS applications, and then you end up building an integration at the customer side. You may then go and buy a past service where you write your own code to integrate data from all these, or you buy a data warehouse that pulls data from these multiple applications, and then put a one of the BA tools on top of that. So data warehouse acts like an aggregator for integrating with multiple SaaS applications like Snowflake or any of the data warehouse vendors, where they pull data from multiple SaaS application. And you build an analytical applications on top of that. And that’s a trend where things are moving, but if you want to build your own application, that pulls data from multiple SaaS application, again, it is all possible because almost all vendors in the SaaS application, they provide ways to extract data, but then it leads to a lot of complex things like how do you script that?
Kumar Ramaiyer2 00:53:32 How do you schedule that and so on. But it is important to have a data warehouse strategy. Yeah. BI and analytical strategy. And there are a lot of possibilities and there are a lot of capabilities even there available in the cloud, right? Whether it is Amazon Android shift or Snowflake, there are many or Google big table. There are many data warehouses in the cloud and all the BA vendors talk to all of these cloud. So it’s almost not necessary to have any data center footprint where you build complex applications or deploy your own data warehouse or anything like that.
Kanchan Shringi 00:54:08 So we covered several topics though. Is there anything you feel that we did not talk about that is absolutely critical to?
Kumar Ramaiyer2 00:54:15 I don’t think so. No, thanks Kanchan. I mean, for this opportunity to talk about this, I think we covered a lot. One last point I would add is, you know, study and DevOps, it’s a new thing, right? I mean, they’re absolutely critical for success of your cloud. Maybe that’s one aspect we didn’t talk about. So DevOps automation, all the runbooks they create and investing heavily in, uh, DevOps organization is an absolute must because they are the key folks who, if there is a vendor cloud vendor, who’s delivering four or five SA applications to thousands of customers, the DevOps basically runs the show. They’re an important part of the organization. And it’s important to have a good set of people.
Kanchan Shringi 00:54:56 How can people contact you?
Kumar Ramaiyer2 00:54:58 I think they can contact me through LinkedIn to start with my company email, but I would prefer that they start with the LinkedIn.
Kanchan Shringi 00:55:04 Thank you so much for this today. I really enjoyed this conversation.
Kumar Ramaiyer2 00:55:08 Oh, thank you, Kanchan for taking time.
Kanchan Shringi 00:55:11 Thanks all for listening. [End of Audio]