A critical aspect of cloud services is service-to-service communication, and it's essential to do it securely. As an architect who designed centralized authentication and authorization services in CyberArk, a cyber security SaaS provider, I will share my take on developing a secure authorization mechanism between AWS cloud-based services, whether serverless or containers.
By the time you finish reading this post, you will not only understand the importance of service authentication and authorization, but also be equipped with the practical knowledge to implement it. This includes securely enabling cross-account access to resources for both synchronous and asynchronous communication patterns using AWS IAM.
This post includes JSON IAM policies and Python code examples.
This post is the first of many security topics in an upcoming series.
Table of Contents
Authentication and Authorization Concepts
Authentication is:
and authorization on the other hand is:
the function of specifying access rights/privileges to resources - Wikipedia
In a secure service, authentication and authorization go hand in hand.
As a service developer who exposes a REST API, it's crucial to understand that leaving your API open to the world without proper authentication and authorization measures can lead to unauthorized access and data manipulation.
You want your service first to accept communication requests from authenticated services (principals), those whose identities are proven to be trusted and known. Once identified, you want to ensure that only authorized services can trigger your API or communicate with your service.
Or in other words:
Authentication aims to validate and identify that a principal is who it claims to be.
Authorization aims to define the relationship between principals, actions, and resources.
We can visualize authorization as the relationship between the principals (services), the actions they wish to take (e.g., execute an API), and the resources (the actual REST API endpoint and HTTP action type) on which their actions are called.
So, service-to-service communication security comes in two parts:
First, make sure the caller service is who it claims to be.
Then, assert that it has permission to take action on the resource.
To better understand the importance of authentication and authorization, let's consider some real-world examples of service-to-service communication. These examples will highlight the critical role these measures play in ensuring the security and integrity of your REST API services.
Service-to-Service Communication Patterns
Developers spend a lot of time designing customer-facing REST APIs. These APIs introduce the concept of users to the system (human or non-human). AWS offers several user and service authentication options such as Amazon Cognito (Cognito) and AWS Identity and Access Management (IAM), with IAM being the main option for internal user management, while Cognito can connect to external identity providers such as CyberArk and others.
In this post, I'd like to focus on AWS IAM and its role in an important aspect of cloud services: service-to-service authentication and authorization.
Synchronous Communication Pattern
In this use case, there are three services: A, B, and C, each in its own AWS account.
Service A is serverless, B is EC2-based, and C serves a REST API with API Gateway. Both services A and B use service C's API.
*This part discusses public API Gateways. For private API Gateway, visit the appendix below.
However, service C wants to ensure that only these specific services can call its API and perhaps even fine-tune it to be least privileged so that only a particular Lambda function or an EC2 machine can trigger it.
Let's continue to the second pattern.
Asynchronous Communication Pattern
In this use case, we have our three services as before:
A, B and C each in its own AWS account as before.
C holds an SNS topic used as a centralized publisher-subscriber asynchronous communication service in the organization.
Service A is the publisher, and Service B subscribes to the SNS topic via an SQS queue.
Service C wants to ensure that only Service A can publish messages to the SNS topic and that only Service B's SQS can subscribe to the topic as the topic contains data that you should keep private.
This is a standard pattern. You can swap the SNS topic to the EventBridge bus or any messaging service you may have, such as Amazon MSK.
Single or Multiple Accounts
As a side note, services A, B, and C may share the same account, simplifying the IAM solution. However, following the IAM practices in this post is essential, even if all services share the same account do not take shortcuts. You might find yourself moving them to different accounts in the future. In that case, if you don't follow the best practices, you will have a hard time and many breaking changes ahead of you (from experience!). It's best always to follow best practices, especially when security is involved.
IAM Based Authentication & Authorization
Now that we've covered the basics of service to service authentication and authorization issues let's discuss the solution and start using AWS IAM.
This solution will cover both cross-account and same-account authentication and authorization.
IAM authentication means that a principal, for instance, service A, must be authenticated (signed in to AWS) using their credentials to send a request to other AWS services and resources, or service C in our example. Once authenticated, we can leverage IAM again to ensure service A is authorized to access service C.
We will combine two aspects of IAM, identity-based and resource-based policies, and provide authorization solutions for both synchronous and asynchronous communications. We will also provide cross-account access, i.e., authorization services from different AWS accounts using the "assume role" or delegation mechanism.
Resource Based Policies
According to AWS documentation:
Resource-based policies grant the specified principal permission to perform specific actions on that resource and defines under what conditions this applies
Resource-based policies are JSON policy documents that you attach to a resource, such as API Gateway, SNS, S3, or another resource. With resource-based policies, you can specify who has access to the resource and what actions they can perform on it. In addition, they can be used to allow cross-account access. Sounds perfect, right? Well, it has limitations, and not all resource types support it.
Let's review the pros and cons of this IAM mechanism.
Pros:
Simple to define
Enable cross-account access.
Cons:
Not all AWS resources support resource-based policies.
Resource-based policies have a size limit like all IAM policies. When you define more resources and reach the maximum size, there's no way to overcome it other than provisioning a new resource (duplicating part of service C, basically).
Assume Role Mechanism
The IAM access delegation mechanism, or "assume role," as I call it, is crucial in cross-account communication. However, it can also be used in a single AWS account scenario.
The IAM access delegation mechanism is underpinned by identity-based policies attached to an IAM role. In our example, role C is located in service C's account and grants access to the resource in question, which could be service C's API Gateway or SNS topic.
Whoever has permission to assume role C can get a set of temporary IAM security credentials that can be used for authentication and authorization to communicate with service C, whether to execute a REST API call or publish a message to an SNS topic.
Assuming a role involves making an AWS SDK call to Amazon STS (AWS Security Token Service) and utilizing the temporary credentials from the SDK response to initiate a communication session with the service you intend to communicate with, we will explore code examples later in this post to ensure you have a comprehensive understanding of the process.
However, we need to define who can assume this role and gain access to service C. We will need resource policies and define that only the roles of services A and B can assume role C and gain access to service C.
Let's review the pros and cons of this IAM mechanism.
Pros:
Supports all resource types as long as you can define an IAM policy.
Abstracts the resource and its ARN. You get a role that provides you access. The resource can change tomorrow, but all you know is the role of ARN and what API call to use. Suppose the call is abstracted in an organizational SDK that encapsulates the resource. In that case, you can change resources—SNS topic to EventBridge bus—and keep the changes in the role policies and SDK implementation levels, but the role ARNs remain the same.
Role C's resource-based policy, which defines who can assume the role, has a size limit. However, if we want to add more services, we can provision a new role for new services to assume. Unlike the previous mechanism, it's okay to provision a new resource as the protected resource remains unchanged (API Gateway or SNS topic).
Cons:
It is more complicated to define and requires an extra role in service C.
Assuming a role is another AWS SDK call that extends the overall runtime of services A and B and more error-prone code to maintain.
Let's see how we can solve our authentication and authorization issues in synchronous and asynchronous communication patterns with concrete IAM policies and Python code examples.
Synchronous Communication IAM Solution
Let's start by boosting the security of service C, the API Gateway. First, we will add an IAM authorizer to all its API endpoints. By doing so, by default, all unauthenticated and unauthorized requests are denied.
When IAM authorization is enabled, clients must use Signature Version 4 to sign their requests with AWS credentials. API Gateway invokes your API route only if the client has execute-api permission for the route. - AWS documentation
AWS IAM ensures that only requests with authenticated and valid IAM credentials which are authorized will execute the service's C API. All that remains is for service C to define which services (identities) and AWS accounts can execute its APIs.
There are two ways to achieve that:
Resource-based policy
Identity-based policies assume role mechanisms.
Let's start with a resource-based policy.
Resource Based Policy
In this use case, we need to alter service C's API Gateway resource based-policy to allow services A and B (their Lambda function role, for instance, or their entire account or VPC endpoint etc.) to execute the API and its endpoints. We can define a fine-grained definition and decide exactly what endpoint each service can communicate with.
Here's an example of such a resource policy for service C API Gateway:
In lines 8-9, we can allow an entire account (principal) such as the account of service A or B's account or a specific role ARN in a different account (a better option, least privileged), the action (line 12) of executing an API endpoint. We can set the exact endpoint and HTTP command in line 14.
Be advised that at the moment, you cannot use this mechanism on an HTTP API Gateway, just the REST variant.
On services A and B side, they must define their roles (Lambda function role for service A) with identity permissions for the same action specified in the resource policy. For cross-account access, you must define the policy in this two-sided manner. Service A defines its Lambda function's role with permissions to to call service C API Gateway, and service C allows service A to call it from the other account.
In addition, services A and B must send their IAM credentials (Sig v4) in the HTTP authorization header when calling service C's API Gateway.
Here's a Python example for this process:
In this example, we use our service's role to authenticate with IAM, create the auth headers in line 7, and send them in line 15.
Assume Role
In this use case, we need to create a role in service C's account with permissions to execute the API Gateway and let a principal in services A and B assume it.
We can start by defining the role's trust policy - in this policy, the resource, is the role itself.
We let specific roles (principals) from services A and B assume this role (action). We can give a broader scope for services A or B, either to the entire account or to a role with a predefined prefix. However, it's usually best to minimize the scope, so a role prefix or a specific role is better and more secure.
Next, we need to add the role's permissions to execute service C's API to the role :
Lastly, the code on service A is very similar to before, with the addition of the assumed role code.
You can find code examples for multiple services here or refer to the code below.
For 'RoleArn,' you need to provide the role that service C creates and shares its ARN with you. Typically, service teams exchange ARNs manually. I'd recommend saving that ARN as an environment variable in the Lambda function. Also, ensure that your Lambda role has the necessary permissions to assume roles.
As you see below, the code is very similar to code example before, just with the addition of the STS API call in lines 5-10 and using the response values in lines 11-18.
If you wish to learn about private API Gateway use cases, refer to the appendix.
Asynchronous Communication Solution
Our goal is to allow service A to publish SNS messages to service C's SNS topic and to let service B subscribe via SQS to the SNS messages.
Let's review the two IAM implementation options we have.
Resource Based Policy
In this case, we will define an SNS access policy that allows service A (principal) to publish messages (action) to the SNS topic of service C (resource) and service B to subscribe to the messages. Be advised it's best to fine-tune these permissions to the role that can publish and to the specific SQS that can subscribe.
On the service A side, the Lambda role will define its permissions to publish to the SNS topic. Having the two sides define the permissions allows them to work when dealing with cross-account access.
When publishing a message to the topic, service A's Lambda function will utilize the AWS SDK to make the API call. The SDK uses the function role's credentials to take care of the IAM authentication and authorization side.
On the service B side, we need to define an SQS subscription to the SNS topic by following the documentation here.
Assume Role
In this case, we need to create a role in service C's account with permission to publish messages to the SNS topic. Then, we let a specific role of service A assume this role.
We can do this by defining the role's trust policy.
We can give a broader scope for service A to the entire account or a role with a predefined prefix. However, it's usually best to minimize the scope. Hence, a role prefix or a specific role is better and more secure.
Next, we need to add to the role the permissions to publish messages to service C's SNS topic:
Now, on service A's side, we need to assume the role and use AWS SDK to send SNS messages. It's also important to ensure the Lambda function role has permissions 'sns:Publish' and 'sts:AssumeRole'; otherwise, the AWS SDK calls will not work.
The service teams need to exchange the resource ARNs and account numbers for the policies to work.
Service A assumes the role of SDK call and uses the temporary IAM credentials to create a boto client for the SNS publish message SDK call.
Service B remains as in the resource-based policy example; it can work only as a resource policy that allows its SQS to subscribe. However, service A requires some code changes.
Choosing Between Assume Role and Resource Based Policies
Your service communication will be secure with authentication and authorization, whether you chose resource-based policies or assumed role solution. However, each implementation has its pros and cons that can make your life harder in the future if you ignore them.
I'll divide my recommendation by different parameters.
Suppose I had to choose just one implementation. In that case, I'd go with the 'assume role' path, as it allows my service to support multiple services in the future easily. I can create more roles to support more services that assume them and I'm not limited by IAM policy size.
However, resource-based policies are better if you only care about performance, so don't add another SDK call to assume the role. Keep in mind that these policies have a maximum size. Suppose you expect to connect many services and different AWS accounts (think of a central pub-sub account or central API). In that case, you will encounter these limitations at some point in the future. As it doesn't make sense to duplicate the SNS topic or API gateway for integration with new services, you are better off choosing the 'assume role' path. It's easier to provide new roles than to duplicate an SNS topic or an API Gateway, which doesn't make sense.
Another deciding factor is whether the services are in the same account and who maintains them - the same team or not. The resource-based policy is excellent for internal service or micro-service communication when the same team maintains them, as it introduces some degree of coupling. Still, it is acceptable as it "stays" in the family. However, suppose different accounts and teams are involved. In that case, the 'assume role' route is better as it decouples the resources and teams in question and supports endless future extensions.
Lastly, you can always change the implementation, so don't be afraid to choose; just make sure you select one of these two options.
Summary
In this post, we have learned about authentication and authorization. We have also learned about two service-to-service communication patterns: asynchronous and synchronous.
We have implemented both authentication and authorization for those patterns using AWS IAM. We saw four different implementations and discussed their pros and cons, whether the resource-based policies route or the 'assume role' one.
In the following posts, we will discuss the challenges these patterns bring, how to solve them, and how to take authorization another step forward into the fine-grained domain.
Appendix: Private and Public API Gateways
AWS recommends building private API Gateways for service-to-service communication to enhance performance, reduce network costs (you don't leave the AWS network), and improve security.
...traffic to your private API uses secure connections and does not leave the Amazon network—it is isolated from the public internet - AWS documentation.
In reality, it's possible and easier to build APIs as public API Gateways. In that case, authentication and authorization become more critical and you must follow the guidelines in this post.
Private API Gateways require VPCs. They bring extra complexity as connecting services also need to use VPC and VPC endpoints.
AWS recommends connecting the services' VPC networks via resource policies for VPC endpoints or VPC peering. You can read more about it here.
When you use serverless and Lambda functions and wish to communicate with a private API Gateway, you need to put your Lambda functions inside VPCs. This is not ideal, as VPCs have unwanted effects on the Lambda functions, such as longer cold starts and increased costs.
VPCs Don't Replace Authentication and Authorization
I want to set this point straight.
Setting up a network connection between different VPC endpoints does not mean you implemented service authentication or authorization or that you are 100% secure.
Yes, it brings up an extra layer of security, but it does not replace IAM-based authentication and authorization.
First, it's not the least privilege; any service inside those VPC networks can access your service. It's an extra layer of security but does not replace authorization. In addition, it does not scale. The more services you add, the more "breached" your service becomes, with more VPCs and services that gain access.
In addition, if attackers gain access to one of the VPCs, they can communicate freely with your service because your service accepts any incoming calls.
However, combining IAM authentication and authorization with VPC provides the best and most comprehensive security for your API Gateway.
You can learn more about such patterns with API Gateways and VPC in the video below: