Saturday, June 9, 2018

Building An AWS Multi-region Serverless Application With A Single Lambda, Multi-master Database, And Deep Ping Healthchecks

Introduction

Amazon published this excellent article: https://aws.amazon.com/blogs/compute/building-a-multi-region-serverless-application-with-amazon-api-gateway-and-aws-lambda, which moves us closer to pagerless computing (http://jimshowalter.blogspot.com/2018/06/the-promise-of-pagerless-computing.html).

But Amazon's solution has multiple lambdas, depends on API Gateway, and has no backend.

This article shows how to start with that solution, and:
  • Modify it to have a single lambda.
  • Connect to a DynamoDB backend that uses global tables (so the entire end-to-end stack fails over).
  • Check health with a tunable combination of shallow (front-end only) and deep (back-end) pings.
  • Keep almost the entire solution in the free tier.
There might be some controversy about the first two items, so let's start by addressing those:

Q: Why a single lambda?
  • We want the health check to be a reliable indicator that the application is working. If the health check hits a different lambda than a constellation of N lambdas constituting the application, it could report that the health check lambda is working fine, but the application is actually broken. (The terminology is a bit tricky here. There's actually a pool of lambdas running in instances, not just one lambda. Our point is that each instance is running the same code.)
  • We want to use the health check to keep the lambda warm, and hit the same lambda with our other calls (in this example, just "hello", but any number of other calls can hit the same lambda), so calls rarely encounter a cold start. (UPDATE: AWS introduced provisioned concurrency, which keeps a pool of lambdas ready to use, but it has to be paid for. We want to exploit the free tier as much as possible.)
  • We want to reuse the database connection as much as possible (although establishing a connection to DynamoDB is fast anyway).
  • We don't want to have to manage N different lambda files, particularly with them hitting the same shared database, and needing much of the same code.
There are arguments for and against single lambdas, and much active discussion online. One of the cons is that a larger lambda takes longer to cold start, but our lambda really isn't very big (and we're not packaging very much with it).

For our particular use case, a single lambda works well.

Our design minimizes use of API Gateway:
  • Because we only have one lambda, we can only have one handler (AWS limitation), which drives us towards lambda proxy integration, which inherently minimizes API Gateway.
  • Using lambda proxy integration makes it easy to change our client-side application and lambda implementation without having to go back in and add more configuration to API Gateway. It speeds up development.
There are arguments for and against lambda proxy integration, and much active discussion online (for example, https://www.stackery.io/blog/why-you-should-use-api-gateway-proxy-integration-with-lambda/). Security is one of the main cons. But in our use case, the client application needs almost exactly the same permissions for all operations, so we're not appreciably increasing our attack service by having a single endpoint.

Note: This design doesn't address authentication, and adding authentication might requires making more use of API Gateway. It might also introduce a snag, because Cognito doesn't sync user sets across regions.

Select Two Regions

Pick two regions that you will use throughout.

Because this is a multi-region failover solution, it probably doesn't make sense to pick regions on different continents.

We chose us-east-1 and us-west-2.

Create The Deep Ping Table In Both Regions

Select one of the regions.

Go to the DynamoDB service.

Select "Create Table".
  • Enter "prod.Hello" for the table name.
  • Enter "HelloKey" (type string) for the primary key.
  • Ignore the sort key.
Based on experience, we had to start scaling earlier to account for scale-up time:
  • Under "Table settings" uncheck "Use default settings".
  • In "Autoscaling", "Read capacity":
    • Set "Target utilization" to 50%.
    • Set "Minimum provisioned capacity" to 1.
    • Uncheck "Apply same settings to global secondary indexes".
  • In Autoscaling, "Write capacity":
    • Check "Same settings as read".
Click "Create Table".

Wait for the table to be created.

Click "Global Tables".
  • Click "Enable streams".
  • Click "Add region".
  • Select your second region.
  • Click Continue.
  • Wait for the table in the other region to be created.
Go to Items.
  • Click "Create item".
  • Switch to Text view.
  • For the HelloKey, enter "Hello".
  • Add an attribute Hello, with text "Deep hello from " (note the trailing space).
  • It should look like this:
{
  "HelloKey": "Hello",
  "Hello": "Deep hello from "
}
Click Save.

Switch to the other region, and verify that your item propagated automatically.

Notes:
  • Do not enable encryption. If you do that, the deep pings will exceed the number of Key Management Service requests in the free tier.
  • You may be tempted to enable Point-in-time recovery. Don't bother. It doesn't work for global tables. (Neither does optimistic locking.)
  • You may see an alarm similar to "Consumed read capacity < 0.3 for 15 minutes TargetTracking-table/prod.Hello-AlarmLow-some-uuid-string". Ignore this. The alarms are supposed to help you avoid overprovisioning, but it's not possible to provision less than 1 capacity unit, so this warning is pedantic. This has been reported to AWS (they filed a bug).

Set Up The Front End

Go to https://aws.amazon.com/blogs/compute/building-a-multi-region-serverless-application-with-amazon-api-gateway-and-aws-lambda.

Git clone that articles's git repo.

Follow the steps under Prerequisites.

When your buckets are created:
  • Go to Properties for each of them and enable encryption, AES-256.
  • Go to Management for each of them and set up an expire lifecycle rule, set to 1 day for everything. (There's no reason to keep temporary uploads.)
In helloworld-api, move helloworld-sam.yaml to a backup.

Download https://github.com/jimshowalter/failover/blob/master/helloworld-sam.yaml to helloworld-api.

Note: There is a commented-out section in the downloaded yaml that shows how to set up permissions for a table you write to, not just read from. That might come in handy later (but not in this article).

Download https://github.com/jimshowalter/failover/blob/master/thelambda.js to helloworld-api.

Read through thelambda.js to get a feel for how it works (there are a lot of comments).

In thelambda.js, follow the instructions for generating a random string, and use it to replace <PUT YOUR UNIQUE ID HERE>.

Continue Amazon's blog where it says "You can only use SAM from the AWS CLI, so do the following from the command prompt" (execute the two sets of bash commands documented there).

Go to Cloudwatch, Log Groups in the console and, for both regions, set "Expire Events After" to 1 day (the health checks generate a lot of logs).

Note: This only deletes lot events--log streams are kept (even though they are empty). You should probably periodically clear out the empty log streams. AWS has filed an enhancement request to also delete the log streams.

Configure the endpoints to be regional as shown in Amazon's blog.

You should see a different API Gateway than what is shown in Amazon's blog:



Note that there is no /helloworld after /prod in the invoke URL.

Note: AWS SAM is what creates the stage Stage. CloudFormation doesn't do that. It's a known issue: https://github.com/awslabs/serverless-application-model/issues/191. Supposedly you can fix it by changing the template to use resource type "AWS::ApiGateway::RestApi" instead of "AWS::Serverless::Api". We left it as is, because it works, but it might be interesting to convert the template.

Where Amazon says to test with curl, replace the curl command with (adjusting the region if you didn't use us-east-1):
curl https://
<yourinternaldomain>.execute-api.us-east-1.amazonaws.com/prod/health?healthCheckerId=<PUT YOUR UNIQUE ID HERE>
You should see:
{"message":"Shallow hello from us-east-1"}
or:
{"message":"Deep hello from us-east-1"}
depending on the deep-ping threshold percentage (you can keep executing the command until you see it flip from shallow to deep or vice-versa).

Similarly:
curl https://
<yourotherinternaldomain>.execute-api.us-west-2.amazonaws.com/prod/health?healthCheckerId=<PUT YOUR UNIQUE ID HERE>
You should see:
{"message":"Shallow hello from us-west-2"}
or:
{"message":"Deep hello from us-west-2"}
(again adjusting the region if you didn't use us-west-2).

Continue Amazon's blog, with the "Create the custom domain name" section, but be careful in the dialog for "New Custom Domain Name" to set the base path mapping destination to "thelambda" instead of "multiregion-hello".

Continue Amazon's blog, with the "Deploy Route 53 setup", but once you complete that section, go to Route 53 in the console and edit both health checks so their path is:
prod/health?healthCheckerId=<PUT YOUR UNIQUE ID HERE>
Note: This manual step wouldn't be necessary if someone could get https://github.com/jimshowalter/failover/blob/master/route53dns.yaml to work.

In both of the health checks, add an alarm that sends you email.

Optional: Our application is specific to the United States, so we didn't see the point of beating on it from areas outside the U.S. You can configure health checks to run from fewer locations by going to "Advanced configuration", "Health checker regions", "Customize", and deleting ones you don't want (down to a minimum of three).

Note: Your heathchecks will wind up in us-east-1, because that's where all health checks live (according to AWS support). Route 53 is global, and has no notion of what region a health check is for (and it can check the health of URLs not in AWS).

Continue Amazon's blog, with the "Using the Rest API from server-side applications" section, but change the curl URL to:
https://hellowordapi.
<replacewithyourcompanyname>.com/v1/prod/hello?healthCheckerId=<PUT YOUR UNIQUE ID HERE>
Continue Amazon's blog, with the "Testing failover of Rest API in browser" section, but change client.js to:
$.ajax({
    url: 'https:/hellowordapi.
<replacewithyourcompanyname>
.com/v1/prod/hello',
    data: {
        "healthCheckerId": "<PUT YOUR UNIQUE ID HERE>"
    },
    dataType: "json",
When Amazon's blog says to set the environment variable STATUS to fail in the Lambda console, you'll see that we called it FORCE_FAIL, and you set it to true.

Verify that you receive an email for the failover.

Continue through the rest of Amazon's blog.

Notes


Updating The Displayed Region

The Amazon blog says: "During an emulated failure like this, the browser might take some additional time to switch over due to connection keep-alive functionality". We haven't been able to get the browser to update. It seems to just cache the call, even if we go to developer tools and disable caching. But if we relaunch the browser, it displays the failed-over-to region. It would be great if someone could figure out how to reliably make the browser display fail over automatically, because it would be a much more compelling demo.

Spurious Health Checks Warning

If you look at the health checks that are mapped to the regional endpoints, you may notice a warning: "The selected health check specifies the endpoint by domain name. Confirm that the name of this resource record set isn’t the same as the domain name in the associated health check. If the names match, health checking won’t work correctly." That warning is spurious. It should only display that warning if the domains are the same. This has been reported to AWS (they agree it's a bug).

Performance

Review the Cloudwatch logs to get a feel for how fast your lambda executes. You'll see three basic times. One is sub-millisecond, the other is ~80-150 ms, which we think are deep pings. And occasionally you'll see a longer delay, which we think is a cold start.

Running Lambdas From Console

To run thelambda from the Lambda console, you need to supply the path parameter and the healthCheckerId. To do that, go to the Test dropdown, select "Configure test events", and enter:
{
  "pathParameters": {
    "proxy": "health"
  },
  "queryStringParameters": {
    "healthCheckerId": "<PUT YOUR UNIQUE ID HERE>"
  }
}
or:
{
  "pathParameters": {
    "proxy": "health"
  },
  "healthCheckerId": "
<PUT YOUR UNIQUE ID HERE>"

}

Experimenting With Environment Variables

You can experiment with different settings by going to the Lambda console and changing the environment variables. For example, you can increase or decrease the deep-ping percentage.

Incomplete Requests When Running Lambdas From Console

When running lambdas directly from the Lambda console, keep in mind that the request event doesn't have content like what is shown in https://docs.aws.amazon.com/lambda/latest/dg/eventsources.html#eventsources-api-gateway-request, because the request didn't come through the API Gateway.

Restricting Callers By IP Address

There's another way to restrict healthcheck calls: by IP address. You could set an address-based policy that only allows health checkers, and you, to call the health and hello endpoints. That would get rid of the need for the health check ID, which would simplify things a lot.

However, while you might be able to specify origin addresses for yourself that are stable, you can't do that for health checkers, because the addresses change.

It's possible to find out the current values, from https://ip-ranges.amazonaws.com/ip-ranges.json, and you can subscribe to an SNS topic "AmazonIpSpaceChanged" to be notified whenever there is a change to the AWS IP address ranges, so it would be possible to write some tooling that updated your whitelisting, and redeployed.

But why does it have to be so difficult? AWS could add a principal for "healthchecker", and whitelist those, internally maintaining whatever information they need to keep track of the healthcheck IP addresses. They filed an enhancement request.

Paid Support

We got stuck on parts of this project, being entirely new to AWS and using this configuration to learn. While there was a lot of information online, sometimes our questions required talking with an expert (and a few times we demoed bugs we found). If you find yourself in this situation, sign up for paid support. If you don't need to screen share, you can probably get by with Developer. We wound up signing up for Business. It's only $100/month, they respond very quickly, the engineers who respond are experts, and you can shut it off once you're up to speed.

Opportunities For AWS Improvement


Single Handler Per Lambda

The biggest pain point for us is the restriction that there can only be one handler on a lambda. If that wasn't the case, we could have used API Gateway directly, without lambda proxy integration, so we didn't have to have a central dispatcher:
    if (utilsService.isNullOrEmpty(requestKind)) {
        response = createResponse(400, "INVALID REQUEST (missing requestKind)!!!");
    } else if (requestKind.indexOf("INVALID REQUEST (event requestKind ") === 0) {
        response = createResponse(400, requestKind);
    } else if (requestKind === "hello") {
        if (!healthCheckerIdMatch(event)) {
            response = createResponse(500, "Failed");
        } else {
            response = helloHandler(event, context);
        }
    } else if (requestKind === "health") {
        if (!healthCheckerIdMatch(event)) {
            response = createResponse(500, "Failed");
        } else {
            doCallBack = false;
            healthHandler(event, context, callback);
        }
    // TODO: Add more else/ifs for your application's functions, using something other than the healthCheckId for authentication.
Each of the request kinds would just have been separate handlers on the shared lambda.

Unsecured Health Check Endpoints

The second biggest pain point was having to use a cumbersome shared secret to protect the health and hello endpoints from malicious callers. Being able to whitelist health checkers as principals, plus our IP addresses, would have cleaned that up a lot.

Almost But Not Quite In Free Tier

We don't want to sound like ingrates, because AWS gives individuals access to billions of dollars of infrastructure for a few bucks a month, but... it would be great if the entire failover solution fit in the free tier. That way, anyone could start out knowing that their application was going to be secure, available, and performant before writing a single line of their application-specific code, and they would only be charged for their application. It's not a huge ask--pretty much the only costs in the configuration come from Route 53 (for string-matching healthchecks), and API Gateway.

Incomplete Scriptability

The Amazon article ends with "The setup was fully scripted using CloudFormation, the AWS Serverless Application Model (SAM), and the AWS CLI", but that's not really true. Much of it is, but parts aren't. For example, having to set endpoints to regional. We don't know if that was just the author not knowing how to do it through SAM, or if it's a gap. Ideally the whole configuration could be created from only templates, with no manual intervention.

Little Gaps In Managed Services

Even managed services can have little misses on the part of AWS. For example, deleting CloudWatch log events but not the log streams that contain them. Ideally every managed service could be configured to be completely automated, including garbage collection.

Productize Entire Configurations As Packaged Offerings

The opportunities listed above are tactical--they're minor improvements that would streamline things and make them a bit easier. But the biggest opportunity we see is for AWS to embrace the cross-region failover configuration as a first-class citizen. Imagine if it was a fully worked out bundled solution that could simply be ordered up from the console (or just from a web page). Lightsail on steroids. Up pops a wizard that asks for a domain name, regions, backend database (which pretty much is limited currently to DynamoDB, pending Aurora multi-master GA), and a few other pieces of information, then some time passes and boom, here is a complete failover solution, with a place for the developer to add their code and schema.

We want AWS to be a super-reliable, super-fast, super-available, planet-wide app server that requires no care or feeding on our part.

The Promise Of Pagerless Computing

Serverless computing is all the rage, as well it should be, but serverless is a how.

Pagerless is the why.

https://twitter.com/_siddharth_ram/status/992279542835306497

Maybe you're a developer who really enjoys being on-call, yanked out of sleep at 3:46 am to heroically deal with some kind of production crisis. Maybe you really like configuring CIDR blocks and subdomains and bastion hosts, and subscribing to security alerts, and keeping your machine images up to the latest patch levels, and paying for licenses, and etc. Maybe you actually miss buying physical hardware and installing it at the "co-lo".

If you are that kind of developer, AWS would love to hire you. You'll fit right in!

But for the rest of us, that's all undifferentiated heavy lifting. Yes, it's critically necessary, but there's nothing application-specific in any of it. We can't even use it to make our application stand out relative to other applications in terms of uptime, security, performance, etc., because these days users just expect stuff to work. The only way to stand out operationally is by screwing up.

Unless devops is a core competency, it's irresponsible not to outsource it to an army of technicians in white lab coats who specialize in this stuff.

Amazon didn't always understand this, but they learned (https://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html): "It became obvious that developers strongly preferred simplicity to fine-grained control as they voted "with their feet" and adopted cloud-based AWS solutions, like Amazon S3 and Amazon SimpleDB, over Dynamo. Dynamo might have been the best technology in the world at the time but it was still software you had to run yourself. And nobody wanted to learn how to do that if they didn't have to. Ultimately, developers wanted a service."

Bingo.

Instead of going "The developers aren't doing it right", Amazon went "Huh, that's weird, why are they doing that?", and learned from the answer.

Today, an individual developer can set up a cross-region multi-master HA/DR system in AWS for a couple bucks a month (https://aws.amazon.com/blogs/compute/building-a-multi-region-serverless-application-with-amazon-api-gateway-and-aws-lambda). It would have cost millions and millions of dollars to do that 20 years ago. Many reasonably large companies couldn't have pulled it off. It's the democratization of operational excellence.

With everything managed, developers only need to be paged when their software--the part where they are the experts--goes insane.

A lazy programmer is a good programmer.

See also:

http://jimshowalter.blogspot.com/2018/06/building-aws-multi-region-serverless.html

https://read.acloud.guru/simon-wardley-is-a-big-fan-of-containers-despite-what-you-might-think-18c9f5352147