Thursday, January 5, 2023

Lambda Snapstart is Harder than I Thought

Apologies, but kind of in violation of my rules, I don't have any actual example working code. That's because this is more complex than that and the number of moving parts were just too vast. But if you want any help or conversation about this, drop me a line and I'll do/share whatever I can.

When I saw that AWS had released a "snapstart" feature for Lambda, I was ecstatic. I have taken to using Lambda as a way of delivering servers with minimum fuss, but I have somewhat abused the technology by basically moving my existing server-based code into a lambda, along with its long start time.

I grew up in a world where the logic has always been that what you want to optimize is the time spent doing frequent operations; initialization time is effectively "free" (up to the point where it becomes minutes or hours: at which point you really have to do something about it). Lambda, on the other hand, says you have 10s: if you fail to complete in this time, you will be rejected and the whole thing starts again. When you are trying to configure multiple things with multiple services … it doesn't quite make it.

I finally went over the edge when I found that in order to support the latest version of JavaFX, I needed to copy a whole bunch of ".so" files from S3 to the "local disk". This was taking pretty much all the 10s … As I started to consider my options (I'd got as far as panicking, and decided that was not very productive), AWS announced the "SnapStart" feature: initialize once, use repeatedly. Excited, I turned it on for my functions (so excitedly, I just did it in the console, rather than using CloudFormation; more on that later).

And nothing happened. Or, at least, I continued to have problems. But why?

It only works on published functions

Uh, boss, it's not as simple as that. On the upside, it is very clear that the 10s timeout does not apply to SnapStart functions - you have up to 900s to be used when the function is published. On the downside, this implies and, indeed, it is actively stated that you need to publish your functions in order to take advantage of this. Except my test environment does not bother with publishing and versioning lambdas, so I still want to bring that initialization time down.

The Runtime Hooks

Before doing that, I decided to at least try and integrate the runtime hooks and issue tracing messages. My thought process was firstly to see if these were called when the lambda wasn't published, and even if they weren't, I would then at least know when I had successfully published the lambda, because my tracing would come out.

In order to turn on the runtime hooks, it is necessary to download the CRAC library and attach this to your project. The AWSHandler then needs to implement the CRAC Resource interface, and register with the CRAC global context. So I added code like this:
    public  AWSHandler()  {
        Core.getGlobalContext().register(this);
    }
and then it's a simple matter of implementing the callbacks provided in the Resource interface:
    public  void  beforeCheckpoint(org.crac.Context<?  extends  Resource>  arg0)  {
        logger.info("before  checkpoint");
    }
    public  void  afterRestore(org.crac.Context<?  extends  Resource>  arg0)  {
        logger.info("after  restore");
    }

When It Works …

I added the relevant code to publish and alias my lambdas, and also to ensure that the code inside APIGateway called the alias (and thus the published version), rather than the "$LATEST" version, and, lo and behold, it all worked.

During the publication, this happens:
INIT_START  Runtime  Version:  java:11.v15                Runtime  Version  ARN:  arn:aws:lambda:us-east-1::runtime:0a25e3e7a1cc9ce404bc435eeb2ad358d8fa64338e618d0c224fe509403583ca
Picked  up  JAVATOOLOPTIONS:  -Dui4j.headless=true
-Dglass.platform=Monocle
-Dmonocle.platform=Headless
-Dprism.order=sw
-Djavafx.cachedir=/tmp/solibs
20230104-13:06:48.640                    tdaserver/Thread-0  INFO:  In  config
The key thing here being the "INIT_START" rather than just "START". Interestingly, it doesn't seem to issue any message when the initialization is done, it just stops issuing messages.

And then, when the lambda is called during API Gateway access, I see this:
RESTORE_START  Runtime  Version:  java:11.v15                Runtime  Version  ARN:  arn:aws:lambda:us-east-1::runtime:0a25e3e7a1cc9ce404bc435eeb2ad358d8fa64338e618d0c224fe509403583ca
RESTORE_REPORT  Restore  Duration:  383.49  ms
START  RequestId:  c55dfaca-70b4-4a7e-8e4e-b6fc920269aa  Version:  1

END  RequestId:  c55dfaca-70b4-4a7e-8e4e-b6fc920269aa
REPORT  RequestId:  c55dfaca-70b4-4a7e-8e4e-b6fc920269aa                Duration:  785.60  ms                Billed  Duration:  1044  ms                Memory  Size:  1024  MB                Max  Memory  Used:  401  MB                Restore  Duration:  383.49  ms               
Here the RESTORE_START and RESTORE_REPORT make it clear that a SnapStart image is being used, and how much time has been used to literally start the lambda (383ms may seem a lot, until you realize that it was over 15000ms to actually do the initialization).

After that, the lambda proceeds in the normal way.

No CRAC Output

Interestingly, up to this point, I have not seen any of the tracing output I would expect from my CRAC callback. I don't know whether I didn't succeed in registering correctly, or whether my tracing is simply not coming out. In the fullness of time, I will need to sort this out because it is necessary to check that all the initialization that has been done up to this point is up to date.

Configuring from CloudFormation

As a bleeding-edge adopter, when I first tried to use SnapStart, there wasn't any active CloudFormation documentation on using it. For all I know, it wasn't supported in CloudFormation. However, now, a couple of months later, all the relevant documentation is there.

As you'd expect, you configure this by adding a SnapStart property to your Lambda Function configuration which is quite simple so that it basically amounts to adding:
"SnapStart":  {
    "ApplyOn":  "PublishedVersions"
}
to your existing function declarations.

Conclusion

For a long time, Lambda on AWS with Java has been plagued by painfully slow startup times. It does seem that SnapStart makes major strides towards addressing these and, provided you are publishing your lambdas, is relatively easy to set up.

On the other hand, it seems somewhat opaque to use.