Thursday, July 17, 2025

Connecting to Neptune


Before we can go any further, we need to sort out the connectivity issues.

Looking at the console, I can see that I have two endpoints, one for reading and one for writing. For our current tasks, we obviously need the one for writing. I can copy this and hardcode it into my application (always remembering to add https:// on the front and $:8182/" on the end.
    nodeCreator, err := NewNodeCreator("https://user-stocks.cluster-ckgvna81hufy.us-east-1.neptune.amazonaws.com:8182/")
    if err != nil {
        log.Fatal(err)
    }

    err = nodeCreator.Insert("HWX2")
    if err != nil {
        log.Fatal(err)
    }

NEPTUNE_ENDPOINT:neptune/cmd/create/main.go

and then update the code to connect using one of the "modify options" functions that NewFromConfig supports:
func NewNodeCreator(endpoint string) (*NodeCreator, error) {
    svc, err := openNeptune(endpoint)
    if err != nil {
        return nil, err
    } else {
        return &NodeCreator{svc: svc}, nil
    }
}

func openNeptune(endpoint string) (*neptunedata.Client, error) {
    cfg, err := config.LoadDefaultConfig(context.TODO())
    if err != nil {
        return nil, err
    }
    cli := neptunedata.NewFromConfig(cfg, func(opts *neptunedata.Options) {
        opts.BaseEndpoint = aws.String(endpoint)
        log.Printf("%s\n", *opts.BaseEndpoint)
    })
    return cli, nil
}

NEPTUNE_ENDPOINT:neptune/cmd/create/main.go

Now I get this error:
2025/07/11 08:07:22 https://user-stocks.cluster-ckgvna81hufy.us-east-1.neptune.amazonaws.com:8182
2025/07/11 08:07:22 operation error neptunedata: ExecuteOpenCypherQuery, https response error StatusCode: 0, RequestID: , request send failed, Post "https://user-stocks.cluster-ckgvna81hufy.us-east-1.neptune.amazonaws.com:8182/opencypher": dial tcp: lookup user-stocks.cluster-ckgvna81hufy.us-east-1.neptune.amazonaws.com: no such host
This is actually progress. This is the URL I want to connect to, but I don't have access to it (and the reason I'm willing to print it here is because you don't either). This is an address within a VPC.

In "the real world", I will want to package all of my code into a lambda, deploy the lambda into the VPC and connect that to an APIGateway (yes, I hope to get there if I don't get ground down). But for now, I want to connect from my development machine. How do I do that?

The sample code I found has the following to say on the matter:
  ------------------------------------------------------------------------------
  VPC NETWORKING REQUIREMENT:
  ------------------------------------------------------------------------------
  Amazon Neptune must be accessed from *within the same VPC* as the Neptune cluster.
  It does not expose a public endpoint, so this code must be executed from:
 
   - An AWS Lambda function configured to run inside the same VPC
   - An EC2 instance or *ECS task* running in the same VPC
   - A connected environment such as a VPN, AWS Direct Connect, or a peered VPC
So I'm going to try and connect a VPN to AWS from my machine.

Connecting a VPN

The documentation on this seems to start here. As somebody posing as an administrator, I also need to set up an endpoint using the administrator guide. Note that this service is NOT free (indeed, charges you on an hourly basis just for having it set up), so be aware of the VPN pricing before you begin, if you are going to do this.

Since I am just following AWS instructions for this, I'm not going to go into too much detail, but for my own notes as much as anything:
  • Note down a CIDR block associated with your VPC and Neptune subnet
  • Go to the Client VPN endpoints tab of the VPC screen.
  • Click on the "Create client VPN" button.
  • Enter the information they want, including the CIDR block noted earlier.
I wasn't really sure what to do with the Authentication Information section, so, since I have certificates in ACM, I decided to go with mutual authentication and the same certificate on both sides. We'll see what happens.

Among the optional parameters, I chose to specify the VPC, although I figure this would be obvious (the fact that it offers me a choice suggests it is less obvious than I realise).
  • Click on "Create client VPN endpoint".
I had a couple of issues. Firstly, the CIDR range I had selected was invalid. My subnets are "24-bit" (/24), but apparently the smallest range you can have is 22, so I had to widen it. Then it turned out the certificate I had selected was "not in the Issued state" (don't ask), and so I had to go and choose another one.

This seems like the end of the process, but it is not. The endpoint shows up in the list, but it is in the "pending-associate" state. Who knows what this means?

Scouring the documentation, it seems to be the case that we need to associate the endpoint with a target network, so we click on the "Target network associations" tab and click "Associate target network".

This then requires us to:
  • Choose a VPC
  • Choose a subnet
(I thought I did this already in optional properties, but there you go).

Ah, and then we run into problems. This subnet overlaps with the specified CIDR we used above, so apparently the CIDR they want you to specify is NOT the CIDR of your existing subnet, but a different one.

OK, let's go back, delete this one and start again (this is turning into a long process; sorry about that, but one of the purposes I have here is to record the problems I ran into, mainly because nobody else does).

And when I have gone through all that, I can successfully associate the target network, but it still remains in the "pending-associate" state. However, I can see in the lower tab on "Target network associations" that there is an association and it is in the "Associating" state. I presume this is just a question of waiting.

So, now, turning to the client portion of the process, you need to click on Download the AWS Client VPN from the self-service portal from the user-getting-started page. We need to copy across the cvpn ID. When I enter this into the form, it tells me that I need to contact my IT administrator. Turning to myself, I ask, "can you help me with this?" to which I answer, "I think we probably need to wait for the association to complete". OK, then, time for a cup of tea.

I didn't time it exactly, but to give an idea, it was about ten minutes from associating to reaching the "Associated" state. Let's try the client again.

No, it still doesn't work. Digging around some more, I found this article, which has a fair amount more detail than I'd seen before. At the bottom, it lists the additional steps I need to carry out. I'd already figured out the "pending-associate" one, but carrying on:
  • I need to add an authorization rule to a "destination" network. I would presume this is the CIDR of the VPN, but it might be the subnet. Anyway, 0.0.0.0/0 is an option, so I'm going for that. The endpoint now enters the "Authorizing" state, so I will have to wait for a bit, although it turns out to be less time than it's taken me to type this.
  • I need to export the client configuration file by going to the main list and clicking on "Download Client Configuration". The name and alleged action don't seem to match, but it does indeed download something.
  • Because I opted to use mutual certificate authentication, I now need to insert the information about my certificate into the downloaded file. Interestingly, it seems to have already completed the <ca> section, but not the <cert> or <key> aspects. I can do that.
Having done that, I imported that into my OpenVPN client on my Mac and attempted to connect. Sadly, it reported that it could not find the endpoint. Is this just DNS being slow to propagate, or is this something not right with my configuration, or is it an issue with the client? To eliminate one variable, I decided to use the official AWS client. Sadly, that didn't work either, but it said that the "connection failed because of a TLS handshake error". Now, that would not surprise me if it was to do with the certificate and CA issues.

Having gone around and around in circles for a while, I decided to spread my net further and came across this guide to creating what appear to be self-signed client and server certificates and importing them for this purpose.

There is a lot of detritus in the output here which I have excerpted for length.
$ mkdir neptune-vpn
$ cd neptune-vpn
$ git clone https://github.com/OpenVPN/easy-rsa.git
Cloning into 'easy-rsa'...
...
$ easy-rsa/easyrsa3/easyrsa init-pki
...
'init-pki' complete; you may now create a CA or requests.
...
$ easy-rsa/easyrsa3/easyrsa build-ca nopass
...
Common Name (eg: your user, host, or server name) [Easy-RSA CA]:gmmapowell.com
$ easy-rsa/easyrsa3/easyrsa --san=DNS:neptune.gmmapowell.com build-server-full server nopass
Generating a 2048 bit RSA private key
...
$ easy-rsa/easyrsa3/easyrsa build-client-full gareth.gmmapowell.com nopass
Generating a 2048 bit RSA private key
...
$ mkdir upload
$ cp pki/ca.crt upload/
$ cp pki/issued/*.crt upload
$ cp pki/private/*.key upload
$ cd upload/
$ AWS_PROFILE=ziniki-admin aws acm import-certificate --certificate fileb://server.crt --private-key fileb://server.key --certificate-chain fileb://ca.crt
{
    "CertificateArn": "arn:aws:acm:us-east-1:331358773365:certificate/3d12f227-54ec-4ffb-be6f-a02750b4b975"
}
$ AWS_PROFILE=ziniki-admin aws acm import-certificate --certificate fileb://gareth.gmmapowell.com.crt --private-key fileb://gareth.gmmapowell.com.key --certificate-chain fileb://ca.crt
{
    "CertificateArn": "arn:aws:acm:us-east-1:331358773365:certificate/bbdb1930-a190-4f83-b6e2-be4f70a73ca8"
}
Having done all this, it's now necessary to delete all of the VPN endpoint configuration files; this requires us to disassociate the network association and delete the endpoint. We can then start again.

And then, yes! I'm connected. Still, two hours of my life I'll never get back.

Limiting the Connection

As soon as I do this, I find I can't connect to the rest of the internet. Why not? Well, all of my internet traffic is being sent down the VPN.

I've been here before, but struggled. This answer says to use the argument route-nopull in the configuration file, but when I tried that, the Amazon client objected, saying it wasn't supported. Using this file with my (normal) OpenVPN command works providing I make a change to the remote name as follows:
remote cvpn-endpoint-02fe15ffc9f92baae.prod.clientvpn.us-east-1.amazonaws.com 443
remote-random-hostname
becomes
remote foo.cvpn-endpoint-02fe15ffc9f92baae.prod.clientvpn.us-east-1.amazonaws.com 443
# remote-random-hostname
The thing here is that the Amazon client can handle the task of adding "random" parts to the front of the remote name in order to bust the DNS cache. The standard client does not have that, so you need to specify a specific remote host, but any name will do.

Attempting to Connect over the VPN

Running the program again leads to the problem that we still cannot resolve the hostname. Looking at this article, step 2 says that you should be able to set the DNS resolver as being the ".2" IP address in the VPC. This still does not work for me (I am trying with both the OpenVPN client and the Amazon Client).

Can I connect directly via the IP address? Sadly, I cannot see an ip address anywhere, but I can look it up from the DNS server if, indeed, it is where the documentation says to look. Let's try that (this obviously requires the VPN to be connected):
$ nslookup user-stocks.cluster-ckgvna81hufy.us-east-1.neptune.amazonaws.com 172.28.0.2
Server: 172.28.0.2
Address: 172.28.0.2#53
* server can't find user-stocks.cluster-ckgvna81hufy.us-east-1.neptune.amazonaws.com: NXDOMAIN
Hands up anyone who is surprised by that.

As a side benefit, this has enabled me to check my OpenVPN connection is working properly, because I can see a clear difference here between being connected to the VPN (I get NXDOMAIN) and not being connected (connection times out). So, given that I receive the same NXDOMAIN message using the standard OpenVPN client as I do using the AWS client, I clearly have connectivity to the VPC through that. Very good.

So I'm left with one of two options: the DNS is not working, or I simply don't have an instance to connect to. The latter seems more likely. I can't see anywhere to add an instance to my current cluster, but interestingly there is an option to add an extra "reader". When I try this, it tells me that I cannot add a reader without a writer. So, no, I don't think I have an instance.

OK, this is frustrating, but this is why we are here: to have these problems in a nice little sandbox and to figure them out. Let me go back to my deployment script and add two DBInstances: one for writing and one for reading.
So now we can create the Cluster.  It would seem that you can create a cluster with little more
than a name and a SubnetGroup.

        ensure aws.Neptune.Cluster "user-stocks" => cluster
            @teardown delete
            SubnetGroupName <- subnet
            MinCapacity <- 1.0
            MaxCapacity <- 1.0

We need to create Neptune Writer and Reader instances

        ensure aws.Neptune.Instance "writer"
            @teardown delete
            Cluster <- cluster
            InstanceClass <- "serverless"

        ensure aws.Neptune.Instance "reader"
            @teardown delete
            Cluster <- cluster
            InstanceClass <- "serverless"

NEPTUNE_INSTANCES:neptune/dply/infrastructure.dply

Each instance needs an InstanceClass - the type of machine being provisioned to do the work. It turns out that one of the options is to specify the db.serverless class, which automatically scales relative to the work. It does however require parameters to be set to specify the minimum and maximum capacity of the system, and those need to be set on the cluster. So we can go back and add those parameters to the cluster and have it do all the relevant work for us.

Wow, that took a long time to create. 20 minutes for the writer and 35 for the reader (Note to self: should I allow these to be created in parallel in my deployer, or is that just 15 for the reader but it waits until the writer is available?).

And now, when I try to access the DNS, I see this:
nslookup user-stocks.cluster-ckgvna81hufy.us-east-1.neptune.amazonaws.com 172.28.0.2
Server: 172.28.0.2
Address: 172.28.0.2#53
Non-authoritative answer:
user-stocks.cluster-ckgvna81hufy.us-east-1.neptune.amazonaws.com canonical name = writer.ckgvna81hufy.us-east-1.neptune.amazonaws.com.
Name: writer.ckgvna81hufy.us-east-1.neptune.amazonaws.com
Address: 172.28.130.87
Great! Let's try the app again...

No, no improvement. But that's not surprising. The app still can't resolve the DNS name. Let's try the IP address now we know it.
2025/07/11 17:02:15 https://172.28.130.87:8182/
2025/07/11 17:02:17 operation error neptunedata: ExecuteOpenCypherQuery, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "https://172.28.130.87:8182/opencypher": tls: failed to verify certificate: x509: “*.ckgvna81hufy.us-east-1.neptune.amazonaws.com” certificate is not standards compliant
Well, well. Not working, but a different error which makes it quite clear that we have connectivity. I'm not convinced I like this error, but I'm going to assume that the problem is just that I am using an IP address. So how do we resolve this (no pun intended)?

The answer, of course, is to have a custom resolver that knows how to figure out what the IP address is, even though it isn't in the default DNS service. There appear to be such things in the Go library, but in my frazzled state I can't figure that out. Much simpler in the short term is just to add an entry to /etc/hosts:
172.28.130.87 user-stocks.cluster-ckgvna81hufy.us-east-1.neptune.amazonaws.com
(In the long term, we will be deploying everything inside the VPC and it should "just work".)

Sadly, though, the error remains the same:
2025/07/11 17:34:05 https://user-stocks.cluster-ckgvna81hufy.us-east-1.neptune.amazonaws.com:8182/
2025/07/11 17:34:08 operation error neptunedata: ExecuteOpenCypherQuery, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "https://user-stocks.cluster-ckgvna81hufy.us-east-1.neptune.amazonaws.com:8182/opencypher": tls: failed to verify certificate: x509: “*.ckgvna81hufy.us-east-1.neptune.amazonaws.com” certificate is not standards compliant
Back to Google:

This Github issue seems to suggest that the root issue is that Apple changed something at some point in the way they handle certain kinds of certificates, so it's probably worth trying this on Linux, so I'll do that in a bit. Certainly it seems to be an issue that has intersections between Apple, Go and AWS.

There was a suggestion that it might help to add the various AWS root certificates to the x509.SystemCertPool(). I tried this, but it didn't help me - maybe I did it wrong.

Thinking about it, I can eliminate the Go library from the equation by using curl (I tried this before, but since then I've resolved the DNS problem):
curl https://user-stocks.cluster-ckgvna81hufy.us-east-1.neptune.amazonaws.com:8182/openCypher?query="RETURN%201"
{
  "results": [{
      "1": 1
    }]
}
Interesting. So that just works. I was thinking I might have to specify the -k flag (to ignore certificate issues).

On Linux

OK, let's now try all this on Linux instead of MacOS.

In getting in to run on Linux, I ran into a whole bunch of other random problems, like needing to update my dlv debugger (for which this eventually worked for me: go install github.com/go-delve/delve/cmd/dlv@latest).

But then once I had brought everything up to speed, including running the Linux openvpn client with the same configuration file (as root, of course):
$ sudo openvpn --config ~/Ziniki/Credentials/neptune-vpn/neptune-cert-nopull.ovpn
and updating my /etc/hosts file (also as root), I got this:
2025/07/11 20:23:01 operation error neptunedata: ExecuteOpenCypherQuery, https response error StatusCode: 400, RequestID: 1552f1a1-e1fa-4623-aded-d4a8aeda9d47, MalformedQueryException:
OK, it's not "working" but it's not a stupid certificate exception. At least that feels like progress. The setup I have in my flat makes it easier to develop on my Mac when I'm there, but I can slum it on this laptop if it means the code will work.

Removing the Hardcoded Hostname

OK, at a totally different level of complexity, can I remove the need to hardcode the endpoint in the code? Theoretically, at least, I can use the Neptune API to obtain the endpoints associated with a cluster. Let's try this.

The bulk of the changes are in the openNeptune() function which no longer takes an endpoint:
func openNeptune() (*neptunedata.Client, error) {
    cfg, err := config.LoadDefaultConfig(context.TODO())
    if err != nil {
        return nil, err
    }
    nc := neptune.NewFromConfig(cfg)
    endpoints, err := nc.DescribeDBClusterEndpoints(context.TODO(), &neptune.DescribeDBClusterEndpointsInput{DBClusterIdentifier: aws.String("user-stocks")})
    if err != nil {
        return nil, err
    }
    if len(endpoints.DBClusterEndpoints) < 1 {
        return nil, fmt.Errorf("no cluster endpoints found")
    }
    endpoint := *endpoints.DBClusterEndpoints[0].Endpoint
    cli := neptunedata.NewFromConfig(cfg, func(opts *neptunedata.Options) {
        opts.BaseEndpoint = aws.String("https://" + endpoint + ":8182/")
    })
    return cli, nil
}

NEPTUNE_FIND_ENDPOINT:neptune/cmd/create/main.go

And that all works. Great.

(As noted above, this doesn't solve the problem with needing to hardcode it in /etc/hosts, but that is not a long term problem in the way that looking up the Neptune host is once we deploy into a lambda.)

Conclusion

This has been a truly frustrating day. I view myself as a software engineer, not someone who troubleshoots network and communication issues. And yet, I feel like I have spent all day doing one or another task that is just trying to figure out why some piece of infrastructure is not working the way I would expect it to. Hopefully I have reached the end of that now, and the last problem here is one in the code where I am not assembling a query correctly. We will see another day.

No comments:

Post a Comment