Skip to main content

Debugging NodeJS Microservice with Shared Storage on Kubernetes

· 7 min read
Kobbi Gal
I like to pick things apart and see how they work inside

sort-exceeded

Introduction

One of our largest customer recently had a problem loading a list of resources from our web application. The problem was a blocker for the customer and required to identify the problem and provide a workaround, if possible. I was assigned the task as I was the SME in this area (NodeJS microservices, infrastructure such as storage, microservice messaging and configuration).

Initial Analysis and Reproduction

The issue was constantly reproducible which always simplifies things. All I needed to do was to restore the MongoDB dump, log in with a certain user and attempt to load the webpage with the list of resources. I restored the MongoDB dump using:

mongorestore /path/to/dump

and logged in with the user and accessed the webpage. I saw that the webpage layout loaded just fine but the actual resource tree was missing. I then opened up the Chrome Developer Tools and navigated to the ‘Network’ tab which allowed me to see the requests sent to the server and find the failing API call. Easily enough, I discovered that the:

curl -X GET $SERVER/api/list

Returned a 500 HTTP response code (Internal Server Error). This indicates that there must be some uncaught exception thrown from the server side. The only way to understand the problem is to check the server logs and proceed from there.

Reviewing Server Logs

As I had a lot of experience dealing with the specific microservice and storage, I knew the architecture and dependencies pretty well. In short, it’s a NEAM (NodeJS + Express + Angular + MongoDB) stack. The list of resources were stored in a MongoDB collection (let’s call it resources). The DAL (data access layer) interacting with the resources collection was a NodeJS microservice running an Express server. The client generating the API call was running Angular. The microservices were running in Docker containers in a Kubernetes cluster.

Reviewing the NodeJS microservice logs, I saw the following error:

Exceeded memory limit for $group, but didn't allow external sort

Researching this issue, I found that it’s related to the size of the MongoDB query generated. To understand the magnitude of the query, I checked the MongoDB collection size:

db.resources.count()
// 1242731

This was the largest number of documents I’ve seen in this collection. It seems that the sort aggregation query generated to MongoDB was too large and couldn’t be processed by the server.

Brainstorming Solutions

Starting to think about the possible solutions, the first one that came to mind was to run some sort of cleanup scripts to remove MongoDB duplicates. Since I had a tool like that at my disposal, I ran it. But the script removed less than 1% of the documents. The second solution was also related to decreasing the number of documents. But have the customer review the system and begin purging irrelevant resources from the web app. The customer reviewed their list of resources but could not find any particular resources that could be removed. So it seems that we need to somehow resolve the issue without decreasing the size of the collection. After reviewing the NodeJS microservice source code, I found that the query generated by the DAL to MongoDB was the following (simplified version):

// /path/to/resources.dal.js
db.getCollection('resources').aggregate([
{
"$group": {
"ids": { "$push": "$_id" }
}
}
])

So seems that 1 million documents with this specific query was not going to be processed. I needed to find a way to modify the query to be able to execute successfully. I did some research and found that MongoDB has a 100MB system limit on blocking sort operation. There was no way to increase this limit. But, there was a way to work around this problem by using the disk as swap space to complement the RAM used by the query. MongoDB creates a temporary directory (_tmp) inside the dbPath storage location to generate and process the query.

allowDiskUse and Changing Source Code

The configuration to use the disk as swap space to increase the MongoDB system limit is quite easy. All we need to do is add the following object to the query:

{allowDiskUse: true}

Adding it to the source code on my test machine, it looked like this:

// /path/to/resources.dal.js
db.getCollection('resources').aggregate([
{
"$group": {
"ids": { "$push": "$_id" }
}
}
], {allowDiskUse: true})

Restarting the service for the changes to take effect, the GET /api/resources API call returned a 200. Success! But, I faced one last problem: the issue is happening in production. How was I going to change the source code on a Docker container if upon recycling (kubectl delete pod $POD_NAME), the source code would be reverted to the original source code packed into the image?

After a few hours of tinkering around, trying to find an answer to this question, I was able to find a direction that would lead me to the solution implementation. I started by reviewing a few things:

  • The NodeJS microservice Dockerfile – I found that the Dockerfile executed the following command to start the server:
CMD ["npm", "start"]
  • The Kubernetes Deployment (shortened for brevity) had the same initial execution commands but with additional commands to move around some resources within the container and the shared storage location:
apiVersion: v1
kind: Deployment
metadata:
name: some-nodejs-service
labels:
purpose: backend
spec:
containers:
- name: some-nodejs-service
image: alpine
command: ["/bin/bash"]
args: ["-c", "|", "mkdir -p /tmp/other-res;", "mv /path/to/app/tmp /tmp/other-res&&", "npm start"]
volumeMounts:
- mountPath: /opt/storage
name: storage
volumes:
- name: storage
persistentVolumeClaim:
claimName: storage-claim

So my thought was: since I have access to the storage using the web application hosting a Droppy mounted as /opt/storage on the host machine, could I just add a modified NodeJS module with the fix and replace it before the server initializes? It was worth a shot. The first thing I needed to do is to copy the module from the container to the shared storage location. To do this, I ran the following command:

kubectl exec some-nodejs-service-dd888bb69-plr6j -- cp /path/to/resources.dal.js /opt/storage/resources.dal.js

I then modified the source code in the resources.dal.js module that added the allowDiskUse: true configuration. Next step was to add a couple of more commands to the container before the NodeJS service is started. The first command is to rename the module:

mv /path/to/resources.dal.js /path/to/resources.dal.js.bak

The next command is to copy the file from the shared storage (the one with the fix) into the container to replace the original module we renamed in the previous command:

cp /opt/storage/resources.dal.js /path/to/resources.dal.js

I modified the Kubernetes Deployment using kubectl edit deployment some-nodejs-service. The modified Kubernetes container commands looked like this:

apiVersion: v1
kind: Deployment
metadata:
name: some-nodejs-service
labels:
purpose: backend
spec:
containers:
- name: some-nodejs-service
image: alpine
command: ["/bin/bash"]
args: [
"-c",
"|",
"mkdir -p /tmp/other-res;",
"mv /path/to/app/tmp /tmp/other-res&&",

# Added this
"mv /path/to/resources.dal.js /path/to/resources.dal.js.bak",
# Added this
"cp /opt/storage/resources.dal.js /path/to/resources.dal.js",
"npm start"
]
volumeMounts:
- mountPath: /opt/storage
name: storage
volumes:
- name: storage
persistentVolumeClaim:
claimName: storage-claim

Upon saving the changes, I needed to terminate the Pods for changes to take effect using:

kubectl scale deployment some-nodejs-service --replicas 0
kubectl scale deployment some-nodejs-service --replicas 1

This fixed the problem, another satisfied customer escalation was resolved!