Recommended way to self-heal crashed container (lock file) #289

sdaschner · 2021-04-04T08:11:15Z

Hi there 🙂 Not really a bug report (since it's expected behavior), but what is the recommended way to let a container with an attached volume self-heal from a crash?

Scenario

Dockerized Neo4J (in Kubernetes cluster, deployed with Helm in the standalone way)
Pod (e.g. provided by StatefulSet) crashes; pod gets restarted, i.e. new container but same volume
Usual lock file error: Lock file has been locked by another process: /data/databases/store_lock, when the original process was killed and not properly stopped

If I can ensure that the container/pod is the only one using that volume, is there a recommended way to clear the lock file at startup?

I understand that it makes sense from a concurrency perspective, but if the container runtime crashes unexpectedly, the pod can't come up again (CrashLoopBackOff) on its own, and has to manually be restarted...

The text was updated successfully, but these errors were encountered:

sdaschner · 2021-04-04T10:20:08Z

Update:

I played around with the init process (the Helm chart runs the init.sh script from the config map), and tried to check for the lock file status from the command line. What's interesting is that in the case after the crash the file (/data/databases/store_lock) exists, but can be locked & unlocked in CLI with flock, but the Java process still fails to obtain the lock...

This was my test snippet at the start of the container:

    if { flock -x -n /data/databases/store_lock echo -n; }; then
      echo "/data/databases/store_lock not locked, continuing"
    else
      echo "/data/databases/store_lock is locked; deleting"
      rm -rf /data/databases/store_lock
    fi

which prints ... not locked, though the Java process fails afterwards. Is there another way to check for the lock file, or simply delete it at startup (in my particular example, where there is no other Neo4J instance operating on that filesystem)?

jennyowen · 2021-04-06T08:19:42Z

@sdaschner In regular non-k8s docker, you can just delete the store_lock file if you know that there are no running neo4j databases using the mounted data volume. However, you specifically asked about kubernetes and our helm charts, so I know this doesn't exactly answer your question.

Could you re-create this issue on the github repo for the helm charts please? They'll be better able to answer your question / feature request.
https://github.com/neo4j-contrib/neo4j-helm
(I can't transfer the issue myself because it's owned by a different account).

Thanks!

This was referenced Jun 17, 2021

Standalone pod erroring out with lock issue in the very first run with a clean volume neo4j-contrib/neo4j-helm#202

Closed

Standalone pod erroring out with lock issue in the very first run with a clean volume equinor/helm-charts#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended way to self-heal crashed container (lock file) #289

Recommended way to self-heal crashed container (lock file) #289

sdaschner commented Apr 4, 2021

sdaschner commented Apr 4, 2021

jennyowen commented Apr 6, 2021

Recommended way to self-heal crashed container (lock file) #289

Recommended way to self-heal crashed container (lock file) #289

Comments

sdaschner commented Apr 4, 2021

sdaschner commented Apr 4, 2021

jennyowen commented Apr 6, 2021