Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommended way to self-heal crashed container (lock file) #289

Open
sdaschner opened this issue Apr 4, 2021 · 2 comments
Open

Recommended way to self-heal crashed container (lock file) #289

sdaschner opened this issue Apr 4, 2021 · 2 comments

Comments

@sdaschner
Copy link

Hi there 🙂 Not really a bug report (since it's expected behavior), but what is the recommended way to let a container with an attached volume self-heal from a crash?

Scenario

  • Dockerized Neo4J (in Kubernetes cluster, deployed with Helm in the standalone way)
  • Pod (e.g. provided by StatefulSet) crashes; pod gets restarted, i.e. new container but same volume
  • Usual lock file error: Lock file has been locked by another process: /data/databases/store_lock, when the original process was killed and not properly stopped

If I can ensure that the container/pod is the only one using that volume, is there a recommended way to clear the lock file at startup?

I understand that it makes sense from a concurrency perspective, but if the container runtime crashes unexpectedly, the pod can't come up again (CrashLoopBackOff) on its own, and has to manually be restarted...

@sdaschner
Copy link
Author

Update:

I played around with the init process (the Helm chart runs the init.sh script from the config map), and tried to check for the lock file status from the command line. What's interesting is that in the case after the crash the file (/data/databases/store_lock) exists, but can be locked & unlocked in CLI with flock, but the Java process still fails to obtain the lock...

This was my test snippet at the start of the container:

    if { flock -x -n /data/databases/store_lock echo -n; }; then
      echo "/data/databases/store_lock not locked, continuing"
    else
      echo "/data/databases/store_lock is locked; deleting"
      rm -rf /data/databases/store_lock
    fi

which prints ... not locked, though the Java process fails afterwards. Is there another way to check for the lock file, or simply delete it at startup (in my particular example, where there is no other Neo4J instance operating on that filesystem)?

@jennyowen
Copy link
Member

@sdaschner In regular non-k8s docker, you can just delete the store_lock file if you know that there are no running neo4j databases using the mounted data volume. However, you specifically asked about kubernetes and our helm charts, so I know this doesn't exactly answer your question.

Could you re-create this issue on the github repo for the helm charts please? They'll be better able to answer your question / feature request.
https://github.com/neo4j-contrib/neo4j-helm
(I can't transfer the issue myself because it's owned by a different account).

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants