Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis cluster not recovering previously persisted data after host machine restart #1069

Open
MedAzizTousli opened this issue Sep 17, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@MedAzizTousli
Copy link

MedAzizTousli commented Sep 17, 2024

Redis Version: v7.0.12

Hello.

I have deployed a Redis Cluster in my Kubernetes Cluster using ot-helm/redis-operator with the following values:

redisCluster:
  redisSecret:
    secretName: redis-password
    secretKey: REDIS_PASSWORD
  leader:
    replicas: 3
    affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: test
                    operator: In
                    values:
                      - "true"
  follower:
    replicas: 3
    affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: test
                    operator: In
                    values:
                      - "true"
externalService:
  enabled: true
  serviceType: LoadBalancer
  port: 6379
redisExporter:
  enabled: true
storageSpec:
  volumeClaimTemplate:
    spec:
      resources:
        requests:
          storage: 10Gi
  nodeConfVolumeClaimTemplate:
    spec:
      resources:
        requests:
          storage: 1Gi

After adding a couple of keys to the cluster, I stop the host machine (EC2 instance) where the Redis Cluster is deployed, and start it again. Upon the restart of the EC2 instance, and the Redis Cluster, the couple of keys that I have added before the restart disappear.

I have both persistence methods enabled (RDB & AOF), and this is my configuration (default) for Redis Cluster regarding persistency:

config get dir # /data
config get dbfilename # dump.rdb
config get appendonly # yes
config get appendfilename # appendonly.aof

I have noticed that during/after the addition of the keys/data in Redis, /data/dump.rdb, and /data/appendonlydir/appendonly.aof.1.incr.aof (within my main Redis Cluster leader) increase in size, but when I restart the EC2 instance, /data/dump.rdb get back to 0 bytes, while /data/appendonlydir/appendonly.aof.1.incr.aof stays at the same size that was before the restart.

I can confirm this with this screenshot from my Grafana dashboard while monitoring the persistent volume that was attached to main leader of the Redis Cluster. From what I understood, the volume contains both AOF, and RDB data until few seconds after the restart of Redis Cluster, where RDB data is deleted.

This is the Prometheus metric I am using in case anyone is wondering:

sum(kubelet_volume_stats_used_bytes{namespace="test", persistentvolumeclaim="redis-cluster-leader-redis-cluster-leader-0"}/(1024*1024)) by (persistentvolumeclaim)

So, Redis Cluster is actually backing up the data using RDB, and AOF, but as soon as it is restarted (after the EC2 restart), it loses RDB data, and AOF is not enough to retrieve the keys/data for some reason.

Here are the logs of Redis Cluster when it is restarted:

ACL_MODE is not true, skipping ACL file modification
Starting redis service in cluster mode.....
12:C 17 Sep 2024 00:49:39.351 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
12:C 17 Sep 2024 00:49:39.351 # Redis version=7.0.12, bits=64, commit=00000000, modified=0, pid=12, just started
12:C 17 Sep 2024 00:49:39.351 # Configuration loaded
12:M 17 Sep 2024 00:49:39.352 * monotonic clock: POSIX clock_gettime
12:M 17 Sep 2024 00:49:39.353 * Node configuration loaded, I'm ef200bc9befd1c4fb0f6e5acbb1432002a7c2822
12:M 17 Sep 2024 00:49:39.353 * Running mode=cluster, port=6379.
12:M 17 Sep 2024 00:49:39.353 # Server initialized
12:M 17 Sep 2024 00:49:39.355 * Reading RDB base file on AOF loading...
12:M 17 Sep 2024 00:49:39.355 * Loading RDB produced by version 7.0.12
12:M 17 Sep 2024 00:49:39.355 * RDB age 2469 seconds
12:M 17 Sep 2024 00:49:39.355 * RDB memory usage when created 1.51 Mb
12:M 17 Sep 2024 00:49:39.355 * RDB is base AOF
12:M 17 Sep 2024 00:49:39.355 * Done loading RDB, keys loaded: 0, keys expired: 0.
12:M 17 Sep 2024 00:49:39.355 * DB loaded from base file appendonly.aof.1.base.rdb: 0.001 seconds
12:M 17 Sep 2024 00:49:39.598 * DB loaded from incr file appendonly.aof.1.incr.aof: 0.243 seconds
12:M 17 Sep 2024 00:49:39.598 * DB loaded from append only file: 0.244 seconds
12:M 17 Sep 2024 00:49:39.598 * Opening AOF incr file appendonly.aof.1.incr.aof on server start
12:M 17 Sep 2024 00:49:39.599 * Ready to accept connections
12:M 17 Sep 2024 00:49:41.611 # Cluster state changed: ok
12:M 17 Sep 2024 00:49:46.592 # Cluster state changed: fail
12:M 17 Sep 2024 00:50:02.258 * DB saved on disk
12:M 17 Sep 2024 00:50:21.376 # Cluster state changed: ok
12:M 17 Sep 2024 00:51:26.284 * Replica 192.168.58.43:6379 asks for synchronization
12:M 17 Sep 2024 00:51:26.284 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '995d7ac6eedc09d95c4fc184519686e9dc8f9b41', my replication IDs are '654e768d51433cc24667323f8f884c66e8e55566' and '0000000000000000000000000000000000000000')
12:M 17 Sep 2024 00:51:26.284 * Replication backlog created, my new replication IDs are 'de979d9aa433bf37f413a64aff751ed677794b00' and '0000000000000000000000000000000000000000'
12:M 17 Sep 2024 00:51:26.284 * Delay next BGSAVE for diskless SYNC
12:M 17 Sep 2024 00:51:31.195 * Starting BGSAVE for SYNC with target: replicas sockets
12:M 17 Sep 2024 00:51:31.195 * Background RDB transfer started by pid 218
218:C 17 Sep 2024 00:51:31.196 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
12:M 17 Sep 2024 00:51:31.196 # Diskless rdb transfer, done reading from pipe, 1 replicas still up.
12:M 17 Sep 2024 00:51:31.202 * Background RDB transfer terminated with success
12:M 17 Sep 2024 00:51:31.202 * Streamed RDB transfer with replica 192.168.58.43:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
12:M 17 Sep 2024 00:51:31.203 * Synchronization with replica 192.168.58.43:6379 succeeded

Here is the output of INFO PERSISTENCE redis-cli command, after the addition of some data:

# Persistence
loading:0
async_loading:0
current_cow_peak:0
current_cow_size:0
current_cow_size_age:0
current_fork_perc:0.00
current_save_keys_processed:0
current_save_keys_total:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1726552373
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_saves:5
rdb_last_cow_size:1093632
rdb_last_load_keys_expired:0
rdb_last_load_keys_loaded:0
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_rewrites:0
aof_rewrites_consecutive_failures:0
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0
aof_current_size:37092089
aof_base_size:89
aof_pending_rewrite:0
aof_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0

In case anyone is wondering, the persistent volume is attached correctly to the Redis Cluster in /data mount path. Here is a snippet of the YAML definition of the main Redis Cluster leader (this is automatically generated via Helm & Redis Operator):

apiVersion: v1
kind: Pod
metadata:
  name: redis-cluster-leader-0
  namespace: test
[...]
spec:
  containers:
    [...]
    volumeMounts:
    - mountPath: /node-conf
      name: node-conf
    - mountPath: /data
      name: redis-cluster-leader
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-7ds8c
      readOnly: true
  [...]
  volumes:
  - name: node-conf
    persistentVolumeClaim:
      claimName: node-conf-redis-cluster-leader-0
  - name: redis-cluster-leader
    persistentVolumeClaim:
      claimName: redis-cluster-leader-redis-cluster-leader-0
[...]

I have already spent a couple of days on this issue, and I kind of looked everywhere, but in vain. I would appreciate any kind of help guys. I will also be available in case any additional information is needed. Thank you very much.

@MedAzizTousli MedAzizTousli added the bug Something isn't working label Sep 17, 2024
@MedAzizTousli
Copy link
Author

I noticed that /data/appendonlydir/appendonly.aof.1.incr.aof contains FLUSHALL command at the end, so, I solved the issue by adding rename-command FLUSHALL "" to my redis.conf.

@shubham-cmyk
Copy link
Member

@MedAzizTousli can you close this issue if that one is solved

@MedAzizTousli
Copy link
Author

Doesn't this need a permanent fix though? Why would Redis include a FLUSHALL command using AOF/RDB?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants