Redis cluster not recovering previously persisted data after host machine restart #1069

MedAzizTousli · 2024-09-17T06:29:02Z

Redis Version: v7.0.12

Hello.

I have deployed a Redis Cluster in my Kubernetes Cluster using ot-helm/redis-operator with the following values:

redisCluster:
  redisSecret:
    secretName: redis-password
    secretKey: REDIS_PASSWORD
  leader:
    replicas: 3
    affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: test
                    operator: In
                    values:
                      - "true"
  follower:
    replicas: 3
    affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: test
                    operator: In
                    values:
                      - "true"
externalService:
  enabled: true
  serviceType: LoadBalancer
  port: 6379
redisExporter:
  enabled: true
storageSpec:
  volumeClaimTemplate:
    spec:
      resources:
        requests:
          storage: 10Gi
  nodeConfVolumeClaimTemplate:
    spec:
      resources:
        requests:
          storage: 1Gi

After adding a couple of keys to the cluster, I stop the host machine (EC2 instance) where the Redis Cluster is deployed, and start it again. Upon the restart of the EC2 instance, and the Redis Cluster, the couple of keys that I have added before the restart disappear.

I have both persistence methods enabled (RDB & AOF), and this is my configuration (default) for Redis Cluster regarding persistency:

config get dir # /data
config get dbfilename # dump.rdb
config get appendonly # yes
config get appendfilename # appendonly.aof

I have noticed that during/after the addition of the keys/data in Redis, /data/dump.rdb, and /data/appendonlydir/appendonly.aof.1.incr.aof (within my main Redis Cluster leader) increase in size, but when I restart the EC2 instance, /data/dump.rdb get back to 0 bytes, while /data/appendonlydir/appendonly.aof.1.incr.aof stays at the same size that was before the restart.

I can confirm this with this screenshot from my Grafana dashboard while monitoring the persistent volume that was attached to main leader of the Redis Cluster. From what I understood, the volume contains both AOF, and RDB data until few seconds after the restart of Redis Cluster, where RDB data is deleted.

This is the Prometheus metric I am using in case anyone is wondering:

sum(kubelet_volume_stats_used_bytes{namespace="test", persistentvolumeclaim="redis-cluster-leader-redis-cluster-leader-0"}/(1024*1024)) by (persistentvolumeclaim)

So, Redis Cluster is actually backing up the data using RDB, and AOF, but as soon as it is restarted (after the EC2 restart), it loses RDB data, and AOF is not enough to retrieve the keys/data for some reason.

Here are the logs of Redis Cluster when it is restarted:

ACL_MODE is not true, skipping ACL file modification
Starting redis service in cluster mode.....
12:C 17 Sep 2024 00:49:39.351 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
12:C 17 Sep 2024 00:49:39.351 # Redis version=7.0.12, bits=64, commit=00000000, modified=0, pid=12, just started
12:C 17 Sep 2024 00:49:39.351 # Configuration loaded
12:M 17 Sep 2024 00:49:39.352 * monotonic clock: POSIX clock_gettime
12:M 17 Sep 2024 00:49:39.353 * Node configuration loaded, I'm ef200bc9befd1c4fb0f6e5acbb1432002a7c2822
12:M 17 Sep 2024 00:49:39.353 * Running mode=cluster, port=6379.
12:M 17 Sep 2024 00:49:39.353 # Server initialized
12:M 17 Sep 2024 00:49:39.355 * Reading RDB base file on AOF loading...
12:M 17 Sep 2024 00:49:39.355 * Loading RDB produced by version 7.0.12
12:M 17 Sep 2024 00:49:39.355 * RDB age 2469 seconds
12:M 17 Sep 2024 00:49:39.355 * RDB memory usage when created 1.51 Mb
12:M 17 Sep 2024 00:49:39.355 * RDB is base AOF
12:M 17 Sep 2024 00:49:39.355 * Done loading RDB, keys loaded: 0, keys expired: 0.
12:M 17 Sep 2024 00:49:39.355 * DB loaded from base file appendonly.aof.1.base.rdb: 0.001 seconds
12:M 17 Sep 2024 00:49:39.598 * DB loaded from incr file appendonly.aof.1.incr.aof: 0.243 seconds
12:M 17 Sep 2024 00:49:39.598 * DB loaded from append only file: 0.244 seconds
12:M 17 Sep 2024 00:49:39.598 * Opening AOF incr file appendonly.aof.1.incr.aof on server start
12:M 17 Sep 2024 00:49:39.599 * Ready to accept connections
12:M 17 Sep 2024 00:49:41.611 # Cluster state changed: ok
12:M 17 Sep 2024 00:49:46.592 # Cluster state changed: fail
12:M 17 Sep 2024 00:50:02.258 * DB saved on disk
12:M 17 Sep 2024 00:50:21.376 # Cluster state changed: ok
12:M 17 Sep 2024 00:51:26.284 * Replica 192.168.58.43:6379 asks for synchronization
12:M 17 Sep 2024 00:51:26.284 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '995d7ac6eedc09d95c4fc184519686e9dc8f9b41', my replication IDs are '654e768d51433cc24667323f8f884c66e8e55566' and '0000000000000000000000000000000000000000')
12:M 17 Sep 2024 00:51:26.284 * Replication backlog created, my new replication IDs are 'de979d9aa433bf37f413a64aff751ed677794b00' and '0000000000000000000000000000000000000000'
12:M 17 Sep 2024 00:51:26.284 * Delay next BGSAVE for diskless SYNC
12:M 17 Sep 2024 00:51:31.195 * Starting BGSAVE for SYNC with target: replicas sockets
12:M 17 Sep 2024 00:51:31.195 * Background RDB transfer started by pid 218
218:C 17 Sep 2024 00:51:31.196 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
12:M 17 Sep 2024 00:51:31.196 # Diskless rdb transfer, done reading from pipe, 1 replicas still up.
12:M 17 Sep 2024 00:51:31.202 * Background RDB transfer terminated with success
12:M 17 Sep 2024 00:51:31.202 * Streamed RDB transfer with replica 192.168.58.43:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
12:M 17 Sep 2024 00:51:31.203 * Synchronization with replica 192.168.58.43:6379 succeeded

Here is the output of INFO PERSISTENCE redis-cli command, after the addition of some data:

# Persistence
loading:0
async_loading:0
current_cow_peak:0
current_cow_size:0
current_cow_size_age:0
current_fork_perc:0.00
current_save_keys_processed:0
current_save_keys_total:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1726552373
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_saves:5
rdb_last_cow_size:1093632
rdb_last_load_keys_expired:0
rdb_last_load_keys_loaded:0
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_rewrites:0
aof_rewrites_consecutive_failures:0
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0
aof_current_size:37092089
aof_base_size:89
aof_pending_rewrite:0
aof_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0

In case anyone is wondering, the persistent volume is attached correctly to the Redis Cluster in /data mount path. Here is a snippet of the YAML definition of the main Redis Cluster leader (this is automatically generated via Helm & Redis Operator):

apiVersion: v1
kind: Pod
metadata:
  name: redis-cluster-leader-0
  namespace: test
[...]
spec:
  containers:
    [...]
    volumeMounts:
    - mountPath: /node-conf
      name: node-conf
    - mountPath: /data
      name: redis-cluster-leader
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-7ds8c
      readOnly: true
  [...]
  volumes:
  - name: node-conf
    persistentVolumeClaim:
      claimName: node-conf-redis-cluster-leader-0
  - name: redis-cluster-leader
    persistentVolumeClaim:
      claimName: redis-cluster-leader-redis-cluster-leader-0
[...]

I have already spent a couple of days on this issue, and I kind of looked everywhere, but in vain. I would appreciate any kind of help guys. I will also be available in case any additional information is needed. Thank you very much.

The text was updated successfully, but these errors were encountered:

MedAzizTousli · 2024-09-19T02:11:16Z

I noticed that /data/appendonlydir/appendonly.aof.1.incr.aof contains FLUSHALL command at the end, so, I solved the issue by adding rename-command FLUSHALL "" to my redis.conf.

shubham-cmyk · 2024-09-19T11:14:10Z

@MedAzizTousli can you close this issue if that one is solved

MedAzizTousli · 2024-09-22T15:08:49Z

Doesn't this need a permanent fix though? Why would Redis include a FLUSHALL command using AOF/RDB?

MedAzizTousli added the bug Something isn't working label Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis cluster not recovering previously persisted data after host machine restart #1069

Redis cluster not recovering previously persisted data after host machine restart #1069

MedAzizTousli commented Sep 17, 2024 •

edited

Loading

MedAzizTousli commented Sep 19, 2024

shubham-cmyk commented Sep 19, 2024

MedAzizTousli commented Sep 22, 2024

Redis cluster not recovering previously persisted data after host machine restart #1069

Redis cluster not recovering previously persisted data after host machine restart #1069

Comments

MedAzizTousli commented Sep 17, 2024 • edited Loading

MedAzizTousli commented Sep 19, 2024

shubham-cmyk commented Sep 19, 2024

MedAzizTousli commented Sep 22, 2024

MedAzizTousli commented Sep 17, 2024 •

edited

Loading