Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0 #12506

Open
kucharskim opened this issue Nov 23, 2024 · 8 comments
Assignees
Labels
community Issues created by community investigation required

Comments

@kucharskim
Copy link

Contact Details

This GitHub issue or techops at chorus one

Which network are you using?

mainnet

What happened?

  • download of fresh snapshot from s3://near-protocol-public/backups/mainnet/rpc/${latest}/
  • put in place to data/ directory
  • headers sync'd properly
  • and then when blocks where about to sync got:
root@near31:~# journalctl -axlfu neard.service
...
Nov 23 06:47:23 near31 neard[13288]: 2024-11-23T06:47:23.845938Z  WARN garbage collection: Error in gc: GC Error: bloc
Nov 23 06:47:24 near31 neard[13288]: 2024-11-23T06:47:24.847105Z  WARN garbage collection: Error in gc: GC Error: bloc
Nov 23 06:47:25 near31 neard[13288]: 2024-11-23T06:47:25.848413Z  WARN garbage collection: Error in gc: GC Error: bloc
Nov 23 06:47:26 near31 neard[13288]: 2024-11-23T06:47:26.849420Z  WARN garbage collection: Error in gc: GC Error: bloc
Nov 23 06:47:27 near31 neard[13288]: 2024-11-23T06:47:27.850416Z  WARN garbage collection: Error in gc: GC Error: bloc
Nov 23 06:47:28 near31 neard[13288]: 2024-11-23T06:47:28.851421Z  WARN garbage collection: Error in gc: GC Error: bloc
Nov 23 06:47:29 near31 neard[13288]: 2024-11-23T06:47:29.852520Z  WARN garbage collection: Error in gc: GC Error: bloc
Nov 23 06:47:30 near31 neard[13288]: 2024-11-23T06:47:30.853449Z  WARN garbage collection: Error in gc: GC Error: bloc
Nov 23 06:47:31 near31 neard[13288]: 2024-11-23T06:47:31.854431Z  WARN garbage collection: Error in gc: GC Error: bloc
Nov 23 06:47:32 near31 neard[13288]: 2024-11-23T06:47:32.855412Z  WARN garbage collection: Error in gc: GC Error: bloc
Nov 23 06:47:33 near31 neard[13288]: 2024-11-23T06:47:33.856479Z  WARN garbage collection: Error in gc: GC Error: bloc
...

I see previous issue with the same problem at #11927 but I don't see anything in that GitHub issue that problem was solved.

Relevant log output

$ /data/neard/neard --version
neard (release 2.3.1) (build 2.3.1) (rustc 1.81.0) (protocol 72) (db 40)
features: [default, json_rpc, rosetta_rpc]
$ free -h
               total        used        free      shared  buff/cache   available
Mem:           124Gi       1.1Gi       1.7Gi       4.0Mi       122Gi       122Gi
Swap:             0B          0B          0B
root@near31:~# systemctl status neard.service | grep Memory
     Memory: 50.7G (max: 122.0G available: 71.2G)
$ df -hl /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127      6.7T  1.5T  4.9T  23% /data
@kucharskim
Copy link
Author

I tried again, and it failed again with s3://near-protocol-public/backups/mainnet/rpc/2024-11-23T00:00:40Z

@kucharskim
Copy link
Author

Probably should add config.json to this GitHub issue.
near31-rpc-config.json.txt

@sbond14
Copy link

sbond14 commented Nov 24, 2024

I had this same issue using 2.3.1:

thread 'main' panicked at chain/client/src/client_actor.rs:167:6:
called `Result::unwrap()` on an `Err` value: Chain(DBNotFoundErr("epoch block: 7Hjcdu4AmvkY3ZK6GnsDaz1HsdZ9m1XL6HgxFTpUXshn"))

This is after it just randomly stopped downloading blocks and started outputting a lot of these logs:

WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0

While not getting any new blocks, it proceeded to continuously increase in RAM usage, then OOMed, restarted itself, and failed with the DBNotFoundErr I sent above

@marcelo-gonzalez
Copy link
Contributor

While the WARN garbage collection error looks like it definitely means something is wrong and should probably be fixed, I think it's possible that the node would eventually work properly if you left it running, because it might be clearing block data before a state sync. This process takes a while, but unfortunately there are no logs that tell you it's going to look stuck for half an hour+.

So maybe try leaving it for at least an hour or so and hopefully it will work? also we should:

  1. at least warn the user that this is going to happen
  2. look at why this pre-state sync cleanup needs to be done upfront before proceeding. does it need to be? feels like no unless I'm missing something
  3. of course figure out what the GC error is

@marcelo-gonzalez
Copy link
Contributor

thread 'main' panicked at chain/client/src/client_actor.rs:167:6:
called `Result::unwrap()` on an `Err` value: Chain(DBNotFoundErr("epoch block: 7Hjcdu4AmvkY3ZK6GnsDaz1HsdZ9m1XL6HgxFTpUXshn"))

It's hard to say since unfortunately that log line is so opaque, but I think that is most likely an unrelated issue (that of course indicates something is wrong, but I think prob not because of these WARN garbage collection log lines)

@sbond14
Copy link

sbond14 commented Nov 26, 2024

Leaving a node running, it eventually started syncing after 5-6 hours.

I agree, adding some logging to indicate this could happen would be very useful. In addition, the hardware docs should be updated to 64GB RAM, since 32GB was not sufficient.

@kucharskim
Copy link
Author

Do you know did at the same time those 5-6 hours passed, an epoch rolled over? I am wondering is the time related to the work done under the hood, or was is related to new epoch.

@sbond14
Copy link

sbond14 commented Nov 26, 2024

The epoch switch happened less than an hour into its sync. It seems to line up pretty closely to when the node stopped processing blocks

Then ~4 hours after the epoch switch, it started processing blocks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Issues created by community investigation required
Projects
None yet
Development

No branches or pull requests

4 participants