`site unroll not supported for nSpin = 2 nColor = 32` in coarse-grid-deflated MG #1378

kostrzewa · 2023-05-16T10:31:25Z

This is an issue which we encountered already quite some time ago but haven't had time to report yet.

When running coarse-grid-deflated MG from within tmLQCD using a relatively "recent" commit of QUDA's develop branch (32bb266) I encounter:

MG level 2 (GPU): ERROR: site unroll not supported for nSpin = 2 nColor = 32 (rank 19, host nid006334, reduce_quda.cu:76 in virtual void quda::blas::Reduce<quda::blas::Norm2, short, short, 4, double>::apply(const quda::qudaStream_t &) [Reducer = quda::blas::Norm2, store_t = short, y_store_t = short, nSpin = 4, coeff_t = double]())
MG level 2 (GPU):        last kernel called was (name=N4quda7RNGInitE,volume=4x4x2x4,aux=GPU-offline,vol=128,parity=1,precision=2,order=2,Ns=2,Nc=32,TwistFlavor=1)

I know that switching to much older commits "solves" this, so that's something we can explore if necessary (I don't know how compatible the current version of our interface is with these older QUDA versions).

I'm testing with higher verbosity to see what's going on but perhaps you might already have a change in mind from the past couple of months which could have caused this?

The text was updated successfully, but these errors were encountered:

kostrzewa · 2023-05-16T10:34:28Z

Note that everything works fine when I disable coarse-grid deflation.

The failure seems to occur during the launch of the eigensolver:

[...]
MG level 1 (GPU): Computing Y field......
MG level 1 (GPU): ....done computing Y field
MG level 1 (GPU): Computing Yhat field......
MG level 1 (GPU): ....done computing Yhat field
MG level 2 (GPU): Using randStateMRG32k3a
MG level 2 (GPU): Tuned block=(128,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 0.00 GB/s for N4quda7RNGInitE
 with GPU-offline,vol=256,parity=2,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1
MG level 2 (GPU): Creating level 2
MG level 2 (GPU): Creating smoother
MG level 2 (GPU): Smoother done
MG level 2 (GPU): Setup of level 2 done
MG level 2 (GPU): ERROR: site unroll not supported for nSpin = 2 nColor = 32 (rank 0, host nid005036, reduce_quda.cu:76 in virtual void qud
a::blas::Reduce<quda::blas::Norm2, short, short, 4, double>::apply(const quda::qudaStream_t &) [Reducer = quda::blas::Norm2, store_t = shor
t, y_store_t = short, nSpin = 4, coeff_t = double]())

kostrzewa · 2023-05-16T11:33:50Z

Not that everything works fine when I disable coarse-grid deflation.

This was meant to read Note :)

kostrzewa · 2023-05-26T10:54:11Z

Some more context:

# QUDA: QUDA Inverter Parameters:
# QUDA: struct_size = -2147483648
# QUDA: dslash_type = 10
# QUDA: inv_type = 2
# QUDA: kappa = 0.139427
# QUDA: mu = -0.00072
# QUDA: twist_flavor = 1
# QUDA: tm_rho = 0
# QUDA: tol = 1e-10
# QUDA: residual_type = 1
# QUDA: maxiter = 250
# QUDA: reliable_delta = 0.01
# QUDA: reliable_delta_refinement = 0.0001
# QUDA: use_alternative_reliable = 0
# QUDA: use_sloppy_partial_accumulator = 0
# QUDA: solution_accumulator_pipeline = 1
# QUDA: max_res_increase = 10
# QUDA: max_res_increase_total = 40
# QUDA: max_hq_res_increase = 1
# QUDA: max_hq_res_restart_total = 10
# QUDA: heavy_quark_check = 10
# QUDA: pipeline = 24
# QUDA: num_offset = 0
# QUDA: num_src = 1
# QUDA: overlap = 0
# QUDA: split_grid[d] = 1
# QUDA: split_grid[d] = 1
# QUDA: split_grid[d] = 1
# QUDA: split_grid[d] = 1
# QUDA: num_src_per_sub_partition = 1
# QUDA: compute_action = 0
# QUDA: compute_true_res = 1
# QUDA: solution_type = 0
# QUDA: solve_type = 0
# QUDA: matpc_type = 0
# QUDA: dagger = 0
# QUDA: mass_normalization = 0
# QUDA: solver_normalization = 0
# QUDA: preserve_source = 1
# QUDA: cpu_prec = 8
# QUDA: cuda_prec = 8
# QUDA: cuda_prec_sloppy = 4
# QUDA: cuda_prec_refinement_sloppy = 8
# QUDA: cuda_prec_precondition = 2
# QUDA: cuda_prec_eigensolver = 2
# QUDA: input_location = 1
# QUDA: output_location = 1
# QUDA: clover_location = 2
# QUDA: gamma_basis = 2
# QUDA: dirac_order = 1
# QUDA: gcrNkrylov = 24
# QUDA: madwf_param_load = 0
# QUDA: madwf_param_save = 0
# QUDA: use_init_guess = 0
# QUDA: omega = 1
# QUDA: struct_size = -2147483648
# QUDA: clover_location = 2
# QUDA: clover_cpu_prec = 8
# QUDA: clover_cuda_prec = 8
# QUDA: clover_cuda_prec_sloppy = 4
# QUDA: clover_cuda_prec_refinement_sloppy = 8
# QUDA: clover_cuda_prec_precondition = 2
# QUDA: clover_cuda_prec_eigensolver = 2
# QUDA: compute_clover_trlog = 1
# QUDA: compute_clover = 1
# QUDA: compute_clover_inverse = 1
# QUDA: return_clover = 0
# QUDA: return_clover_inverse = 0
# QUDA: clover_rho = 0
# QUDA: clover_coeff = 0.235631
# QUDA: clover_csw = 0
# QUDA: clover_order = 9
# QUDA: verbosity = 2
# QUDA: iter = 0
# QUDA: gflops = 0
# QUDA: secs = 0
# QUDA: cuda_prec_ritz = 4
# QUDA: n_ev = 8
# QUDA: max_search_dim = 64
# QUDA: rhs_idx = 0
# QUDA: deflation_grid = 1
# QUDA: eigcg_max_restarts = 4
# QUDA: max_restart_num = 3
# QUDA: tol_restart = 5e-05
# QUDA: inc_tol = 0.01
# QUDA: eigenval_tol = 0.1
# QUDA: use_resident_solution = 0
# QUDA: make_resident_solution = 0
# QUDA: chrono_use_resident = 0
# QUDA: chrono_make_resident = 0
# QUDA: chrono_replace_last = 0
# QUDA: chrono_max_dim = 0
# QUDA: chrono_index = 0
# QUDA: chrono_precision = 4
# QUDA: extlib_type = 1
# QUDA: native_blas_lapack = 1
# QUDA: use_mobius_fused_kernel = 1

and the setup process seems to work fine:

[...]
MG level 0 (GPU): CG: Convergence at 316 iterations, L2 relative residual: iterated = 4.996240e-07, true = 4.996240e-07 (requested = 5.000000e-07)
MG level 0 (GPU): Computing Y field......
MG level 0 (GPU): ....done computing Y field
MG level 0 (GPU): Computing Yhat field......
MG level 0 (GPU): ....done computing Yhat field
MG level 1 (GPU): WARNING: Exceeded maximum iterations 1500
MG level 1 (GPU): CG: Convergence at 1500 iterations, L2 relative residual: iterated = 1.924570e-06, true = 1.935714e-06 (requested = 5.000000e-07)
[...]
MG level 1 (GPU): WARNING: Exceeded maximum iterations 1500
MG level 1 (GPU): CG: Convergence at 1500 iterations, L2 relative residual: iterated = 2.006127e-06, true = 2.013764e-06 (requested = 5.000000e-07)
MG level 1 (GPU): Computing Y field......
MG level 1 (GPU): ....done computing Y field
MG level 1 (GPU): Computing Yhat field......
MG level 1 (GPU): ....done computing Yhat field
MG level 2 (GPU): Using randStateMRG32k3a
MG level 2 (GPU): Tuned block=(64,1,1), grid=(2,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 0.00 GB/s for N4quda7RNGInitE with GPU-offline,vol=256,parity=2,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1
MG level 2 (GPU): Creating level 2
MG level 2 (GPU): Creating smoother
MG level 2 (GPU): Smoother done
MG level 2 (GPU): Setup of level 2 done
MG level 2 (GPU): ERROR: site unroll not supported for nSpin = 2 nColor = 32 (rank 0, host nid005910, reduce_quda.cu:76 in virtual void quda::blas::Reduce<quda::blas::Norm2, short, short, 4, double>::apply(const quda::qudaStream_t &) [Reducer = quda::blas::Norm2, store_t = short, y_store_t = short, nSpin = 4, coeff_t = double]())
MG level 2 (GPU):        last kernel called was (name=N4quda7RNGInitE,volume=4x4x2x4,aux=GPU-offline,vol=128,parity=1,precision=2,order=2,Ns=2,Nc=32,TwistFlavor=1)
Local seed is 144041342  proc_id = 2

where I used CUDA_LAUNCH_BLOCKING=1.

kostrzewa · 2023-05-26T11:17:24Z

DEBUG_VERBOSE on level 2:

MG level 2 (GPU): Using randStateMRG32k3a
MG level 2 (GPU): Allocated array of random numbers with size: 0.01 MB
MG level 2 (GPU): PreTune N4quda7RNGInitE
MG level 2 (GPU): Tuning N4quda7RNGInitE with GPU-offline,vol=256,parity=2,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1 at vol=8x4x2x4
MG level 2 (GPU): About to call tunable.apply block=(64,1,1) grid=(2,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): C   block=(64,1,1), grid=(2,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(128,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): C   block=(128,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(192,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): C   block=(192,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(256,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): C   block=(256,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(320,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(320,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(384,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(384,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(448,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(448,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(512,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(512,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(576,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(576,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(640,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(640,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(704,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(704,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(768,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(768,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(832,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(832,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(896,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(896,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(960,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(960,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(1024,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(1024,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(64,2,1) grid=(2,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): C   block=(64,2,1), grid=(2,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(128,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): C   block=(128,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(192,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(192,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(256,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(256,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(320,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(320,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(384,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(384,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(448,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(448,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): About to call tunable.apply block=(512,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU):     block=(512,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives unspecified launch failure
MG level 2 (GPU): Candidate tuning finished for N4quda7RNGInitE with GPU-offline,vol=256,parity=2,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1. Best time 0.000016 and now continuing with 62 iterations.
MG level 2 (GPU): About to call tunable.apply block=(192,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): T   block=(192,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(256,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): T   block=(256,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(64,1,1) grid=(2,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): T   block=(64,1,1), grid=(2,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(128,1,1) grid=(1,2,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): T   block=(128,1,1), grid=(1,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(64,2,1) grid=(2,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): T   block=(64,2,1), grid=(2,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): About to call tunable.apply block=(128,2,1) grid=(1,1,1) shared_bytes=0 aux=(-1,-1,-1,-1)
MG level 2 (GPU): T   block=(128,2,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) gives 0.00 Gflop/s, 0.00 GB/s
MG level 2 (GPU): Tuned block=(64,1,1), grid=(2,2,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 0.00 GB/s for N4quda7RNGInitE with GPU-offline,vol=256,parity=2,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1
MG level 2 (GPU): PostTune N4quda7RNGInitE
MG level 2 (GPU): Creating level 2
MG level 2 (GPU): Creating smoother
MG level 2 (GPU): Smoother done
MG level 2 (GPU): Setup of level 2 done
MG level 2 (GPU): ERROR: site unroll not supported for nSpin = 2 nColor = 32 (rank 0, host nid005327, reduce_quda.cu:76 in virtual void quda::blas::Reduce<quda::blas::Norm2, short, short, 4, double>::apply(const quda::qudaStream_t &) [Reducer = quda::blas::Norm2, store_t = short, y_store_t = short, nSpin = 4, coeff_t = double]())
MG level 2 (GPU):        last kernel called was (name=N4quda7RNGInitE,volume=4x4x2x4,aux=GPU-offline,vol=128,parity=1,precision=2,order=2,Ns=2,Nc=32,TwistFlavor=1)
Local seed is 144041342  proc_id = 2

kostrzewa · 2023-05-26T11:45:09Z

Increasing the level of verbosity step-by-step, in particular increasing the verbosity on level 1, reveals some more details on where this is failing (likely because of buffers being flushed more frequently):

MG level 2 (GPU): PostTune N4quda7RNGInitE
MG level 2 (GPU): Creating level 2
MG level 2 (GPU): Creating smoother
MG level 2 (GPU): Smoother done
MG level 2 (GPU): Setup of level 2 done
MG level 1 (GPU): Creating coarse solver wrapper
MG level 1 (GPU): Creating a CA-GCR solver
MG level 1 (GPU): Tuned block=(64,1,1), grid=(2,2,1), shared_bytes=6401, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 4.93 GB/s for N4quda11Spino
rNoiseIfLi2ELi32EEE with GPU-offline,vol=256,parity=2,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,uniform
MG level 2 (GPU): Tuned block=(64,1,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 23.71 GB/s for hipMemsetAsync 
with zero,color_spinor_field.cpp,436
MG level 2 (GPU): Tuned block=(64,1,4), grid=(16,1,16), shared_bytes=8001, aux=(8,1,1,1) giving 548.10 Gflop/s, 292.32 GB/s for N4quda12Dsl
ashCoarseIfssLi2ELi32ELb0ELb1ELb0ELNS_10DslashTypeE2EEE with policy_kernel,GPU-offline,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,Twis
tFlavor=1,comm=0111,full,halo=00111111,n_rhs=1
MG level 2 (GPU): Tuned block=(64,1,4), grid=(16,1,16), shared_bytes=4572, aux=(8,1,1,1) giving 559.01 Gflop/s, 298.14 GB/s for N4quda12DslashCoarseIfssLi2ELi32ELb0ELb1ELb0ELNS_10DslashTypeE2EEE with policy_kernel,GPU-offline,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,comm=0111,full,halo=00222222,n_rhs=1
MG level 2 (GPU): Tuned block=(64,1,1), grid=(1,1,1), shared_bytes=0, aux=(0,0,0,0) giving 545.83 Gflop/s, 291.11 GB/s for N4quda22DslashCoarsePolicyTuneINS_18DslashCoarseLaunchILb0ELi32EEEEE with policy,clover,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,gauge_prec=2,halo_prec=2,comm=0111,topo=1244,p2p=0,gdr=1,nvshmem=0,pol=11110011111,full,n_rhs=1
MG level 2 (GPU): Tuned block=(16,16,1), grid=(16,1,1), shared_bytes=4096, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 0.97 GB/s for N4quda9GhostPackIfsL16QudaFieldOrder_s2ELi2ELi32EEE with GPU-offline,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,halo_prec=2,comm=0111,topo=1244,dest=00111111,nFace=1,spins_per_thread=2,colors_per_thread=2,shmem=0,batched
MG level 2 (GPU): Tuned block=(64,1,4), grid=(8,1,32), shared_bytes=16384, aux=(4,2,1,1) giving 1003.82 Gflop/s, 519.81 GB/s for N4quda12DslashCoarseIfssLi2ELi32ELb1ELb0ELb0ELNS_10DslashTypeE2EEE with policy_kernel,GPU-offline,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,comm=0111,full,halo=00111111,n_rhs=1
MG level 2 (GPU): Tuned block=(16,16,1), grid=(16,1,1), shared_bytes=4096, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 0.97 GB/s for N4quda9GhostPackIfsL16QudaFieldOrder_s2ELi2ELi32EEE with GPU-offline,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,halo_prec=2,comm=0111,topo=1244,dest=00222222,nFace=1,spins_per_thread=2,colors_per_thread=2,shmem=0,batched
MG level 2 (GPU): Tuned block=(64,1,4), grid=(8,1,16), shared_bytes=8001, aux=(4,1,1,1) giving 269.55 Gflop/s, 139.58 GB/s for N4quda12DslashCoarseIfssLi2ELi32ELb1ELb0ELb0ELNS_10DslashTypeE2EEE with policy_kernel,GPU-offline,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,comm=0111,full,halo=00222222,n_rhs=1
MG level 2 (GPU): Tuned block=(64,1,1), grid=(1,1,1), shared_bytes=0, aux=(1,0,0,0) giving 73.15 Gflop/s, 37.88 GB/s for N4quda22DslashCoarsePolicyTuneINS_18DslashCoarseLaunchILb0ELi32EEEEE with policy,dslash,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1,gauge_prec=2,halo_prec=2,comm=0111,topo=1244,p2p=0,gdr=1,nvshmem=0,pol=11110011111,full,n_rhs=1
MG level 2 (GPU): Tuned block=(192,1,1), grid=(220,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 9.40 Gflop/s, 37.62 GB/s for N4quda4blas7axpbyz_IfEE with GPU-offline,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1
MG level 2 (GPU): Tuned block=(256,1,1), grid=(2,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 1.90 Gflop/s, 3.79 GB/s for N4quda4blas5Norm2IdfEE with GPU-offline,nParity=1,vol=128,parity=1,precision=4,order=2,Ns=2,Nc=32,TwistFlavor=1
MG level 2 (GPU): Creating TR Lanczos eigensolver
MG level 2 (GPU): Tuned block=(64,1,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 15.64 GB/s for hipMemsetAsync with zero,color_spinor_field.cpp,436
MG level 2 (GPU): Running eigensolver in half precision
MG level 2 (GPU): Using randStateMRG32k3a
MG level 2 (GPU): Tuned block=(128,1,1), grid=(1,1,1), shared_bytes=0, aux=(-1,-1,-1,-1) giving 0.00 Gflop/s, 0.00 GB/s for N4quda7RNGInitE with GPU-offline,vol=128,parity=1,precision=2,order=2,Ns=2,Nc=32,TwistFlavor=1
MG level 2 (GPU): ERROR: site unroll not supported for nSpin = 2 nColor = 32 (rank 0, host nid006249, reduce_quda.cu:76 in virtual void quda::blas::Reduce<quda::blas::Norm2, short, short, 4, double>::apply(const quda::qudaStream_t &) [Reducer = quda::blas::Norm2, store_t = short, y_store_t = short, nSpin = 4, coeff_t = double]())
MG level 2 (GPU):        last kernel called was (name=N4quda7RNGInitE,volume=4x4x2x4,aux=GPU-offline,vol=128,parity=1,precision=2,order=2,Ns=2,Nc=32,TwistFlavor=1)

At least from the last few lines here it appears that the issue occurs in the eigensolver, running with a global QUDA_DEBUG_VERBOSE now to see what exactly is happening.

kostrzewa · 2023-05-26T12:43:57Z

After adding a manual debug statement I've figured out that the issue comes from here:

quda/lib/eigensolve_quda.cpp

Lines 152 to 175 in 68d7d20

    
           void EigenSolver::prepareInitialGuess(std::vector<ColorSpinorField> &kSpace) 
        
           { 
        
             // Use 0th vector to extract meta data for the RNG. 
        
             RNG rng(kSpace[0], 1234); 
        
             for (int b = 0; b < block_size; b++) { 
        
               // If the spinor contains initial data from the user 
        
               // preserve it, else populate with rands. 
        
               if (sqrt(blas::norm2(kSpace[b])) == 0.0) { spinorNoise(kSpace[b], rng, QUDA_NOISE_UNIFORM); } 
        
             } 
        
             bool orthed = false; 
        
             int k = 0; 
        
             while (!orthed && k < max_ortho_attempts) { 
        
               orthonormalizeHMGS(kSpace, ortho_block_size, block_size); 
        
               if (block_size > 1) { 
        
                 logQuda(QUDA_SUMMARIZE, "Orthonormalising initial guesses with Modified Gram Schmidt, iter k=%d\n", k); 
        
               } else { 
        
                 logQuda(QUDA_SUMMARIZE, "Orthonormalising initial guess\n"); 
        
               } 
        
               orthed = orthoCheck(kSpace, block_size); 
        
               k++; 
        
             } 
        
             if (!orthed) errorQuda("Failed to orthonormalise initial guesses with %d orthonormalisation attempts (max = %d)", k+1, max_ortho_attempts); 
        
           }

In particular, the issue seems to be with blas::norm2(kSpace[b]).

prepareInitialGuess(kSpace) is called from

quda/lib/eig_trlm.cpp

Lines 40 to 58 in 68d7d20

    
           void TRLM::operator()(std::vector<ColorSpinorField> &kSpace, std::vector<Complex> &evals) 
        
           { 
        
             // Override any user input for block size. 
        
             block_size = 1; 
        
             // Pre-launch checks and preparation 
        
             //--------------------------------------------------------------------------- 
        
             queryPrec(kSpace[0].Precision()); 
        
             // Check to see if we are loading eigenvectors 
        
             if (strcmp(eig_param->vec_infile, "") != 0) { 
        
               logQuda(QUDA_VERBOSE, "Loading evecs from file name %s\n", eig_param->vec_infile); 
        
               loadFromFile(kSpace, evals); 
        
               return; 
        
             } 
        
             // Check for an initial guess. If none present, populate with rands, then 
        
             // orthonormalise 
        
             prepareInitialGuess(kSpace);

as far as I can tell.

I'm wondering if the check in

quda/lib/reduce_quda.cu

Lines 72 to 77 in 68d7d20

    
           void apply(const qudaStream_t &stream) override 
        
           { 
        
             constexpr bool site_unroll_check = !std::is_same<store_t, y_store_t>::value || isFixed<store_t>::value || decltype(r)::site_unroll; 
        
             if (site_unroll_check && (x.Ncolor() != 3 || x.Nspin() == 2)) 
        
               errorQuda("site unroll not supported for nSpin = %d nColor = %d", x.Nspin(), x.Ncolor());

is warranted if the eigensolver (and hence blas::norm2) is to be used on the coarse operator. On the other hand, it has been in place for a long time, also when the coarse-deflated MG was working IIRC:

927d04d1a0 (Dean Howarth   2020-05-28 05:35:20 -0700  72)       void apply(const qudaStream_t &stream)
fe7252cba2 (Mathias Wagner 2019-04-04 23:23:12 +0200  73)       {
073f2d93cf (maddyscientist 2020-07-01 08:26:03 -0700  74)         constexpr bool site_unroll_check = !std::is_same<store_t, y_store_t>::value || isFixed<store_t>::value || decltype(r)::site_unroll;
073f2d93cf (maddyscientist 2020-07-01 08:26:03 -0700  75)         if (site_unroll_check && (x.Ncolor() != 3 || x.Nspin() == 2))
cb485e74b4 (maddyscientist 2020-06-17 13:49:25 -0700  76)           errorQuda("site unroll not supported for nSpin = %d nColor = %d", x.Nspin(), x.Ncolor());
cb485e74b4 (maddyscientist 2020-06-17 13:49:25 -0700  77)

and there are no changes in multigrid.cpp which seem to suggest that anything was changed...

maddyscientist · 2023-05-30T20:57:37Z

Hi @kostrzewa. This issue looks like a precision one I think: I don't think we should ever be using half precision on the coarse grids here. Can you enable QUDA_BACKWARDS=ON so I can see exactly where this being called?

FWIW: the "site unrolling" refers to the fact that the entire site (all spin and color for a given site in spacetime) is handled by a single thread.

kostrzewa · 2023-06-09T12:09:09Z

This issue looks like a precision one I think: I don't think we should ever be using half precision on the coarse grids here.

Thanks for this hint! Setting all [clover_]cuda_prec_precondition and [clover_]cuda_prec_eigensolver to single precision does indeed resolve the problem. This seems to be somewhat inconsistent with https://github.com/lattice/quda/wiki/Twisted-clover-deflated-multigrid#improvement-2-using-coarse-level-deflation, however, where --prec-precondition half is passed, while *_prec_eigensolver does not seem to be set explicitly at all. I was also under the (apparently false) impression that it would make the most sense to run the coarse eigensolver in half precision as the level of convergence required is rather low (residual 1e-4 or so).

I'm aware of course that the Wiki page will be three years old in two weeks, so it might well have grown inconsistent. For example, n-conv is also not set, while it appears to be required now.

Can you enable QUDA_BACKWARDS=ON so I can see exactly where this being called?

Will do and report back, hopefully soon.

kostrzewa · 2023-06-09T12:13:57Z

FWIW: the "site unrolling" refers to the fact that the entire site (all spin and color for a given site in spacetime) is handled by a single thread.

Thanks. How come this is being done on the coarsest grid?

kostrzewa · 2023-06-19T08:24:47Z

very useful, will certainly use backward-cpp in the future!

#16   Object "libquda.so", at 0x14fa3aef8e8d, in newMultigridQuda
#15   Object "libquda.so", at 0x14fa3aef6995, in quda::multigrid_solver::multigrid_solver(QudaMultigridParam_s&, quda::TimeProfile&)
#14   Object "libquda.so", at 0x14fa3ae6f83c, in quda::MG::MG(quda::MGParam&, quda::TimeProfile&)
#13   Object "libquda.so", at 0x14fa3ae73c94, in quda::MG::reset(bool)
#12   Object "libquda.so", at 0x14fa3ae6f83c, in quda::MG::MG(quda::MGParam&, quda::TimeProfile&)
#11   Object "libquda.so", at 0x14fa3ae740c8, in quda::MG::reset(bool)
#10   Object "libquda.so", at 0x14fa3ae78de8, in quda::MG::createCoarseSolver()
#9    Object "libquda.so", at 0x14fa3ae7c323, in quda::PreconditionedSolver::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&)
#8    Object "libquda.so", at 0x14fa3aeb43ad, in quda::CAGCR::operator()(quda::ColorSpinorField&, quda::ColorSpinorField&)
#7    Object "libquda.so", at 0x14fa3ae41bd8, in quda::TRLM::operator()(std::vector<quda::ColorSpinorField, std::allocator<quda::ColorSpinorField> >&, std::vector<std::complex<double>, std::allocator<std::complex<double> > >&)
#6    Object "libquda.so", at 0x14fa3ae5c5de, in quda::EigenSolver::prepareInitialGuess(std::vector<quda::ColorSpinorField, std::allocator<quda::ColorSpinorField> >&)
#5    Object "libquda.so", at 0x14fa38aea640, in quda::blas::norm2(quda::ColorSpinorField const&)
#4    Object "libquda.so", at 0x14fa38afd472, in void quda::blas::instantiate<quda::blas::Norm2, quda::blas::Reduce, false, double, quda::ColorSpinorField const, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, double&>(double const&, double const&, double const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, double&)
#3    Object "libquda.so", at 0x14fa38b0127f, in quda::blas::Reduce<quda::blas::Norm2, short, short, 4, double>::Reduce<quda::ColorSpinorField const, quda::ColorSpinorField const, quda::ColorSpinorField const, quda::ColorSpinorField const, quda::ColorSpinorField const>(double const&, double const&, double const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, quda::ColorSpinorField const&, double&)
#2    Object "libquda.so", at 0x14fa38b01b70, in quda::blas::Reduce<quda::blas::Norm2, short, short, 4, double>::apply(quda::qudaStream_t const&)
#1    Object "libquda.so", at 0x14fa3af33b46, in errorQuda_(char const*, char const*, int, ...)
#0    Object "libquda.so", at 0x14fa3af675e4, in quda::comm_abort(int)

maddyscientist · 2023-06-21T00:11:34Z

This issue looks like a precision one I think: I don't think we should ever be using half precision on the coarse grids here.

Thanks for this hint! Setting all [clover_]cuda_prec_precondition and [clover_]cuda_prec_eigensolver to single precision does indeed resolve the problem. This seems to be somewhat inconsistent with https://github.com/lattice/quda/wiki/Twisted-clover-deflated-multigrid#improvement-2-using-coarse-level-deflation, however, where --prec-precondition half is passed, while *_prec_eigensolver does not seem to be set explicitly at all. I was also under the (apparently false) impression that it would make the most sense to run the coarse eigensolver in half precision as the level of convergence required is rather low (residual 1e-4 or so).

I'm aware of course that the Wiki page will be three years old in two weeks, so it might well have grown inconsistent. For example, n-conv is also not set, while it appears to be required now.

This just looks like the wiki pages have grown stale: the eigenvector precision option was added after they were written. So we have five precisions to worry about now:

prec the outer precision of the solver, or "restart" precision
prec_sloppy the working precision of the solver, where most cycles are done
prec_precondition the precision of the preconditioner (for MG this is the precision of the fine-level smoother)
prec_null the precision that the null-space vectors are stored in (and also corresponds to the precision of the link variables on the coarse grids)
prec_eigensolver the precision of the eigensolver used for deflation (and thus the precision of the eigenvectors on the coarse grid when using deflation)

So in general one would want to use a double / single / half / half / single (respectively) solver. The coarse eigensolvers must use single precision since we don't support half precision on the coarse grid fermion fields (because of this need to "unroll" the site vector, which would make for a combinatoric nightmare for compilation and also reduce parallelism which kill performance).

I will update the wiki to fix this deficit, and apologies for this incongruity between the wiki and the code.

Glad you find the QUDA_BACKWARDS option helpful. I've updated the debugging page to note this, as it escaped documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`site unroll not supported for nSpin = 2 nColor = 32` in coarse-grid-deflated MG #1378

`site unroll not supported for nSpin = 2 nColor = 32` in coarse-grid-deflated MG #1378

kostrzewa commented May 16, 2023

kostrzewa commented May 16, 2023 •

edited

Loading

kostrzewa commented May 16, 2023

kostrzewa commented May 26, 2023

kostrzewa commented May 26, 2023

kostrzewa commented May 26, 2023

kostrzewa commented May 26, 2023

maddyscientist commented May 30, 2023

kostrzewa commented Jun 9, 2023 •

edited

Loading

kostrzewa commented Jun 9, 2023

kostrzewa commented Jun 19, 2023

maddyscientist commented Jun 21, 2023

site unroll not supported for nSpin = 2 nColor = 32 in coarse-grid-deflated MG #1378

site unroll not supported for nSpin = 2 nColor = 32 in coarse-grid-deflated MG #1378

Comments

kostrzewa commented May 16, 2023

kostrzewa commented May 16, 2023 • edited Loading

kostrzewa commented May 16, 2023

kostrzewa commented May 26, 2023

kostrzewa commented May 26, 2023

kostrzewa commented May 26, 2023

kostrzewa commented May 26, 2023

maddyscientist commented May 30, 2023

kostrzewa commented Jun 9, 2023 • edited Loading

kostrzewa commented Jun 9, 2023

kostrzewa commented Jun 19, 2023

maddyscientist commented Jun 21, 2023

`site unroll not supported for nSpin = 2 nColor = 32` in coarse-grid-deflated MG #1378

`site unroll not supported for nSpin = 2 nColor = 32` in coarse-grid-deflated MG #1378

kostrzewa commented May 16, 2023 •

edited

Loading

kostrzewa commented Jun 9, 2023 •

edited

Loading