Returned Policies and Exploitability #1215

bwr125 · 2024-04-26T18:12:14Z

Can I get some clarity on the input assumptions when calculating exploitability? See the following simple example:

from open_spiel.python.algorithms import sequence_form_lp, exploitability
import pyspiel

game = pyspiel.load_game('kuhn_poker')
(v1, v2, pi1, pi2) = sequence_form_lp.solve_zero_sum_game(game)
exploitability.exploitability(game, pi1)

The exploitability here is nonzero even though the utilities match the expected values (-1/18, 1/18). This does not happen when I used the closed form expression for the kuhn nash equilibrium:

from open_spiel.python.algorithms import exploitability
from open_spiel.python.games import data
import pyspiel

game = pyspiel.load_game('kuhn_poker')

pi_kuhn = data.kuhn_nash_equilibrium(0.219434553)
exploitability.exploitability(game, pi_kuhn)

This exploitability is zero. However, when I manually compare [pi_1, pi_2] and pi_kuhn, they look almost identical. What am I missing?

The text was updated successfully, but these errors were encountered:

lanctot · 2024-04-27T01:12:07Z

Wow, that is quite the gotcha.... I'm surprised the first one even works!

The problem is that you're trying to compute exploitability for just a single policy. But exploitability is a function of the entire strategy profile (both policies) or "joint policy", not just the one player's policy. The sequence form LP code gives you them back separately rather than contained in a single object (like e.g. the CFR code does).

Whereas pi_kuhn is the joint policy (equilibrium).

The fix is to just merge them into a single joint policy:

from open_spiel.python.algorithms import sequence_form_lp, exploitability
from open_spiel.python import policy
import pyspiel

game = pyspiel.load_game('kuhn_poker')
(v1, v2, pi1, pi2) = sequence_form_lp.solve_zero_sum_game(game)
merged_policy = policy.merge_tabular_policies([pi1, pi2], game)
exploitability.exploitability(game, merged_policy)

We should totally have this in a test. I was surprised not to see it in the sequence_form_lp_test, so please leave the issue open as a reminder to add a test that does the above so there's a reference to it in the code somewhere.

bwr125 · 2024-05-04T20:36:19Z

Oh that makes sense! I was curious why there were two returned policies, but each policy itself was still defined over the joint action space (with uniform random values in the other player's states). Thanks!!

lanctot closed this as completed May 22, 2024

lanctot reopened this May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Returned Policies and Exploitability #1215

Returned Policies and Exploitability #1215

bwr125 commented Apr 26, 2024

lanctot commented Apr 27, 2024 •

edited

Loading

bwr125 commented May 4, 2024

Returned Policies and Exploitability #1215

Returned Policies and Exploitability #1215

Comments

bwr125 commented Apr 26, 2024

lanctot commented Apr 27, 2024 • edited Loading

bwr125 commented May 4, 2024

lanctot commented Apr 27, 2024 •

edited

Loading