-
-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle nym similarity #588
base: master
Are you sure you want to change the base?
Conversation
enum NonItemActType { | ||
NYM_CHANGE | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Probably should've done this with donations!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has already grown far beyond what I originally expected lol
* make edit nym invoiceable * consolidate schema migrations * only create an act instance if the cost is nonzero, since we only care about payments for nym changes, not free ones * display paid nym changes in wallet history aka satistics
disable submit button while form is submitting
5648f9c
to
547b4ba
Compare
I've been messing around with this. I don't think I've been working with a If we go this route though, we'll probably want to first filter Anyway, not suggesting you need to do this, but I wanted to update my status on this. CREATE OR REPLACE FUNCTION are_visually_similar(ch1 char, ch2 char)
RETURNS boolean AS $$
BEGIN
RETURN (ch1 = ch2) OR (
(LOWER(ch1), LOWER(ch2)) IN (
('0', 'o'), ('o', '0'),
('1', 'i'), ('i', '1'), ('1', 'l'), ('l', '1'),
('2', 'z'), ('z', '2'), ('2', '3'), ('3', '2'),
('5', 's'), ('s', '5'), ('5', '6'), ('6', '5'),
('6', '8'), ('8', '6'),
('7', '1'), ('1', '7'),
('9', 'q'), ('q', '9'), ('9', 'g'), ('g', '9'),
('a', '4'), ('4', 'a'),
('a', 'd'), ('d', 'a'),
('b', '6'), ('6', 'b'), ('b', 'h'), ('h', 'b'),
('c', 'o'), ('o', 'c'), ('c', 'e'), ('e', 'c'),
('d', '0'), ('0', 'd'), ('d', 'o'), ('o', 'd'),
('e', '3'), ('3', 'e'), ('e', 'o'), ('o', 'e'),
('f', 't'), ('t', 'f'),
('g', 'q'), ('q', 'g'),
('i', 'j'), ('j', 'i'),
('i', 'l'), ('l', 'i'),
('i', 't'), ('t', 'i'),
('k', 'x'), ('x', 'k'),
('m', 'n'), ('n', 'm'), ('m', 'r'), ('r', 'm'), ('m', 'nn'), ('nn', 'm'),
('n', 'u'), ('u', 'n'),
('o', 'u'), ('u', 'o'),
('p', 'q'), ('q', 'p'),
('v', 'w'), ('w', 'v'), ('v', 'y'), ('y', 'v'), ('v', 'vv'), ('vv', 'v'),
('x', 'z'), ('z', 'x'),
('c', 'g'), ('g', 'c'),
('d', 'o'), ('o', 'd'),
('f', 'e'), ('e', 'f'),
('k', 'x'), ('x', 'k'),
('p', 'f'), ('f', 'p'),
('q', 'o'), ('o', 'q'),
('v', 'w'), ('w', 'v'),
('z', '2'), ('2', 'z'),
('c', 'c'),
('o', '0'), ('0', 'o'),
('s', '5'), ('5', 's'),
('z', '2'), ('2', 'z')
)
);
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;
CREATE OR REPLACE FUNCTION damerau_levenshtein_distance(
str1 text,
str2 text,
insertion_cost int DEFAULT 1,
deletion_cost int DEFAULT 1,
substitution_cost int DEFAULT 1,
transposition_cost int DEFAULT 1
)
RETURNS int AS $$
DECLARE
len1 int;
len2 int;
i int;
j int;
cost int;
d int[][];
ch1 char;
ch2 char;
BEGIN
len1 = LENGTH(str1);
len2 = LENGTH(str2);
-- Short-circuit: If length difference is greater than possible edit distance
IF ABS(len1 - len2) > GREATEST(insertion_cost, deletion_cost) * GREATEST(len1, len2) THEN
RETURN ABS(len1 - len2);
END IF;
-- Special cases: if either string is empty
IF len1 = 0 THEN
RETURN len2 * insertion_cost;
ELSIF len2 = 0 THEN
RETURN len1 * deletion_cost;
END IF;
-- Initialize 2D array with dimensions (len1+1) x (len2+1)
d := ARRAY_FILL(0, ARRAY[len1+1, len2+1]);
-- Initialize the first row and column
FOR i IN 0..len1 LOOP
d[i+1][1] := i * deletion_cost;
END LOOP;
FOR j IN 0..len2 LOOP
d[1][j+1] := j * insertion_cost;
END LOOP;
-- Populate the matrix
FOR i IN 1..len1 LOOP
ch1 := SUBSTRING(str1 FROM i FOR 1);
FOR j IN 1..len2 LOOP
ch2 := SUBSTRING(str2 FROM j FOR 1);
IF are_visually_similar(ch1, ch2) THEN
cost := 0;
ELSE
cost := substitution_cost;
END IF;
d[i+1][j+1] := LEAST(
d[i][j+1] + deletion_cost, -- Deletion
d[i+1][j] + insertion_cost, -- Insertion
d[i][j] + cost -- Substitution
);
-- Check for transposition
IF i > 1 AND j > 1 THEN
IF are_visually_similar(ch1, SUBSTRING(str2 FROM j-1 FOR 1)) AND are_visually_similar(ch2, SUBSTRING(str1 FROM i-1 FOR 1)) THEN
d[i+1][j+1] := LEAST(
d[i+1][j+1],
d[i-1][j-1] + transposition_cost -- Transposition
);
END IF;
END IF;
END LOOP;
END LOOP;
-- The distance is at the bottom-right corner of the matrix
RETURN d[len1+1][len2+1];
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT; |
I agree, that's kinda what I meant by it being too heavy handed. It could probably be accounted for (somewhat) by adjusting the cost formula, but ultimately a better similarity detection algorithm is best.
Cool! It'll take me a bit to grok the algorithm, but that sounds like a great approach, assuming we can get the perf in good shape. I imagine we should be able to update the name-cost SQL query to use this new similarity function and the rest of this PR would just work, yea? |
Yeah the rest of the PR is good to go afaict. We just need to be able to back up charging really well, which means we need to detect similarity really well ... which is hard. With the above function, it's not bad if tuned to:
e.g. Is pretty excellent at finding similar things
e.g.
|
Additionally, I think my cost function was too continuous. It should probably be more of a step function.
|
I also wonder if we should exclude inactive accounts from comparisons. Like, if we haven't seen someone on site in 6 months, what are we protecting? |
Yea, that makes sense. Fine-tuning the cost algo might take some experimentation, but I do really like how nym-changes are currently free, so nickel and diming folks for low-risk changes would be a worsening of the experience, IMO.
Yea, that would make sense. You could take it a step further and factor activity frequency into the cost, like a scaling factor, where anything older than 6 months is 0, but as you get progressively more recently active, the scaling factor increases. There's probably many similar variations we could apply - basically any method of identifying a high value account that someone would want to pretend to be - top stackers, verified contributor, verified corporate accounts, etc. Let me know how much of this you want me to tackle, if any. I figured I'd let you finalize the above sql functions first. |
Thanks! I’ll let you know. The ball is fully in my court until I can figure this out |
Another approach: don't charge and instead let a nym "occupy" the entire grid of nyms that's 3 distance units away from it, ie consider all those nyms taken rather than for sale. Not actively pursuing any approach yet, but kind of background thinking through some of this. |
I'll defer resolving the conflicts to whenever the overall approach of this PR is decided upon. |
Going to put this in draft temporarily. We definitely need this and I'm pretty sure we want to use the DL algo but hard to tell what the thresholds should be and what to do when an nym is on the wrong side of the threshold. |
Sounds good to me, I agree that we need to exercise caution with changes like this 👍 |
Closes #580
This PR updates the edit nym feature by introducing fees if the new nym is too similar to another user's nym. Similarity is calculated via a
levenshtein
distance (see #580 for discussion).The UX for the edit nym form was updated to include a
cancel
and asave
button so users can back out of changing their nym if they want, which they can't explicitly do today.The save button is largely a copy of the existing
FeeButton
component, built for this particular use case instead of being item-focused.The save button also displays a receipt if the fee is non-zero, to explain to the user why changing their nym costs as much as it does.
The edit nym action is also now invoiceable, meaning a user can pay an invoice on-demand if their balance doesn't have enough to cover the transaction.
Paid nym changes are logged in a new
NonItemAct
table, and are therefore included in the wallet history aka satistics page. We still don't track a history of which nyms were used by any account - just how much was paid and when to change to a new nym.Additionally, when new users are created, their auto-generated nyms are checked to see what their cost would be, and if the cost would be non-zero, a random nym is instead generated. This prevents a user from bypassing the cost to get a high-value nym by creating an account via email, twitter, or github login.
Some nuances in the code:
nonItemSpent
because it needs a different UI rendering compared tospent
types, but I show them asspent
type in wallet history because users don't need to know the difference.