-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSRS-AnnA Integration #22
Comments
I suppose I have found a reason :^) |
Seems fixed with this change at AnnA.py:772 self.df = pd.DataFrame(list_cardInfo)
self.df["cardId"] = self.df["cardId"].astype('int64')
self.df = self.df.set_index("cardId").sort_index()
self.df["interval"] = self.df["interval"].astype(float)
return True My card ids were very large for some reason and were overflowing when truncated to pandas 2 default int types. But now I get: Computing distance matrix on all available cores...
Scaling each vertical row of the distance matrix...
Scaling: 100%|██████████| 687/687 [00:00<00:00, 6279.42card/s]
Computing mean and std of distance...
(excluding diagonal)
Mean distance: 0.78, std: 0.13
Traceback (most recent call last):
File "C:\GitProjects\AnnA_Anki_neuronal_Appendix\AnnA.py", line 2550, in <module>
anna = AnnA(**args)
File "C:\GitProjects\AnnA_Anki_neuronal_Appendix\AnnA.py", line 539, in __init__
self._compute_distance_matrix()
File "C:\GitProjects\AnnA_Anki_neuronal_Appendix\AnnA.py", line 1344, in _compute_distance_matrix
self._print_similar()
File "C:\GitProjects\AnnA_Anki_neuronal_Appendix\AnnA.py", line 1355, in _print_similar
signal.signal(signal.SIGALRM, time_watcher)
AttributeError: module 'signal' has no attribute 'SIGALRM'. Did you mean: 'SIGABRT'?
Process finished with exit code 1 Which is likely due to signal.SIGALRM not being cross platform (I'm on windows). I think there's a good python timeout library I may remember the name of soon. |
Found it, I think it's better at cross platform than signal, but I could be wrong, maybe I'll try to give you a pr with it to see if it works on your machine from func_timeout import func_timeout, FunctionTimedOut then at line 1344: if self.skip_print_similar is False:
try:
func_timeout(60, self._print_similar)
except FunctionTimedOut:
red("Taking too long to find similar nonequal cards, skipping")
return True |
Seems to work now after fixing some int overflow errors with either the new pandas version, or my specific deck, but I'm not as well versed in the usual functioning of the script, so I'm not 100% confident one of the "tasks" I wasn't testing isn't also failing for a reason. But should be enough for a PR perhaps. I do get this error a lot, and an eardrum bursting beedoop every time, but it seems to work in spite of that:
|
#23 I'll continue to mess around with it and see if I can't better understand what exactly this program is doing and how to integrate it with FSRS (if at all possible)! Also probably worth mentioning that the dev version of the addon is incompatible with the main version of the addon, and figuring out how to "deploy" a dev version wasn't completely obvious to me, so maybe it warrants a section in the readme. |
Relative overdueness calculation seems incorrect, I have 100 cards scheduled wiith FSRS that are due today, and AnnA thinks 40% of them are dangerously overdue, even though there are only 12 that are overdue at all. |
So i clarified a bit the error message. Basixally using the whole deck means applying tfidf using vocabulary from the whole deck regardless of note types etc. For example I have a source field for my medical cards that thanks to OCR are filled with words. This can be useful ponderation for tfidf or not, depending on the user, hence this whole_deck argument. Another example is sorting your vocabulary cards without example sentences while vectorizing WITH example sentences. It's fine for it to fail. |
Would you mind saving me some time and sourcing your claim about func_timeout being (at least on paper) cross platform please? |
Well I'm very short on time so the idea is that the dev version is most of the time not 'production ready' and I usually do a merge every couple of months and check the addon then. I have heard of FSRS but never actually took time to understand what it's all about, would you have a few links handy or an ELI5? |
Please tell me all the kwargs you're using to get at this result. If you can also print the logs here that might help. IIRC I had a slightly different implementation of the RO but thought it should have the same real life results. |
Not sure if there's a great ELI5 floating around, but I'll try to give it a shot: ELI5: FSRS is a scheduling algorithm that learns from your review history to better schedule your reviews. (And is going to be implemented into Anki soon) In more detail, essentially it's a machine learning based algorithm that optimizes an explainable model of your retention patterns per deck, using all of your historical reviews as time series data, in order to more accurately predict your recall of a card than basically any algorithm in the past has been able to do. And in doing so, minimizes the amount of time you have to spend reviewing for the same retention level. Often reducing the amount of reviews one has to do by around 14%. There's lots of information about it on the github's wiki: However the algorithm doesn't take into account "conceptual siblings", like AnnA attempts to. I strongly believe that good integration of these two ideas could lead to large gains in study efficiency. Though it is certainly a very difficult problem, and depends to a large extent on the particulars of decks and how well AnnA approximates the true, in-brain, conceptual links between cards. |
I don't recall the kwargs and don't have a copy of the logs, but I suppose the relative overdueness could've been due to the fact that, were they buried, they would then be too relatively overdue, due to being freshly learned cards with short intervals like 1d. If that's the way the calculation works, this was probably what was happening. |
https://pypi.org/project/func-timeout/ says:
and I can personally vouch that it works on Windows, while the signal implementation currently employed in AnnA has no possibility of doing so. as well, it seems to allow a much cleaner implementation in general, though my PR didn't do much more cleanup than was necessary |
This is understandable and probably a good way to go about it, I mostly just was mad at myself for accidentally pulling main when I meant to pull dev when I was making my changes. |
Those information were very helpful. Thank you very much. I am now absolutely pumped at the idea of using AnnA to enhance FSRS :). I'm thinking experimenting with adding a new feature as input of the NN that indicates wether a conceptually similar card was reviewed recently might be interesting. It turns out that finding the k nearest neighbors of each card is not that computationnaly intensive and scales okay. The feature could simply be a softmaxed time distance to the most recent sibling. Might be needed to add another feature that indicates the grade of this latest sibling review. Just thinking out loud of course. But the code of Anna might be the quickest way to extract a distance matrix to experiment with FSRS.
If that happens again we can take a look. I might have made mistakes!
Great, happy to merge the PR once you make it cleaner and on the dev branch. Thanks a lot!
Happens to me all the time :) |
Good to hear :) I figured the idea would tickle your brain once you saw the merits of FSRS and the possibility of gains from integrating it with AnnA. I had written up a few thoughts about my brainstorming of the exact ways it could integrated: (open-spaced-repetition/fsrs4anki#352 (comment) and open-spaced-repetition/fsrs4anki#352 (comment)) (if you can overlook the rambling and length). tl;dr: I think using AnnA distance matrix as a starting point, and then trying to optimize the distance matrix further using time-series review data to further tease out the strength of the conceptual overlap (and then using that to inform retrievability calculations) seems like the ultimate path. This is of course a very difficult problem, and would probably be best tried after doing something simpler as a proof of concept. Building off of your idea about k nearest neighbors, perhaps a simple idea would be for one of the learned weights being the k number of neighbors to use (or maybe a similarity cutoff), to "learn" the optimal similarity threshold of a deck (below which, conceptual siblings are not thought to help one another). And then maybe some parameters for the magnitude of the effect, and tying the stability/difficulty of conceptual neighbors' reviews together in some learned and parameterized way, while keeping the distance matrix constant. With this way of doing things, one could directly quantify (offline) how well the changes to the algorithm predict the review data, which is a benefit, as it could be more convincing of a result than anecdotes, and result in more attention on the method. Another user has suggested simply using AnnA to inform the "disperse siblings" functionality of FSRS, which could be a quicker and easier win for practical use, though this change would not directly effect the prediction capability of the algorithm. However the variety of ways one can alter AnnA for specific decks makes this perhaps seem like a difficult UX problem. |
Thanks again. I posted a comment to the thread. If you need me to push something to expose some parts of the code I'd be happy to. Like if you just want the dist matrix or whatever. |
pandas == 1.2.3
version lock?
I don't have any plans to work on this at least until the dust settles on FSRS's integration into Anki (which is currently ongoing, with a beta out), but I'll let you know here if I need any help! |
Perfect. Thanks! |
Just inquiring if there's a reason pandas is locked at this version, my machine is having issues installing this specific version. I figure it might be due to ankipandas requiring it, but didn't see any info about this after a cursory search.
FWIW, I found this addon via a discussion in the FSRS github centered around improving the accuracy of the FSRS algorithm by identifying "conceptual" or (you might call them) "semantic" siblings. So I wanted to try this out, as it seemed a very promising direction of research.
The text was updated successfully, but these errors were encountered: