Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequential learner syncing #11390

Merged

Conversation

bjester
Copy link
Member

@bjester bjester commented Oct 11, 2023

Summary

  • The Learn-only device syncing:

    • now persists syncing information in the SyncQueue model
    • operates through one task pipeline, using the queue to determine the time of the next run
    • re-attempts syncs or sync requests up to 5 times without a network change occurring
  • The SyncQueue API:

    • is backwards incompatible with the previous version
    • now prioritizes the devices that least recently synced
    • no longer aggressively deletes records from the queue, but marks things as stale when a device misses a rendezvous by at least half of the intended sync interval
    • no longer returns the sync_interval configured on the server (seems like something we can integrate into device management later on)

References

#11209
#11166

Reviewer guidance

Running kolibri manage migrate --database syncqueue may be required to fully migrate the models


Testing checklist

  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Critical and brittle code paths are covered by unit tests

PR process

  • PR has the correct target branch and milestone
  • PR has 'needs review' or 'work-in-progress' label
  • If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
  • If this is an important user-facing change, PR or related issue has a 'changelog' label
  • If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

@github-actions github-actions bot added DEV: backend Python, databases, networking, filesystem... APP: Learn Re: Learn App (content, quizzes, lessons, etc.) APP: User Re: User app (sign-in, sign-up, user profile, etc.) SIZE: very large labels Oct 11, 2023
Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I get the overall gist - although I can't claim to have followed the entire thing all the way through. One question about handling of network connection errors.

from kolibri.utils.translation import ugettext as _


logger = logging.getLogger(__name__)
SOUD_SYNC_PROCESSING_JOB_ID = "50"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, I am starting to wonder if we shouldn't just allow a task to register with static_id=True and then have the task machinery generate a hash of the task for the id, to avoid collisions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be helpful

instance_id=instance_id,
defaults={
"status": SyncQueueStatus.Pending,
"updated": time.time(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference in the application of the defaults between get_or_create and update_or_create shouldn't confuse me as much as it does, but I always forget the defaults will always be applied in the update.

self.set_next_attempt(28 + 2 ** self.attempts)

class Meta:
unique_together = ("user_id", "instance_id")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came back to check for this, and wasn't disappointed!

context
)
)
sync_queue.increment_and_backoff_next_attempt()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be backing off in this case, or waiting for the device to come back online again and then requeueing instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to say that we could skip the backoff, but I think there are scenarios where we do get some benefit here in the time being, particularly with static locations that may not be on the same subnet. During the next run, if the network location doesn't exist or is marked with a connection error, it won't make any requests, but continue to backoff until it exceeds the max attempts

@bjester bjester marked this pull request as ready for review October 20, 2023 20:50
@pcenov pcenov self-requested a review October 24, 2023 07:34
@pcenov
Copy link
Member

pcenov commented Oct 24, 2023

Hi @bjester, I was asked to start regression testing on this one however using the latest available .deb asset and the Android build I can't get the sync to start working.

Once I import the learner on the LOD the assigned lesson resources are correctly displayed on the device but after that the status is constantly 'Not connected to server'. I am testing with a clean install of Kolibri on all of the devices.

sync.not.working.mp4

Logs:

AdminLogs.zip
LearnerLogs.zip

cc @radinamatic @marcellamaki

Copy link
Member

@jredrejo jredrejo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been testing some syncs and the merge facility feature and everything works fine in the merging process. However, after restarting the device, when it tries to sync again, this error happens:

INFO     2023-10-24 19:52:03,759 Response failure: http://192.168.0.201:80/api/public/info/
ERROR    2023-10-24 19:52:03,958 Internal Server Error: /api/public/syncqueue/
Traceback (most recent call last):
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/models/query.py", line 464, in get_or_create
    return self.get(**lookup), False
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/models/query.py", line 378, in get
    raise self.model.DoesNotExist(
kolibri.core.device.models.DoesNotExist: SyncQueue matching query does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/backends/utils.py", line 69, in execute
    return self.cursor.execute(sql, params)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/backends/sqlite3/base.py", line 328, in execute
    return Database.Cursor.execute(self, query, params)
sqlite3.IntegrityError: UNIQUE constraint failed: device_syncqueue.user_id, device_syncqueue.instance_id

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/core/handlers/exception.py", line 41, in inner
    response = get_response(request)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/core/handlers/base.py", line 187, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/core/handlers/base.py", line 185, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/views/decorators/csrf.py", line 58, in wrapped_view
    return view_func(*args, **kwargs)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/rest_framework/views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)
  File "/datos/le/mio/kolibri/kolibri/core/public/api.py", line 342, in create
    queue_object = self.create_or_update(request)
  File "/datos/le/mio/kolibri/kolibri/core/public/api.py", line 305, in create_or_update
    queue_object, created = SyncQueue.objects.get_or_create(
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/models/query.py", line 466, in get_or_create
    return self._create_object_from_params(lookup, params)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/models/query.py", line 509, in _create_object_from_params
    six.reraise(*exc_info)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/utils/six.py", line 686, in reraise
    raise value
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/models/query.py", line 500, in _create_object_from_params
    obj = self.create(**params)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/models/query.py", line 394, in create
    obj.save(force_insert=True, using=self.db)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/models/base.py", line 807, in save
    self.save_base(using=using, force_insert=force_insert,
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/models/base.py", line 838, in save_base
    updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/models/base.py", line 924, in _save_table
    result = self._do_insert(cls._base_manager, using, fields, update_pk, raw)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/models/base.py", line 962, in _do_insert
    return manager._insert([self], fields=fields, return_id=update_pk,
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/models/query.py", line 1079, in _insert
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/models/sql/compiler.py", line 1112, in execute_sql
    cursor.execute(sql, params)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/backends/utils.py", line 65, in execute
    with self.db.wrap_database_errors:
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/utils/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/backends/utils.py", line 69, in execute
    return self.cursor.execute(sql, params)
  File "/datos/le/mio/kolibri/venv/lib/python3.10/site-packages/django/db/backends/sqlite3/base.py", line 328, in execute
    return Database.Cursor.execute(self, query, params)
django.db.utils.IntegrityError: UNIQUE constraint failed: device_syncqueue.user_id, device_syncqueue.instance_id
INFO     2023-10-24 19:52:04,036 192.168.0.60 - - "POST /api/public/syncqueue/" 500 0 "" "Kolibri/0.16.0b6.dev0+git.211.g1a3009ce python-requests/2.27.1"

I've tried it both using an old kolibri instance migrated from 0.15.x or with a new instance created from scratch. In both cases the result is the same

On the code side, nothing called my attention excepting the small problem with timestamp.

kolibri/core/public/api.py Outdated Show resolved Hide resolved
@rtibbles
Copy link
Member

Just to leave a paper trail here - there seem to be two issues at play here. The first is the issue that both @pcenov and @jredrejo have reproduced, which is that syncing seems to get into a failure state, where trying to create a SyncQueue object on the server continuously results in integrity errors.

The other appears to be a regression in the merge user workflow, whereby after the user is merged, and the old user deleted, this is not properly propagated to the frontend, and the cached FacilityUser that is now deleted is still being referenced.

@rtibbles
Copy link
Member

Have made fixes here for currently failing test cases as a start. Now focusing on the behaviour issues.

Prevent errors by relying on unique together enforcement.
@rtibbles
Copy link
Member

Just to leave a paper trail here - there seem to be two issues at play here. The first is the issue that both @pcenov and @jredrejo have reproduced, which is that syncing seems to get into a failure state, where trying to create a SyncQueue object on the server continuously results in integrity errors.

The other appears to be a regression in the merge user workflow, whereby after the user is merged, and the old user deleted, this is not properly propagated to the frontend, and the cached FacilityUser that is now deleted is still being referenced.

It appears that these were actually related - resolving the former also resolved the latter. However, there still appear to be issues with sync status being properly reported to the frontend.

@rtibbles
Copy link
Member

A sync status is now being generated, unfortunately, it always seems to be 'not recently synced'. Working to resolve.

@rtibbles
Copy link
Member

Sync status should now also be properly reported on the LOD side and the server side (I have confirmed both locally). This should be ready for more testing.

@pcenov
Copy link
Member

pcenov commented Oct 25, 2023

Hi @rtibbles syncing is still not working at all. Tried with both Ubuntu VM and Windows 11 as server and clean installs of Kolibri.

On the LOD the main error in the logs is:
Unable to request SoUD sync from unavailable network location

2023-10-25_12-39-00

The sync status seems to be partially working - initially on the coach side it's displayed correctly as 'Synced' but on the LOD it's displayed as 'Not recently synced' right from the start.

Logs:
UbuntuLOD.zip
UbuntuServer.zip
AndroidLearnerLogs.zip
WindowsServer.zip

jredrejo
jredrejo previously approved these changes Oct 25, 2023
Copy link
Member

@jredrejo jredrejo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything seems to work now fine after the latest commits

kolibri/core/public/api_urls.py Show resolved Hide resolved
@rtibbles
Copy link
Member

OK, @pcenov - this should be good for you to test again, to the best of my and @jredrejo's attempts to test, this should now be working as intended, and I have done an additional tweak to make sure the sync status is being properly updated on both LOD and server.

🤞

@rtibbles rtibbles dismissed jredrejo’s stale review October 25, 2023 18:48

Happy for the validation, but will await QA approval for final sign off.

@pcenov
Copy link
Member

pcenov commented Oct 26, 2023

Hi @rtibbles, the 2 main issues I was able to identify are:

  1. After I merge a learner created through the 'On my own flow', all the assigned lessons/quizzes get correctly synced to the learner device however after I complete any of the assigned lessons/quizzes the sync won't start automatically - I have to restart the server after which syncing seems to be working correctly and occurs frequently.
2023-10-26_16-55-15.mp4
  1. Syncing with the Android app is not working correctly for most of the time - after the import of the learner all of the assigned resources get synced correctly but after that the sync won't start even after restarting both the server and the app repeatedly and waiting for extended periods of time. I did manage to get it to sync once or twice but it's not easy.

The good news is that syncing to a LOD on Ubuntu is working correctly and occurs frequently.

UbuntuServer.zip
AndroidLODLogs.zip
AndroidMergedLODLogs.zip
MergedUserLogs.zip
UbuntuLODLogs.zip

@rtibbles
Copy link
Member

For Android, is the behaviour the same regardless of whether it's the merged user flow or the import flow?

@rtibbles
Copy link
Member

OK, I replicated the merged user syncing issues and they should now be resolved.

I will need to do some local replication of the Android syncing issues using an Android device, as I suspect there is something happening at the Android app layer that is not getting caught in the logs. I think we can merge this PR regardless of this, and I can file a follow up issue to investigate further.

Copy link
Member

@pcenov pcenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rtibbles I confirm that the merged user syncing issues are indeed fixed now. No issues observed after regression testing so I think this is good to go.
I also confirm my observations from yesterday about Android (for both merged and not merged learner), and today I was not able to sync it with the server as well. Adding the logs from today in case you are able to spot something but as you say it can be tracked separately.

AndroidMergedLOD.zip
Admin.zip
AndroidLOD.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
APP: Learn Re: Learn App (content, quizzes, lessons, etc.) APP: User Re: User app (sign-in, sign-up, user profile, etc.) DEV: backend Python, databases, networking, filesystem... SIZE: very large
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants