Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race conditions in tree operations causing Internal Server Errors #1518

Open
timobrembeck opened this issue Jun 1, 2022 · 21 comments · Fixed by #2082 or #2596 · May be fixed by #3076
Open

Race conditions in tree operations causing Internal Server Errors #1518

timobrembeck opened this issue Jun 1, 2022 · 21 comments · Fixed by #2082 or #2596 · May be fixed by #3076
Assignees
Labels
bug Something isn't working effort: medium Should be doable in <12h prio: urgent Needs to be resolved now(?)
Milestone

Comments

@timobrembeck
Copy link
Member

timobrembeck commented Jun 1, 2022

Describe the Bug

When content forms are submitted several times in succession too quickly, the same translation version can be created multiple times.

Steps to Reproduce

Send edit requests multiple times after another (for me, it didn't work to push the "save" button multiple times, but I could simulate this beaviour and we know that it already happened in the production system, so somehow this problem definitely exists)

Maybe, this is a combination of autosave and the real saving process?

Expected Behavior

There should only exist exactly one database object for each translation version.

Actual Behavior

In some cases, a version can be created multiple times, which breaks the CMS functionality in multiple places.

@timobrembeck timobrembeck added bug Something isn't working prio: high Needs to be resolved ASAP. labels Jun 1, 2022
@timobrembeck timobrembeck added this to the Version 1.2 milestone Jun 1, 2022
@svenseeberg svenseeberg added the effort: medium Should be doable in <12h label Jun 9, 2022
@ulliholtgrave
Copy link
Member

As a first approach we could deactivate the form buttons after they got clicked. That won't really hurt and we can eliminate this possible source for the problem.

@timobrembeck timobrembeck modified the milestones: 22Q2, 22Q3 Jul 18, 2022
@dkehne
Copy link

dkehne commented Jul 20, 2022

Can you specify that a bit more that we can priorize it in our service team?

@timobrembeck
Copy link
Member Author

Can you specify that a bit more that we can priorize it in our service team?

We already released a quick fix so no duplicate versions are created and an internal server error is shown instead.
If you received no complaints from cities that they get errors when trying to save pages, this does not seem to be a big problem anymore. But we can also evaluate our log files to estimate how frequently this error occurs.

@ulliholtgrave
Copy link
Member

However, this is probably the same problem that created the two Landshuts yesterday. As this error always requires some manual cleanup I would prioritize it rather high.

@svenseeberg svenseeberg added prio: urgent Needs to be resolved now(?) and removed prio: high Needs to be resolved ASAP. labels Jul 20, 2022
@timobrembeck timobrembeck changed the title Race condition in content forms Race conditions causing corrupt tree structure Aug 30, 2022
@timobrembeck
Copy link
Member Author

For reference: This is the corresponding issue in our tree library which could have an impact on this:
django-treebeard/django-treebeard#53

@timobrembeck timobrembeck changed the title Race conditions causing corrupt tree structure Race conditions in tree operations causing Internal Server Errors Sep 17, 2022
@timobrembeck
Copy link
Member Author

One example traceback during link replacement:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/views/generic/base.py", line 70, in view
    return self.dispatch(request, *args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/views/generic/base.py", line 98, in dispatch
    return handler(request, *args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/integreat_cms/cms/views/linkcheck/linkcheck_list_view.py", line 179, in post
    new_translation.save()
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/integreat_cms/cms/models/abstract_content_translation.py", line 470, in save
    super().save(*args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 739, in save
    self.save_base(using=using, force_insert=force_insert,
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 776, in save_base
    updated = self._save_table(
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 881, in _save_table
    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 919, in _do_insert
    return manager._insert(
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/query.py", line 1270, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql
    cursor.execute(sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/cacheops/transaction.py", line 98, in execute
    result = self._no_monkey.execute(self, sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.IntegrityError: duplicate key value violates unique constraint "pagetranslation_unique_version"
DETAIL:  Key (page_id, language_id, version)=(9652, 1, 3) already exists.

Traceback in page form:

    super().save(*args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 739, in save
    self.save_base(using=using, force_insert=force_insert,
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 776, in save_base
    updated = self._save_table(
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 881, in _save_table
    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 919, in _do_insert
    return manager._insert(
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/query.py", line 1270, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql
    cursor.execute(sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/cacheops/transaction.py", line 98, in execute
    result = self._no_monkey.execute(self, sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.IntegrityError: duplicate key value violates unique constraint "pagetranslation_unique_version"
DETAIL:  Key (page_id, language_id, version)=(15753, 3, 18) already exists.

@svenseeberg
Copy link
Member

svenseeberg commented Sep 17, 2022

As far as I understand there are two things that go wrong:

  • Treebeard does ignore the atomic view. It does not use the Django ORM but directly interacts with the database. It could honor the atomic view decorator and the issue would be solved.
  • Treebeard itself does not use a transaction for the multiple database updates it performs.

Is this correct?

@timobrembeck
Copy link
Member Author

Yes, however converting the library to using the Django ORM is probably unrealistic (even for a large upstream contribution), and I don't know how a custom mutex would be implemented here... maybe the same way as django-linkcheck would suffice?

https://github.com/DjangoAdminHackers/django-linkcheck/blob/8cb97b165b1d26f498896c8cda61f26d98f7a8ec/linkcheck/__init__.py#L7

@svenseeberg
Copy link
Member

maybe the same way as django-linkcheck would suffice?

True, this would probably solve most issues that do currently arise. There are some edge cases (crashing processes, full hard disk, ...) for which a real transaction would be even better. But these are probably extremely rare in comparison to those race conditions.

@svenseeberg
Copy link
Member

I grepped the logs today. We have only 12 of these messages in 14 days. I think it is okay to remove the "urgent" label.

@svenseeberg svenseeberg added prio: high Needs to be resolved ASAP. and removed prio: urgent Needs to be resolved now(?) labels Dec 6, 2022
@svenseeberg
Copy link
Member

svenseeberg commented Feb 22, 2023

Right. I expect we should see at least a decrease in our error rate with the current solution. If that is the case I'd be confident enough in the solution to create an upstream PR that fixes the problem for all queries.

@timobrembeck timobrembeck modified the milestones: 23Q1, 23Q2 Mar 23, 2023
@timobrembeck
Copy link
Member Author

Quoting from the other issue:

Now that we have the mutex, I think we can definitely say that we can barely see a difference to before. We currently have 18 duplicate key violations in our log files. AFAICT this is roughly the same amount we had before we introduced the locking mechanism. I think we should remove the mutex and need to rethink the cause of the duplicate key violations.

I'm actually not sure anymore, I could also imagine that page moving was never really the problem and most of the race conditions come from normal page form saving (as described above)...
If that's the case, then the lack of reduction of errors does not necessarily mean that django-db-mutex is not working, but maybe only that we're using it in the wrong places.
Before discarding that library completely, we could also try to wrap all the node model methods into the mutex as well.

@timobrembeck
Copy link
Member Author

Another idea from @ulliholtgrave: We could also move the page tree into their separate db tables based on the region, this would also make sure that they're independent and prevent and cross-region-race-conditions...
This would also require some refactoring, see e.g. https://stackoverflow.com/questions/5036357/single-django-model-multiple-tables

@timobrembeck timobrembeck modified the milestones: 23Q2, 23Q3 Jul 2, 2023
@svenseeberg svenseeberg modified the milestones: 23Q3, 23Q4 Sep 23, 2023
@svenseeberg
Copy link
Member

django-treebeard/django-treebeard#278 could be relevant to solve this issue.

@svenseeberg
Copy link
Member

svenseeberg commented Sep 27, 2023

Another idea from @ulliholtgrave: We could also move the page tree into their separate db tables based on the region, this would also make sure that they're independent and prevent and cross-region-race-conditions... This would also require some refactoring, see e.g. https://stackoverflow.com/questions/5036357/single-django-model-multiple-tables

Personally, I dislike for the idea of creating tables per region. That happened in WP and always was tedious to work with.

@PeterNerlich
Copy link
Collaborator

not closed until confirmed

@PeterNerlich PeterNerlich reopened this Sep 27, 2024
@osmers osmers modified the milestones: 24Q3, 24Q4 Oct 2, 2024
@JoeyStk JoeyStk modified the milestones: 24Q4, Next-up Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment