Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge imports #1394

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

merge imports #1394

wants to merge 2 commits into from

Conversation

obfusk
Copy link
Contributor

@obfusk obfusk commented Jun 26, 2023

Replaces #1389.

This renumbers IDs -- both from the CSV and in the image file names -- that are present in the imported data, to make sure they do not conflict with the existing data.

It also removes the newly added image files if the import fails, which is now safe since there are no conflicts with existing IDs.

Only the Catima importer has to deal with IDs being present in the imported data; all other importers create new IDs, so they are not affected.

I'm not really handling duplicate group names, but errors from DBHelper.insertGroup() are ignored and I think it makes sense to just effectively merge groups with identical names anyway; we do need to deal with that when we have proper group IDs.

@obfusk obfusk marked this pull request as ready for review June 26, 2023 15:05
@obfusk obfusk mentioned this pull request Jun 26, 2023
@obfusk
Copy link
Contributor Author

obfusk commented Jun 28, 2023

After this change, the only errors ignored during import should be duplicate groups (which is fine), unless the data being imported does not conform to the expected format somehow, which should not happen under any normal circumstances.

But a lot of the code still doesn't actually check for errors.
That seems like something we might still want to address, in a separate PR.
Let me know how you'd like to proceed with that :)

@obfusk
Copy link
Contributor Author

obfusk commented Jun 28, 2023

unless the data being imported does not conform to the expected format somehow

It looks like the parser is handling that already. So I can't guarantee any overlooked edge cases, but I'm not seeing any potential errors being overlooked now, except for duplicate groups and group mappings, which are perfectly fine to ignore, or something being wrong with the database itself. But I don't expect the parser to produce invalid/conflicting data, and I expect valid data to not result in errors.

I still think error checking is a good idea anyway. But it doesn't look like a high priority after this.

@obfusk
Copy link
Contributor Author

obfusk commented Jun 28, 2023

There could be an NPE from invalid data. But that should be caught by the importer and cause the import to fail.

@obfusk obfusk mentioned this pull request Jun 28, 2023
12 tasks
@TheLastProject
Copy link
Member

I'm wondering, wouldn't it be cleaner to change how the id database value for cards is generated open card creation (like: unix timestamp instead of incrementing by 1 for each new card)?

As I understand your change this doesn't "merge" imports, it just assumes that anything in the import doesn't exist yet so it switches the brokenness from silently dropping cards to duplicating cards which I agree is an improvement but still not the desirable status.

@TheLastProject
Copy link
Member

TheLastProject commented Jun 29, 2023

See also #513 and especially #512 (comment) for what feels like a better flow. And CatimaLoyalty/Docs#4 (comment).

@obfusk
Copy link
Contributor Author

obfusk commented Jun 29, 2023

I'm wondering, wouldn't it be cleaner to change how the id database value for cards is generated open card creation (like: unix timestamp instead of incrementing by 1 for each new card)?

We would still need to deal with existing exported data using old IDs though. So that would add complexity. And might still not produce the desired result if you e.g. add the same card to two devices and then want to import one into the other. This would pretty much only help avoid duplicating the exact same card added on the same device. And then you could still have two different versions that were modified independently on different devices or when merging exports from different times.

As I understand your change this doesn't "merge" imports, it just assumes that anything in the import doesn't exist yet so it switches the brokenness from silently dropping cards to duplicating cards which I agree is an improvement but still not the desirable status.

Fair enough. The point here is to make sure nothing gets overwritten/broken. Having to delete a few duplicates manually is much better than the current situation. I was focused on that, and not so much on dealing with duplicates.

Adding a conflict dialog like Nextcloud seems like quite a bit of work, and since duplicate file/card names are not an issue here, I'm not sure it makes sense to do the same. If you want to go that route I'd suggest merging this for now and making a separate PR for the additional work.

It would be relatively easy to add something like this though:

  • prefer existing data / prefer imported data / don't check for duplicates (i.e. what it does now)
  • choose which fields must be equal for the card to be considered "the same", defaulting to cardid + name maybe

I can add that to this PR if you'd like.

@obfusk
Copy link
Contributor Author

obfusk commented Jun 29, 2023

Also note that only catima exports contain IDs. If we want "intelligent merging", it seems good to apply that to all other import formats as well, not just catima. For consistency.

@obfusk
Copy link
Contributor Author

obfusk commented Jun 29, 2023

prefer existing data / prefer imported data / don't check for conflicts

A fourth option would be to prefer latest lastused, but that seems a bit brittle.

@obfusk
Copy link
Contributor Author

obfusk commented Jun 29, 2023

I can add that to this PR if you'd like.

Or we could merge this as-is and I'll make a follow-up PR for that. Let me know what you prefer.

@obfusk
Copy link
Contributor Author

obfusk commented Jun 30, 2023

actually, I think this would not be too hard after all:

  • ask / prefer existing data / prefer imported data / don't check for duplicates (i.e. what it does now)
  • choose which fields must be equal for the card to be considered "the same", defaulting to cardid + name

I think I'll implement that in catimerge.

what do you think?

@obfusk obfusk marked this pull request as draft June 30, 2023 13:13
@TheLastProject
Copy link
Member

The main issue is that the current import code needs a complete rewrite anyway (see #513). Because it'll require moving the whole thing into a background thread and requires spawning a notification, using a notification on conflicts and offering options seemed easier but I guess it probably is still challenging.

My main concern is complicating the current importing code and making moving away from the current deprecated and crashy method more work.

My secondary concern is switching to a new method that won't work well for multi-device sync and I think randomly increasing the IDs if the ID is already in use will be one of those things where it'll severely complicate multi-device sync and risk the possibility of getting stuck in loops where IDs keep getting incremented each sync because they keep conflicting.

Aside from that, messing with the IDs seems very hacky, like the kind of think that will bite us in the future.

You do raise a good point with that other importers don't share Catima's ID system and the only combination of fields we can reasonably assume to always exist is name, type and value and perhaps merging based on those values makes most sense. It would be nice to detect duplicates based on those fields but without any way to notify users a duplicate was found that is currently useless.

So, in my mind, the cleanest solution is:

  1. (Temporary hack) Drop all data before import
  2. Migrate the importer to a background thread so it won't use deprecated code and won't crash when users navigate away or try to import from for example Nextcloud
  3. Write a good merging solution and notify users on conflicts to ask them what to do (at this point we can revert step 1)

I don't really think writing merging code while the importing code is in such a broken state and we can't reliably notify users of conflicts makes a lot of sense as it'll just translate into more code to move to the new system and more things to test before the new system can be considered stable.

@obfusk
Copy link
Contributor Author

obfusk commented Jul 1, 2023

severely complicate multi-device sync

That wasn't something I was considering in this context. But good point. We should address that.

(Temporary hack) Drop all data before import

#1389 does that. So feel free to merge that as-is (or request changes). Unfortunately, if the import fails the DB will be unchanged but images may have been overwritten. Preventing that would be quite hacky with the current code, even more so than this PR which at least should keep both old and new data completely intact.

Migrate the importer to a background thread

That does seem like a good idea. Sadly, I have no experience with that so it'll take more time to figure out. But I'd like to help with that. Any pointers on how to get started with that?

Write a good merging solution and notify users on conflicts to ask them what to do

My proposal above is an attempt to formulate such a merging strategy. It could almost certainly use some tweaking but the basic concept seems sound to me personally, though I could easily have missed a use case I don't have myself that it doesn't cover.

Feedback on that idea so we can figure out the best merge strategy before implementing it would be appreciated. I would then like to implement that in catimerge so it can be evaluated before modifying Catima itself.

And yes, if we're going to implement a UI for conflict resolution, we should probably do it as part of the rewrite and not based on the current implementation.

@TheLastProject
Copy link
Member

Unfortunately, if the import fails the DB will be unchanged but images may have been overwritten. Preventing that would be quite hacky with the current code, even more so than this PR which at least should keep both old and new data completely intact.

I may be missing something but I'd expect this to be fairly simple to solve?

  1. Get a list of all IDs
  2. Start a database transaction
  3. Delete all cards
  4. Try to import all cards (on fail, exit)
  5. Delete the images for every single card ID retrieved in 1
  6. Import all images from the import

But maybe it's better to just do nothing until we have a good fix because any workaround we do not we'll just have to revert again in the future.

Migrate the importer to a background thread

That does seem like a good idea. Sadly, I have no experience with that so it'll take more time to figure out. But I'd like to help with that. Any pointers on how to get started with that?

At #622 (comment) I talked about using a Service but I might be wrong there too. I don't know what the recommended Android method is for long-time background processing (maybe WorkManager?)

My proposal above is an attempt to formulate such a merging strategy. It could almost certainly use some tweaking but the basic concept seems sound to me personally, though I could easily have missed a use case I don't have myself that it doesn't cover.

I see one big issue with this: asking the user what they want. You can't use that for sync constantly, you'd have to bother users every sync. I'm thinking we should just change the ID to be the current UNIX timestamp on creation and consider any difference in field a conflict and offer the option to using 1, 2 or keep both (with keeping both we can insert the new card with the current timestamp). That way, the only conflicts anyone could possibly get would be multiple users saving a card on the exact same second which seems unlikely.

@obfusk
Copy link
Contributor Author

obfusk commented Jul 13, 2023

I may be missing something but I'd expect this to be fairly simple to solve?

In theory, yes. But currently, parsing the data and updating the database is interleaved with saving the new images. So existing images may have been overwritten by the time the database transaction is aborted.

We could e.g. save new images with a tmp_ prefix and rename them after the transaction is committed (or delete them when it's aborted), that would be simpler than modifying the current control flow (which doesn't seem worth doing given that we plan to rewrite it all). I can add that to #1389 if you want.

But to me this PR seemed the better interim solution as it guarantees that no data is lost or overwritten. Though of course it may be duplicated, which is clearly a concern for you. Just let me know how you want to proceed :)

@obfusk
Copy link
Contributor Author

obfusk commented Jul 13, 2023

I see one big issue with this: asking the user what they want. You can't use that for sync constantly, you'd have to bother users every sync. I'm thinking we should just change the ID to be the current UNIX timestamp on creation [...] That way, the only conflicts anyone could possibly get would be multiple users saving a card on the exact same second which seems unlikely.

I agree it makes sense to use timestamps to avoid duplicate IDs. That does indeed make syncing and importing easier since we rarely need to modify the ID. It does not however solve the problem with existing IDs (and imports from formats without IDs). As I see it, that leaves us with:

1. Same ID, 100% identical data:

Trivial :)

Though what about e.g. two identical cards with different images?

2. Same ID, different data (e.g. modified independently, could be a completely different name or barcode or just a modified note or colour):

consider any difference in field a conflict and offer the option to using 1, 2 or keep both (with keeping both we can insert the new card with the current timestamp).

Agreed. But what will we do next sync?

Also: last used timestamp will likely differ, so maybe we should ignore that one field (not sure about header colour and star status).

3. Different ID, 100% identical data (e.g. adding the exact same card on two devices independently):

We need to choose which ID to use, but that should not be too hard.

Also: last used timestamp will likely differ, leading us to 4 unless we ignore that.

4. Different ID, not 100% identical data:

This is the hard one. I think if e.g. the name and barcode (I suggested letting the user choose which fields to use for that, but that seems like a good default) are identical we should consider this a conflict as well and handle it the same way as 2.

Also: this could happen to more than one pair of cards at a time. And what will we do next sync? If we keep both, there should not be a conflict as we have an identical card with the same ID.

Also: how is sync intended to work? Unidirectional? Bidirectional? Will it remember how conflicts were handled before?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants