Skip to content

Conversation

@carlesalbasboix
Copy link

Bank statements usually have inconsistent diacritics. I propose removing them when performing a case-insensitive search.

I've only added the diacritics that I'm interested in but I can add more.

@jralls
Copy link
Member

jralls commented May 9, 2025

What's the use exact use case and how do you propose to protect users who don't want to ignore diacritics? Have you examined every single use of the string predicate to ensure that this change affects your use case and no other?

@carlesalbasboix
Copy link
Author

carlesalbasboix commented May 10, 2025

In Catalan and Spanish (and I guess in other languages too) diacritics are often ignored in bank statements. For example, sometimes they might write "pasteleria" and others "pastelería".

I set it so that diacritics only get ignored when matching case insensitive. Maybe we could rename the checkbox to "Match case and diacritics"? I don't think that matching diacritics but ignoring case is common. Otherwise I can add another separate checkbox.

@jralls
Copy link
Member

jralls commented May 10, 2025

In Catalan and Spanish (and I guess in other languages too) diacritics are often ignored in bank statements. For example, sometimes they might write "pasteleria" and others "pastelería".

You seem to have ignored the diacritics yourself, there's no diacritic in either string (never mind that I couldn't find an example spelling of pasteleria or patisseria that contains diacritics anyway though the French write pâtisserie).

Anyway, I get the idea. The problem is the "I guess in other languages too" because there may be some languages where the same spelling with and without diacritics, or with different diacritics, are completely different words and this change would create false matches.

Another problem is that case-insensitive query is used in two places: One is the Find dialog that I suspect is your concern, but the other is in the register's Invoice/Bill autocompletion feature where it's unconditional.

I think you could fix both instances by testing the input string for diacritics and modifying the output string only if there are none. That way if the user types a diacritic they get the diacritic match, but if they don't then it's ambiguous and the query can sensibly return matches with and without the diacritic.

Obviously all diacritics in all languages need to be included, otherwise we'll get complaints about inconsistency.

A technical note: Diacritics can be represented two ways so you need to call g_utf8_normalize(utf8_string, -1, G_NORMALIZE_NFC) on both strings.

Please review the coding standard, in particular that we place open braces ({) on a separate line unless the closing brace can go on the same line.

@carlesalbasboix
Copy link
Author

There is indeed a diacritic on the second string, on the "í". That's the proper spelling in Spanish.

What about including another checkbox to ignore diacritics? I think it's the most transparent. I can also add a more exhaustive list of diacritics if I know that the PR will be accepted.

@jralls
Copy link
Member

jralls commented May 10, 2025

Another checkbox in the Find dialog is fine, but you still need a way to distinguish in the Invoice/Bill autocompletion where there is no dialog box. I suppose you could add an item in Preferences>Business, though @gjanssens might object.

@jralls
Copy link
Member

jralls commented May 10, 2025

BTW, you do know that this will work in only those two places, right? There are a lot of other string matching functions that don't involve QofQuery.

@christopherlam
Copy link
Contributor

Shouldn't this be a locale-specific change? And should use ICU?

@jralls
Copy link
Member

jralls commented May 11, 2025

Inserting ICU into the middle of a g_utf8 workflow doesn't make a lot of sense.

As for locale sensitivity, hmm. That implies a bunch of locale-specific hash tables. There would be a performance benefit as it would reduce number of diacritic pairs to the ones used in the current locale but it would increase the complexity a bit.

@carlesalbasboix
Copy link
Author

Technically std::unordered_map::find() has average time complexity O(1) so the number of diacritics shouldn't impact performance.

@jralls
Copy link
Member

jralls commented May 11, 2025

Technically std::unordered_map::find() has average time complexity O(1) so the number of diacritics shouldn't impact performance.

Hash tables like std::unordered_map have two levels of storage and so two levels of complexity. The outer level of storage is the hash table itself. The find function is traversing a BTree so complexity is O(log2 N) where N is the number of buckets. The inner level is a linked list of all items that have the same hash and it's complexity is O(n) where n is the number of items in the bucket. Traversal of both the BTree and the list is accomplished by pointer chasing, but since in this case the whole hash table is created at once the storage locations should be pretty close together and so won't have the cache-miss problem inherent to such containers loaded one item at a time. But size still does matter and a larger table takes longer to search than a smaller one.

@jralls
Copy link
Member

jralls commented May 30, 2025

Shouldn't this be a locale-specific change? And should use ICU?

Revisiting because I'm working on applying ICU's string search to fix bug 799521 and the situation is very similar to this one. The ICU docs have a discussion about exactly this topic.

@jralls
Copy link
Member

jralls commented Jun 2, 2025

#2097 is IMO a much better way to accomplish this: It's locale-sensitive and since it's based on ICU it's far more likely to be correct than anything we're likely to devise on our own.

@carlesalbasboix
Copy link
Author

The problem with locale-specific things is that they don't work well in multilingual environments. I have gnucash in english, but I want to be able to search transactions in a diacritic insensitive way across multiple languages.

@jralls
Copy link
Member

jralls commented Jun 5, 2025

Maybe you should test before making blanket statements. #2097 works in the en_US locale; I think it likely that accented/accentless matching will too. It's easy enough to test: Apply the PR and rebuild GnuCash. Create some transactions with accented characters and see if they match when you enter the words without the accents.

@christopherlam
Copy link
Contributor

Would this PR be obsolete when #2097 merged in?

@jralls
Copy link
Member

jralls commented Jun 14, 2025

Would this PR be obsolete when #2097 merged in?

No, because #2097 doesn't touch QofQuery... but gnc_unicode_has_substring_basic_chars is a better way to accomplish this PR's goal. There's still the matter of what UI adjustments might be appropriate for this in the Find dialog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants