Remove diacritics when comparing case insensitive strings. #2084

carlesalbasboix · 2025-05-09T20:38:41Z

Bank statements usually have inconsistent diacritics. I propose removing them when performing a case-insensitive search.

I've only added the diacritics that I'm interested in but I can add more.

jralls · 2025-05-09T23:41:59Z

What's the use exact use case and how do you propose to protect users who don't want to ignore diacritics? Have you examined every single use of the string predicate to ensure that this change affects your use case and no other?

carlesalbasboix · 2025-05-10T07:27:01Z

In Catalan and Spanish (and I guess in other languages too) diacritics are often ignored in bank statements. For example, sometimes they might write "pasteleria" and others "pastelería".

I set it so that diacritics only get ignored when matching case insensitive. Maybe we could rename the checkbox to "Match case and diacritics"? I don't think that matching diacritics but ignoring case is common. Otherwise I can add another separate checkbox.

jralls · 2025-05-10T17:59:07Z

In Catalan and Spanish (and I guess in other languages too) diacritics are often ignored in bank statements. For example, sometimes they might write "pasteleria" and others "pastelería".

You seem to have ignored the diacritics yourself, there's no diacritic in either string (never mind that I couldn't find an example spelling of pasteleria or patisseria that contains diacritics anyway though the French write pâtisserie).

Anyway, I get the idea. The problem is the "I guess in other languages too" because there may be some languages where the same spelling with and without diacritics, or with different diacritics, are completely different words and this change would create false matches.

Another problem is that case-insensitive query is used in two places: One is the Find dialog that I suspect is your concern, but the other is in the register's Invoice/Bill autocompletion feature where it's unconditional.

I think you could fix both instances by testing the input string for diacritics and modifying the output string only if there are none. That way if the user types a diacritic they get the diacritic match, but if they don't then it's ambiguous and the query can sensibly return matches with and without the diacritic.

Obviously all diacritics in all languages need to be included, otherwise we'll get complaints about inconsistency.

A technical note: Diacritics can be represented two ways so you need to call g_utf8_normalize(utf8_string, -1, G_NORMALIZE_NFC) on both strings.

Please review the coding standard, in particular that we place open braces ({) on a separate line unless the closing brace can go on the same line.

carlesalbasboix · 2025-05-10T19:53:44Z

There is indeed a diacritic on the second string, on the "í". That's the proper spelling in Spanish.

What about including another checkbox to ignore diacritics? I think it's the most transparent. I can also add a more exhaustive list of diacritics if I know that the PR will be accepted.

jralls · 2025-05-10T20:08:18Z

Another checkbox in the Find dialog is fine, but you still need a way to distinguish in the Invoice/Bill autocompletion where there is no dialog box. I suppose you could add an item in Preferences>Business, though @gjanssens might object.

jralls · 2025-05-10T20:10:26Z

BTW, you do know that this will work in only those two places, right? There are a lot of other string matching functions that don't involve QofQuery.

christopherlam · 2025-05-11T02:19:31Z

Shouldn't this be a locale-specific change? And should use ICU?

jralls · 2025-05-11T03:47:51Z

Inserting ICU into the middle of a g_utf8 workflow doesn't make a lot of sense.

As for locale sensitivity, hmm. That implies a bunch of locale-specific hash tables. There would be a performance benefit as it would reduce number of diacritic pairs to the ones used in the current locale but it would increase the complexity a bit.

carlesalbasboix · 2025-05-11T06:50:02Z

Technically std::unordered_map::find() has average time complexity O(1) so the number of diacritics shouldn't impact performance.

jralls · 2025-05-11T15:58:42Z

Technically std::unordered_map::find() has average time complexity O(1) so the number of diacritics shouldn't impact performance.

Hash tables like std::unordered_map have two levels of storage and so two levels of complexity. The outer level of storage is the hash table itself. The find function is traversing a BTree so complexity is O(log2 N) where N is the number of buckets. The inner level is a linked list of all items that have the same hash and it's complexity is O(n) where n is the number of items in the bucket. Traversal of both the BTree and the list is accomplished by pointer chasing, but since in this case the whole hash table is created at once the storage locations should be pretty close together and so won't have the cache-miss problem inherent to such containers loaded one item at a time. But size still does matter and a larger table takes longer to search than a smaller one.

jralls · 2025-05-30T23:16:15Z

Shouldn't this be a locale-specific change? And should use ICU?

Revisiting because I'm working on applying ICU's string search to fix bug 799521 and the situation is very similar to this one. The ICU docs have a discussion about exactly this topic.

jralls · 2025-06-02T17:57:14Z

#2097 is IMO a much better way to accomplish this: It's locale-sensitive and since it's based on ICU it's far more likely to be correct than anything we're likely to devise on our own.

carlesalbasboix · 2025-06-05T10:51:05Z

The problem with locale-specific things is that they don't work well in multilingual environments. I have gnucash in english, but I want to be able to search transactions in a diacritic insensitive way across multiple languages.

jralls · 2025-06-05T15:29:10Z

Maybe you should test before making blanket statements. #2097 works in the en_US locale; I think it likely that accented/accentless matching will too. It's easy enough to test: Apply the PR and rebuild GnuCash. Create some transactions with accented characters and see if they match when you enter the words without the accents.

christopherlam · 2025-06-14T01:13:46Z

Would this PR be obsolete when #2097 merged in?

jralls · 2025-06-14T03:20:07Z

Would this PR be obsolete when #2097 merged in?

No, because #2097 doesn't touch QofQuery... but gnc_unicode_has_substring_basic_chars is a better way to accomplish this PR's goal. There's still the matter of what UI adjustments might be appropriate for this in the Find dialog.

Remove diacritics when comparing case insensitive strings.

31894b2

Remove diacritics when comparing case insensitive strings. #2084

Are you sure you want to change the base?

Remove diacritics when comparing case insensitive strings. #2084

Uh oh!

Conversation

carlesalbasboix commented May 9, 2025

Uh oh!

jralls commented May 9, 2025

Uh oh!

carlesalbasboix commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jralls commented May 10, 2025

Uh oh!

carlesalbasboix commented May 10, 2025

Uh oh!

jralls commented May 10, 2025

Uh oh!

jralls commented May 10, 2025

Uh oh!

christopherlam commented May 11, 2025

Uh oh!

jralls commented May 11, 2025

Uh oh!

carlesalbasboix commented May 11, 2025

Uh oh!

jralls commented May 11, 2025

Uh oh!

jralls commented May 30, 2025

Uh oh!

jralls commented Jun 2, 2025

Uh oh!

carlesalbasboix commented Jun 5, 2025

Uh oh!

jralls commented Jun 5, 2025

Uh oh!

christopherlam commented Jun 14, 2025

Uh oh!

jralls commented Jun 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

carlesalbasboix commented May 10, 2025 •

edited

Loading