Bug #1053
Badword and unicode support
| Status: | Closed | Start date: | 06/29/2010 | |
|---|---|---|---|---|
| Priority: | Normal | Due date: | ||
| Assignee: | - | % Done: | 100% |
|
| Category: | Post Parser | |||
| Target version: | 1.6.1 | |||
| Reproducibility: | Always | Database Type: | ||
| Reported In MyBB Version: | 1.4.13 | Database Version: | ||
| PHP Version: | SQA assignments: | Kieran Dunbar | ||
| Browser: |
Description
For some reason, the badwords filter doesn't work as expected with non-latin languages.
Reproduction steps:
I added the words "filter", "?????", "low-impact" and "??-???" to the badword filter, then I created a post with all those words in it (the post also contain the pular forms "filters" and "???????"). For the enlgish words, only the singular form was masked, and the hyphen-separated phrase was also masked. For the Persian ones, the singular "?????" was not masked, but the pluarl "???????" was shown as "*****???" (that is, partly-masked); the hyphen-separated Persian word was not masked at all.
Additional information:
A Persian user once sent me an email suggesting a regex change in the badword filter function; it required changing the starting (|\b) to (|\s|\b) and changing the ending (\b|$) to (\b|\s|$). I didn't give it thorough tests, and I wondered why it should solve the problem, because theoretically, checking for word boundaries should be enough.
PS: Redmine is unable to show the Persian characters; however, you can get the concept even by seeing the question marks only.
Associated revisions
Fixes Badword and unicode support (fixes:1053)
Fixes Badword and unicode support (fixes:1053)
History
#1 Updated by Ryan Gordon almost 2 years ago
When you added \b to the badword filter you broke unicode support because PCRE "wordbreaks" doesn't support unicode - That is why I suggested using ^ and $ instead. Maybe I didn't make myself through though, it shouldn't been instead, not addition to.
Anyway this is a regression of that previous related bug report then.
#2 Updated by Huji Lee almost 2 years ago
First, I didn't add the \b's here. I added them somewhere else (that is bug #1028 and has nothing to do with this one), but if we figure this is the cause of the current bug, I'll remove them from there too.
Second, removing \b's doesn't solve the problem. Changing that line of code to:
$message = preg_replace("#^".$badword['badword']."$#i", "\\1".$badword['replacement']."\\2", $message);
will make the code even unable to remove English badwords.
#3 Updated by Huji Lee almost 2 years ago
Besides, \b is a totally different pattern from $ and ^.
\b is a word boundary (supposed to match at the beginning or end of a word) as described here: http://www.regular-expressions.info/wordboundaries.html
^ and $ match the beginning and the end of a string (and not a word) respectively) as mentioned here: http://www.regular-expressions.info/anchors.html
We can't replace one for another, unless we're working with one-word sentences.
#4 Updated by Ryan Gordon almost 2 years ago
Ah - sorry, thought we were talking about the changes to your commit. That is probably still a regression but just separate.
Do you have your PCRE compiled with unicode support?
#5 Updated by Huji Lee almost 2 years ago
I need to check that. From what I know, there are many other users using famous shared hosts having this problem. So either they all don't compile PCRE with Unicode support, or this has nothing to do with the problem.
For the record, using mb_eregi_replace did not solve the issue.
#6 Updated by Huji Lee almost 2 years ago
But if you have yours compiled with Unicode support, I can give you a test code and we can see if it works.
#7 Updated by Ryan Gordon almost 2 years ago
The trick is to have a working regex expression that doesn't care about unicode characters.
Technically we're just matching bytes to bytes with badwords so as long as you don't have anything in the regex that incurs logic that could break this (i.e. \b - word boundaries) then it would work fine.
Word boundaries break unicode support. I've seen it, it makes them go all wonky if you have multibyte characters in your regex. That's why we don't use them. If you want to check for a word boundary then check for a space on both sides.
#8 Updated by Huji Lee almost 2 years ago
- Status changed from New to Assigned
- Assignee set to Huji Lee
- Target version set to 1.6.0 Final
But spaces are not the only word boundaries. Full stops, hyphens, etc are also word boundaries.
If we want to avoid using \b, then we need to put in our own set of word boundary checks. Let's see what I can do.
PS: By the way, are we supposed to allow wildcards in badwords? Otherwise, we could use my_str_replace safely.
#9 Updated by Huji Lee almost 2 years ago
- Status changed from Assigned to Resolved
- % Done changed from 0 to 100
Applied in changeset r5081.
This is not ideal; it cannot recognize Arabic semicolon as a word boundary, nor the Zero-Width-Non-Joiner (or Zero-Width-Joiner). However, none of the regexp engines support these either, so we're leveled up with them at least.
Discription of the patch:
With \s, tabs and new lines are also covered.
With \w, full stops, commas, and semicolons are covered.
#10 Updated by Chris Köcher almost 2 years ago
There's a little regression in the r5081:
In my test-installation I'm replacing the word "test" with "bla".
I wrote a post with the text "test test test test test". The output was the following: "bla test bla test bla". The output with r5080 and before was this: "bla bla bla bla bla".
And there's another little problem:
This revision introduces the support of wildcards. But they didn't work without any problems. I replaced "My*BB" with "Forum". The result of a post with "My_BB" was "Forum_". I think the "_" shouldn't be there... ;-)
#11 Updated by Polar Bear over 1 year ago
- Status changed from Resolved to Closed
#12 Updated by Stefan T. over 1 year ago
- Status changed from Closed to Feedback
- Target version deleted (
1.6.0 Final)
#13 Updated by Ryan Gordon over 1 year ago
- Target version set to 1.6.1
#14 Updated by Ryan Gordon over 1 year ago
- Assignee changed from Huji Lee to Ryan Gordon
#15 Updated by Ryan Gordon over 1 year ago
- Status changed from Feedback to Resolved
Applied in changeset r5283.
#16 Updated by Polar Bear over 1 year ago
- SQA assignments set to Stefan T.
#17 Updated by Ryan Gordon over 1 year ago
If this one could be tested thoroughly and with high priority, I'd appreciate it
#18 Updated by Polar Bear over 1 year ago
- Assignee deleted (
Ryan Gordon) - SQA assignments changed from Stefan T. to Kieran Dunbar
#19 Updated by Polar Bear over 1 year ago
- Status changed from Resolved to Closed
Seems to work fine for me with all the provided reproduction steps and some other bits I tried :).