ElasticSearch search behavior is so weird. It works well to search any language except Chinese and Japanese, like Chinese “餐馆” and Japanese “大戸屋”. The workaround in front end is add double quotes to those characters. In usually Chinese, Japanese and Korean are named CJK, which are similar, but Korean it seems works fine. So here we discuss about how to detect Chinese and Japanese in JS.
Check Japanese
You can find detail from stackoverflow.
Check Chinese
You can find detail from here. Actually both are simple regex, but to check unicode. If you know the unicode range for each language, then it’s easy to do. I spent a lot of time to figure out the regex for Chinese. Let’s discuss detail.
Unicode Regex
In Javascript, we have several way to represent string like below:
Why do we need unicode code point escape? Because we need support code point more than 4 hex digits, like below or the range for Chinese character used in REGEX_CHINESE
. Let’s have another example below
Now we understand why have \u{20000}
in Chinese regex, not \u20000
. This is a very nice article to introduce detail of Javascript unicode.
Also add u
flag on regex to show “unicode; treat pattern as a sequence of unicode code points”.