Javascript Detect Chinese and Japanese Characters


ElasticSearch search behavior is so weird. It works well to search any language except Chinese and Japanese, like Chinese “餐馆” and Japanese “大戸屋”. The workaround in front end is add double quotes to those characters. In usually Chinese, Japanese and Korean are named CJK, which are similar, but Korean it seems works fine. So here we discuss about how to detect Chinese and Japanese in JS.

Check Japanese

<figure class="highlight"><pre><code class="language-js" data-lang="js">    <span class="kd">const</span> <span class="nx">REGEX_JAPANESE</span> <span class="o">=</span> <span class="sr">/</span><span class="se">[\u</span><span class="sr">3000-</span><span class="se">\u</span><span class="sr">303f</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">3040-</span><span class="se">\u</span><span class="sr">309f</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">30a0-</span><span class="se">\u</span><span class="sr">30ff</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">ff00-</span><span class="se">\u</span><span class="sr">ff9f</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">4e00-</span><span class="se">\u</span><span class="sr">9faf</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">3400-</span><span class="se">\u</span><span class="sr">4dbf</span><span class="se">]</span><span class="sr">/</span><span class="p">;</span>
<span class="kd">const</span> <span class="nx">hasJapanese</span> <span class="o">=</span> <span class="p">(</span><span class="nx">str</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="nx">REGEX_JAPANESE</span><span class="p">.</span><span class="nf">test</span><span class="p">(</span><span class="nx">str</span><span class="p">);</span>
</code></pre></figure>

You can find detail from stackoverflow.

Check Chinese

<figure class="highlight"><pre><code class="language-js" data-lang="js">    <span class="kd">const</span> <span class="nx">REGEX_CHINESE</span> <span class="o">=</span> <span class="sr">/</span><span class="se">[\u</span><span class="sr">4e00-</span><span class="se">\u</span><span class="sr">9fff</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">3400-</span><span class="se">\u</span><span class="sr">4dbf</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">{20000}-</span><span class="se">\u</span><span class="sr">{2a6df}</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">{2a700}-</span><span class="se">\u</span><span class="sr">{2b73f}</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">{2b740}-</span><span class="se">\u</span><span class="sr">{2b81f}</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">{2b820}-</span><span class="se">\u</span><span class="sr">{2ceaf}</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">f900-</span><span class="se">\u</span><span class="sr">faff</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">3300-</span><span class="se">\u</span><span class="sr">33ff</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">fe30-</span><span class="se">\u</span><span class="sr">fe4f</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">f900-</span><span class="se">\u</span><span class="sr">faff</span><span class="se">]</span><span class="sr">|</span><span class="se">[\u</span><span class="sr">{2f800}-</span><span class="se">\u</span><span class="sr">{2fa1f}</span><span class="se">]</span><span class="sr">/u</span><span class="p">;</span>
<span class="kd">const</span> <span class="nx">hasJapanese</span> <span class="o">=</span> <span class="p">(</span><span class="nx">str</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="nx">REGEX_CHINESE</span><span class="p">.</span><span class="nf">test</span><span class="p">(</span><span class="nx">str</span><span class="p">);</span>
</code></pre></figure>

You can find detail from here. Actually both are simple regex, but to check unicode. If you know the unicode range for each language, then it’s easy to do. I spent a lot of time to figure out the regex for Chinese. Let’s discuss detail.

Unicode Regex

In Javascript, we have several way to represent string like below:

<figure class="highlight"><pre><code class="language-js" data-lang="js">    <span class="kd">const</span> <span class="nx">a</span> <span class="o">=</span> <span class="dl">'</span><span class="s1">A</span><span class="dl">'</span><span class="p">;</span>
<span class="c1">// hexadecimal escape for code point between U+0000 to U+00FF</span>
<span class="kd">const</span> <span class="nx">b</span> <span class="o">=</span> <span class="dl">'</span><span class="se">\</span><span class="s1">x41</span><span class="dl">'</span><span class="p">;</span>
<span class="c1">// Unicode escape for code point between U+0000 to U+FFFF</span>
<span class="kd">const</span> <span class="nx">c</span> <span class="o">=</span> <span class="dl">'</span><span class="se">\</span><span class="s1">u0041</span><span class="dl">'</span><span class="p">;</span>
<span class="c1">// Unicode code point escape for any code point up six hexadecimal digits for all unicode code point</span>
<span class="kd">const</span> <span class="nx">d</span> <span class="o">=</span> <span class="s2">`\u{0041}`</span>
<span class="kd">const</span> <span class="nx">isTrue</span> <span class="o">=</span> <span class="nx">a</span> <span class="o">===</span> <span class="nx">b</span> <span class="o">&amp;&amp;</span> <span class="nx">b</span> <span class="o">===</span> <span class="nx">c</span> <span class="o">&amp;&amp;</span> <span class="nx">c</span> <span class="o">===</span> <span class="nx">d</span><span class="p">;</span>
</code></pre></figure>

Why do we need unicode code point escape? Because we need support code point more than 4 hex digits, like below or the range for Chinese character used in REGEX_CHINESE. Let’s have another example below

<figure class="highlight"><pre><code class="language-js" data-lang="js">    <span class="kd">const</span> <span class="nx">a</span> <span class="o">=</span> <span class="dl">'</span><span class="se">\</span><span class="s1">u{1f4a9}</span><span class="dl">'</span><span class="p">;</span>
<span class="kd">const</span> <span class="nx">b</span> <span class="o">=</span> <span class="dl">'</span><span class="s1">💩</span><span class="dl">'</span><span class="p">;</span>
<span class="kd">const</span> <span class="nx">c</span> <span class="o">=</span> <span class="dl">'</span><span class="se">\</span><span class="s1">uD83D</span><span class="se">\</span><span class="s1">uDCA9</span><span class="dl">'</span><span class="p">;</span>
<span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="nx">a</span> <span class="o">===</span> <span class="nx">b</span> <span class="o">&amp;&amp;</span> <span class="nx">b</span> <span class="o">===</span> <span class="nx">c</span><span class="p">);</span>    <span class="c1">// true</span>
<span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="nx">a</span><span class="p">.</span><span class="nx">length</span> <span class="o">===</span> <span class="mi">2</span><span class="p">);</span>        <span class="c1">// true</span>
<span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="nx">a</span><span class="p">.</span><span class="nf">charCodeAt</span><span class="p">(</span><span class="mi">0</span><span class="p">).</span><span class="nf">toString</span><span class="p">(</span><span class="mi">16</span><span class="p">),</span> <span class="nx">a</span><span class="p">.</span><span class="nf">charCodeAt</span><span class="p">(</span><span class="mi">1</span><span class="p">).</span><span class="nf">toString</span><span class="p">(</span><span class="mi">16</span><span class="p">));</span>    <span class="c1">// d83d, dca9</span>
<span class="c1">// I think a.codePointAt(1) should be null, but why it's `dca9` and codePointAt(2) is undefined.</span>
<span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="nx">a</span><span class="p">.</span><span class="nf">codePointAt</span><span class="p">(</span><span class="mi">0</span><span class="p">).</span><span class="nf">toString</span><span class="p">(</span><span class="mi">16</span><span class="p">),</span> <span class="nx">a</span><span class="p">.</span><span class="nf">codePointAt</span><span class="p">(</span><span class="mi">1</span><span class="p">).</span><span class="nf">toString</span><span class="p">(</span><span class="mi">16</span><span class="p">));</span> <span class="c1">// 1f4a9, dca9</span>
</code></pre></figure>

Now we understand why have \u{20000} in Chinese regex, not \u20000. This is a very nice article to introduce detail of Javascript unicode.

Also add u flag on regex to show “unicode; treat pattern as a sequence of unicode code points”.