JavaScript validation issue with international characters

We use the excellent validator plugin for jQuery here on Stack Overflow to do client-side validation of input before it is submitted to the server.

It generally works well, however, this one has us scratching our heads.

The following validator method is used on the ask/answer form for the user name field (note that you must be logged out to see this field on the live site; it’s on every /question page and the /ask page)

$.validator.addMethod("validUserName",
  function(value, element) {
  return this.optional(element) || 
  /^[\w\-\s\dÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöüçÇßØøÅåÆæÞþÐð]+$/.test(value); },
  "Can only contain A-Z, 0-9, spaces, and hyphens.");  

Now this regex looks weird but it’s pretty simple:

  • match the beginning of the string (^)
  • match any of these..
    • word character (\w)
    • dash (-)
    • space (\s)
    • digit (\d)
    • crazy moon language characters (àèìòù etc)
  • now match the end of the string ($)

Yes, we ran into the Internationalized Regular Expressions problem. JavaScript’s definition of “word character” does not include international characters.. at all.

Here’s the weird part: even though we’ve gone to the trouble of manually adding tons of the valid international characters to the regex, it doesn’t work. You cannot enter these international characters in the input box for user name without getting the..

Can only contain A-Z, 0-9, spaces, and hyphens

.. validation return!

Obviously the validation is working for the other parts of the regex.. so.. what gives?

The other strange part is that this validation works in the browser’s JavaScript console but not when executed as a part of our standard *.js includes.

/^[\w-\sÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöüçÇßØøÅåÆæÞþÐð]+$/
.test(‘ÓBill de hÓra’) === true

We’ve run into some really bizarre international character issues in JavaScript code before, resulting in some very, very nasty hacks. We’d like to understand what’s going on here and why. Please enlighten us!

Read More:   How to prevent submitting the HTML form's input field value if it empty

I think the email and url validation methods are a good reference here, eg. the email method:

email: function(value, element) {
    return this.optional(element) || /^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/i.test(value);
},

The script to compile that regex.

In other words, replacing your arbitrary list of “crazy moon” characters with this could help:

[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]

Basically this avoids the character encoding issues you have elsewhere by replacing the needs-encoding characters with more general definitions. While not necessarily more readable, so far it’s shorter than your full list.

This isn’t really an answer but I don’t have 50 rep yet to add a comment… It can definately be attributed to encoding issues.

Yea “ECMA shouldn’t care about encoding…” blah blah, well if you’re on firefox, go to View > Character Encoding > Western (ISO-8859-1) then try using the Name field.

It works fine for me after changing the encoding manually (granted the rest of the page doesn’t like the encoding switch, :P)

(on IE8 you can go to Page > Encoding > Western European (Windows) to get the same effect)

What is the character encoding of the JS file?

For XML QNames I use this RegExp:

/**
 * Definition of an XML Name
 */
var NameStartChar = "A-Za-z:_\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D"+
                    "\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF"+
                    "\uF900-\uFDCF\uFDF0-\uFFFD\u010000-\u0EFFFF";
var NameChar = NameStartChar+"\\-\\.0-9\u00B7\u0300-\u036F\u203F-\u2040";
var Name = "^["+NameStartChar+"]["+NameChar+"]*$";
RegExp (Name).test (value);

It works like a charm also with internationalized characters. Note the escaping. Due to that I’m able to restrict the JS file to ASCII characters only. Therefore I don’t get into trouble when dealing with ISO-8859 vs UTF-8 charsets.

Read More:   JSON Stringify changes time of date because of UTC

This is no more true, if you use character encodings where ASCII is no real subset (like, e.g., in Asia UTF-16).

Cheers,

Late to the game here, but I just used this expression and it seemed to work well for me. Seems to be fairly comprehensive and relatively simple:

var re = /^[A-zÀ-Ÿ\s\d-]*$/g; 
var str1 = 'casa-me,pois 99 estou farto! Eis a lista:uma;duas;três';
var str2 = 'casa-me pois 99 estou farto Eis a lista uma duas três';
var str3 = 'àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ'

alert(re.test(str1));
alert(re.test(str2));
alert(re.test(str3));

international characters listed are part of extended ASCII. the ones added by you are certainly not.

Seeing as the statement works in the console, could this have to do the way your .js files are saved (i.e. ascii or UTF-8) and that the browser is loading them thusly and in the process translates the characters?

Use something like Fiddler or Charles (not Firebug’s Net panel, or anything else that’s actually inside the browser) to examine what’s actually coming over the wire. It’s almost certainly an encoding issue: either the file has been saved in some Microsoft character set and is being sent as UTF-8, or maybe the other way around.

In the case of JS RegExps you can, as Boldewyn points out, avoid these problems by specifying the Unicode code point for the characters you want that are outside the US-ASCII range. It would still be as well to make sure you aren’t mixing up encodings between the place where the file is saved and the place where it’s served, though.

Read More:   What's the difference between $.add and $.append JQuery


The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .

Similar Posts