Home  |  Linux  | Mysql  | PHP  | XML
From:Samuel L. Bayer Date:Mon May 14 10:09:44 2007
Subject:Re: Problems with Perl Asian encodings?
Samuel L. Bayer wrote:

> Has anyone else done such a comparison of GNU recode and Perl Encode? 
> I'd very much prefer to move the Perl, not simply for efficiency but 
> because, unlike GNU recode, it appears to be actively maintained; 
> however, the error rate is just too high, especially considering that 
> the GNU recode output looks clean, and our users have not complained 
> about it.

Hi again all -

Last week, I sent out a query about Asian encodings and Perl Encode vs. 
GNU recode. Martin Thurn graciously helped me debug this problem, and I 
can now summarize as follows, quoting Martin:

"  In the sample data you sent, in the original GB2312, right after the
word "diode", there is an octal \244 and octal \112.  Octal \244 =
decimal 164 which is not a legal first-byte in GB2312.
   Recode apparently dropped the \244 and left the \112 as-is, a capital
J.
   Encode apparently converted the \244 to a default UTF-8 "unknown
character" and left the \112 as-is, a capital J."

So the outcome was that there's a mode in GNU recode which will drop 
these illegal first bytes. So the question is: is the same thing 
possible in Perl Encode? The documentation for some of the FB_ variables 
is tempting, but pretty opaque.

Again, I'm using Perl 5.8.7, with the versions of Encode that come with 
that distribution.

Thanks so much in advance -

Sam Bayer
The MITRE Corporation
sam@mitre.org

Navigate in group perl.i18n at sever nntp.perl.org
Previous Next




  
© No Copyright
You are free to use Anything
Site Maintained by PHP Developer
Powered By PHP Consultants