2
Kaji
4y

OK, so I've been working on processing a Japanese dictionary file and things are going smoothly for the most part. Out of ~185,000 entries, I've got 35 that are still causing problems.

The error I'm getting is "Incorrect string value '\xF0\xA4\xAD\xAF' for column...". I've checked all of my encoding and collation settings, and I'm pretty sure I've got it set to properly implement all of Unicode (as well as it does, anyway), as shown in the image attached. My suspicion is the problem characters are likely among the JIS X 0213 character set; in either case we're clearly dealing with a 4-byte character encoding issue here.

If needed I can attach a flag in the database and base64 encode these particular entries so the data isn't lost, but I'd like to just get it to handle the data properly in the first place if possible.

Anyone have any ideas on other items I can check to resolve the error?

Comments
  • 4
    I've spent way too much of my life fighting with MySQL character encodings.

    Unfortunately I dont have any answers just want to let you know you are not alone.
  • 2
    Have you definitely checked you have the correct encoding on all tables and columns, not just globally?

    Had a similar issue a while back, and turns out we still had one sneaky column that was stuck on something non-unicode.
  • 1
    @AlmondSauce All of the tables and columns were set to utf8mb4 from the outset, and I double-checked them as I updated the collations after changing that globally (was originally on utf8mb4_general_ci). Another check just now indicates it's set at the table/column level as well.
  • 1
    @Kaji Have you looked up the char?

    U+24B6F is it if I'm not mistaken...

    That's part of CJK Unified Ideographs Extension B... What's the servers mysql version?
  • 0
    @IntrusionCM Yeah, the 𠮟 character in particular seems to account for half the issues.

    Dev server is running MariaDB 10.4.11 (which came with the latest version of XAMPP); the server I'm hosting things on is using MySQL 5.5.65.

    Interestingly, I went ahead and did a hack where it base64 encoded the information and flagged it if it got this error, and I was able to fix it manually on the server directly using Sequel Pro. The SQL dump to send things over to the remote server converted the offending characters to question marks, but it's letting me fix them there as well without any issues.

    Based on that, I'm almost wondering if it's just PHP's MySQLi having issues with it, perhaps?
  • 1
    @Kaji when you tested against mariadb 10.4 it's definitely Unicode 3.1 support.

    Did you set the correct client encoding?

    @highlight
    $mysqli = new mysqli(...)
    $mysqli->set_charset('utf8mb4');

    Or (DSN)
    charset=utf8mb4

    To verify, I'd dump the Session variables straight from that connection, eg.:

    @highlight
    SHOW SESSION VARIABLES LIKE 'character\_set\_%';

    SHOW SESSION VARIABLES LIKE 'collation\_%';

    Do not use SET NAMEs... Or stuff like that.
  • 0
    @IntrusionCM Just checked in my config.php file for the project and saw the following:

    $db->set_charset("utf8");

    Think you may have just hit on the missing link there. The variable dumps in the top of the post were gotten using the following:

    SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
  • 2
    @Kaji would be nice. It's pretty common to forget this....

    Happens all the time xD
  • 0
    @IntrusionCM Just did a check to be certain, and yeah, it was the set_charset() call in the PHP code. Changed it to $db->set_charset("utf8mb4"); and then ran a query inserting the string '𠮟る叱る' (which should definitely trip things, based on past tests) and it entered it without issue.

    Thanks to everyone for the assistance sorting this one out!
  • 1
    @Kaji Happy to hear that. :)
Add Comment