Been working on trying to get JMdict (relatively comprehensive Japanese dictionary file) into a database so I can do some analysis on the data therein, and it's been a bit of a pain. The KANJIDIC XML file had me thinking it'd be fairly straightforward, but this thing uses just about every trick possible to complicate what one would think would be a straightforward dictionary file:

* Readings and Spellings/Kanji usage are done in a many-to-many manner, with the only thing tying them together being an arbitrary ID. Not everything is related, however, as there can be certain readings that only apply to specific spellings within the group and vice versa. In short, there's no way to really meaningfully establish a headword fora given entry.

* Definitions are buried within broader Sense groups, which clumsily attach metadata and have the same many-to-many (except when not) structure as the readings/spellings.

Suffice to say, this has made coming up with a logical database schema for it a bit more interesting than usual.

It's at least an improvement over the original format, however, which had a couple different ways of setting up the headword section and could splatter tagging information across any part of a given entry. Fine if you're going to grep the flat file, but annoying if you're looking for something more nuanced.

Was looking online last night to see if anyone had a PHP class written to handle entries and didn't turn anything up, but *did* find this amusing exchange from a while back where the creator basically said, "I like my idiosyncratic format and it works for me. Deal with it!": https://sci.lang.japan.narkive.com/...

Grateful to the creator for producing the dictionary I've used most in my studies over the years, but still...

  • 1
    @irene I'm using an XML file, but the original format is a flat text file
  • 1
    Just opened the XML file in nano to see if it the .gz archive extracted properly, now feeling lucky I didn't crash the server for trying. 4,621,224 lines total.
  • 0
    Had to do a fresh install of OS X on my main machine today, so while waiting on that I decided to hook up some external hard drives that have been languishing in my laptop case for a while to see what's on them. Need to load it up to confirm, but I *think* I found an old version of the database that I built from the flat file a long time ago. It's not the same schema I'm using for the XML file, but if it loads up properly it'll probably be far easier to just write a migration script between the two than to figure out how to split an XML file into segments...
Add Comment