xml

Ranter

ilechuks73

700

Comments

2

Hazarth

9185

3y

Good luck with that. That's a mixed format.

The main document follows some sort of custom format. You can see that the header is marked as "SGML" which is just a standard. This at the very least means they should follows the standard so you can parse the main tags like SEC-DOCUMENT and SEC-HEADER into some sort of SecDocument object

each custom tag seems to also allow for additional info in the same line as the tag, which seems to be [filename : ] timestamp.

Additionally the actual header content is a separate format and it's a indentation tree + tab separated dictionary. This whole thing should be simple parsable into an object and thus translate into JSON directly

Lastly the actual Document can have any other format they want, your example is an XML. Here you will need to check the <TEXT> tag content for a format tag, probably always first line after TEXT e.g. <XML> And you'll need to use the proper format parser to transform that into a JSON
2

j0n4s

5088

3y

Seems pretty easy the first part before the real xml seems to be irrelevant.

Then you only parse the real xml and check for documenttype

Or you just search the full text without parsing for

"<documentType>4</documentType>"

Or am i missing something here
2

Hazarth

9185

3y

so really, you need to implement (or find implemented) at least 3 parsers, one for their custom SGML, one for the tree/object data and one for whatever format the file is in

unless this is some sort of standardized format that you can find a library for, but I have never seen this one before so I wouldn't know... but ultimately you should be able to parse it easily if you split your workload into separate parsers and then just combine them as needed
1

ScriptCoded

18427

3y

Looks like some XML-like format with natural text-like values. Also, some tags such ass <acceptance-datetime> are not being closed, which is fun.

The text section has an embedded XML document, which also is interesting.

You have some more documents so that we can see if the format deviates? Looks fun and rather straight forward to parse :)

What language are you using btw?
0

ilechuks73

700

3y

@ScriptCoded i have checked multiple documents and they are the same. i use nodejs

Related Rants

Add Comment

question

help

parsing