Awlex

2y

Fuck me, i spent the last 2 days trying to populate a database about the game satisfactory from the wikia, only to read on a subreddit they shipped a json file with api the data I want. I need to check that tomorrow, because I just want to sleep, but if that's true, just kill me.

I FUCKING HATE PARSING WEBSITES

rant

Ranter

Comments

3

TheSilent

1008

2y

Parsing websites is already bad but have you tried parsing PDF. It's even worse (ノ°益°)ノ
0

Wisecrack

9212

2y

@TheSilent sounds like fun. What was your experience?
1

Awlex

17560

2y

@Wisecrack Dread. There are multiple special cases that made it much more difficult than just parsing table rows.

https://satisfactory.fandom.com/wik...
Items that have more than 2 inputs get a rowspan of 2, which means the parser would need to be stateful

https://satisfactory.fandom.com/wik...
Multiple related items on the same page

And I didn't want to continue this any further.

There totally are sites you can index with parsing data like that, but this isn't one of them, unless you're ready to go the extra mile
1

Wisecrack

9212

2y

@Awlex

Sometimes doing something as simple as a counter that catches the count of closing structures (parens, braces, closing tags, etc) can be enough to make a parser stateful.

I had a similar issue when I was outputting what turned out to be a run-length-encoded datastructure that was all binary.

If you know the count of something, then a simple switch in a loop that checks for whether the count increases, can toggle an exception, or perhaps utilize that count as a form of indexing for variable length encodings.

For example in my case I had an array like so

[1111000110000000000111001111111] etc. And I wanted to pack each run length into subarrays.

I did something to the effect of

current=data[i]

results = []

sublist = []

#using type coercion from 1 to true

if current==False:

...sublist.append(current)

else:

...results.append(sublist)

...sublist = []

sublist.append(current)

Handled all the cases w/o knowing sublist lengths. Just an idea.
1

TheSilent

1008

2y

@Wisecrack We tried to parse them for automated text processing. But since PDF just has text boxes with positions it is already hard to get the text order right let alone find out which boxes are relevant or not. Also they are unstructured by nature since they are usually created by different people in different styles so it was not a lot of fun to get some structured text out of there.
0

Wisecrack

9212

2y

@TheSilent probably want to either look up a postscript bytecode manual or some kind of reference. Sounds super fun.
1

Wisecrack

9212

2y

@TheSilent @TheSilent some relevant links that took just a little googling:

https://pypi.org/project/...

https://code.activestate.com/recipe...

I'd write a script to ocr each page, use a similarity measure (like for example any hash algol designed specifically for measuring similarity) based on a 1hot sparse encoding of each paragraph of ocr'd text, and then do the same to text decoded from the pdf, and use that to suss out ordering.

Looking at it now, I could probably do this in an afternoon.

Shit that sounded arrogant. I'm just a junior, don't throw any chairs at my head.
1

TheSilent

1008

2y

@Wisecrack Thank you for all the pointers, but the project was more of a prototype and has been "completed" (read as I don't work on it anymore).
I think back then we also needed to use some heuristic based on font sizes to eliminate page number and other irrelevant text snippets.
The project was a platform for automated document organization using transformers for automated summarization and a lucene based search as well as a semantic search based on text embeddings.

devRant © 2021 Hexical Labs LLC
Privacy Policy | Terms of Service