Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
Parsing websites is already bad but have you tried parsing PDF. It's even worse (ノ°益°)ノ
-
Awlex177262y@Wisecrack Dread. There are multiple special cases that made it much more difficult than just parsing table rows.
https://satisfactory.fandom.com/wik...
Items that have more than 2 inputs get a rowspan of 2, which means the parser would need to be stateful
https://satisfactory.fandom.com/wik...
Multiple related items on the same page
And I didn't want to continue this any further.
There totally are sites you can index with parsing data like that, but this isn't one of them, unless you're ready to go the extra mile -
@Awlex
Sometimes doing something as simple as a counter that catches the count of closing structures (parens, braces, closing tags, etc) can be enough to make a parser stateful.
I had a similar issue when I was outputting what turned out to be a run-length-encoded datastructure that was all binary.
If you know the count of something, then a simple switch in a loop that checks for whether the count increases, can toggle an exception, or perhaps utilize that count as a form of indexing for variable length encodings.
For example in my case I had an array like so
[1111000110000000000111001111111] etc. And I wanted to pack each run length into subarrays.
I did something to the effect of
current=data[i]
results = []
sublist = []
#using type coercion from 1 to true
if current==False:
...sublist.append(current)
else:
...results.append(sublist)
...sublist = []
sublist.append(current)
Handled all the cases w/o knowing sublist lengths. Just an idea. -
@Wisecrack We tried to parse them for automated text processing. But since PDF just has text boxes with positions it is already hard to get the text order right let alone find out which boxes are relevant or not. Also they are unstructured by nature since they are usually created by different people in different styles so it was not a lot of fun to get some structured text out of there.
-
@TheSilent probably want to either look up a postscript bytecode manual or some kind of reference. Sounds super fun.
-
@TheSilent @TheSilent some relevant links that took just a little googling:
https://pypi.org/project/...
https://code.activestate.com/recipe...
I'd write a script to ocr each page, use a similarity measure (like for example any hash algol designed specifically for measuring similarity) based on a 1hot sparse encoding of each paragraph of ocr'd text, and then do the same to text decoded from the pdf, and use that to suss out ordering.
Looking at it now, I could probably do this in an afternoon.
Shit that sounded arrogant. I'm just a junior, don't throw any chairs at my head. -
@Wisecrack Thank you for all the pointers, but the project was more of a prototype and has been "completed" (read as I don't work on it anymore).
I think back then we also needed to use some heuristic based on font sizes to eliminate page number and other irrelevant text snippets.
The project was a platform for automated document organization using transformers for automated summarization and a lucene based search as well as a semantic search based on text embeddings.
Fuck me, i spent the last 2 days trying to populate a database about the game satisfactory from the wikia, only to read on a subreddit they shipped a json file with api the data I want. I need to check that tomorrow, because I just want to sleep, but if that's true, just kill me.
I FUCKING HATE PARSING WEBSITES
rant