Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
Lensflare2162114hIf ocr works better than direct text, you could write a script to convert the text into an image with the preferred size.
Feels like treating the symptoms though. -
cuddlyogre155614h@Lensflare I've got a python script that spits out the pages as pngs with the options of merging pages. It seems to work best with one page at a time and seems to work better than processing an equivalent amount of text. It seems to be easier on the context window as well.
I have a 32GB 5090. I've tried Gemma 3 27b, Mistral Small 3.2 24B, Qwen VL 30b, and they all seem to like processing images better than the text. I think it's because they scan the image each time instead of filling the context with the text, so it has more room to work with.
I put the entire thing into the Grok API and it does better, but it used up more than 3 million tokens in about 2 hours of testing, so that's not very scalable.
I am really looking forward to some unknown person becoming a trillionaire when they figure out non-proprietary AI processing hardware. -
cuddlyogre155614h@Lensflare Reasoning vision models like Magistral read the entire document into the context, so it fills up pretty quickly, so I don't know how well it performs over several pages compared to non-reasoning models. -
retoor109813hI understand everything you say.
The older messages you want to keep, make summaries of them. Summaries of summaries :P So just compressing.
But on how much documents are you talking? Consider gpt-4.1-nano or something, that has 1 million tokens context window, i flatout cheap, summarizes well imho. -
Lensflare2162113h@retoor
> I understand everything you say.
Ah, good. Because I understand nothing. It goes way over my head :) -
cuddlyogre155613h@retoor I'm trying to get it working locally so that I don't have to send my or a client's stuff to a proprietary LLM in case they don't trust them or want to fine tune on their own data.
I'm testing with a 15k story I wrote so that I know for sure where it gets things wrong. The chunking strategy I have works decently even with 7b and lower models, with larger models giving more precise summaries. It stays fast and accurate this way, where sending the entire document at once leads to errors and hallucinations.
I've tested the chunking strategy on huge open models like OSS 120B, and one more that was even larger that I can't recall, on runpod, and the results aren't so much better than using a ~30B model that I can justify the cost, so I'm sticking with local.
I'm really hoping that using images can extend the context for general use. For non-reasoning models, this seems to be a promising experiment. -
jestdotty662513hAI seems to take out all the interesting parts of a text and just make it really boring reductive stuff
I want reverse AI summaries. just the interesting parts -
We worked with RAGs at work.
After a few months, we stopped working with RAGs at work.
It's a retarded system that cannot be perfected, only a % of accuracy will be achieved.
Semantic search SUCKS for specific information. For example, if you have a bunch of data that says "my phone number is xxx-xxx-xxyz" and then you ask "What is Sandy's phone number" it will say "I have no fucking clue!" because RAGs suck dick.
The best approach is hybrid - have a RAG that searches both a semantic index AND a traditional index. This way, you get both semantic and literal matches.
But it's still just throwing more money at the bullshit and hoping it grows into a flower. For us, it never did, and we got bored of spending money.
Good luck!

The hoops you have to go through to summarize a document even using LLMs with 100s of billions of parameters is insane. Even when you get something that "works" with RAG, all you are really getting is a summary of the distilled version of the document, not a pure summary.
I've got a script that breaks down documents into manageable chunks with an overlap so the meaning isn't lost between paragraphs, and it works decently enough, especially when you add terms and definitions to the system prompt for things it has trouble with. But the context window is still a problem so you have to discard older entries, which means you can't correct previous items based on new entries.
Using vision models to OCR the image instead of reading in a text document seems to be working a bit better, but it relies on the image being the right size and you can't load in too many at a time.
rant