Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Related Rants
based on my previous rant about dataset I downloaded
https://devrant.com/rants/9870922/...
I filtered data from single language and removed duplicates.
The first problem I spotted are advertisements and kudos at movie start and at end in the subtitles.
The second is that some text files with subtitles don’t have extensions.
However I managed to extract text files with subtitles and it turned out there is only 2.8gb of data in my native language.
I postponed model training for now as it will be long, painful process and will try to get some nice results faster by leveraging different approach.
I figured out I can try to load this data to vector database and see if I can query it with text fragment. 2.8gb will easily fit into ram so queries should be fast.
Output I want is time of this text fragment, movie name and couple lines before and after.
It will be faster and simpler test to find out if dataset is ok.
Will try to make it this week as I don’t have much todo besides sending CVs and talking with people.
random
vectors