3
vane
1y

based on my previous rant about dataset I downloaded
https://devrant.com/rants/9870922/...

I filtered data from single language and removed duplicates.

The first problem I spotted are advertisements and kudos at movie start and at end in the subtitles.

The second is that some text files with subtitles don’t have extensions.

However I managed to extract text files with subtitles and it turned out there is only 2.8gb of data in my native language.

I postponed model training for now as it will be long, painful process and will try to get some nice results faster by leveraging different approach.

I figured out I can try to load this data to vector database and see if I can query it with text fragment. 2.8gb will easily fit into ram so queries should be fast.

Output I want is time of this text fragment, movie name and couple lines before and after.

It will be faster and simpler test to find out if dataset is ok.

Will try to make it this week as I don’t have much todo besides sending CVs and talking with people.

Comments
  • 1
    You could use the english dubs vs foreign dubs to train translation possibly.

    Very cool project.

    Btw you lose your job or you just changing jobs?

    Regardless, this is pretty cool. Surprised you didn't do the indian thing and blog about the entire process on medium. Technical breakdowns are great, and double as something you can more easily show others.
  • 1
    @Wisecrack just losing as always. Someone needs to lose so somebody wins. That’s life.

    I thought about translations but it’s something that everyone does and it’s obvious how to do it. Just use lstm and you’re done. Also there are better datasets for it.

    Don’t want to go this way, I already pushed more into the video and search.
Add Comment