3

I had a splash of inspiration. I would like to develop a method for analyzing unknown bitstreams of data. The method would involve determining the format of the data by trial and error machine learning algorithms. This would allow determining data types and byte formats and meanings of streams of data. Could be useful in data forensics. I would call the method: heuristic translation machine learning. I am currently developing code that does this. It will be fun to learn about reinforcement algorithms.

Comments
  • 2
    ´man file´
    :)
  • 1
    Cool ! But change the name "heuristic translation machine learning". DO we want a second HTML ;p ?
  • 0
    Machine Learning is roughly about pattern matching.

    The patterns come from known data.

    This seems a bit different than looking what *unknown* data might do or consists of.

    And yes, 'file' does this already without needing terabytes of storage, dozens of CPUs and Gigabytes of RAM.
    Although it only identifies known formats, of course.
  • 1
    @Yamakuzure The data in question is a bitstream. Not a bytestream. I have a source of data that it is not known how the bits are encoded or if they even have a complete byte in some cases. I intend to search for possibly characters in latin character set (the stream is old, like 40 to 50 years old) and possibly other datatypes. I have no confidence in the data being complete even for individual bytes. Yes, for data that has bytes that are unknown then other tools are available.
  • 1
    @Demolishun Also try EBCDIC if that data is that old.
Add Comment