Not a question per se but an assignment -

Design an application that could find logs between two timestamps where the logs are stored in 10000 files, each with a file size of ~16GB.

For an entry level position this was a really good and interesting problem to solve.

  • 1
    Out of curiosity, how were these documents provided and was there any sorted order? If sorted, not much of a problem. Otherwise ...
  • 1
    @cb219 they did not provide any of these documents. They just provided a sample of the log detailing the information present in a single log line.
    And yeah the logs inside a file a sorted by time. Nothing mentioned about the file or filenames themselves.
  • 0
    Feed them into appInsights

    then use traces where timestamp between ....

    I wonder if they would accept that as answer.
  • 1
    Elastic search + Some sort of an ingester, and then Kibana ftw?

    16gb*10k is a very large set.
  • 0
    @NoToJavaScript @magicMirror these are the solutions that are used in production and are the correct way.
    But the essence of the assignment was the ability to modify a basic binary search to be accomodated to such a huge data.
    This was the very thing that I liked about this assignment.
  • 5
    Well first thing id ask is are the log files time-stamped...

    And if they aren't whoever wrote these logs should be fired
  • 2
    @donuts I too had that thought. But then I realised that it can be done without this assumption. I can simply check the last modified date from the metadata to get that info.
    But yeah later in the interview I was told that the filenames are indeed timestamped and could have asked this upfront.
  • 1
    @-devpool- does it work if the file is copied/moved?
  • 0
    @donuts interesting question .. did not think about that .. maybe sth to look for
  • 1
    @-devpool- yeah i would have gone gone with last modified too. The tricky part is timezones though I've worked with log entries that where from a different time than the system time.
  • 0
    @hjk101 so different timezones in the same set of logs? Or just a different timezone from the system time?
Add Comment