40

My System Analysis professor wants to fail me because I refuse to store PDF files in the database in my project.

He wants me to store THE WHOLE BINARY FILE in the database instead of on the filesystem.

When I tried to explain why that would be bad, he interrupted me and began the "you think you know more than I do? I've been teaching this for X years" speech.

How do such people become professors?

Comments
  • 11
    Could just be me but as far as I know although its not best practice, it's widely used I thought :). Also, welcome!
  • 9
    Storing pdfs in databases makes sense, as it resolves a few problems: one, reduces space use for a large amount of small files, two, removes issues with limiting factors fdom the os like maximum number of files in the database. Databases have the BLOB datatypes just for this purpose. However, There are downsides...
  • 5
    Also solves a lot of issues with multi server setups and backups
  • 21
    Although his prof is right, he/she could have been more explanatory... He/she could explain why instead of just nagging it off with an air of 'i'm better than thou'...
  • 1
    I understand that I can store them as BLOBs, but I believe it's a very bad idea in this case. My project is a lame library management system prototype.. The PDFs are digital copies of books and documents. Saved on the server, served to a desktop client via HTTP.. You really think this would be a good case to store the files in the database?
  • 6
    @UltimateZero personally think so yeah. Also since letsencrypt is free and if time isn't much of the essence I'll recommend adding ssl!
  • 2
    Also yeah I'm only ranting because of the way he said it! Some people teach computer science classes like it's a history class, they don't welcome debates or alternative methods of doing something. That's the part I really hate.
  • 3
    @linuxxx thanks for the welcome btw!
    How is it a good idea? The only way I imagine it'll go is: Each time a PDF is requested, it'll need to query the db, receive the whole file into memory, write to a temp file (?) and serve it as a normal static file would be served, byte range and all that..
  • 2
    @UltimateZero I don't know much details but it's cross platform proof just in case. There are other advantaged but can't remember those right now 😅
  • 3
    personally, I would store files < 1GB as BLOBs. Modern databases can handle them pretty darn good. + it's in the "same place" as the other application data.
  • 5
    If is a school project follow the advice of the teacher, if is your personal project do what ever you thing is right.
  • 2
    There is this kind of cursor (don't know its name right now) which crawls over the query result instead of caching it all at once. Also you can serve the bytes as a binary file directly to the http client by manually setting the headers, so it doesn't have to be stored as a file before sending it to the user.
  • 2
    Despite of who has the reason, the argument that the professor used was not the right one. He should teach, not enforce "hierarchy"...
  • 2
    He should give you a reason instead of a mouthful of "I'm better than you"
  • 4
    Sorry to hear about your non-constructive debate with your instructor.
    Implementation choices always depend on the application and its expected behavior.
    A quick example would be seamless data migration or fragmentation, e.g. for load balancing purposes. In such a case, storing blobs in a database would be rather convenient because it simplifies distribution logic.
    I don't think a college project could use such an insight but it might seem that your instructor is desperately trying to "teach you a thing or two". Nonetheless, I advise you to go with the instructor's flow as they may have wrongly expressed a good intention.
  • 1
    You can just tell the Prof that he is (probably) wrong to the face. Ask him like "My idea was this and this, what do you think of it?" or "Couldn't we also do it like this... Or am I missing something?"
  • 3
    I work for a company that has twenty years of pdf test files, because they built a pdf/postscript/xps interpreter.

    I'm involved with trying to design and build a next gen regression test system that can handle the sheer number of files we deal with.

    Trust me when I say we have major performance issues with file system access with the pdf files we have, we've done the research, crunched the numbers and we want to put the files in a database!
  • 1
    You could use a seperate file on the database for PDFs. Not sure which DB you're using but in SQL server that's pretty easy.

    For the most part, "those who can't or don't want to do teach", although I had an amazing professor who became a professor because he had six kids and they were just about to go off to college so he became a professor for the free tuition. That dude was amazingly brilliant. Another became a prof once he retired. I think it all just depends.
  • 1
    @nmunro is the company one thats name would suggest it "saves paper"? If so, you guys have a really amazing product. I've seen some of the stuff it can do and was truly impressed.
  • 1
    @ninjatini I have no idea what company you might be referring to but if you clicked my profile you'd see who I work for, it's no secret...
  • 1
    @nmunro ah. I figured no one would offer up that info so never even looked. I was referring to papersave. They do some crazy stuff with ocr and integration into CRM systems. Sounded pretty similar.
  • 2
    For the most part I always store files in the DB apposed to the file server. Depending on the data that's all being stored, using a tables foreign/primary key is a lot easier for referencing than the file name/storing the files path in the DB.
  • 1
    Years doesn't mean anything!
  • 2
    @ninjatini Nope, we build a raster image processor, well... A few for different purposes, but we have thousands of suits that usually contain thousands or even hundreds of thousands of PDF files.

    It's not exciting but this thread has given me a potential solution to save even more data... Since a pdf file is a header, binary data and an xref table, our database solution could strip out the header (since all our pdf files are the same 1.4 version, for compatibility) and if the binary data is at the same offset then even the xref table could be stripped out saving only the internal binary image in a binary blob and when the file is downloaded the header, binary blob and xref table could be stitched together and sent as a file.

    I mean, I know the header and xref table is small, but it's repeated in every single file. Over about a billion files this would save some disk space!
Add Comment