Password protected checker for PDF, DOC and DOCX files

Project: Rails (monolith) Objective: Verify if uploaded files are password-protected and prompt users to provide unprotected versions if necessary.

Initially, I considered using plain JavaScript for the file check, assuming it would be a straightforward process with minimal dependencies. However, I soon realized this approach had several limitations:

  1. User could bypass the check if they wanted to. (Not a big deal, but still) Anyways, we would need to replicate the same validation on the server side.
  2. The most right approach would be to use a system library to try and open the file and see if it's password protected, then we would know for sure. And at system level, we can only do that on the server side.
  3. But where is the fun? Coding is learning too, so I started to code the solution in JavaScript.

My research revealed that PDF files were the most straightforward to analyze. By reading the file as binary data, we could search for specific encryption-related byte patterns. Here's how it works:

  async #isPdfPasswordProtected(file) {
    return new Promise((resolve, reject) => {
      const reader = new FileReader();

      reader.onload = function (e) {
        const data = new Uint8Array(e.target.result);
        let pdfString = '';
        const chunkSize = 8192;

        for (let i = 0; i < data.length; i += chunkSize) {
          pdfString += String.fromCharCode.apply(null, data.subarray(i, i + chunkSize));
        }

        const isEncrypted = /\/Encrypt\s*\d+\s+\d+\s+R/.test(pdfString);
        resolve(isEncrypted);
      };
      reader.onerror = function (e) {
        reject(e);
      };

      reader.readAsArrayBuffer(file);
    });
  }

That worked for read-protected PDF files but not for edit-protected PDF files (it wasnt a requirement for this project though).

For DOCX files, the approach was more complex but equally interesting. DOCX files are essentially ZIP archives containing XML files. The strategy was to open the DOCX file as a ZIP and check for the presence of [Content_Types].xml, which should be the first file in a valid, unprotected DOCX. If this file is missing or inaccessible, it's a strong indicator that the file is password protected.

This method proved effective in testing, accurately identifying protected DOCX files. However, it required introducing a new dependency, JSZip, to handle the ZIP file operations in JavaScript. While this added some complexity to the project, it provided a robust solution for DOCX encryption detection.

Here's a version of the DOCX checking function:

  async #checkDocxEncryption(arrayBuffer) {
    try {
      const zip = await JSZip.loadAsync(arrayBuffer);
      const hasEncryptionInfo = zip.file('EncryptionInfo') !== null;
      const hasEncryptedPackage = zip.file('EncryptedPackage') !== null;

      if (hasEncryptionInfo || hasEncryptedPackage) {
        return true;
      } else {
        try {
          const documentXml = await zip.file('word/document.xml')?.async('string');
          if (documentXml) {
            // Able to read document.xml, file is not encrypted.
            return false;
          } else {
            // Cannot read document.xml, file may be encrypted.
            return true;
          }
        } catch (error) {
          // Error reading document.xml, file may be encrypted
          return true;
        }
      }
    } catch (err) {
      console.error('Error reading .docx file, likely encrypted:', err);
      return true;
    }
  }

After successfully implementing checks for PDF and DOCX files, I encountered a significant challenge with DOC files. The older DOC format, based on Microsoft's proprietary compound file binary format, proved difficult to analyze using JavaScript alone.

My breakthrough came when I discovered a Ruby library capable of detecting password protection in DOC files. This discovery led me to shift my approach, moving all file encryption checks to the backend. This decision allowed for a more unified and robust solution across different file formats.

Here's the final code of the backend solution:

Gist with the final code

I've extracted and used some parts of docx gem, and, from msworddoc-extractor gem.

Requirements:

  • Install imagemagick at Linux/server
  • Add gem image_processing for PDF checking.
  • Add gem ruby-zip for DOCX checking
  • Add gem ruby-ole for DOC checking

After all of this considered, we could choose to install libreoffice on the server, and trying to open the file using libreoffice cli. Its a nice alternative, perhaps a better one. Its a big package on the server though and I dont know the impact on performance on the system nor on the processing of each file.

I've learned a lot with this project. It was challenging and fun.