Password protected checker for PDF, DOC and DOCX files
Project: Rails (monolith) Objective: Verify if uploaded files are password-protected and prompt users to provide unprotected versions if necessary.
Initially, I considered using plain JavaScript for the file check, assuming it would be a straightforward process with minimal dependencies. However, I soon realized this approach had several limitations:
- User could bypass the check if they wanted to. (Not a big deal, but still) Anyways, we would need to replicate the same validation on the server side.
- The most right approach would be to use a system library to try and open the file and see if it's password protected, then we would know for sure. And at system level, we can only do that on the server side.
- But where is the fun? Coding is learning too, so I started to code the solution in JavaScript.
My research revealed that PDF files were the most straightforward to analyze. By reading the file as binary data, we could search for specific encryption-related byte patterns. Here's how it works:
async #isPdfPasswordProtected(file) {
return new Promise((resolve, reject) => {
const reader = new FileReader();
reader.onload = function (e) {
const data = new Uint8Array(e.target.result);
let pdfString = '';
const chunkSize = 8192;
for (let i = 0; i < data.length; i += chunkSize) {
pdfString += String.fromCharCode.apply(null, data.subarray(i, i + chunkSize));
}
const isEncrypted = /\/Encrypt\s*\d+\s+\d+\s+R/.test(pdfString);
resolve(isEncrypted);
};
reader.onerror = function (e) {
reject(e);
};
reader.readAsArrayBuffer(file);
});
}
That worked for read-protected PDF files but not for edit-protected PDF files (it wasnt a requirement for this project though).
For DOCX files, the approach was more complex but equally interesting. DOCX files are essentially ZIP archives containing XML files. The strategy was to open the DOCX file as a ZIP and check for the presence of [Content_Types].xml
, which should be the first file in a valid, unprotected DOCX. If this file is missing or inaccessible, it's a strong indicator that the file is password protected.
This method proved effective in testing, accurately identifying protected DOCX files. However, it required introducing a new dependency, JSZip
, to handle the ZIP file operations in JavaScript. While this added some complexity to the project, it provided a robust solution for DOCX encryption detection.
Here's a version of the DOCX checking function:
async #checkDocxEncryption(arrayBuffer) {
try {
const zip = await JSZip.loadAsync(arrayBuffer);
const hasEncryptionInfo = zip.file('EncryptionInfo') !== null;
const hasEncryptedPackage = zip.file('EncryptedPackage') !== null;
if (hasEncryptionInfo || hasEncryptedPackage) {
return true;
} else {
try {
const documentXml = await zip.file('word/document.xml')?.async('string');
if (documentXml) {
// Able to read document.xml, file is not encrypted.
return false;
} else {
// Cannot read document.xml, file may be encrypted.
return true;
}
} catch (error) {
// Error reading document.xml, file may be encrypted
return true;
}
}
} catch (err) {
console.error('Error reading .docx file, likely encrypted:', err);
return true;
}
}
After successfully implementing checks for PDF and DOCX files, I encountered a significant challenge with DOC files. The older DOC format, based on Microsoft's proprietary compound file binary format, proved difficult to analyze using JavaScript alone.
My breakthrough came when I discovered a Ruby library capable of detecting password protection in DOC files. This discovery led me to shift my approach, moving all file encryption checks to the backend. This decision allowed for a more unified and robust solution across different file formats.
Here's the final code of the backend solution:
I've extracted and used some parts of docx
gem, and, from msworddoc-extractor
gem.
Requirements:
- Install
imagemagick
at Linux/server - Add gem
image_processing
for PDF checking. - Add gem
ruby-zip
for DOCX checking - Add gem
ruby-ole
for DOC checking
After all of this considered, we could choose to install libreoffice
on the server, and trying to open the file using libreoffice
cli. Its a nice alternative, perhaps a better one. Its a big package on the server though and I dont know the impact on performance on the system nor on the processing of each file.
I've learned a lot with this project. It was challenging and fun.