Ticket #4825 (new enhancement)
extract worksheets embedded in pdf files
|Reported by:||jason||Owned by:||boothby|
Description (last modified by jason) (diff)
This is an ongoing discussion on sage-devel right now: http://groups.google.com/group/sage-devel/browse_frm/thread/65a932ea328b1afb/91ced495a0a1c27a
Basically, we'd like to embed an sws file in a pdf and then be able to upload the pdf file to the notebook and have the notebook automatically extract the sws file and create the worksheet.
We can use pdfminer to extract the data. Here's a sample program which extracts the first embedded file in a pdf named 'foo.pdf'.
from pdflib.pdfparser import PDFDocument, PDFParser import sys stdout = sys.stdout doc = PDFDocument() fp = file('foo.pdf', 'rb') parser = PDFParser(doc, fp) doc.initialize() for xref in doc.xrefs: for objid in xref.objids(): try: obj = doc.getobj(objid) except: continue if isinstance(obj,dict) and 'Type' in obj and obj['Type'].name == "Annot": if 'Subtype' in obj and obj['Subtype'].name == "FileAttachment": # We have an attached file! filespec = obj['FS'] # Look for embedded file; we could try to extract the # filename too (and make sure it's an sws file). but that is platform dependent. See page # 182 (Section 3.10.2) of # http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf. if 'EF' in filespec: fileobj = filespec['EF']['F'] embeddedspec = filespec['EF'] stdout.write(fileobj.resolve().get_data()) # Just output the first file found. exit()