Opened 11 years ago

Last modified 5 years ago

#4825 new enhancement

extract worksheets embedded in pdf files

Reported by: jason Owned by: boothby
Priority: major Milestone: sage-6.4
Component: notebook Keywords:
Cc: ddrake Merged in:
Authors: Reviewers:
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Description (last modified by jason)

This is an ongoing discussion on sage-devel right now: http://groups.google.com/group/sage-devel/browse_frm/thread/65a932ea328b1afb/91ced495a0a1c27a

Basically, we'd like to embed an sws file in a pdf and then be able to upload the pdf file to the notebook and have the notebook automatically extract the sws file and create the worksheet.

We can use pdfminer to extract the data. Here's a sample program which extracts the first embedded file in a pdf named 'foo.pdf'.

from pdflib.pdfparser import PDFDocument, PDFParser
import sys
stdout = sys.stdout

doc = PDFDocument()
fp = file('foo.pdf', 'rb')
parser = PDFParser(doc, fp)
doc.initialize()

for xref in doc.xrefs:
    for objid in xref.objids():
        try:
            obj = doc.getobj(objid)
        except:
            continue
        if isinstance(obj,dict) and 'Type' in obj and obj['Type'].name == "Annot":
            if 'Subtype' in obj and obj['Subtype'].name == "FileAttachment":
                # We have an attached file!
                filespec = obj['FS']
                # Look for embedded file; we could try to extract the
                # filename too (and make sure it's an sws file). but that is platform dependent.  See page
                # 182 (Section 3.10.2) of
                # http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf.
                if 'EF' in filespec:
                    fileobj = filespec['EF']['F']
                    embeddedspec = filespec['EF']
                    stdout.write(fileobj.resolve().get_data())
                    # Just output the first file found.
                    exit()

Change History (10)

comment:1 Changed 11 years ago by mabshoff

  • Milestone changed from sage-3.3 to sage-3.4

3.3 is foremost about the ReST transition, so all tickets should be opened against 3.4.

Cheers,

Michael

comment:2 Changed 11 years ago by AlexGhitza

  • Type changed from defect to enhancement

comment:3 Changed 10 years ago by jason

  • Description modified (diff)
  • Report Upstream set to N/A

comment:4 Changed 10 years ago by jason

pdfminer is about 350Kb of code.

comment:5 Changed 9 years ago by ddrake

  • Cc ddrake added

comment:6 Changed 6 years ago by jdemeyer

  • Milestone changed from sage-5.11 to sage-5.12

comment:7 Changed 6 years ago by vbraun_spam

  • Milestone changed from sage-6.1 to sage-6.2

comment:8 Changed 5 years ago by vbraun_spam

  • Milestone changed from sage-6.2 to sage-6.3

comment:9 Changed 5 years ago by vbraun_spam

  • Milestone changed from sage-6.3 to sage-6.4

comment:10 Changed 5 years ago by kcrisman

I feel like maybe this is possible now?

Note: See TracTickets for help on using tickets.