Opened 13 years ago
Closed 21 months ago
#4825 closed enhancement (invalid)
extract worksheets embedded in pdf files
Reported by: | jason | Owned by: | boothby |
---|---|---|---|
Priority: | major | Milestone: | sage-duplicate/invalid/wontfix |
Component: | notebook | Keywords: | |
Cc: | ddrake | Merged in: | |
Authors: | Reviewers: | Dima Pasechnik | |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Description (last modified by )
This is an ongoing discussion on sage-devel right now: http://groups.google.com/group/sage-devel/browse_frm/thread/65a932ea328b1afb/91ced495a0a1c27a
Basically, we'd like to embed an sws file in a pdf and then be able to upload the pdf file to the notebook and have the notebook automatically extract the sws file and create the worksheet.
We can use pdfminer to extract the data. Here's a sample program which extracts the first embedded file in a pdf named 'foo.pdf'.
from pdflib.pdfparser import PDFDocument, PDFParser import sys stdout = sys.stdout doc = PDFDocument() fp = file('foo.pdf', 'rb') parser = PDFParser(doc, fp) doc.initialize() for xref in doc.xrefs: for objid in xref.objids(): try: obj = doc.getobj(objid) except: continue if isinstance(obj,dict) and 'Type' in obj and obj['Type'].name == "Annot": if 'Subtype' in obj and obj['Subtype'].name == "FileAttachment": # We have an attached file! filespec = obj['FS'] # Look for embedded file; we could try to extract the # filename too (and make sure it's an sws file). but that is platform dependent. See page # 182 (Section 3.10.2) of # http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf. if 'EF' in filespec: fileobj = filespec['EF']['F'] embeddedspec = filespec['EF'] stdout.write(fileobj.resolve().get_data()) # Just output the first file found. exit()
Change History (14)
comment:1 Changed 13 years ago by
- Milestone changed from sage-3.3 to sage-3.4
comment:2 Changed 13 years ago by
- Type changed from defect to enhancement
comment:3 Changed 12 years ago by
- Description modified (diff)
- Report Upstream set to N/A
comment:4 Changed 12 years ago by
pdfminer is about 350Kb of code.
comment:5 Changed 11 years ago by
- Cc ddrake added
comment:6 Changed 9 years ago by
- Milestone changed from sage-5.11 to sage-5.12
comment:7 Changed 8 years ago by
- Milestone changed from sage-6.1 to sage-6.2
comment:8 Changed 8 years ago by
- Milestone changed from sage-6.2 to sage-6.3
comment:9 Changed 8 years ago by
- Milestone changed from sage-6.3 to sage-6.4
comment:10 Changed 8 years ago by
I feel like maybe this is possible now?
comment:11 Changed 2 years ago by
Unsure if this should be closed, as it could conceivably be useful for historical purposes. Thoughts?
comment:12 Changed 21 months ago by
- Milestone changed from sage-6.4 to sage-duplicate/invalid/wontfix
- Status changed from new to needs_review
sagenb is gone, so...
comment:13 Changed 21 months ago by
- Reviewers set to Dima Pasechnik
- Status changed from needs_review to positive_review
comment:14 Changed 21 months ago by
- Resolution set to invalid
- Status changed from positive_review to closed
3.3 is foremost about the ReST transition, so all tickets should be opened against 3.4.
Cheers,
Michael