Objectcreation/manipulation
$doc->new($package, $content)
$doc->new($package, $content, $ownerpass, $userpass)
$doc->new($package, $content, $ownerpass, $userpass, $prompt)
$doc->new($package, $content, $ownerpass, $userpass, $options)
Instantiate a new CAM::PDF object. $content can be a document in a string, a filename, or '-'. The
latter indicates that the document should be read from standard input. If the document is password
protected, the passwords should be passed as additional arguments. If they are not known, a boolean
$prompt argument allows the programmer to suggest that the constructor prompt the user for a
password. This is rudimentary prompting: passwords are in the clear on the console.
This constructor takes an optional final argument which is a hash reference. This hash can contain
any of the following optional parameters:
prompt_for_password => $boolean
This is the same as the $prompt argument described above.
fault_tolerant => $boolean
This flag causes the instance to be more lenient when reading the input PDF. Currently, this
only affects PDFs which cannot be successfully decrypted.
$doc->toPDF()
Serializes the data structure as a PDF document stream and returns as in a scalar.
$doc->toString()
Returns a serialized representation of the data structure. Implemented via Data::Dumper.
Documentreading
(all of these functions are intended for internal only)
$doc->getRootDict()
Returns the Root dictionary for the PDF.
$doc->getPagesDict()
Returns the root Pages dictionary for the PDF.
$doc->parseObj($string)
Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return an object Node. This can be called as a
class method in most circumstances, but is intended as an instance method.
$doc->parseInlineImage($string)
$doc->parseInlineImage($string, $objnum)
$doc->parseInlineImage($string, $objnum, $gennum)
Given a fragment of PDF page content, parse it and return an object Node. This can be called as a
class method in some cases, but is intended as an instance method.
$doc->writeInlineImage($objectnode)
This is the inverse of parseInlineImage(), intended for use only in the CAM::PDF::Content class.
$doc->parseStream($string, $objnum, $gennum, $dictnode)
This should only be used by parseObj(), or other specialized cases.
Given a fragment of PDF page content, parse it and return a stream Node. This can be called as a
class method in most circumstances, but is intended as an instance method.
The dictionary Node argument is typically the body of the object Node that precedes this stream.
$doc->parseDict($string)
$doc->parseDict($string, $objnum)
$doc->parseDict($string, $objnum, $gennum)
Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return an dictionary Node. This can be called as
a class method in most circumstances, but is intended as an instance method.
$doc->parseArray($string)
$doc->parseArray($string, $objnum)
$doc->parseArray($string, $objnum, $gennum)
Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return an array Node. This can be called as a
class or instance method.
$doc->parseLabel($string)
$doc->parseLabel($string, $objnum)
$doc->parseLabel($string, $objnum, $gennum)
Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return a label Node. This can be called as a
class or instance method.
$doc->parseRef($string)
$doc->parseRef($string, $objnum)
$doc->parseRef($string, $objnum, $gennum)
Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return a reference Node. This can be called as a
class or instance method.
$doc->parseNum($string)
$doc->parseNum($string, $objnum)
$doc->parseNum($string, $objnum, $gennum)
Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return a number Node. This can be called as a
class or instance method.
$doc->parseString($string)
$doc->parseString($string, $objnum)
$doc->parseString($string, $objnum, $gennum)
Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return a string Node. This can be called as a
class or instance method.
$doc->parseHexString($string)
$doc->parseHexString($string, $objnum)
$doc->parseHexString($string, $objnum, $gennum)
Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return a hex string Node. This can be called as a
class or instance method.
$doc->parseBoolean($string)
$doc->parseBoolean($string, $objnum)
$doc->parseBoolean($string, $objnum, $gennum)
Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return a boolean Node. This can be called as a
class or instance method.
$doc->parseNull($string)
$doc->parseNull($string, $objnum)
$doc->parseNull($string, $objnum, $gennum)
Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return a null Node. This can be called as a class
or instance method.
$doc->parseAny($string)
$doc->parseAny($string, $objnum)
$doc->parseAny($string, $objnum, $gennum)
Given a fragment of PDF page content, parse it and return a Node of the appropriate type. This can
be called as a class or instance method.
DataAccessors
$doc->getValue($object)
ForINTERNALuse
Dereference a data object, return a value. Given an node object of any kind, returns raw scalar
object: hashref, arrayref, string, number. This function follows all references, and descends into
all objects.
$doc->getObjValue($objectnum)
ForINTERNALuse
Dereference a data object, and return a value. Behaves just like the getValue() function, but used
when all you know is the object number.
$doc->dereference($objectnum)
$doc->dereference($name, $pagenum)
ForINTERNALuse
Dereference a data object, return a PDF object as a node. This function makes heavy use of the
internal object cache. Most (if not all) object requests should go through this function.
$name should look something like '/R12'.
$doc->getPropertyNames($pagenum)
$doc->getProperty($pagenum, $propertyname)
Each PDF page contains a list of resources that it uses (images, fonts, etc). getPropertyNames()
returns an array of the names of those resources. getProperty() returns a node representing a named
property (most likely a reference node).
$doc->getFont($pagenum, $fontname)
ForINTERNALuse
Returns a dictionary for a given font identified by its label, referenced by page.
$doc->getFontNames($pagenum)
ForINTERNALuse
Returns a list of fonts for a given page.
$doc->getFonts($pagenum)
ForINTERNALuse
Returns an array of font objects for a given page.
$doc->getFontByBaseName($pagenum, $fontname)
ForINTERNALuse
Returns a dictionary for a given font, referenced by page and the name of the base font.
$doc->getFontMetrics($properties $fontname)
ForINTERNALuse
Returns a data structure representing the font metrics for the named font. The property list is the
results of something like the following:
$self->_buildNameTable($pagenum);
my $properties = $self->{Names}->{$pagenum};
Alternatively, if you know the page number, it might be easier to do:
my $font = $self->dereference($fontlabel, $pagenum);
my $fontmetrics = $font->{value}->{value};
where the $fontlabel is something like '/Helv'. The getFontMetrics() method is useful in the cases
where you've forgotten which page number you are working on (e.g. in CAM::PDF::GS), or if your
property list isn't part of any page (e.g. working with form field annotation objects).
$doc->addFont($pagenum, $fontname, $fontlabel)
$doc->addFont($pagenum, $fontname, $fontlabel, $fontmetrics)
Adds a reference to the specified font to the page.
If a font metrics hash is supplied (it is required for a font other than the 14 core fonts), then it
is cloned and inserted into the new font structure. Note that if those font metrics contain
references (e.g. to the "FontDescriptor"), the referred objects are not copied -- you must do that
part yourself.
For Type1 fonts, the font metrics must minimally contain the following fields: "Subtype",
"FirstChar", "LastChar", "Widths", "FontDescriptor".
$doc->deEmbedFont($pagenum, $fontname)
$doc->deEmbedFont($pagenum, $fontname, $basefont)
Removes embedded font data, leaving font reference intact. Returns true if the font exists and 1)
font is not embedded or 2) embedded data was successfully discarded. Returns false if the font does
not exist, or the embedded data could not be discarded.
The optional $basefont parameter allows you to change the font. This is useful when some
applications embed a standard font (see below) and give it a funny name, like "SYLXNP+Helvetica". In
this example, it's important to change the basename back to the standard "Helvetica" when de-
embedding.
De-embedding the font does NOT remove it from the PDF document, it just removes references to it. To
get a size reduction by throwing away unused font data, you should use the following code sometime
after this method.
$self->cleanse();
For reference, the standard fonts are "Times-Roman", "Helvetica", and "Courier" (and their bold,
italic and bold-italic forms) plus "Symbol" and "Zapfdingbats". (Adobe PDF Reference v1.4, p.319)
$doc->deEmbedFontByBaseName($pagenum, $fontname)
$doc->deEmbedFontByBaseName($pagenum, $fontname, $basefont)
Just like deEmbedFont(), except that the font name parameter refers to the name of the current base
font instead of the PDF label for the font.
$doc->wrapString($string, $width, $fontsize, $fontmetrics)
$doc->wrapString($string, $width, $fontsize, $pagenum, $fontlabel)
Returns an array of strings wrapped to the specified width.
$doc->getStringWidth($fontmetrics, $string)
ForINTERNALuse
Returns the width of the string, using the font metrics if possible.
$doc->numPages()
Returns the number of pages in the PDF document.
$doc->getPage($pagenum)
ForINTERNALuse
Returns a dictionary for a given numbered page.
$doc->getPageObjnum($pagenum)
ForINTERNALuse
Return the number of the PDF object in which the specified page occurs.
$doc->getPageText($pagenum)
Extracts the text from a PDF page as a string.
$doc->getPageContentTree($pagenum)
Retrieves a parsed page content data structure, or undef if there is a syntax error or if the page
does not exist.
$doc->getPageContent($pagenum)
Return a string with the layout contents of one page.
$doc->getPageDimensions($pagenum)
Returns an array of "x", "y", "width" and "height" numbers that define the dimensions of the
specified page in points (1/72 inches). Technically, this is the "MediaBox" dimensions, which
explains why it's possible for "x" and "y" to be non-zero, but that's a rare case.
For example, given a simple 8.5 by 11 inch page, this method will return "(0,0,612,792)".
This method will die() if the specified page number does not exist.
$doc->getName($object)
ForINTERNALuse
Given a PDF object reference, return it's name, if it has one. This is useful for indirect
references to images in particular.
$doc->getPrefs()
Return an array of security information for the document:
owner password
user password
print boolean
modify boolean
copy boolean
add boolean
See the PDF reference for the intended use of the latter four booleans.
This module publishes the array indices of these values for your convenience:
$CAM::PDF::PREF_OPASS
$CAM::PDF::PREF_UPASS
$CAM::PDF::PREF_PRINT
$CAM::PDF::PREF_MODIFY
$CAM::PDF::PREF_COPY
$CAM::PDF::PREF_ADD
So, you can retrieve the value of the Copy boolean via:
my ($canCopy) = ($self->getPrefs())[$CAM::PDF::PREF_COPY];
$doc->canPrint()
Return a boolean indicating whether the Print permission is enabled on the PDF.
$doc->canModify()
Return a boolean indicating whether the Modify permission is enabled on the PDF.
$doc->canCopy()
Return a boolean indicating whether the Copy permission is enabled on the PDF.
$doc->canAdd()
Return a boolean indicating whether the Add permission is enabled on the PDF.
$doc->getFormFieldList()
Return an array of the names of all of the PDF form fields. The names are the full hierarchical
names constructed as explained in the PDF reference manual. These names are useful for the
fillFormFields() function.
$doc->getFormField($name)
ForINTERNALuse
Return the object containing the form field definition for the specified field name. $name can be
either the full name or the "short/alternate" name.
$doc->getFormFieldDict($formfieldobject)
ForINTERNALuse
Return a hash reference representing the accumulated property list for a form field, including all of
it's inherited properties. This should be treated as a read-only hash! It ONLY retrieves the
properties it knows about.
Data/ObjectManipulation
$doc->setPrefs($ownerpass, $userpass, $print?, $modify?, $copy?, $add?)
Alter the document's security information. Note that modifying these parameters must be done
respecting the intellectual property of the original document. See Adobe's statement in the
introduction of the reference manual.
ImportantNote: Most PDF readers (Acrobat, Preview.app) only offer one password field for opening
documents. So, if the $ownerpass and $userpass are different, those applications cannot read the
documents. (Perhaps this is a bug in CAM::PDF?)
Note: any omitted booleans default to false. So, these two are equivalent:
$doc->setPrefs('password', 'password');
$doc->setPrefs('password', 'password', 0, 0, 0, 0);
$doc->setName($object, $name)
ForINTERNALuse
Change the name of a PDF object structure.
$doc->removeName($object)
ForINTERNALuse
Delete the name of a PDF object structure.
$doc->pageAddName($pagenum, $name, $objectnum)
ForINTERNALuse
Append a named object to the metadata for a given page.
$doc->setPageContent($pagenum, $content)
$doc->setPageContent($pagenum, $tree->toString)
Replace the content of the specified page with a new version. This function is often used after the
getPageContent() function and some manipulation of the returned string from that function.
If your content is a parsed tree (i.e. the result of getPageContentTree) then you should serialize it
via toString first.
$doc->appendPageContent($pagenum, $content)
Add more content to the specified page. Note that this function does NOT do any page metadata work
for you (like creating font objects for any newly defined fonts).
$doc->extractPages($pages...)
Remove all pages from the PDF except the specified ones. Like deletePages(), the pages can be
multiple arguments, comma separated lists, ranges (open or closed).
$doc->deletePages($pages...)
Remove the specified pages from the PDF. The pages can be multiple arguments, comma separated lists,
ranges (open or closed).
$doc->deletePage($pagenum)
Remove the specified page from the PDF. If the PDF has only one page, this method will fail.
$doc->decachePages($pagenum, $pagenum, ...)
Clears cached copies of the specified page data structures. This is useful if an operation has been
performed that changes a page.
$doc->addPageResources($pagenum, $resourcehash)
Add the resources from the given object to the page resource dictionary. If the page does not have a
resource dictionary, create one. This function avoids duplicating resources where feasible.
$doc->appendPDF($pdf)
Append pages from another PDF document to this one. No optimization is done -- the pieces are just
appended and the internal table of contents is updated.
Note that this can break documents with annotations. See the appendpdf.pl script for a workaround.
$doc->prependPDF($pdf)
Just like appendPDF() except the new document is inserted on page 1 instead of at the end.
$doc->duplicatePage($pagenum)
$doc->duplicatePage($pagenum, $leaveblank)
Inserts an identical copy of the specified page into the document. The new page's number will be
"$pagenum + 1".
If $leaveblank is true, the new page does not get any content. Thus, the document is broken until
you subsequently call setPageContent().
$doc->createStreamObject($content)
$doc->createStreamObject($content, $filter ...)
ForINTERNALuse
Create a new Stream object. This object is NOT added to the document. Use the appendObject()
function to do that after calling this function.
$doc->uninlineImages()
$doc->uninlineImages($pagenum)
Search the content of the specified page (or all pages if the page number is omitted) for embedded
images. If there are any, replace them with indirect objects. This procedure uses heuristics to
detect in-line images, and is subject to confusion in extremely rare cases of text that uses "BI" and
"ID" a lot.
$doc->appendObject($doc, $objectnum, $recurse?)
$doc->appendObject($undef, $object, $recurse?)
Duplicate an object from another PDF document and add it to this document, optionally descending into
the object and copying any other objects it references.
Like replaceObject(), the second form allows you to append a newly-created block to the PDF.
$doc->replaceObject($objectnum, $doc, $objectnum, $recurse?)
$doc->replaceObject($objectnum, $undef, $object)
Duplicate an object from another PDF document and insert it into this document, replacing an existing
object. Optionally descend into the original object and copy any other objects it references.
If the other document is undefined, then the object to copy is taken to be an anonymous object that
is not part of any other document. This is useful when you've just created that anonymous object.
$doc->deleteObject($objectnum)
Remove an object from the document. This function does NOT take care of dependencies on this object.
$doc->cleanse()
Remove unused objects. WARNING: this function breaks some PDF documents because it removes objects
that are strictly part of the page model hierarchy, but which are required anyway (like some font
definition objects).
$doc->createID()ForINTERNALuse
Generate a new document ID. Contrary the Adobe recommendation, this is a random number.
$doc->fillFormFields($name => $value, ...)
$doc->fillFormFields($opts_hash, $name => $value, ...)
Set the default values of PDF form fields. The name should be the full hierarchical name of the
field as output by the getFormFieldList() function. The argument list can be a hash if you like. A
simple way to use this function is something like this:
my %fields = (fname => 'John', lname => 'Smith', state => 'WI');
$field{zip} = 53703;
$self->fillFormFields(%fields);
If the first argument is a hash reference, it is interpreted as options for how to render the filled
data:
background_color =< 'none' | $gray | [$r, $g, $b]
Specify the background color for the text field.
max_autoscale_fontsize =< $size
min_autoscale_fontsize =< $size
If the form field is set to auto-size the text to fit, then you may use these options to
constrain the limits of that autoscaling. Otherwise, for example, a very long string will become
arbitrarily small to fit in the box.
$doc->clearFormFieldTriggers($name, $name, ...)
Disable any triggers set on data entry for the specified form field names. This is useful in the
case where, for example, the data entry Javascript forbids punctuation and you want to prefill with a
hyphenated word. If you don't clear the trigger, the prefill may not happen.
$doc->clearAnnotations()
Remove all annotations from the document. If form fields are encountered, their text is added to the
appropriate page.
$doc->previousRevision()
If this PDF was previously saved in append mode (that is, if "clean()" was not invoked on it), return
a new instance representing that previous version. Otherwise return void. If this is an encrypted
PDF, this method assumes that previous revisions were encrypted with the same password, which may be
an incorrect assumption.
$doc->allRevisions()
Accumulate CAM::PDF instances returned by "previousRevision" until there are no more previous
revisions. Returns a list of instances from newest to oldest including this instance as the newest.
DocumentWriting
$doc->preserveOrder()
Try to recreate the original document as much as possible. This may help in recreating documents
which use undocumented tricks of saving font information in adjacent objects.
$doc->isLinearized()
Returns a boolean indicating whether this PDF is linearized (aka "optimized").
$doc->delinearize()ForINTERNALuse
Undo the tweaks used to make the document 'optimized'. This function is automatically called on
every save or output since this library does not yet support linearized documents.
$doc->clean()
Cache all parts of the document and throw away it's old structure. This is useful for writing PDFs
anew, instead of simply appending changes to the existing documents. This is called by cleansave()
and cleanoutput().
$doc->needsSave()
Returns a boolean indicating whether the save() method needs to be called. Like save(), this has
nothing to do with whether the document has been saved to disk, but whether the in-memory
representation of the document has been serialized.
$doc->save()
Serialize the document into a single string. All changed document elements are normalized, and a new
index and an updated trailer are created.
This function operates solely in memory. It DOES NOT write the document to a file. See the output()
function for that.
$doc->cleansave()
Call the clean() function, then call the save() function.
$doc->output($filename)
$doc->output()
Save the document to a file. The save() function is called first to serialize the data structure.
If no filename is specified, or if the filename is '-', the document is written to standard output.
Note: it is the responsibility of the application to ensure that the PDF document has either the
Modify or Add permission. You can do this like the following:
if ($self->canModify()) {
$self->output($outfile);
} else {
die "The PDF file denies permission to make modifications\n";
}
$doc->cleanoutput($file)
$doc->cleanoutput()
Call the clean() function, then call the output() function to write a fresh copy of the document to a
file.
$doc->writeObject($objnum)
Return the serialization of the specified object.
$doc->writeString($string)
Return the serialization of the specified string. Works on normal or hex strings. If encryption is
desired, the string should be encrypted before being passed here.
$doc->writeAny($node)
Returns the serialization of the specified node. This handles all Node types, including object
Nodes.
DocumentTraversing
$doc->traverse($dereference, $node, $callbackfunc, $callbackdata)
Recursive traversal of a PDF data structure.
In many cases, it's useful to apply one action to every node in an object tree. The routines below
all use this traverse() function. One of the most important parameters is the first: the
$dereference boolean. If true, the traversal follows reference Nodes. If false, it does not descend
into reference Nodes.
Optionally, you can pass in a hashref as a final argument to reduce redundant traversing across
multiple calls. Just pass in an empty hashref the first time and pass in the same hashref each time.
See "changeRefKeys()" for an example.
$doc->decodeObject($objectnum)
ForINTERNALuse
Remove any filters (like compression, etc) from a data stream indicated by the object number.
$doc->decodeAll($object)
ForINTERNALuse
Remove any filters from any data stream in this object or any object referenced by it.
$doc->decodeOne($object)
$doc->decodeOne($object, $save?)
ForINTERNALuse
Remove any filters from an object. The boolean flag $save (defaults to false) indicates whether this
removal should be permanent or just this once. If true, the function returns success or failure. If
false, the function returns the defiltered content.
$doc->fixDecode($streamdata, $filter, $params)
This is a utility method to do any tweaking after removing the filter from a data stream.
$doc->encodeObject($objectnum, $filter)
Apply the specified filter to the object.
$doc->encodeOne($object, $filter)
Apply the specified filter to the object.
$doc->setObjNum($object, $objectnum, $gennum)
Descend into an object and change all of the INTERNAL object number flags to a new number. This is
just for consistency of internal accounting.
$doc->getRefList($object)
ForINTERNALuse
Return an array all of objects referred to in this object.
$doc->changeRefKeys($object, $hashref)
ForINTERNALuse
Renumber all references in an object.
$doc->abbrevInlineImage($object)
Contract all image keywords to inline abbreviations.
$doc->unabbrevInlineImage($object)
Expand all inline image abbreviations.
$doc->changeString($object, $hashref)
Alter all instances of a given string. The hashref is a dictionary of from-string and to-string. If
the from-string looks like "regex(...)" then it is interpreted as a Perl regular expression and is
eval'ed. Otherwise the search-and-replace is literal.
Utilityfunctions
$doc->rangeToArray($min, $max, $list...)
Converts string lists of numbers to an array. For example,
CAM::PDF->rangeToArray(1, 15, '1,3-5,12,9', '14-', '8 - 6, -2');
becomes
(1,3,4,5,12,9,14,15,8,7,6,1,2)
$doc->trimstr($string)
Used solely for debugging. Trims a string to a max of 40 characters, handling nulls and non-Unix
line endings.
$doc->copyObject($node)
Clones a node via Data::Dumper and eval().
$doc->cacheObjects()
Parses all object Nodes and stores them in the cache. This is useful for cases where you intend to
do some global manipulation and want all of the data conveniently in RAM.
$doc->asciify($string)
Helper class/instance method to massage a string, cleaning up some non-ASCII problems. This is a
very incomplete list. Specifically:
f-i ligatures
(R) symbol