--- lib/HTML/Encoding.pm.orig 2011-03-27 14:53:03.000000000 -0700 +++ lib/HTML/Encoding.pm 2011-03-27 15:28:44.000000000 -0700 @@ -523,20 +523,20 @@ =head1 WARNING -The interface and implementation are guranteed to change before this +The interface and implementation are guaranteed to change before this module reaches version 1.00! Please send feedback to the author of this module. =head1 DESCRIPTION HTML::Encoding helps to determine the encoding of HTML and XML/XHTML -documents... +documents. =head1 DEFAULT ENCODINGS -Most routines need to know some suspected character encodings which +Most routines need to know some suspected character encodings; these can be provided through the C option. This option always -defaults to the $HTML::Encoding::DEFAULT_ENCODINGS array reference +defaults to the $HTML::Encoding::DEFAULT_ENCODINGS array reference, which means the following encodings are considered by default: * ISO-8859-1 @@ -546,7 +546,7 @@ * UTF-32BE * UTF-8 -If you change the values or pass custom values to the routines note +If you change the values or pass custom values to the routines, note that L must support them in order for this module to work correctly. @@ -554,7 +554,7 @@ C, C, and C return in list context the encoding -source and the encoding name, possible encoding sources are +source and the encoding name. Possible encoding sources are: * protocol (Content-Type: text/html;charset=encoding) * bom (leading U+FEFF) @@ -565,21 +565,21 @@ =head1 ROUTINES -Routines exported by this module at user option. By default, nothing -is exported. +Routines may be exported by this module at the user's option. By +default, nothing is exported. =over 2 =item encoding_from_content_type($content_type) Takes a byte string and uses L to extract the -charset parameter from the C header value and returns +charset parameter from the C header value. Returns its value or C (or an empty list in list context) if there is no such value. Only the first component will be examined (HTTP/1.1 only allows for one component), any backslash escapes in strings will be unescaped, all leading and trailing quote marks and white-space characters will be removed, all white-space will be -collapsed to a single space, empty charset values will be ignored +collapsed to a single space, empty charset values will be ignored, and no case folding is performed. Examples: @@ -596,28 +596,28 @@ | "text/html;charset=\" UTF-8 \"" | 'UTF-8' | +-----------------------------------------+-----------+ -If you pass a string with the UTF-8 flag turned on the string will +If you pass a string with the UTF-8 flag turned on, the string will be converted to bytes before it is passed to L. -The return value will thus never have the UTF-8 flag turned on (this -might change in future versions). +The return value will thus never have the UTF-8 flag turned on. (This +might change in future versions.) =item encoding_from_byte_order_mark($octets [, %options]) -Takes a sequence of octets and attempts to read a byte order mark -at the beginning of the octet sequence. It will go through the list -of $options{encodings} or the list of default encodings if no -encodings are specified and match the beginning of the string against -any byte order mark octet sequence found. - -The result can be ambiguous, for example qq(\xFF\xFE\x00\x00) could -be both, a complete BOM in UTF-32LE or a UTF-16LE BOM followed by a -U+0000 character. It is also possible that C<$octets> starts with +Takes a sequence of octets and attempts to read a byte order mark at the +beginning of the octet sequence. It will go through the list of +$options{encodings} (or the list of default encodings if no encodings +are specified) and match the beginning of the string against any byte +order mark octet sequence found. + +The result can be ambiguous. For example, qq(\xFF\xFE\x00\x00) could +be either a complete BOM in UTF-32LE or a UTF-16LE BOM followed by a +U+0000 character. It is also possible for C<$octets> to start with something that looks like a byte order mark but actually is not. -encoding_from_byte_order_mark sorts the list of possible encodings -by the length of their BOM octet sequence and returns in scalar -context only the encoding with the longest match, and all encodings -ordered by length of their BOM octet sequence in list context. +encoding_from_byte_order_mark sorts the list of possible encodings by +the length of their BOM octet sequence. In scalar context, it returns +only the encoding with the longest match. In list context, it returns +all encodings ordered by length of their BOM octet sequence. Examples: @@ -634,9 +634,9 @@ | "\x2B\x2F\x76\x38\x2D" | UTF-7 | qw(UTF-7) | +-------------------------+------------+-----------------------+ -Note however that for UTF-7 it is in theory possible that the U+FEFF -combines with other characters in which case such detection would fail, -for example consider: +Note, however, that for UTF-7 it is theoretically possible for U+FEFF to +combine with other characters, in which case such detection would fail. +For example, consider: +--------------------------------------+-----------+-----------+ | Input | Encodings | Result | @@ -649,15 +649,17 @@ relevant for most applications as there should never be need to use UTF-7 in the encoding list for existing documents. -If no BOM can be found it returns C in scalar context and an -empty list in list context. This routine should not be used with -strings with the UTF-8 flag turned on. +If no BOM can be found, it returns C in scalar context or an +empty list in list context. + +This routine should not be used with strings with the UTF-8 flag turned +on. =item encoding_from_xml_declaration($declaration) Attempts to extract the value of the encoding pseudo-attribute in an XML declaration or text declaration in the character string $declaration. If -there does not appear to be such a value it returns nothing. This would +there does not appear to be such a value, it returns nothing. This would typically be used with the return values of xml_declaration_from_octets. Normalizes whitespaces like encoding_from_content_type. @@ -688,12 +690,14 @@ =item xml_declaration_from_octets($octets [, %options]) Attempts to find a ">" character in the byte string $octets using the -encodings in $encodings and upon success attempts to find a preceding -"<" character. Returns all the strings found this way in the order of -number of successful matches in list context and the best match in -scalar context. Should probably be combined with the only user of this -routine, encoding_from_xml_declaration... You can modify the list of -suspected encodings using $options{encodings}; +encodings in $encodings, and upon success attempts to find a preceding +"<" character. In list context, returns all the strings found this way +in the order of number of successful matches; or in scalar context, +returns the best match. You can modify the list of suspected encodings +using $options{encodings}; + +(Should probably be combined with the only user of this routine, +encoding_from_xml_declaration...) =item encoding_from_first_chars($octets [, %options]) @@ -707,9 +711,9 @@ document is a HTML document) to get at least a base encoding which can be used to decode enough of the document to find elements using encoding_from_meta_element. $options{whitespace} defaults to qw/CR LF SP TB/. -Returns nothing if unsuccessful. Returns the matching encodings in order -of the number of octets matched in list context and the best match in -scalar context. +Returns nothing if unsuccessful. In list context, returns the matching +encodings in order of the number of octets matched. In scalar context, +returns the best match. Examples: @@ -742,14 +746,14 @@ are found, uses encoding_from_content_type to extract the charset -parameter. It returns all such encodings it could find in document -order in list context or the first encoding in scalar context (it -will currently look for others regardless of calling context) or -nothing if that fails for some reason. +parameter. It returns (in list context) all such encodings it could find +in document order, or (in scalar context) the first encoding, or nothing +if that fails for some reason. (Currently it will look for any and all +encodings even when called in scalar context.) -Note that there are many edge cases where this does not yield in +Note that there are many edge cases where this does not yield "proper" results depending on the capabilities of the HTML::Parser -version and the options you pass for it, for example, +version and the options you pass for it. For example: @@ -759,19 +763,19 @@

...

This would likely not detect the C value if HTML::Parser -does not resolve the entity. This should however only be a concern +does not resolve the entity. This should, however, only be a concern for documents specifically crafted to break the encoding detection. =item encoding_from_xml_document($octets, [, %options]) -Uses encoding_from_byte_order_mark to detect the encoding using a -byte order mark in the byte string and returns the return value of -that routine if it succeeds. Uses xml_declaration_from_octets and -encoding_from_xml_declaration and returns the encoding for which -the latter routine found most matches in scalar context, and all -encodings ordered by number of occurences in list context. It -does not return a value of neither byte order mark not inbound -declarations declare a character encoding. +Uses encoding_from_byte_order_mark to detect the encoding using a byte +order mark in the byte string. Returns the return value of that routine +if it succeeds. Uses xml_declaration_from_octets and +encoding_from_xml_declaration, and (in scalar context) returns the +encoding for which the latter routine found most matches, or (in list +context) all encodings ordered by number of occurences. It does not +return a value of neither byte order mark not inbound declarations +declare a character encoding. Examples: @@ -787,12 +791,12 @@ +----------------------------+----------+-----------+----------+ Lacking a return value from this routine and higher-level protocol -information (such as protocol encoding defaults) processors would +information (such as protocol encoding defaults), processors would be required to assume that the document is UTF-8 encoded. -Note however that the return value depends on the set of suspected +Note, however, that the return value depends on the set of suspected encodings you pass to it. For example, by default, EBCDIC encodings -would not be considered and thus for +would not be considered, and thus for @@ -803,7 +807,7 @@ Uses encoding_from_xml_document and encoding_from_meta_element to determine the encoding of HTML documents. If $options{xhtml} is -set to a false value uses encoding_from_byte_order_mark and +set to a false value, uses encoding_from_byte_order_mark and encoding_from_meta_element to determine the encoding. The xhtml option is on by default. The $options{encodings} can be used to modify the suspected encodings and $options{parser_options} can @@ -811,13 +815,13 @@ encoding_from_meta_element (see the relevant documentation). Returns nothing if no declaration could be found, the winning -declaration in scalar context and a list of encoding source -and encoding name in list context, see ENCODING SOURCES. +declaration in scalar context, or a list of encoding source +and encoding name in list context. See L. ... Other problems arise from differences between HTML and XHTML syntax -and encoding detection rules, for example, the input could be +and encoding detection rules. For example, the input could be: Content-Type: text/html @@ -829,14 +833,14 @@

...

-This is a perfectly legal HTML 4.01 document and implementations -might be expected to consider the document ISO-8859-2 encoded as -XML rules for encoding detection do not apply to HTML documents. -This module attempts to avoid making decisions which rules apply -for a specific document and would thus by default return 'utf-8' -for this input. +This is a perfectly legal HTML 4.01 document and implementations might +be expected to consider the document to have ISO-8859-2 encoding, as XML +rules for encoding detection do not apply to HTML documents. This +module attempts to avoid making decisions on which rules apply for a +specific document, and would thus by default return 'utf-8' for this +input. -On the other hand, if the input omits the encoding declaration, +On the other hand, if the input omits the encoding declaration, thus: Content-Type: text/html @@ -848,8 +852,10 @@

...

-It would return 'iso-8859-2'. Similar problems would arise from -other differences between HTML and XHTML, for example consider +it would return 'iso-8859-2'. + +Similar problems would arise from other differences between HTML and +XHTML. For example, consider: Content-Type: text/html @@ -864,69 +870,70 @@ If this is processed using HTML rules, the first > will end the processing instruction and the XHTML document type declaration -would be the relevant declaration for the document, if it is +would be the relevant declaration for the document. If it is processed using XHTML rules, the ?> will end the processing instruction and the HTML document type declaration would be the relevant declaration. -IOW, an application would need to assume a certain character -encoding (family) to process enough of the document to determine -whether it is XHTML or HTML and the result of this detection would -depend on which processing rules are assumed in order to process it. -It is thus in essence not possible to write a "perfect" detection -algorithm, which is why this routine attempts to avoid making any -decisions on this matter. +In other words, an application would need to assume a certain character +encoding (family) to process enough of the document to determine whether +it is XHTML or HTML, and the result of this detection would depend on +which processing rules are assumed in order to process it. It is thus +in essence not possible to write a "perfect" detection algorithm, which +is why this routine attempts to avoid making any decisions on this +matter. =item encoding_from_http_message($message [, %options]) -Determines the encoding of HTML / XML / XHTML documents enclosed -in HTTP message. $message is an object compatible to L, -e.g. a L object. %options is a hash with the following -possible entries: +Determines the encoding of HTML/XML/XHTML documents enclosed in an HTTP +message. $message is an object compatible withL, e.g. a +L object. %options is a hash with the following possible +entries: =over 2 =item encodings -array references of suspected character encodings, defaults to +Array references of suspected character encodings; defaults to C<$HTML::Encoding::DEFAULT_ENCODINGS>. =item is_html Regular expression matched against the content_type of the message -to determine whether to use HTML rules for the entity body, defaults +to determine whether to use HTML rules for the entity body; defaults to C. =item is_xml Regular expression matched against the content_type of the message -to determine whether to use XML rules for the entity body, defaults +to determine whether to use XML rules for the entity body; defaults to C. =item is_text_xml Regular expression matched against the content_type of the message -to determine whether to use text/html rules for the message, defaults +to determine whether to use text/html rules for the message; defaults to C. This will only be checked if is_xml -matches aswell. +matches as well. =item html_default -Default encoding for documents determined (by is_html) as HTML, +Default encoding for documents determined (by is_html) as HTML; defaults to C. =item xml_default -Default encoding for documents determined (by is_xml) as XML, +Default encoding for documents determined (by is_xml) as XML; defaults to C. =item text_xml_default -Default encoding for documents determined (by is_text_xml) as text/xml, -defaults to C in which case the default is ignored. This should -be set to C if desired as this module is by default -inconsistent with RFC 3023 which requires that for text/xml documents -without a charset parameter in the HTTP header C is assumed. +Default encoding for documents determined (by is_text_xml) as text/xml; +defaults to C, in which case the default is ignored. This should +be set to C if desired, as this module is by default +inconsistent with RFC 3023; that RFC requires that for text/xml +documents without a charset parameter in the HTTP header, C is +assumed. This requirement is inconsistent with RFC 2616 (HTTP/1.1) which requires to assume C, has been widely ignored and is thus disabled by @@ -935,18 +942,18 @@ =item xhtml Whether the routine should look for an encoding declaration in the -XML declaration of the document (if any), defaults to C<1>. +XML declaration of the document (if any); defaults to C<1>. =item default Whether the relevant default value should be returned when no other -information can be determined, defaults to C<1>. +information can be determined; defaults to C<1>. =back -This is furhter possibly inconsistent with XML MIME types that differ -in other ways from application/xml, for example if the MIME Type does -not allow for a charset parameter in which case applications might be +This is possibly further inconsistent with XML MIME types that differ +in other ways from application/xml (for example, if the MIME type does +not allow for a charset parameter), in which case applications might be expected to ignore the charset parameter if erroneously provided. =back @@ -954,17 +961,17 @@ =head1 EBCDIC SUPPORT By default, this module does not support EBCDIC encodings. To enable -support for EBCDIC encodings you can either change the +support for EBCDIC encodings, you can either change the $HTML::Encodings::DEFAULT_ENCODINGS array reference or pass the -encodings to the routines you use using the encodings option, for -example +encodings to the routines you use using the encodings option; for +example: my @try = qw/UTF-8 UTF-16LE cp500 posix-bc .../; my $enc = encoding_from_xml_document($doc, encodings => \@try); Note that there are some subtle differences between various EBCDIC -encodings, for example C is mapped to 0x5A in C and -to 0x4F in C; these differences might affect processing in +encodings. For example, C is mapped to 0x5A in C and +to 0x4F in C. These differences might affect processing in yet undetermined ways. =head1 TODO @@ -994,4 +1001,8 @@ Copyright (c) 2004-2008 Bjoern Hoehrmann . This module is licensed under the same terms as Perl itself. + This document has been edited for grammar, spelling, and clarity by + Larry Gilbert for the MacPorts Project. (Some + especially opaque passages have been left alone.) + =cut