DEHTML
— Strip HTML tags from a file
and dump the contents to standard output.
Syntax:
DEHTML
/A:
attribs /B /C /CP:
n /E /H /M /N /N: /O:
n /P /R /S
filename…
/A: attribs | attributes mask; valid flags are -ACEHIORS |
/B | exclude text outside the body and title |
/C | include text in <!-- comments --> |
/CP: n | interpret non-Unicode input text using code page n |
/E | omit empty (blank) lines |
/H | display filenames |
/M | look in <meta> tags for charset info |
/N | by itself: include text in <noscript> or <applet> tags |
/N: | with suboptions: disable features |
/O: n | include text inside <option> tags: |
0 — don’t include any (the default) | |
1 — include only the first <option> | |
2 — include all <option> text | |
/P | page output |
/R | remove title |
/S | search in subdirectories for matching files |
… | Range options are also supported. |
Input filenames may be specified on the command line, or text may be
redirected or piped into DEHTML
. If you want to pipe to
DEHTML
, remember that pipes open a new shell. To pipe to a plugin
command, you must either ensure that the plugin is loaded in the transient
shell, e.g. by installing the .DLL file
in the shell’s PlugIns directory; or else
use temporary files or an in-process pipe.
You may specify more than one filename;
wildcards and directory aliases are supported. You can search recursively into
subdirectories for matching files with /S
. @File lists and
internet files are supported. You may also specify CLIP:
to dump
the clipboard if it contains HTML.
DEHTML
will strip HTML tags from the file and replace
HTML entities
with the corresponding characters; most of the remaining text will be dumped to
stdout. This command will also discard: any text in the header which does not
appear within <title> tags; anything in
<script> or <style>
tags; anything within an HTML comment unless you specify /C
;
anything in <noscript> or
<applet> tags unless you specify /N
;
and anything in <option> tags within a
<select> block unless you specify /O:1
or /O:2
.
If you specify /M
, DEHTML
will look in
<meta> tags in the header for information
about the document’s character encoding. This only works if the file
is not in Unicode; /M
has no effect with Unicode files.
/N
with suboptions disables features:
/NB | do not write a Byte Order Mark |
/NC | disable highlight |
/ND | do not search into hidden directories; only useful with /S |
/NF | suppress the file-not-found error |
/NJ | do not search into junctions; only useful with /S |
/NZ | do not search into system directories; only useful with /S |
You can combine these, e.g. /NDJ
.
• Note: HTML files often include
some unusual characters like non-breaking spaces, bullets, em dashes, ellipses,
and guillemets. If you want to pipe or redirect the output from this command,
it’s a good idea to enable Unicode output with
OPTION //UNICODEOUTPUT=YES
. If Unicode output is disabled,
some characters may be mangled in translation.