DEHTML — Strip HTML tags from a file and dump the contents to standard output.

Syntax:
DEHTML /A:attribs /B /C /CP:n /E /H /M /N /N: /O:n /P /R /S filename…

/A:attribsattributes mask; valid flags are -ACEHIORS
/Bexclude text outside the body and title
/Cinclude text in <!-- comments -->
/CP:ninterpret non-Unicode input text using code page n
/Eomit empty (blank) lines
/Hdisplay filenames
/Mlook in <meta> tags for charset info
/Nby itself: include text in <noscript> or <applet> tags
/N:with suboptions: disable features
/O:ninclude text inside <option> tags:
   0 — don’t include any (the default)
   1 — include only the first <option>
   2 — include all <option> text
/Ppage output
/Rremove title
/Ssearch in subdirectories for matching files
Range options are also supported.

Input filenames may be specified on the command line, or text may be redirected or piped into DEHTML. If you want to pipe to DEHTML, remember that pipes open a new shell. To pipe to a plugin command, you must either ensure that the plugin is loaded in the transient shell, e.g. by installing the .DLL file in the shell’s PlugIns directory; or else use temporary files or an in-process pipe.

You may specify more than one filename; wildcards and directory aliases are supported. You can search recursively into subdirectories for matching files with /S. @File lists and internet files are supported. You may also specify CLIP: to dump the clipboard if it contains HTML.

DEHTML will strip HTML tags from the file and replace HTML entities with the corresponding characters; most of the remaining text will be dumped to stdout. This command will also discard: any text in the header which does not appear within <title> tags; anything in <script> or <style> tags; anything within an HTML comment unless you specify /C; anything in <noscript> or <applet> tags unless you specify /N; and anything in <option> tags within a <select> block unless you specify /O:1 or /O:2.

If you specify /M, DEHTML will look in <meta> tags in the header for information about the document’s character encoding. This only works if the file is not in Unicode; /M has no effect with Unicode files.

/N with suboptions disables features:

/NBdo not write a Byte Order Mark
/NCdisable highlight
/NDdo not search into hidden directories; only useful with /S
/NFsuppress the file-not-found error
/NJdo not search into junctions; only useful with /S
/NZdo not search into system directories; only useful with /S

You can combine these, e.g. /NDJ.


•  Note: HTML files often include some unusual characters like non-breaking spaces, bullets, em dashes, ellipses, and guillemets. If you want to pipe or redirect the output from this command, it’s a good idea to enable Unicode output with OPTION //UNICODEOUTPUT=YES. If Unicode output is disabled, some characters may be mangled in translation.