WORDS
— Count words, sentences,
and paragraphs in English text.
Syntax:
WORDS
/A:
attribs /C /CP:
n /D /F:
fmt /K /M:
n /N /S /U:
mode /X
filename…
/A: attribs | attributes mask; valid flags are -ACEHIORS |
/C | code mode; words may contain underscores and dollar signs |
/CP: n | interpret non-Unicode input text using code page n |
/D | dumps lists of unique words, sorted by frequency |
/F: fmt | specifies the format for input text; fmt is one of: |
0 — best guess (default) | |
1 — unformatted (line breaks are used only to end paragraphs) | |
2 — prewrapped (line breaks are used to wrap text) | |
/K | keeps hyphens when reassembling split words |
/M: n | minimum number of letters in a word |
/N | by itself: no words containing digits |
/N | with suboptions: disable features |
/S | search in subdirectories for matching files |
/U: mode | controls the counting of unique words; mode is one of: |
0 — do not count unique words (faster for large files) | |
1 — count unique words for each file individually (the default) | |
2 — count unique words for all files together (slower) | |
3 — separate counts for each file and for all files together (double oink!) | |
/X | no words beginning with a digit |
… | Range options are also supported. |
WORDS
counts words, sentences, and paragraphs in English text.
It can read text from standard input, or from one or more files specified on
the command line. A report is written to standard output; this report can be
piped or redirected. The results of the last file processed are also saved
internally, and can be acessed through internal
variables.
Note: This command was designed
specifically for use with English text. I make many Anglocentric assumptions
about what constitutes a ‘word’, a ‘sentence’, a
‘paragraph’, ‘forms’ of a word, and so on. These
assumptions are probably not useful for any other language. WORDS
may give strange or undesired results when used on source code, program output,
HTML, or whatnot.
If standard input (stdin) is redirected, WORDS
will read from
stdin before any filenames specified on the
command line. If no filenames are specified, then
WORDS
will read from stdin whether it is redirected or not.
Filenames may include wildcards and directory aliases. You can search into
subdirectories for matching files with /S
. @File lists and
internet files are supported. You may also specify CLIP:
to count
words on the clipboard.
This command’s definition of a ‘word’
is complex and subject to ongoing tweaking. In general, though, a word may
contain only letters, digits (unless /N
is specified), periods,
apostrophes, and hyphens; at least one character must be a letter. For
instance, 20th, 1920s, 1969's, and post-1941
are all considered words, but 1984 is not. The first character must
be alphanumeric or (very rarely) an apostrophe.
If /C
is specified, words may also
contain underscores and dollar signs, but must not begin with a digit or dollar
sign. /C
also suppresses the count of sentences and paragraphs in
the final report.
Words that differ only in case are counted as the same word. In the phrase polish Polish furniture using Polish furniture polish, this command will find only three ‘unique’ words.
A word is counted as ‘proper’ only if it never occurs in an all-lowercase form; no proper nouns will be found in Polish polish. Acronyms like NATO will be counted as ‘proper nouns’; so will ordinary words capitalized at the start of a sentence. The latter are often common words like articles and prepositions, which tend to be weeded out in longer files as they recur midsentence.
Note that a hyphenate is always counted as a single word. Without a dictionary, the command has no way of knowing whether it is composed of actual words (red-eye, half-baked) or not (pre-K, Wi-Fi).
WORDS
also gives counts of sentences, paragraphs, lines,
characters, and bytes. All counts should be viewed as estimates rather than
gospel truth. The sentences count in particular must be taken with a healthy
dose of salt; the command has no good way to determine whether a period ends an
abbreviation, a sentence, or both.
A line, or a series of lines, which contains one or more sentences is counted as a ‘paragraph’. A line or series of lines which contains one or more words, but no recognized sentences, is instead counted as a ‘title’. It might actually be a title, subtitle, or chapter heading; or it might be a byline, date line, attribution, salutation, signature, line of poetry….
The number of lines reported may differ from the number of carriage returns or line feeds in the text, e.g. if the last line in the file is not terminated. A line containing only whitespace characters (spaces and tabs) is considered blank. The character and byte counts do not include any Unicode byte-order mark at the beginning of the file.
Split words:
If a hyphenated word is split across a line break, WORDS
will
reassemble it and treat it as a single word. By default, the hyphen is
dropped — the command has no way of knowing whether a hyphenated
compound word was broken at a hyphen, or whether a normal word was divided
between syllables and a hyphen added. The latter seems more common, and I
wanted to avoid cluttering the vocabulary list with differently-hyphenated
versions of the same word. If /K
is specified, the command will
instead retain hyphens when reassembling words broken at the end of a line.
This option may cause a larger number of ‘unique’ words to be
reported.
Vocabularies: In
order to count unique words and ‘proper nouns’, WORDS
must build a list of all words found. Building this list can slow down the
process and use a good deal of memory if the text file involved is large.
/U:
mode controls the vocabulary lists.
/U:0
disables vocabularies; the command executes faster, but there
will be no counts of unique and proper words. /U:1
causes
WORDS
to build a vocabulary list for each file it processes; this
is the default behavior. /U:2
builds a combined vocabulary for
all files that WORDS
processes; this is slower than the default.
Finally, /U:3
builds a vocabulary for each file that WORDS
reads, and at the same time builds a master vocabulary for all files together;
this is much slower than the default behavior, and devours memory
shamelessly.
If you are processing extremely large text files, or files which are not
English prose — e.g. output from a program or command —
I strongly recommend using /U:0
to disable vocabulary lists.
Dump: If /D
is specified, the vocabulary for each file will be dumped to stdout. If
/D
is combined with /U:2
, you’ll instead get a
combined vocabulary for all files. The list is sorted by frequency, with more
common words appearing first. Note that words may be shown in a different case
than they appear in the input text. This is because the command stores all words
in lowercase internally for speed (lowercase letters are more streamlined).
Text format: Text files use line-break characters
in different ways. In some files, line break characters are used only to mark
where a line end should occur: the end of a paragraph. In other files,
line breaks are used to wrap text to some desired width. You can use
/F:
n to tell CONTEXT
how to handle line breaks. /F:1
indicates that the text is unformatted,
with line breaks only at the ends of paragraphs. CONTEXT
will
honor all line breaks, and add an extra blank line after each paragraph.
/F:2
means that the input text is prewrapped, having line
breaks within paragraphs and even within sentences. CONTEXT
will
skip single line breaks, honoring only sequences of two or more in a row. /F:3
is also for unformatted text and acts like /F:1
, but does not insert a
blank line after each paragraph. If you specify /F:0
or do not
specify any /F:
n, CONTEXT
will attempt to guess how the input text is formatted. (Guessing is not reliable
when there isn’t much input text.)
Text encoding: WORDS
automatically
detects Unicode text files. If the file is not Unicode, the command has no way
of detecting the character encoding; the default Windows code page is assumed.
You can specify a different code page for non-Unicode text files with
/CP:
n. Most single-byte (i.e.,
alphabetic) code pages are supported, but multibyte
code pages (Chinese, Japanese, Korean) are not. This option only affects
non-Unicode files.
Disabling features:
/N
with suboptions disables features:
/NB | do not write a Byte Order Mark |
/NC | disable highlight |
/ND | do not search into hidden directories; only useful with /S |
/NF | suppress the file-not-found error |
/NJ | do not search into junctions; only useful with /S |
/NZ | do not search into system directories; only useful with /S |
You can combine these, e.g. /NDJ
.
C:\> type EBS.txt
This is a test. For the next sixty seconds, this station will conduct a test
of the Emergency Broadcast System. This is only a test.
C:\> words /d EBS.txt
File "C:\EBS.txt" :
25 words total, 17 unique, 4 proper. 25 runs of non-blanks.
3 sentences total: 3. 0! 0? Average sentence 8.3 words.
1 paragraph, 0 titles. Average paragraph 3.0 sentences.
2 lines total, 2 not blank; the longest had 77 characters.
137 characters in 137 bytes (OEM, prewrapped).
3: a test this
2: is the
1: Broadcast conduct Emergency For next of only seconds sixty station System will
C:\>
The results from the last file processed are saved, and can be accessed using these internal variables:
_WORDS | _UNIQUEWORDS | _PROPERNOUNS | _WC |
_SENTENCES | _SENTENCESD | _SENTENCESE | _SENTENCESQ |
_SENTENCEWORDS | _PARAGRAPHS | _TITLES | |
_LINES | _NONBLANKLINES | _LONGESTLINE | _CHARACTERS |
The cumulative results from all files processed by the last invocation of
WORDS
can be accessed through these variables: