WORDS — Count words, sentences, and paragraphs in English text.

Syntax:
WORDS /A:attribs /C /CP:n /D /F:fmt /K /M:n /N /S /U:mode /X filename…

/A:attribsattributes mask; valid flags are -ACEHIORS
/Ccode mode; words may contain underscores and dollar signs
/CP:ninterpret non-Unicode input text using code page n
/Ddumps lists of unique words, sorted by frequency
/F:fmtspecifies the format for input text; fmt is one of:
   0 — best guess (default)
   1 — unformatted (line breaks are used only to end paragraphs)
   2 — prewrapped (line breaks are used to wrap text)
/Kkeeps hyphens when reassembling split words
/M:nminimum number of letters in a word
/Nby itself: no words containing digits
/Nwith suboptions: disable features
/Ssearch in subdirectories for matching files
/U:modecontrols the counting of unique words; mode is one of:
   0 — do not count unique words (faster for large files)
   1 — count unique words for each file individually (the default)
   2 — count unique words for all files together (slower)
   3 — separate counts for each file and for all files together (double oink!)
/Xno words beginning with a digit
Range options are also supported.

WORDS counts words, sentences, and paragraphs in English text. It can read text from standard input, or from one or more files specified on the command line. A report is written to standard output; this report can be piped or redirected. The results of the last file processed are also saved internally, and can be acessed through internal variables.

Note:  This command was designed specifically for use with English text. I make many Anglocentric assumptions about what constitutes a ‘word’, a ‘sentence’, a ‘paragraph’, ‘forms’ of a word, and so on. These assumptions are probably not useful for any other language. WORDS may give strange or undesired results when used on source code, program output, HTML, or whatnot.

If standard input (stdin) is redirected, WORDS will read from stdin before any filenames specified on the command line. If no filenames are specified, then WORDS will read from stdin whether it is redirected or not. Filenames may include wildcards and directory aliases. You can search into subdirectories for matching files with /S. @File lists and internet files are supported. You may also specify CLIP: to count words on the clipboard.

This command’s definition of a ‘word’ is complex and subject to ongoing tweaking. In general, though, a word may contain only letters, digits (unless /N is specified), periods, apostrophes, and hyphens; at least one character must be a letter. For instance, 20th, 1920s, 1969's, and post-1941 are all considered words, but 1984 is not. The first character must be alphanumeric or (very rarely) an apostrophe.

If /C is specified, words may also contain underscores and dollar signs, but must not begin with a digit or dollar sign. /C also suppresses the count of sentences and paragraphs in the final report.

Words that differ only in case are counted as the same word. In the phrase polish Polish furniture using Polish furniture polish, this command will find only three ‘unique’ words.

A word is counted as ‘proper’ only if it never occurs in an all-lowercase form; no proper nouns will be found in Polish polish. Acronyms like NATO will be counted as ‘proper nouns’; so will ordinary words capitalized at the start of a sentence. The latter are often common words like articles and prepositions, which tend to be weeded out in longer files as they recur midsentence.

Note that a hyphenate is always counted as a single word. Without a dictionary, the command has no way of knowing whether it is composed of actual words (red-eye, half-baked) or not (pre-K, Wi-Fi).

WORDS also gives counts of sentences, paragraphs, lines, characters, and bytes. All counts should be viewed as estimates rather than gospel truth. The sentences count in particular must be taken with a healthy dose of salt; the command has no good way to determine whether a period ends an abbreviation, a sentence, or both.

A line, or a series of lines, which contains one or more sentences is counted as a ‘paragraph’. A line or series of lines which contains one or more words, but no recognized sentences, is instead counted as a ‘title’. It might actually be a title, subtitle, or chapter heading; or it might be a byline, date line, attribution, salutation, signature, line of poetry….

The number of lines reported may differ from the number of carriage returns or line feeds in the text, e.g. if the last line in the file is not terminated. A line containing only whitespace characters (spaces and tabs) is considered blank. The character and byte counts do not include any Unicode byte-order mark at the beginning of the file.

Split words: If a hyphenated word is split across a line break, WORDS will reassemble it and treat it as a single word. By default, the hyphen is dropped — the command has no way of knowing whether a hyphenated compound word was broken at a hyphen, or whether a normal word was divided between syllables and a hyphen added. The latter seems more common, and I wanted to avoid cluttering the vocabulary list with differently-hyphenated versions of the same word. If /K is specified, the command will instead retain hyphens when reassembling words broken at the end of a line. This option may cause a larger number of ‘unique’ words to be reported.

Vocabularies: In order to count unique words and ‘proper nouns’, WORDS must build a list of all words found. Building this list can slow down the process and use a good deal of memory if the text file involved is large. /U:mode controls the vocabulary lists. /U:0 disables vocabularies; the command executes faster, but there will be no counts of unique and proper words. /U:1 causes WORDS to build a vocabulary list for each file it processes; this is the default behavior. /U:2 builds a combined vocabulary for all files that WORDS processes; this is slower than the default. Finally, /U:3 builds a vocabulary for each file that WORDS reads, and at the same time builds a master vocabulary for all files together; this is much slower than the default behavior, and devours memory shamelessly.

If you are processing extremely large text files, or files which are not English prose — e.g. output from a program or command — I strongly recommend using /U:0 to disable vocabulary lists.

Dump: If /D is specified, the vocabulary for each file will be dumped to stdout. If /D is combined with /U:2, you’ll instead get a combined vocabulary for all files. The list is sorted by frequency, with more common words appearing first. Note that words may be shown in a different case than they appear in the input text. This is because the command stores all words in lowercase internally for speed (lowercase letters are more streamlined).

Text format: Text files use line-break characters in different ways. In some files, line break characters are used only to mark where a line end should occur: the end of a paragraph. In other files, line breaks are used to wrap text to some desired width. You can use /F:n to tell CONTEXT how to handle line breaks. /F:1 indicates that the text is unformatted, with line breaks only at the ends of paragraphs. CONTEXT will honor all line breaks, and add an extra blank line after each paragraph. /F:2 means that the input text is prewrapped, having line breaks within paragraphs and even within sentences. CONTEXT will skip single line breaks, honoring only sequences of two or more in a row. /F:3 is also for unformatted text and acts like /F:1, but does not insert a blank line after each paragraph. If you specify /F:0 or do not specify any /F:n, CONTEXT will attempt to guess how the input text is formatted. (Guessing is not reliable when there isn’t much input text.)

Text encoding: WORDS automatically detects Unicode text files. If the file is not Unicode, the command has no way of detecting the character encoding; the default Windows code page is assumed. You can specify a different code page for non-Unicode text files with /CP:n. Most single-byte (i.e., alphabetic) code pages are supported, but multibyte code pages (Chinese, Japanese, Korean) are not. This option only affects non-Unicode files.

Disabling features: /N with suboptions disables features:

/NBdo not write a Byte Order Mark
/NCdisable highlight
/NDdo not search into hidden directories; only useful with /S
/NFsuppress the file-not-found error
/NJdo not search into junctions; only useful with /S
/NZdo not search into system directories; only useful with /S

You can combine these, e.g. /NDJ.


C:\> type EBS.txt
This is a test.  For the next sixty seconds, this station will conduct a test
of the Emergency Broadcast System.  This is only a test.

C:\> words /d EBS.txt

File "C:\EBS.txt" :
  25 words total, 17 unique, 4 proper.  25 runs of non-blanks.
  3 sentences total:  3.  0!  0?   Average sentence 8.3 words.
  1 paragraph, 0 titles.  Average paragraph 3.0 sentences.
  2 lines total, 2 not blank; the longest had 77 characters.
  137 characters in 137 bytes (OEM, prewrapped).

3:  a test this
2:  is the
1:  Broadcast conduct Emergency For next of only seconds sixty station System will

C:\>


The results from the last file processed are saved, and can be accessed using these internal variables:

_WORDS_UNIQUEWORDS_PROPERNOUNS_WC
_SENTENCES_SENTENCESD_SENTENCESE_SENTENCESQ
_SENTENCEWORDS_PARAGRAPHS_TITLES 
_LINES_NONBLANKLINES_LONGESTLINE_CHARACTERS

The cumulative results from all files processed by the last invocation of WORDS can be accessed through these variables:

_WORDSALL_UNIQUEWORDSALL_PROPERNOUNSALL_WCALL
_SENTENCESALL_SENTENCESDALL_SENTENCESEALL_SENTENCESQALL
_SENTENCEWORDSALL_PARAGRAPHSALL_TITLESALL_WORDFILES
_LINESALL_NONBLANKLINESALL_LONGESTLINEALL_CHARACTERSALL