Annotation of src/usr.bin/sort/sort.1, Revision 1.29
1.29 ! jmc 1: .\" $OpenBSD: sort.1,v 1.28 2007/05/30 04:41:34 jmc Exp $
1.1 millert 2: .\"
3: .\" Copyright (c) 1991, 1993
4: .\" The Regents of the University of California. All rights reserved.
5: .\"
6: .\" This code is derived from software contributed to Berkeley by
7: .\" the Institute of Electrical and Electronics Engineers, Inc.
8: .\"
9: .\" Redistribution and use in source and binary forms, with or without
10: .\" modification, are permitted provided that the following conditions
11: .\" are met:
12: .\" 1. Redistributions of source code must retain the above copyright
13: .\" notice, this list of conditions and the following disclaimer.
14: .\" 2. Redistributions in binary form must reproduce the above copyright
15: .\" notice, this list of conditions and the following disclaimer in the
16: .\" documentation and/or other materials provided with the distribution.
1.20 millert 17: .\" 3. Neither the name of the University nor the names of its contributors
1.1 millert 18: .\" may be used to endorse or promote products derived from this software
19: .\" without specific prior written permission.
20: .\"
21: .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
22: .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
23: .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
24: .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
25: .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
26: .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
27: .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
28: .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
29: .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
30: .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
31: .\" SUCH DAMAGE.
32: .\"
33: .\" @(#)sort.1 8.1 (Berkeley) 6/6/93
34: .\"
1.29 ! jmc 35: .Dd $Mdocdate$
1.1 millert 36: .Dt SORT 1
37: .Os
38: .Sh NAME
39: .Nm sort
40: .Nd sort or merge text files
41: .Sh SYNOPSIS
42: .Nm sort
1.23 jmc 43: .Op Fl bcdfHimnruz
44: .Sm off
1.24 jmc 45: .Op Fl k\ \& Ar field1 Op , Ar field2
1.23 jmc 46: .Sm on
47: .Op Fl o Ar output
48: .Op Fl R Ar char
49: .Bk -words
1.1 millert 50: .Op Fl T Ar dir
1.23 jmc 51: .Ek
52: .Op Fl t Ar char
53: .Op Ar file ...
1.1 millert 54: .Sh DESCRIPTION
55: The
1.8 aaron 56: .Nm
1.12 aaron 57: utility sorts text files by lines.
1.1 millert 58: Comparisons are based on one or more sort keys extracted
1.8 aaron 59: from each line of input, and are performed lexicographically.
60: By default, if keys are not given,
61: .Nm
1.1 millert 62: regards each input line as a single field.
63: .Pp
1.7 aaron 64: The options are as follows:
1.21 jmc 65: .Bl -tag -width Ds
1.1 millert 66: .It Fl c
67: Check that the single input file is sorted.
68: If the file is not sorted,
1.8 aaron 69: .Nm
1.12 aaron 70: produces the appropriate error messages and exits with code 1; otherwise,
1.8 aaron 71: .Nm
1.1 millert 72: returns 0.
1.8 aaron 73: .Nm
1.1 millert 74: .Fl c
1.6 pjanzen 75: produces no output, except the error messages on
76: .Em stderr .
1.1 millert 77: .It Fl m
78: Merge only; the input files are assumed to be pre-sorted.
79: .It Fl o Ar output
80: The argument given is the name of an
81: .Ar output
1.12 aaron 82: file to be used instead of the standard output.
83: This file can be the same as one of the input files.
1.1 millert 84: .It Fl T Ar dir
85: Use
86: .Ar dir
1.8 aaron 87: as the directory for temporary files.
88: The default is the contents of the environment variable
1.1 millert 89: .Ev TMPDIR
90: or
91: .Pa /var/tmp
92: if
93: .Ev TMPDIR
94: does not exist.
95: .It Fl u
1.12 aaron 96: Unique: suppress all but one in each set of lines having equal keys.
1.1 millert 97: If used with the
98: .Fl c
1.26 jmc 99: option, also check that there are no lines with duplicate keys.
1.1 millert 100: .El
101: .Pp
102: The following options override the default ordering rules.
103: When ordering options appear independent of key field
104: specifications, the requested field ordering rules are
105: applied globally to all sort keys.
106: When attached to a specific key (see
107: .Fl k ) ,
108: the ordering options override
109: all global ordering options for that key.
110: .Bl -tag -width indent
111: .It Fl d
112: Only blank space and alphanumeric characters
113: .\" according
114: .\" to the current setting of LC_CTYPE
1.12 aaron 115: are used in making comparisons.
1.1 millert 116: .It Fl f
117: Considers all lowercase characters that have uppercase
1.12 aaron 118: equivalents to be the same for purposes of comparison.
1.23 jmc 119: .It Fl H
120: Use a merge sort instead of a radix sort.
121: This option should be used for files larger than 60Mb.
1.1 millert 122: .It Fl i
123: Ignore all non-printable characters.
124: .It Fl n
1.12 aaron 125: An initial numeric string, consisting of optional blank space, optional
126: minus sign, and zero or more digits (including decimal point)
1.1 millert 127: .\" with
128: .\" optional radix character and thousands
129: .\" separator
130: .\" (as defined in the current locale),
131: is sorted by arithmetic value.
132: (The
133: .Fl n
1.12 aaron 134: option no longer implies the
1.1 millert 135: .Fl b
136: option.)
137: .It Fl r
138: Reverse the sense of comparisons.
139: .El
140: .Pp
1.12 aaron 141: The treatment of field separators can be altered using these options:
1.1 millert 142: .Bl -tag -width indent
143: .It Fl b
144: Ignores leading blank space when determining the start
145: and end of a restricted sort key.
146: A
147: .Fl b
148: option specified before the first
149: .Fl k
150: option applies globally to all
151: .Fl k
152: options.
153: Otherwise, the
154: .Fl b
1.12 aaron 155: option can be attached independently to each
1.1 millert 156: .Ar field
157: argument of the
158: .Fl k
159: option (see below).
160: Note that the
161: .Fl b
1.12 aaron 162: option has no effect unless key fields are specified.
1.23 jmc 163: .It Xo
164: .Sm off
165: .Fl k\ \& Ar field1 Op , Ar field2
166: .Sm on
167: .Xc
168: Designates the starting position,
169: .Ar field1 ,
170: and optional ending position,
171: .Ar field2 ,
172: of a key field.
1.25 jmc 173: The
174: .Fl k
175: option may be specified multiple times,
176: in which case subsequent keys are compared after earlier keys compare equal.
1.23 jmc 177: The
178: .Fl k
179: option replaces the obsolescent options
180: .Cm \(pl Ns Ar pos1
181: and
182: .Fl Ns Ar pos2 .
183: .It Fl R Ar char
184: .Ar char
185: is used as the record separator character.
186: This should be used with discretion;
187: .Fl R Aq Ar alphanumeric
188: usually produces undesirable results.
189: The default record separator is newline.
1.1 millert 190: .It Fl t Ar char
1.3 aaron 191: .Ar char
1.8 aaron 192: is used as the field separator character.
193: The initial
1.1 millert 194: .Ar char
1.12 aaron 195: is not considered to be part of a field when determining key offsets.
1.1 millert 196: Each occurrence of
197: .Ar char
198: is significant (for example,
199: .Dq Ar charchar
200: delimits an empty field).
201: If
202: .Fl t
1.6 pjanzen 203: is not specified, the default field separator is a sequence of
204: blank-space characters, and consecutive blank spaces do
205: .Em not
206: delimit an empty field; further, the initial blank space
207: .Em is
208: considered part of a field when determining key offsets.
1.22 dlg 209: .It Fl z
210: Uses the nul character as the record separator.
1.1 millert 211: .El
212: .Pp
213: The following operands are available:
214: .Bl -tag -width indent
1.3 aaron 215: .It Ar file
216: The pathname of a file to be sorted, merged, or checked.
217: If no
1.1 millert 218: .Ar file
1.12 aaron 219: operands are specified, or if a
1.3 aaron 220: .Ar file
221: operand is
1.1 millert 222: .Fl ,
223: the standard input is used.
1.3 aaron 224: .El
1.1 millert 225: .Pp
1.12 aaron 226: A field is defined as a maximal sequence of characters other than the
1.6 pjanzen 227: field separator and record separator
228: .Pq newline by default .
229: Initial blank spaces are included in the field unless
230: .Fl b
231: has been specified;
232: the first blank space of a sequence of blank spaces acts as the field
233: separator and is included in the field (unless
234: .Fl t
235: is specified).
236: For example, by default all blank spaces at the beginning of a line are
237: considered to be part of the first field.
1.1 millert 238: .Pp
1.12 aaron 239: Fields are specified by the
1.23 jmc 240: .Sm off
241: .Fl k\ \& Ar field1 Op , Ar field2
242: .Sm on
1.8 aaron 243: argument.
244: A missing
1.1 millert 245: .Ar field2
246: argument defaults to the end of a line.
247: .Pp
248: The arguments
249: .Ar field1
250: and
251: .Ar field2
252: have the form
253: .Em m.n
1.6 pjanzen 254: .Em (m,n > 0)
255: and can be followed by one or more of the letters
256: .Cm b , d , f , i ,
1.10 aaron 257: .Cm n ,
1.6 pjanzen 258: and
259: .Cm r ,
260: which correspond to the options discussed above.
1.1 millert 261: A
262: .Ar field1
263: position specified by
264: .Em m.n
265: is interpreted as the
266: .Em n Ns th
1.6 pjanzen 267: character from the beginning of the
1.1 millert 268: .Em m Ns th
269: field.
270: A missing
271: .Em \&.n
272: in
273: .Ar field1
274: means
275: .Ql \&.1 ,
276: indicating the first character of the
277: .Em m Ns th
1.12 aaron 278: field; if the
1.1 millert 279: .Fl b
280: option is in effect,
281: .Em n
1.12 aaron 282: is counted from the first non-blank character in the
1.1 millert 283: .Em m Ns th
284: field;
285: .Em m Ns \&.1b
1.12 aaron 286: refers to the first non-blank character in the
1.1 millert 287: .Em m Ns th
288: field.
1.6 pjanzen 289: .No 1\&. Ns Em n
290: refers to the
291: .Em n Ns th
292: character from the beginning of the line;
293: if
294: .Em n
295: is greater than the length of the line, the field is taken to be empty.
1.1 millert 296: .Pp
297: A
298: .Ar field2
299: position specified by
300: .Em m.n
1.12 aaron 301: is interpreted as the
1.1 millert 302: .Em n Ns th
303: character (including separators) of the
304: .Em m Ns th
305: field.
306: A missing
307: .Em \&.n
1.5 aaron 308: indicates the last character of the
1.1 millert 309: .Em m Ns th
310: field;
1.5 aaron 311: .Em m
1.1 millert 312: = \&0
313: designates the end of a line.
314: Thus the option
315: .Fl k Ar v.x,w.y
316: is synonymous with the obsolescent option
317: .Cm \(pl Ns Ar v-\&1.x-\&1
318: .Fl Ns Ar w-\&1.y ;
319: when
320: .Em y
321: is omitted,
322: .Fl k Ar v.x,w
323: is synonymous with
1.5 aaron 324: .Cm \(pl Ns Ar v-\&1.x-\&1
1.19 tdeval 325: .Fl Ns Ar w\&.0 .
1.1 millert 326: The obsolescent
327: .Cm \(pl Ns Ar pos1
328: .Fl Ns Ar pos2
329: option is still supported, except for
1.3 aaron 330: .Fl Ns Ar w\&.0b ,
1.1 millert 331: which has no
332: .Fl k
333: equivalent.
1.8 aaron 334: .Pp
335: The
336: .Nm
337: utility shall exit with one of the following values:
338: .Pp
339: .Bl -tag -width flag -compact
340: .It 0
341: Normal behavior.
342: .It 1
343: On disorder (or non-uniqueness) with the
344: .Fl c
345: option.
346: .It 2
347: An error occurred.
348: .El
1.1 millert 349: .Sh ENVIRONMENT
350: .Bl -tag -width Fl
351: .It Ev TMPDIR
1.3 aaron 352: Path in which to store temporary files.
353: Note that
1.1 millert 354: .Ev TMPDIR
355: may be overridden by the
356: .Fl T
357: option.
1.11 aaron 358: .El
1.1 millert 359: .Sh FILES
360: .Bl -tag -width Pa -compact
361: .It Pa /var/tmp/sort.*
1.3 aaron 362: default temporary directories
1.1 millert 363: .It Pa Ar output Ns #PID
1.3 aaron 364: temporary name for
1.1 millert 365: .Ar output
366: if
367: .Ar output
1.3 aaron 368: already exists
1.1 millert 369: .El
370: .Sh SEE ALSO
371: .Xr comm 1 ,
1.3 aaron 372: .Xr join 1 ,
1.18 fgsch 373: .Xr uniq 1 ,
374: .Xr radixsort 3
1.27 dlg 375: .Sh STANDARDS
376: The
377: .Nm
1.28 jmc 378: utility is compliant with the
379: .St -p1003.1-2004
1.27 dlg 380: specification.
381: .Pp
382: The flags
383: .Op Fl HRTz
1.28 jmc 384: are extensions to that specification.
1.1 millert 385: .Sh HISTORY
386: A
1.8 aaron 387: .Nm
1.1 millert 388: command appeared in
1.16 mickey 389: .At v3 .
1.1 millert 390: .Sh NOTES
1.14 ericj 391: .Nm
392: has no limits on input line length (other than imposed by available
393: memory) or any restrictions on bytes allowed within lines.
394: .Pp
395: To protect data
396: .Nm
397: .Fl o
398: calls
399: .Xr link 2
400: and
401: .Xr unlink 2 ,
402: and thus fails on protected directories.
403: .Pp
1.1 millert 404: The current sort command uses lexicographic radix sorting, which requires
1.12 aaron 405: that sort keys be kept in memory (as opposed to previous versions which
406: used quick and merge sorts and did not).
1.1 millert 407: Thus performance depends highly on efficient choice of sort keys, and the
408: .Fl b
409: option and the
410: .Ar field2
411: argument of the
412: .Fl k
413: option should be used whenever possible.
414: Similarly,
1.8 aaron 415: .Nm
1.1 millert 416: .Fl k1f
417: is equivalent to
1.8 aaron 418: .Nm
1.1 millert 419: .Fl f
420: and may take twice as long.
1.12 aaron 421: .Sh BUGS
422: To sort files larger than 60Mb, use
423: .Nm
424: .Fl H ;
425: files larger than 704Mb must be sorted in smaller pieces, then merged.