Annotation of src/usr.bin/sort/sort.1, Revision 1.49
1.49 ! jmc 1: .\" $OpenBSD: sort.1,v 1.48 2015/03/21 21:19:25 jmc Exp $
1.1 millert 2: .\"
3: .\" Copyright (c) 1991, 1993
4: .\" The Regents of the University of California. All rights reserved.
5: .\"
6: .\" This code is derived from software contributed to Berkeley by
7: .\" the Institute of Electrical and Electronics Engineers, Inc.
8: .\"
9: .\" Redistribution and use in source and binary forms, with or without
10: .\" modification, are permitted provided that the following conditions
11: .\" are met:
12: .\" 1. Redistributions of source code must retain the above copyright
13: .\" notice, this list of conditions and the following disclaimer.
14: .\" 2. Redistributions in binary form must reproduce the above copyright
15: .\" notice, this list of conditions and the following disclaimer in the
16: .\" documentation and/or other materials provided with the distribution.
1.20 millert 17: .\" 3. Neither the name of the University nor the names of its contributors
1.1 millert 18: .\" may be used to endorse or promote products derived from this software
19: .\" without specific prior written permission.
20: .\"
21: .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
22: .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
23: .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
24: .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
25: .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
26: .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
27: .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
28: .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
29: .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
30: .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
31: .\" SUCH DAMAGE.
32: .\"
33: .\" @(#)sort.1 8.1 (Berkeley) 6/6/93
34: .\"
1.47 jmc 35: .Dd $Mdocdate: March 21 2015 $
1.1 millert 36: .Dt SORT 1
37: .Os
38: .Sh NAME
39: .Nm sort
1.41 millert 40: .Nd sort, merge, or sequence check text and binary files
1.1 millert 41: .Sh SYNOPSIS
42: .Nm sort
1.43 jmc 43: .Op Fl bCcdfgHhiMmnRrsuVz
1.42 jmc 44: .Op Fl k Ar field1 Ns Op , Ns Ar field2
1.23 jmc 45: .Op Fl o Ar output
1.42 jmc 46: .Op Fl S Ar size
1.1 millert 47: .Op Fl T Ar dir
1.23 jmc 48: .Op Fl t Ar char
1.34 sobrado 49: .Op Ar
1.1 millert 50: .Sh DESCRIPTION
51: The
1.8 aaron 52: .Nm
1.41 millert 53: utility sorts text and binary files by lines.
54: A line is a record separated from the subsequent record by a
55: newline (default) or NUL \'\\0\' character (-z option).
56: A record can contain any printable or unprintable characters.
57: Comparisons are based on one or more sort keys extracted from
58: each line of input, and are performed lexicographically,
59: according to the current locale's collating rules and the
60: specified command-line options that can tune the actual
61: sorting behavior.
1.8 aaron 62: By default, if keys are not given,
63: .Nm
1.41 millert 64: uses entire lines for comparison.
1.1 millert 65: .Pp
1.49 ! jmc 66: If no
! 67: .Ar file
! 68: is specified, or if
! 69: .Ar file
! 70: is
! 71: .Sq - ,
! 72: the standard input is used.
! 73: .Pp
1.7 aaron 74: The options are as follows:
1.21 jmc 75: .Bl -tag -width Ds
1.41 millert 76: .It Fl C, Fl Fl check=silent|quiet
1.35 schwarze 77: Check that the single input file is sorted.
78: If it is, exit 0; if it's not, exit 1.
79: In either case, produce no output.
1.41 millert 80: .It Fl c, Fl Fl check
1.35 schwarze 81: Like
82: .Fl C ,
1.37 jmc 83: but additionally write a message to
1.35 schwarze 84: .Em stderr
85: if the input file is not sorted.
1.41 millert 86: .It Fl m , Fl Fl merge
1.1 millert 87: Merge only; the input files are assumed to be pre-sorted.
1.41 millert 88: If they are not sorted, the output order is undefined.
89: .It Fl o Ar output , Fl Fl output Ns = Ns Ar output
90: Write the output to the
1.1 millert 91: .Ar output
1.41 millert 92: file instead of the standard output.
1.12 aaron 93: This file can be the same as one of the input files.
1.42 jmc 94: .It Fl S Ar size , Fl Fl buffer-size Ns = Ns Ar size
1.41 millert 95: Use a memory buffer no larger than
96: .Ar size .
97: The modifiers %, b, K, M, G, T, P, E, Z, and Y can be used.
98: If no memory limit is specified,
99: .Nm
100: may use up to about 90% of available memory.
101: If the input is too big to fit into the memory buffer,
102: temporary files are used.
1.42 jmc 103: .It Fl s
104: Stable sort; maintains the original record order of records that have
105: and equal key.
106: This is a non-standard feature, but it is widely accepted and used.
1.41 millert 107: .It Fl T Ar dir , Fl Fl temporary-directory Ns = Ns Ar dir
108: Store temporary files in the directory
109: .Ar dir .
110: The default path is the value of the environment variable
1.1 millert 111: .Ev TMPDIR
112: or
113: .Pa /var/tmp
114: if
115: .Ev TMPDIR
1.41 millert 116: is not defined.
117: .It Fl u , Fl Fl unique
1.12 aaron 118: Unique: suppress all but one in each set of lines having equal keys.
1.41 millert 119: This option implies a stable sort (see below).
120: If used with
1.35 schwarze 121: .Fl C
122: or
1.41 millert 123: .Fl c ,
124: .Nm
125: also checks that there are no lines with duplicate keys.
1.38 jmc 126: .El
127: .Pp
1.1 millert 128: The following options override the default ordering rules.
1.37 jmc 129: If ordering options appear before the first
130: .Fl k
131: option, they apply globally to all sort keys.
1.1 millert 132: When attached to a specific key (see
133: .Fl k ) ,
1.41 millert 134: the ordering options override all global ordering options for that key.
1.37 jmc 135: Note that the ordering options intended to apply globally should not
136: appear after
137: .Fl k
138: or results may be unexpected.
1.1 millert 139: .Bl -tag -width indent
1.41 millert 140: .It Fl b, Fl Fl ignore-leading-blanks
141: Ignore leading blank characters when comparing lines.
142: .It Fl d , Fl Fl dictionary-order
143: Consider only blank spaces and alphanumeric characters in comparisons.
144: .It Fl f , Fl Fl ignore-case
145: Consider all lowercase characters that have uppercase
1.12 aaron 146: equivalents to be the same for purposes of comparison.
1.41 millert 147: .It Fl g, Fl Fl general-numeric-sort, Fl Fl sort=general-numeric
148: Sort by general numerical value.
149: As opposed to
150: .Fl n ,
151: this option handles general floating points, which have a much
152: permissive format than those allowed by
153: . Fl n ,
154: but it has a significant performance drawback.
155: .It Fl h, Fl Fl human-numeric-sort, Fl Fl sort=human-numeric
156: Sort by numerical value, but take into account the SI suffix,
157: if present.
158: Sorts first by numeric sign (negative, zero, or
159: positive); then by SI suffix (either empty, or `k' or `K', or one
160: of `MGTPEZY', in that order); and finally by numeric value.
161: The SI suffix must immediately follow the number.
162: For example, '12345K' sorts before '1M', because M is "larger" than K.
163: This sort option is useful for sorting the output of a single invocation
164: of 'df' command with
165: .Fl h
166: or
167: .Fl H
168: options (human-readable).
169: .It Fl i , Fl Fl ignore-nonprinting
1.1 millert 170: Ignore all non-printable characters.
1.41 millert 171: .It Fl M, Fl Fl month-sort, Fl Fl sort=month
172: Sort by month abbreviations.
173: Unknown strings are considered smaller than valid month names.
174: .It Fl n , Fl Fl numeric-sort, Fl Fl sort=numeric
1.12 aaron 175: An initial numeric string, consisting of optional blank space, optional
176: minus sign, and zero or more digits (including decimal point)
1.1 millert 177: .\" with
178: .\" optional radix character and thousands
179: .\" separator
180: .\" (as defined in the current locale),
181: is sorted by arithmetic value.
1.41 millert 182: Leading blank characters are ignored.
183: .It Fl R, Fl Fl random-sort, Fl Fl sort=random
184: Sort lines in random order.
185: This is a random permutation of the inputs with the exception that
186: equal keys sort together.
187: It is implemented by hashing the input keys and sorting the hash values.
188: The hash function is randomized with data from
1.47 jmc 189: .Xr arc4random_buf 3 ,
1.41 millert 190: or by file content if one is specified via
191: .Fl Fl random-source .
192: If multiple sort fields are specified,
193: the same random hash function is used for all of them.
194: .It Fl r , Fl Fl reverse
195: Sort in reverse order.
196: .It Fl V, Fl Fl version-sort
197: Sort version numbers.
198: The input lines are treated as file names in form
199: PREFIX VERSION SUFFIX, where SUFFIX matches the regular expression
200: "(\.([A-Za-z~][A-Za-z0-9~]*)?)*".
201: The files are compared by their prefixes and versions (leading
202: zeros are ignored in version numbers, see example below).
203: If an input string does not match the pattern, then it is compared
204: using the byte compare function.
205: All string comparisons are performed in the C locale.
1.44 jmc 206: .Pp
207: For example:
208: .Bd -literal -offset indent
209: $ ls sort* | sort -V
210: sort-1.022.tgz
211: sort-1.23.tgz
212: sort-1.23.1.tgz
213: sort-1.024.tgz
214: sort-1.024.003.
215: sort-1.024.003.tgz
216: sort-1.024.07.tgz
217: sort-1.024.009.tgz
218: .Ed
1.1 millert 219: .El
220: .Pp
1.12 aaron 221: The treatment of field separators can be altered using these options:
1.1 millert 222: .Bl -tag -width indent
1.41 millert 223: .It Fl b , Fl Fl ignore-leading-blanks
224: Ignore leading blank space when determining the start
225: and end of a restricted sort key (see
226: .Fl k ) .
227: If
1.1 millert 228: .Fl b
1.41 millert 229: is specified before the first
1.1 millert 230: .Fl k
1.41 millert 231: option, it applies globally to all key specifications.
232: Otherwise,
1.1 millert 233: .Fl b
1.41 millert 234: can be attached independently to each
1.1 millert 235: .Ar field
1.41 millert 236: argument of the key specifications.
237: .It Xo
1.42 jmc 238: .Fl k Ar field1 Ns Op , Ns Ar field2 ,
239: .Fl Fl key Ns = Ns Ar field1 Ns Op , Ns Ar field2
1.41 millert 240: .Xc
241: Define a restricted sort key that has the starting position
242: .Ar field1 ,
243: and optional ending position
244: .Ar field2
245: of a key field.
246: The
247: .Fl k
248: option may be specified multiple times,
249: in which case subsequent keys are compared after earlier keys compare equal.
250: The
1.1 millert 251: .Fl k
1.41 millert 252: option replaces the obsolete options
253: .Cm \(pl Ns Ar pos1
254: and
255: .Fl Ns Ar pos2 ,
256: but the old notation is also supported.
257: .It Fl t Ar char , Fl Fl field-separator Ns = Ns Ar char
258: Use
1.3 aaron 259: .Ar char
1.41 millert 260: as the field separator character.
1.8 aaron 261: The initial
1.1 millert 262: .Ar char
1.12 aaron 263: is not considered to be part of a field when determining key offsets.
1.1 millert 264: Each occurrence of
265: .Ar char
266: is significant (for example,
267: .Dq Ar charchar
268: delimits an empty field).
269: If
270: .Fl t
1.6 pjanzen 271: is not specified, the default field separator is a sequence of
272: blank-space characters, and consecutive blank spaces do
273: .Em not
274: delimit an empty field; further, the initial blank space
275: .Em is
276: considered part of a field when determining key offsets.
1.41 millert 277: To use NUL as field separator, use
278: .Fl t
279: \'\\0\'.
280: .It Fl z , Fl Fl zero-terminated
281: Use NUL as the record separator.
282: By default, records in the files are expected to be separated by
283: the newline characters.
284: With this option, NUL (\'\\0\') is used as the record separator character.
1.37 jmc 285: .El
286: .Pp
1.41 millert 287: Other options:
1.37 jmc 288: .Bl -tag -width indent
1.41 millert 289: .It Fl Fl batch-size Ns = Ns Ar num
290: Specify maximum number of files that can be opened by
291: .Nm
292: at once.
293: This option affects behavior when having many input files or using
294: temporary files.
295: The default value is 16.
296: .It Fl Fl compress-program Ns = Ns Ar program
297: Use
298: .Ar program
299: to compress temporary files.
300: When invoked with no arguments,
301: .Ar program
302: must compress standard input to standard output.
303: When called with the
304: .Fl d
305: option, it must decompress standard input to standard output.
306: If
307: .Ar program
308: fails,
309: .Nm
310: will exit with an error.
1.37 jmc 311: The
1.41 millert 312: .Xr compress 1
313: and
314: .Xr gzip 1
315: utilities meet these requirements.
316: .It Fl Fl debug
317: Print some extra information about the sorting process to the
318: standard output.
319: .It Fl Fl files0-from Ns = Ns Ar filename
320: Take the input file list from the file
1.44 jmc 321: .Ar filename .
1.41 millert 322: The file names must be separated by NUL
323: (like the output produced by the command
324: .Dq find ... -print0 ) .
1.49 ! jmc 325: .It Fl Fl heapsort
! 326: Try to use heap sort, if the sort specifications allow.
! 327: This sort algorithm cannot be used with
! 328: .Fl u
! 329: and
! 330: .Fl s .
! 331: .It Fl Fl help
! 332: Print the help text and exit.
! 333: .It Fl Fl mergesort , Fl H
1.41 millert 334: Use mergesort.
335: This is a universal algorithm that can always be used,
336: but it is not always the fastest.
1.49 ! jmc 337: .It Fl Fl mmap
! 338: Try to use file memory mapping system call.
! 339: It may increase speed in some cases.
1.41 millert 340: .It Fl Fl qsort
341: Try to use quick sort, if the sort specifications allow.
342: This sort algorithm cannot be used with
343: .Fl u
344: and
345: .Fl s .
1.49 ! jmc 346: .It Fl Fl radixsort
! 347: Try to use radix sort, if the sort specifications allow.
! 348: The radix sort can only be used for trivial locales (C and POSIX),
! 349: and it cannot be used for numeric or month sort.
! 350: Radix sort is very fast and stable.
! 351: .It Fl Fl random-source Ns = Ns Ar filename
! 352: For random sort, the contents of
! 353: .Ar filename
! 354: are used as the source of the
! 355: .Sq seed
! 356: data for the hash function.
! 357: Two invocations of random sort with the same seed data will use
! 358: produce the same result if the input is also identical.
! 359: By default, the
! 360: .Xr arc4random_buf 3
! 361: function is used instead.
! 362: .It Fl Fl version
! 363: Print the version and exit.
1.3 aaron 364: .El
1.1 millert 365: .Pp
1.12 aaron 366: A field is defined as a maximal sequence of characters other than the
1.6 pjanzen 367: field separator and record separator
368: .Pq newline by default .
369: Initial blank spaces are included in the field unless
370: .Fl b
371: has been specified;
372: the first blank space of a sequence of blank spaces acts as the field
373: separator and is included in the field (unless
374: .Fl t
375: is specified).
376: For example, by default all blank spaces at the beginning of a line are
377: considered to be part of the first field.
1.1 millert 378: .Pp
1.12 aaron 379: Fields are specified by the
1.45 jmc 380: .Fl k Ar field1 Ns Op , Ns Ar field2
1.41 millert 381: option.
382: If
1.1 millert 383: .Ar field2
1.41 millert 384: is missing, the end of the key defaults to the end of the line.
1.1 millert 385: .Pp
386: The arguments
387: .Ar field1
388: and
389: .Ar field2
390: have the form
391: .Em m.n
1.6 pjanzen 392: .Em (m,n > 0)
1.41 millert 393: and can be followed by one or more of the modifiers
1.6 pjanzen 394: .Cm b , d , f , i ,
1.41 millert 395: .Cm n , g , M
1.6 pjanzen 396: and
397: .Cm r ,
398: which correspond to the options discussed above.
1.41 millert 399: When
400: .Cm b
401: is specified it applies only to
402: .Ar field1
403: or
404: .Ar field2
405: where it is specified while the rest of the modifiers
406: apply to the whole key field regardless if they are
407: specified only with
408: .Ar field1
409: or
410: .Ar field2
411: or both.
1.1 millert 412: A
413: .Ar field1
414: position specified by
415: .Em m.n
416: is interpreted as the
417: .Em n Ns th
1.6 pjanzen 418: character from the beginning of the
1.1 millert 419: .Em m Ns th
420: field.
421: A missing
422: .Em \&.n
423: in
424: .Ar field1
425: means
426: .Ql \&.1 ,
427: indicating the first character of the
428: .Em m Ns th
1.12 aaron 429: field; if the
1.1 millert 430: .Fl b
431: option is in effect,
432: .Em n
1.12 aaron 433: is counted from the first non-blank character in the
1.1 millert 434: .Em m Ns th
435: field;
436: .Em m Ns \&.1b
1.12 aaron 437: refers to the first non-blank character in the
1.1 millert 438: .Em m Ns th
439: field.
1.6 pjanzen 440: .No 1\&. Ns Em n
441: refers to the
442: .Em n Ns th
443: character from the beginning of the line;
444: if
445: .Em n
446: is greater than the length of the line, the field is taken to be empty.
1.1 millert 447: .Pp
1.41 millert 448: .Em n Ns th
449: positions are always counted from the field beginning, even if the field
450: is shorter than the number of specified positions.
451: Thus, the key can really start from a position in a subsequent field.
452: .Pp
1.1 millert 453: A
454: .Ar field2
455: position specified by
456: .Em m.n
1.12 aaron 457: is interpreted as the
1.1 millert 458: .Em n Ns th
1.41 millert 459: character (including separators) from the beginning of the
1.1 millert 460: .Em m Ns th
461: field.
462: A missing
463: .Em \&.n
1.5 aaron 464: indicates the last character of the
1.1 millert 465: .Em m Ns th
466: field;
1.5 aaron 467: .Em m
1.1 millert 468: = \&0
469: designates the end of a line.
470: Thus the option
471: .Fl k Ar v.x,w.y
1.41 millert 472: is synonymous with the obsolete option
1.1 millert 473: .Cm \(pl Ns Ar v-\&1.x-\&1
474: .Fl Ns Ar w-\&1.y ;
475: when
476: .Em y
477: is omitted,
478: .Fl k Ar v.x,w
479: is synonymous with
1.5 aaron 480: .Cm \(pl Ns Ar v-\&1.x-\&1
1.19 tdeval 481: .Fl Ns Ar w\&.0 .
1.41 millert 482: The obsolete
1.1 millert 483: .Cm \(pl Ns Ar pos1
484: .Fl Ns Ar pos2
485: option is still supported, except for
1.3 aaron 486: .Fl Ns Ar w\&.0b ,
1.1 millert 487: which has no
488: .Fl k
489: equivalent.
490: .Sh ENVIRONMENT
491: .Bl -tag -width Fl
1.46 jmc 492: .It Ev GNUSORT_NUMERIC_COMPATIBILITY
493: If defined
494: .Fl t
495: will not override the locale numeric symbols, that is, thousand
496: separators and decimal separators.
497: By default, if we specify
498: .Fl t
499: with the same symbol as the thousand separator or decimal point,
500: the symbol will be treated as the field separator.
501: Older behavior was less definite; the symbol was treated as both field
502: separator and numeric separator, simultaneously.
503: This environment variable enables the old behavior.
504: .It Ev LANG
505: Used as a last resort to determine different kinds of locale-specific
506: behavior if neither the respective environment variable, nor
507: .Ev LC_ALL
508: are set.
509: .It Ev LC_ALL
510: Locale settings that override all of the above locale settings.
511: This environment variable can be used to set all these settings
512: to the same value at once.
1.41 millert 513: .It Ev LC_COLLATE
514: Locale settings to be used to determine the collation for
515: sorting records.
516: .It Ev LC_CTYPE
517: Locale settings to be used to case conversion and classification
518: of characters, that is, which characters are considered
519: whitespaces, etc.
520: .It Ev LC_MESSAGES
521: Locale settings that determine the language of output messages
522: that
523: .Nm
524: prints out.
525: .It Ev LC_NUMERIC
526: Locale settings that determine the number format used in numeric sort.
527: .It Ev LC_TIME
528: Locale settings that determine the month format used in month sort.
1.1 millert 529: .It Ev TMPDIR
1.41 millert 530: Path to the directory in which temporary files will be stored.
1.3 aaron 531: Note that
1.1 millert 532: .Ev TMPDIR
533: may be overridden by the
534: .Fl T
535: option.
1.11 aaron 536: .El
1.1 millert 537: .Sh FILES
538: .Bl -tag -width Pa -compact
1.41 millert 539: .It Pa /var/tmp/.bsdsort.PID.*
540: Temporary files.
1.39 jmc 541: .El
542: .Sh EXIT STATUS
543: The
544: .Nm
545: utility exits with one of the following values:
546: .Pp
547: .Bl -tag -width Ds -offset indent -compact
548: .It 0
1.41 millert 549: Successfully sorted the input files or if used with
550: .Fl C
551: or
552: .Fl c ,
553: the input file already met the sorting criteria.
1.39 jmc 554: .It 1
1.41 millert 555: On disorder (or non-uniqueness) with the
1.39 jmc 556: .Fl C
557: or
558: .Fl c
1.41 millert 559: options.
1.39 jmc 560: .It 2
561: An error occurred.
1.1 millert 562: .El
563: .Sh SEE ALSO
564: .Xr comm 1 ,
1.3 aaron 565: .Xr join 1 ,
1.47 jmc 566: .Xr uniq 1
1.27 dlg 567: .Sh STANDARDS
568: The
569: .Nm
1.28 jmc 570: utility is compliant with the
1.33 jmc 571: .St -p1003.1-2008
1.27 dlg 572: specification.
573: .Pp
574: The flags
1.43 jmc 575: .Op Fl gHhiMRSsTVz
1.28 jmc 576: are extensions to that specification.
1.41 millert 577: .Pp
578: All long options are extensions to the specification.
579: Some are provided for compatibility with GNU
580: .Nm ,
581: others are specific to this implementation.
582: .Pp
583: The historic key notations
584: .Cm \(pl Ns Ar pos1
585: and
586: .Fl Ns Ar pos2
587: are supported for compatibility with older versions of
588: .Nm
589: but their use is highly discouraged.
1.1 millert 590: .Sh HISTORY
591: A
1.8 aaron 592: .Nm
1.1 millert 593: command appeared in
1.16 mickey 594: .At v3 .
1.41 millert 595: .Sh AUTHORS
1.44 jmc 596: .An Gabor Kovesdan Aq Mt gabor@FreeBSD.org
597: .An Oleg Moskalenko Aq Mt mom040267@gmail.com
1.45 jmc 598: .Sh CAVEATS
1.41 millert 599: This implementation of
1.14 ericj 600: .Nm
601: has no limits on input line length (other than imposed by available
602: memory) or any restrictions on bytes allowed within lines.
603: .Pp
1.41 millert 604: The performance depends highly on locale settings,
605: efficient choice of sort keys and key complexity.
606: The fastest sort is with the C locale, on whole lines, with option
607: .Fl s .
608: In general, the C locale is the fastest, followed by single-byte
609: locales with multi-byte locales being the slowest.
610: The correct collation order respected in all cases.
611: For the key specification, the simpler to process the
612: lines the faster the search will be.
1.14 ericj 613: .Pp
1.41 millert 614: When sorting by arithmetic value, using
615: .Fl n
616: results in much better performance than
617: .Fl g
618: so its use is encouraged whenever possible.