Annotation of src/usr.bin/sort/sort.1, Revision 1.43
1.43 ! jmc 1: .\" $OpenBSD: sort.1,v 1.42 2015/03/19 11:38:12 jmc Exp $
1.1 millert 2: .\"
3: .\" Copyright (c) 1991, 1993
4: .\" The Regents of the University of California. All rights reserved.
5: .\"
6: .\" This code is derived from software contributed to Berkeley by
7: .\" the Institute of Electrical and Electronics Engineers, Inc.
8: .\"
9: .\" Redistribution and use in source and binary forms, with or without
10: .\" modification, are permitted provided that the following conditions
11: .\" are met:
12: .\" 1. Redistributions of source code must retain the above copyright
13: .\" notice, this list of conditions and the following disclaimer.
14: .\" 2. Redistributions in binary form must reproduce the above copyright
15: .\" notice, this list of conditions and the following disclaimer in the
16: .\" documentation and/or other materials provided with the distribution.
1.20 millert 17: .\" 3. Neither the name of the University nor the names of its contributors
1.1 millert 18: .\" may be used to endorse or promote products derived from this software
19: .\" without specific prior written permission.
20: .\"
21: .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
22: .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
23: .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
24: .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
25: .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
26: .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
27: .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
28: .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
29: .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
30: .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
31: .\" SUCH DAMAGE.
32: .\"
33: .\" @(#)sort.1 8.1 (Berkeley) 6/6/93
34: .\"
1.43 ! jmc 35: .Dd $Mdocdate: March 19 2015 $
1.1 millert 36: .Dt SORT 1
37: .Os
38: .Sh NAME
39: .Nm sort
1.41 millert 40: .Nd sort, merge, or sequence check text and binary files
1.1 millert 41: .Sh SYNOPSIS
42: .Nm sort
1.43 ! jmc 43: .Op Fl bCcdfgHhiMmnRrsuVz
1.42 jmc 44: .Op Fl k Ar field1 Ns Op , Ns Ar field2
1.23 jmc 45: .Op Fl o Ar output
1.42 jmc 46: .Op Fl S Ar size
1.1 millert 47: .Op Fl T Ar dir
1.23 jmc 48: .Op Fl t Ar char
1.34 sobrado 49: .Op Ar
1.1 millert 50: .Sh DESCRIPTION
51: The
1.8 aaron 52: .Nm
1.41 millert 53: utility sorts text and binary files by lines.
54: A line is a record separated from the subsequent record by a
55: newline (default) or NUL \'\\0\' character (-z option).
56: A record can contain any printable or unprintable characters.
57: Comparisons are based on one or more sort keys extracted from
58: each line of input, and are performed lexicographically,
59: according to the current locale's collating rules and the
60: specified command-line options that can tune the actual
61: sorting behavior.
1.8 aaron 62: By default, if keys are not given,
63: .Nm
1.41 millert 64: uses entire lines for comparison.
1.1 millert 65: .Pp
1.7 aaron 66: The options are as follows:
1.21 jmc 67: .Bl -tag -width Ds
1.41 millert 68: .It Fl C, Fl Fl check=silent|quiet
1.35 schwarze 69: Check that the single input file is sorted.
70: If it is, exit 0; if it's not, exit 1.
71: In either case, produce no output.
1.41 millert 72: .It Fl c, Fl Fl check
1.35 schwarze 73: Like
74: .Fl C ,
1.37 jmc 75: but additionally write a message to
1.35 schwarze 76: .Em stderr
77: if the input file is not sorted.
1.41 millert 78: .It Fl m , Fl Fl merge
1.1 millert 79: Merge only; the input files are assumed to be pre-sorted.
1.41 millert 80: If they are not sorted, the output order is undefined.
81: .It Fl o Ar output , Fl Fl output Ns = Ns Ar output
82: Write the output to the
1.1 millert 83: .Ar output
1.41 millert 84: file instead of the standard output.
1.12 aaron 85: This file can be the same as one of the input files.
1.42 jmc 86: .It Fl S Ar size , Fl Fl buffer-size Ns = Ns Ar size
1.41 millert 87: Use a memory buffer no larger than
88: .Ar size .
89: The modifiers %, b, K, M, G, T, P, E, Z, and Y can be used.
90: If no memory limit is specified,
91: .Nm
92: may use up to about 90% of available memory.
93: If the input is too big to fit into the memory buffer,
94: temporary files are used.
1.42 jmc 95: .It Fl s
96: Stable sort; maintains the original record order of records that have
97: and equal key.
98: This is a non-standard feature, but it is widely accepted and used.
1.41 millert 99: .It Fl T Ar dir , Fl Fl temporary-directory Ns = Ns Ar dir
100: Store temporary files in the directory
101: .Ar dir .
102: The default path is the value of the environment variable
1.1 millert 103: .Ev TMPDIR
104: or
105: .Pa /var/tmp
106: if
107: .Ev TMPDIR
1.41 millert 108: is not defined.
109: .It Fl u , Fl Fl unique
1.12 aaron 110: Unique: suppress all but one in each set of lines having equal keys.
1.41 millert 111: This option implies a stable sort (see below).
112: If used with
1.35 schwarze 113: .Fl C
114: or
1.41 millert 115: .Fl c ,
116: .Nm
117: also checks that there are no lines with duplicate keys.
118: .It Fl Fl version
119: Print the version and exit.
120: .It Fl Fl help
121: Print the help text and exit.
1.38 jmc 122: .El
123: .Pp
1.1 millert 124: The following options override the default ordering rules.
1.37 jmc 125: If ordering options appear before the first
126: .Fl k
127: option, they apply globally to all sort keys.
1.1 millert 128: When attached to a specific key (see
129: .Fl k ) ,
1.41 millert 130: the ordering options override all global ordering options for that key.
1.37 jmc 131: Note that the ordering options intended to apply globally should not
132: appear after
133: .Fl k
134: or results may be unexpected.
1.1 millert 135: .Bl -tag -width indent
1.41 millert 136: .It Fl b, Fl Fl ignore-leading-blanks
137: Ignore leading blank characters when comparing lines.
138: .It Fl d , Fl Fl dictionary-order
139: Consider only blank spaces and alphanumeric characters in comparisons.
140: .It Fl f , Fl Fl ignore-case
141: Consider all lowercase characters that have uppercase
1.12 aaron 142: equivalents to be the same for purposes of comparison.
1.41 millert 143: .It Fl g, Fl Fl general-numeric-sort, Fl Fl sort=general-numeric
144: Sort by general numerical value.
145: As opposed to
146: .Fl n ,
147: this option handles general floating points, which have a much
148: permissive format than those allowed by
149: . Fl n ,
150: but it has a significant performance drawback.
151: .It Fl h, Fl Fl human-numeric-sort, Fl Fl sort=human-numeric
152: Sort by numerical value, but take into account the SI suffix,
153: if present.
154: Sorts first by numeric sign (negative, zero, or
155: positive); then by SI suffix (either empty, or `k' or `K', or one
156: of `MGTPEZY', in that order); and finally by numeric value.
157: The SI suffix must immediately follow the number.
158: For example, '12345K' sorts before '1M', because M is "larger" than K.
159: This sort option is useful for sorting the output of a single invocation
160: of 'df' command with
161: .Fl h
162: or
163: .Fl H
164: options (human-readable).
165: .It Fl i , Fl Fl ignore-nonprinting
1.1 millert 166: Ignore all non-printable characters.
1.41 millert 167: .It Fl M, Fl Fl month-sort, Fl Fl sort=month
168: Sort by month abbreviations.
169: Unknown strings are considered smaller than valid month names.
170: .It Fl n , Fl Fl numeric-sort, Fl Fl sort=numeric
1.12 aaron 171: An initial numeric string, consisting of optional blank space, optional
172: minus sign, and zero or more digits (including decimal point)
1.1 millert 173: .\" with
174: .\" optional radix character and thousands
175: .\" separator
176: .\" (as defined in the current locale),
177: is sorted by arithmetic value.
1.41 millert 178: Leading blank characters are ignored.
179: .It Fl R, Fl Fl random-sort, Fl Fl sort=random
180: Sort lines in random order.
181: This is a random permutation of the inputs with the exception that
182: equal keys sort together.
183: It is implemented by hashing the input keys and sorting the hash values.
184: The hash function is randomized with data from
185: .Fn arc4random_buf ,
186: or by file content if one is specified via
187: .Fl Fl random-source .
188: If multiple sort fields are specified,
189: the same random hash function is used for all of them.
190: .It Fl r , Fl Fl reverse
191: Sort in reverse order.
192: .It Fl V, Fl Fl version-sort
193: Sort version numbers.
194: The input lines are treated as file names in form
195: PREFIX VERSION SUFFIX, where SUFFIX matches the regular expression
196: "(\.([A-Za-z~][A-Za-z0-9~]*)?)*".
197: The files are compared by their prefixes and versions (leading
198: zeros are ignored in version numbers, see example below).
199: If an input string does not match the pattern, then it is compared
200: using the byte compare function.
201: All string comparisons are performed in the C locale.
202: .Bl -tag -width indent
203: .It Example:
204: .It $ ls sort* | sort -V
205: .It sort-1.022.tgz
206: .It sort-1.23.tgz
207: .It sort-1.23.1.tgz
208: .It sort-1.024.tgz
209: .It sort-1.024.003.
210: .It sort-1.024.003.tgz
211: .It sort-1.024.07.tgz
212: .It sort-1.024.009.tgz
213: .El
1.1 millert 214: .El
215: .Pp
1.12 aaron 216: The treatment of field separators can be altered using these options:
1.1 millert 217: .Bl -tag -width indent
1.41 millert 218: .It Fl b , Fl Fl ignore-leading-blanks
219: Ignore leading blank space when determining the start
220: and end of a restricted sort key (see
221: .Fl k ) .
222: If
1.1 millert 223: .Fl b
1.41 millert 224: is specified before the first
1.1 millert 225: .Fl k
1.41 millert 226: option, it applies globally to all key specifications.
227: Otherwise,
1.1 millert 228: .Fl b
1.41 millert 229: can be attached independently to each
1.1 millert 230: .Ar field
1.41 millert 231: argument of the key specifications.
232: .It Xo
1.42 jmc 233: .Fl k Ar field1 Ns Op , Ns Ar field2 ,
234: .Fl Fl key Ns = Ns Ar field1 Ns Op , Ns Ar field2
1.41 millert 235: .Xc
236: Define a restricted sort key that has the starting position
237: .Ar field1 ,
238: and optional ending position
239: .Ar field2
240: of a key field.
241: The
242: .Fl k
243: option may be specified multiple times,
244: in which case subsequent keys are compared after earlier keys compare equal.
245: The
1.1 millert 246: .Fl k
1.41 millert 247: option replaces the obsolete options
248: .Cm \(pl Ns Ar pos1
249: and
250: .Fl Ns Ar pos2 ,
251: but the old notation is also supported.
252: .It Fl t Ar char , Fl Fl field-separator Ns = Ns Ar char
253: Use
1.3 aaron 254: .Ar char
1.41 millert 255: as the field separator character.
1.8 aaron 256: The initial
1.1 millert 257: .Ar char
1.12 aaron 258: is not considered to be part of a field when determining key offsets.
1.1 millert 259: Each occurrence of
260: .Ar char
261: is significant (for example,
262: .Dq Ar charchar
263: delimits an empty field).
264: If
265: .Fl t
1.6 pjanzen 266: is not specified, the default field separator is a sequence of
267: blank-space characters, and consecutive blank spaces do
268: .Em not
269: delimit an empty field; further, the initial blank space
270: .Em is
271: considered part of a field when determining key offsets.
1.41 millert 272: To use NUL as field separator, use
273: .Fl t
274: \'\\0\'.
275: .It Fl z , Fl Fl zero-terminated
276: Use NUL as the record separator.
277: By default, records in the files are expected to be separated by
278: the newline characters.
279: With this option, NUL (\'\\0\') is used as the record separator character.
1.37 jmc 280: .El
281: .Pp
1.41 millert 282: Other options:
1.37 jmc 283: .Bl -tag -width indent
1.41 millert 284: .It Fl Fl batch-size Ns = Ns Ar num
285: Specify maximum number of files that can be opened by
286: .Nm
287: at once.
288: This option affects behavior when having many input files or using
289: temporary files.
290: The default value is 16.
291: .It Fl Fl compress-program Ns = Ns Ar program
292: Use
293: .Ar program
294: to compress temporary files.
295: When invoked with no arguments,
296: .Ar program
297: must compress standard input to standard output.
298: When called with the
299: .Fl d
300: option, it must decompress standard input to standard output.
301: If
302: .Ar program
303: fails,
304: .Nm
305: will exit with an error.
1.37 jmc 306: The
1.41 millert 307: .Xr compress 1
308: and
309: .Xr gzip 1
310: utilities meet these requirements.
311: .It Fl Fl random-source Ns = Ns Ar filename
312: For random sort, the contents of
313: .Ar filename
314: are used as the source of the
315: .Sq seed
316: data for the hash function.
317: Two invocations of random sort with the same seed data will use
318: produce the same result if the input is also identical.
319: By default, the
320: .Fn arc4random_buf
321: function is used instead.
322: .It Fl Fl debug
323: Print some extra information about the sorting process to the
324: standard output.
325: .It Fl Fl files0-from Ns = Ns Ar filename
326: Take the input file list from the file
327: .Ar filename.
328: The file names must be separated by NUL
329: (like the output produced by the command
330: .Dq find ... -print0 ) .
331: .It Fl Fl radixsort
332: Try to use radix sort, if the sort specifications allow.
333: The radix sort can only be used for trivial locales (C and POSIX),
334: and it cannot be used for numeric or month sort.
335: Radix sort is very fast and stable.
336: .It Fl H, Fl Fl mergesort
337: Use mergesort.
338: This is a universal algorithm that can always be used,
339: but it is not always the fastest.
340: .It Fl Fl qsort
341: Try to use quick sort, if the sort specifications allow.
342: This sort algorithm cannot be used with
343: .Fl u
344: and
345: .Fl s .
346: .It Fl Fl heapsort
347: Try to use heap sort, if the sort specifications allow.
348: This sort algorithm cannot be used with
349: .Fl u
1.37 jmc 350: and
1.41 millert 351: .Fl s .
352: .It Fl Fl mmap
353: Try to use file memory mapping system call.
354: It may increase speed in some cases.
1.1 millert 355: .El
356: .Pp
357: The following operands are available:
358: .Bl -tag -width indent
1.3 aaron 359: .It Ar file
360: The pathname of a file to be sorted, merged, or checked.
361: If no
1.1 millert 362: .Ar file
1.12 aaron 363: operands are specified, or if a
1.3 aaron 364: .Ar file
365: operand is
1.1 millert 366: .Fl ,
367: the standard input is used.
1.3 aaron 368: .El
1.1 millert 369: .Pp
1.12 aaron 370: A field is defined as a maximal sequence of characters other than the
1.6 pjanzen 371: field separator and record separator
372: .Pq newline by default .
373: Initial blank spaces are included in the field unless
374: .Fl b
375: has been specified;
376: the first blank space of a sequence of blank spaces acts as the field
377: separator and is included in the field (unless
378: .Fl t
379: is specified).
380: For example, by default all blank spaces at the beginning of a line are
381: considered to be part of the first field.
1.1 millert 382: .Pp
1.12 aaron 383: Fields are specified by the
1.23 jmc 384: .Sm off
385: .Fl k\ \& Ar field1 Op , Ar field2
386: .Sm on
1.41 millert 387: option.
388: If
1.1 millert 389: .Ar field2
1.41 millert 390: is missing, the end of the key defaults to the end of the line.
1.1 millert 391: .Pp
392: The arguments
393: .Ar field1
394: and
395: .Ar field2
396: have the form
397: .Em m.n
1.6 pjanzen 398: .Em (m,n > 0)
1.41 millert 399: and can be followed by one or more of the modifiers
1.6 pjanzen 400: .Cm b , d , f , i ,
1.41 millert 401: .Cm n , g , M
1.6 pjanzen 402: and
403: .Cm r ,
404: which correspond to the options discussed above.
1.41 millert 405: When
406: .Cm b
407: is specified it applies only to
408: .Ar field1
409: or
410: .Ar field2
411: where it is specified while the rest of the modifiers
412: apply to the whole key field regardless if they are
413: specified only with
414: .Ar field1
415: or
416: .Ar field2
417: or both.
1.1 millert 418: A
419: .Ar field1
420: position specified by
421: .Em m.n
422: is interpreted as the
423: .Em n Ns th
1.6 pjanzen 424: character from the beginning of the
1.1 millert 425: .Em m Ns th
426: field.
427: A missing
428: .Em \&.n
429: in
430: .Ar field1
431: means
432: .Ql \&.1 ,
433: indicating the first character of the
434: .Em m Ns th
1.12 aaron 435: field; if the
1.1 millert 436: .Fl b
437: option is in effect,
438: .Em n
1.12 aaron 439: is counted from the first non-blank character in the
1.1 millert 440: .Em m Ns th
441: field;
442: .Em m Ns \&.1b
1.12 aaron 443: refers to the first non-blank character in the
1.1 millert 444: .Em m Ns th
445: field.
1.6 pjanzen 446: .No 1\&. Ns Em n
447: refers to the
448: .Em n Ns th
449: character from the beginning of the line;
450: if
451: .Em n
452: is greater than the length of the line, the field is taken to be empty.
1.1 millert 453: .Pp
1.41 millert 454: .Em n Ns th
455: positions are always counted from the field beginning, even if the field
456: is shorter than the number of specified positions.
457: Thus, the key can really start from a position in a subsequent field.
458: .Pp
1.1 millert 459: A
460: .Ar field2
461: position specified by
462: .Em m.n
1.12 aaron 463: is interpreted as the
1.1 millert 464: .Em n Ns th
1.41 millert 465: character (including separators) from the beginning of the
1.1 millert 466: .Em m Ns th
467: field.
468: A missing
469: .Em \&.n
1.5 aaron 470: indicates the last character of the
1.1 millert 471: .Em m Ns th
472: field;
1.5 aaron 473: .Em m
1.1 millert 474: = \&0
475: designates the end of a line.
476: Thus the option
477: .Fl k Ar v.x,w.y
1.41 millert 478: is synonymous with the obsolete option
1.1 millert 479: .Cm \(pl Ns Ar v-\&1.x-\&1
480: .Fl Ns Ar w-\&1.y ;
481: when
482: .Em y
483: is omitted,
484: .Fl k Ar v.x,w
485: is synonymous with
1.5 aaron 486: .Cm \(pl Ns Ar v-\&1.x-\&1
1.19 tdeval 487: .Fl Ns Ar w\&.0 .
1.41 millert 488: The obsolete
1.1 millert 489: .Cm \(pl Ns Ar pos1
490: .Fl Ns Ar pos2
491: option is still supported, except for
1.3 aaron 492: .Fl Ns Ar w\&.0b ,
1.1 millert 493: which has no
494: .Fl k
495: equivalent.
496: .Sh ENVIRONMENT
497: .Bl -tag -width Fl
1.41 millert 498: .It Ev LC_COLLATE
499: Locale settings to be used to determine the collation for
500: sorting records.
501: .It Ev LC_CTYPE
502: Locale settings to be used to case conversion and classification
503: of characters, that is, which characters are considered
504: whitespaces, etc.
505: .It Ev LC_MESSAGES
506: Locale settings that determine the language of output messages
507: that
508: .Nm
509: prints out.
510: .It Ev LC_NUMERIC
511: Locale settings that determine the number format used in numeric sort.
512: .It Ev LC_TIME
513: Locale settings that determine the month format used in month sort.
514: .It Ev LC_ALL
515: Locale settings that override all of the above locale settings.
516: This environment variable can be used to set all these settings
517: to the same value at once.
518: .It Ev LANG
519: Used as a last resort to determine different kinds of locale-specific
520: behavior if neither the respective environment variable, nor
521: .Ev LC_ALL
522: are set.
1.1 millert 523: .It Ev TMPDIR
1.41 millert 524: Path to the directory in which temporary files will be stored.
1.3 aaron 525: Note that
1.1 millert 526: .Ev TMPDIR
527: may be overridden by the
528: .Fl T
529: option.
1.41 millert 530: .It Ev GNUSORT_NUMERIC_COMPATIBILITY
531: If defined
532: .Fl t
533: will not override the locale numeric symbols, that is, thousand
534: separators and decimal separators.
535: By default, if we specify
536: .Fl t
537: with the same symbol as the thousand separator or decimal point,
538: the symbol will be treated as the field separator.
539: Older behavior was less definite; the symbol was treated as both field
540: separator and numeric separator, simultaneously.
541: This environment variable enables the old behavior.
1.11 aaron 542: .El
1.1 millert 543: .Sh FILES
544: .Bl -tag -width Pa -compact
1.41 millert 545: .It Pa /var/tmp/.bsdsort.PID.*
546: Temporary files.
1.39 jmc 547: .El
548: .Sh EXIT STATUS
549: The
550: .Nm
551: utility exits with one of the following values:
552: .Pp
553: .Bl -tag -width Ds -offset indent -compact
554: .It 0
1.41 millert 555: Successfully sorted the input files or if used with
556: .Fl C
557: or
558: .Fl c ,
559: the input file already met the sorting criteria.
1.39 jmc 560: .It 1
1.41 millert 561: On disorder (or non-uniqueness) with the
1.39 jmc 562: .Fl C
563: or
564: .Fl c
1.41 millert 565: options.
1.39 jmc 566: .It 2
567: An error occurred.
1.1 millert 568: .El
569: .Sh SEE ALSO
570: .Xr comm 1 ,
1.3 aaron 571: .Xr join 1 ,
1.18 fgsch 572: .Xr uniq 1 ,
1.41 millert 573: .Xr arc4random_buf 3
1.27 dlg 574: .Sh STANDARDS
575: The
576: .Nm
1.28 jmc 577: utility is compliant with the
1.33 jmc 578: .St -p1003.1-2008
1.27 dlg 579: specification.
580: .Pp
581: The flags
1.43 ! jmc 582: .Op Fl gHhiMRSsTVz
1.28 jmc 583: are extensions to that specification.
1.41 millert 584: .Pp
585: All long options are extensions to the specification.
586: Some are provided for compatibility with GNU
587: .Nm ,
588: others are specific to this implementation.
589: .Pp
590: The historic key notations
591: .Cm \(pl Ns Ar pos1
592: and
593: .Fl Ns Ar pos2
594: are supported for compatibility with older versions of
595: .Nm
596: but their use is highly discouraged.
1.1 millert 597: .Sh HISTORY
598: A
1.8 aaron 599: .Nm
1.1 millert 600: command appeared in
1.16 mickey 601: .At v3 .
1.41 millert 602: .Sh AUTHORS
603: Gabor Kovesdan <gabor@FreeBSD.org>
604: .br
605: Oleg Moskalenko <mom040267@gmail.com>
1.1 millert 606: .Sh NOTES
1.41 millert 607: This implementation of
1.14 ericj 608: .Nm
609: has no limits on input line length (other than imposed by available
610: memory) or any restrictions on bytes allowed within lines.
611: .Pp
1.41 millert 612: The performance depends highly on locale settings,
613: efficient choice of sort keys and key complexity.
614: The fastest sort is with the C locale, on whole lines, with option
615: .Fl s .
616: In general, the C locale is the fastest, followed by single-byte
617: locales with multi-byte locales being the slowest.
618: The correct collation order respected in all cases.
619: For the key specification, the simpler to process the
620: lines the faster the search will be.
1.14 ericj 621: .Pp
1.41 millert 622: When sorting by arithmetic value, using
623: .Fl n
624: results in much better performance than
625: .Fl g
626: so its use is encouraged whenever possible.