LCOV - code coverage report
Current view: top level - src/backend/utils/sort - tuplesort.c (source / functions) Hit Total Coverage
Test: PostgreSQL 15devel Lines: 1181 1279 92.3 %
Date: 2021-12-05 01:09:12 Functions: 78 80 97.5 %
Legend: Lines: hit not hit

          Line data    Source code
       1             : /*-------------------------------------------------------------------------
       2             :  *
       3             :  * tuplesort.c
       4             :  *    Generalized tuple sorting routines.
       5             :  *
       6             :  * This module handles sorting of heap tuples, index tuples, or single
       7             :  * Datums (and could easily support other kinds of sortable objects,
       8             :  * if necessary).  It works efficiently for both small and large amounts
       9             :  * of data.  Small amounts are sorted in-memory using qsort().  Large
      10             :  * amounts are sorted using temporary files and a standard external sort
      11             :  * algorithm.
      12             :  *
      13             :  * See Knuth, volume 3, for more than you want to know about external
      14             :  * sorting algorithms.  The algorithm we use is a balanced k-way merge.
      15             :  * Before PostgreSQL 15, we used the polyphase merge algorithm (Knuth's
      16             :  * Algorithm 5.4.2D), but with modern hardware, a straightforward balanced
      17             :  * merge is better.  Knuth is assuming that tape drives are expensive
      18             :  * beasts, and in particular that there will always be many more runs than
      19             :  * tape drives.  The polyphase merge algorithm was good at keeping all the
      20             :  * tape drives busy, but in our implementation a "tape drive" doesn't cost
      21             :  * much more than a few Kb of memory buffers, so we can afford to have
      22             :  * lots of them.  In particular, if we can have as many tape drives as
      23             :  * sorted runs, we can eliminate any repeated I/O at all.
      24             :  *
      25             :  * Historically, we divided the input into sorted runs using replacement
      26             :  * selection, in the form of a priority tree implemented as a heap
      27             :  * (essentially Knuth's Algorithm 5.2.3H), but now we always use quicksort
      28             :  * for run generation.
      29             :  *
      30             :  * The approximate amount of memory allowed for any one sort operation
      31             :  * is specified in kilobytes by the caller (most pass work_mem).  Initially,
      32             :  * we absorb tuples and simply store them in an unsorted array as long as
      33             :  * we haven't exceeded workMem.  If we reach the end of the input without
      34             :  * exceeding workMem, we sort the array using qsort() and subsequently return
      35             :  * tuples just by scanning the tuple array sequentially.  If we do exceed
      36             :  * workMem, we begin to emit tuples into sorted runs in temporary tapes.
      37             :  * When tuples are dumped in batch after quicksorting, we begin a new run
      38             :  * with a new output tape.  If we reach the max number of tapes, we write
      39             :  * subsequent runs on the existing tapes in a round-robin fashion.  We will
      40             :  * need multiple merge passes to finish the merge in that case.  After the
      41             :  * end of the input is reached, we dump out remaining tuples in memory into
      42             :  * a final run, then merge the runs.
      43             :  *
      44             :  * When merging runs, we use a heap containing just the frontmost tuple from
      45             :  * each source run; we repeatedly output the smallest tuple and replace it
      46             :  * with the next tuple from its source tape (if any).  When the heap empties,
      47             :  * the merge is complete.  The basic merge algorithm thus needs very little
      48             :  * memory --- only M tuples for an M-way merge, and M is constrained to a
      49             :  * small number.  However, we can still make good use of our full workMem
      50             :  * allocation by pre-reading additional blocks from each source tape.  Without
      51             :  * prereading, our access pattern to the temporary file would be very erratic;
      52             :  * on average we'd read one block from each of M source tapes during the same
      53             :  * time that we're writing M blocks to the output tape, so there is no
      54             :  * sequentiality of access at all, defeating the read-ahead methods used by
      55             :  * most Unix kernels.  Worse, the output tape gets written into a very random
      56             :  * sequence of blocks of the temp file, ensuring that things will be even
      57             :  * worse when it comes time to read that tape.  A straightforward merge pass
      58             :  * thus ends up doing a lot of waiting for disk seeks.  We can improve matters
      59             :  * by prereading from each source tape sequentially, loading about workMem/M
      60             :  * bytes from each tape in turn, and making the sequential blocks immediately
      61             :  * available for reuse.  This approach helps to localize both read and write
      62             :  * accesses.  The pre-reading is handled by logtape.c, we just tell it how
      63             :  * much memory to use for the buffers.
      64             :  *
      65             :  * In the current code we determine the number of input tapes M on the basis
      66             :  * of workMem: we want workMem/M to be large enough that we read a fair
      67             :  * amount of data each time we read from a tape, so as to maintain the
      68             :  * locality of access described above.  Nonetheless, with large workMem we
      69             :  * can have many tapes.  The logical "tapes" are implemented by logtape.c,
      70             :  * which avoids space wastage by recycling disk space as soon as each block
      71             :  * is read from its "tape".
      72             :  *
      73             :  * When the caller requests random access to the sort result, we form
      74             :  * the final sorted run on a logical tape which is then "frozen", so
      75             :  * that we can access it randomly.  When the caller does not need random
      76             :  * access, we return from tuplesort_performsort() as soon as we are down
      77             :  * to one run per logical tape.  The final merge is then performed
      78             :  * on-the-fly as the caller repeatedly calls tuplesort_getXXX; this
      79             :  * saves one cycle of writing all the data out to disk and reading it in.
      80             :  *
      81             :  * This module supports parallel sorting.  Parallel sorts involve coordination
      82             :  * among one or more worker processes, and a leader process, each with its own
      83             :  * tuplesort state.  The leader process (or, more accurately, the
      84             :  * Tuplesortstate associated with a leader process) creates a full tapeset
      85             :  * consisting of worker tapes with one run to merge; a run for every
      86             :  * worker process.  This is then merged.  Worker processes are guaranteed to
      87             :  * produce exactly one output run from their partial input.
      88             :  *
      89             :  *
      90             :  * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
      91             :  * Portions Copyright (c) 1994, Regents of the University of California
      92             :  *
      93             :  * IDENTIFICATION
      94             :  *    src/backend/utils/sort/tuplesort.c
      95             :  *
      96             :  *-------------------------------------------------------------------------
      97             :  */
      98             : 
      99             : #include "postgres.h"
     100             : 
     101             : #include <limits.h>
     102             : 
     103             : #include "access/hash.h"
     104             : #include "access/htup_details.h"
     105             : #include "access/nbtree.h"
     106             : #include "catalog/index.h"
     107             : #include "catalog/pg_am.h"
     108             : #include "commands/tablespace.h"
     109             : #include "executor/executor.h"
     110             : #include "miscadmin.h"
     111             : #include "pg_trace.h"
     112             : #include "utils/datum.h"
     113             : #include "utils/logtape.h"
     114             : #include "utils/lsyscache.h"
     115             : #include "utils/memutils.h"
     116             : #include "utils/pg_rusage.h"
     117             : #include "utils/rel.h"
     118             : #include "utils/sortsupport.h"
     119             : #include "utils/tuplesort.h"
     120             : 
     121             : 
     122             : /* sort-type codes for sort__start probes */
     123             : #define HEAP_SORT       0
     124             : #define INDEX_SORT      1
     125             : #define DATUM_SORT      2
     126             : #define CLUSTER_SORT    3
     127             : 
     128             : /* Sort parallel code from state for sort__start probes */
     129             : #define PARALLEL_SORT(state)    ((state)->shared == NULL ? 0 : \
     130             :                                  (state)->worker >= 0 ? 1 : 2)
     131             : 
     132             : /*
     133             :  * Initial size of memtuples array.  We're trying to select this size so that
     134             :  * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of
     135             :  * allocation might possibly be lowered.  However, we don't consider array sizes
     136             :  * less than 1024.
     137             :  *
     138             :  */
     139             : #define INITIAL_MEMTUPSIZE Max(1024, \
     140             :     ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
     141             : 
     142             : /* GUC variables */
     143             : #ifdef TRACE_SORT
     144             : bool        trace_sort = false;
     145             : #endif
     146             : 
     147             : #ifdef DEBUG_BOUNDED_SORT
     148             : bool        optimize_bounded_sort = true;
     149             : #endif
     150             : 
     151             : 
     152             : /*
     153             :  * The objects we actually sort are SortTuple structs.  These contain
     154             :  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
     155             :  * which is a separate palloc chunk --- we assume it is just one chunk and
     156             :  * can be freed by a simple pfree() (except during merge, when we use a
     157             :  * simple slab allocator).  SortTuples also contain the tuple's first key
     158             :  * column in Datum/nullflag format, and a source/input tape number that
     159             :  * tracks which tape each heap element/slot belongs to during merging.
     160             :  *
     161             :  * Storing the first key column lets us save heap_getattr or index_getattr
     162             :  * calls during tuple comparisons.  We could extract and save all the key
     163             :  * columns not just the first, but this would increase code complexity and
     164             :  * overhead, and wouldn't actually save any comparison cycles in the common
     165             :  * case where the first key determines the comparison result.  Note that
     166             :  * for a pass-by-reference datatype, datum1 points into the "tuple" storage.
     167             :  *
     168             :  * There is one special case: when the sort support infrastructure provides an
     169             :  * "abbreviated key" representation, where the key is (typically) a pass by
     170             :  * value proxy for a pass by reference type.  In this case, the abbreviated key
     171             :  * is stored in datum1 in place of the actual first key column.
     172             :  *
     173             :  * When sorting single Datums, the data value is represented directly by
     174             :  * datum1/isnull1 for pass by value types (or null values).  If the datatype is
     175             :  * pass-by-reference and isnull1 is false, then "tuple" points to a separately
     176             :  * palloc'd data value, otherwise "tuple" is NULL.  The value of datum1 is then
     177             :  * either the same pointer as "tuple", or is an abbreviated key value as
     178             :  * described above.  Accordingly, "tuple" is always used in preference to
     179             :  * datum1 as the authoritative value for pass-by-reference cases.
     180             :  */
     181             : typedef struct
     182             : {
     183             :     void       *tuple;          /* the tuple itself */
     184             :     Datum       datum1;         /* value of first key column */
     185             :     bool        isnull1;        /* is first key column NULL? */
     186             :     int         srctape;        /* source tape number */
     187             : } SortTuple;
     188             : 
     189             : /*
     190             :  * During merge, we use a pre-allocated set of fixed-size slots to hold
     191             :  * tuples.  To avoid palloc/pfree overhead.
     192             :  *
     193             :  * Merge doesn't require a lot of memory, so we can afford to waste some,
     194             :  * by using gratuitously-sized slots.  If a tuple is larger than 1 kB, the
     195             :  * palloc() overhead is not significant anymore.
     196             :  *
     197             :  * 'nextfree' is valid when this chunk is in the free list.  When in use, the
     198             :  * slot holds a tuple.
     199             :  */
     200             : #define SLAB_SLOT_SIZE 1024
     201             : 
     202             : typedef union SlabSlot
     203             : {
     204             :     union SlabSlot *nextfree;
     205             :     char        buffer[SLAB_SLOT_SIZE];
     206             : } SlabSlot;
     207             : 
     208             : /*
     209             :  * Possible states of a Tuplesort object.  These denote the states that
     210             :  * persist between calls of Tuplesort routines.
     211             :  */
     212             : typedef enum
     213             : {
     214             :     TSS_INITIAL,                /* Loading tuples; still within memory limit */
     215             :     TSS_BOUNDED,                /* Loading tuples into bounded-size heap */
     216             :     TSS_BUILDRUNS,              /* Loading tuples; writing to tape */
     217             :     TSS_SORTEDINMEM,            /* Sort completed entirely in memory */
     218             :     TSS_SORTEDONTAPE,           /* Sort completed, final run is on tape */
     219             :     TSS_FINALMERGE              /* Performing final merge on-the-fly */
     220             : } TupSortStatus;
     221             : 
     222             : /*
     223             :  * Parameters for calculation of number of tapes to use --- see inittapes()
     224             :  * and tuplesort_merge_order().
     225             :  *
     226             :  * In this calculation we assume that each tape will cost us about 1 blocks
     227             :  * worth of buffer space.  This ignores the overhead of all the other data
     228             :  * structures needed for each tape, but it's probably close enough.
     229             :  *
     230             :  * MERGE_BUFFER_SIZE is how much buffer space we'd like to allocate for each
     231             :  * input tape, for pre-reading (see discussion at top of file).  This is *in
     232             :  * addition to* the 1 block already included in TAPE_BUFFER_OVERHEAD.
     233             :  */
     234             : #define MINORDER        6       /* minimum merge order */
     235             : #define MAXORDER        500     /* maximum merge order */
     236             : #define TAPE_BUFFER_OVERHEAD        BLCKSZ
     237             : #define MERGE_BUFFER_SIZE           (BLCKSZ * 32)
     238             : 
     239             : typedef int (*SortTupleComparator) (const SortTuple *a, const SortTuple *b,
     240             :                                     Tuplesortstate *state);
     241             : 
     242             : /*
     243             :  * Private state of a Tuplesort operation.
     244             :  */
     245             : struct Tuplesortstate
     246             : {
     247             :     TupSortStatus status;       /* enumerated value as shown above */
     248             :     int         nKeys;          /* number of columns in sort key */
     249             :     bool        randomAccess;   /* did caller request random access? */
     250             :     bool        bounded;        /* did caller specify a maximum number of
     251             :                                  * tuples to return? */
     252             :     bool        boundUsed;      /* true if we made use of a bounded heap */
     253             :     int         bound;          /* if bounded, the maximum number of tuples */
     254             :     bool        tuples;         /* Can SortTuple.tuple ever be set? */
     255             :     int64       availMem;       /* remaining memory available, in bytes */
     256             :     int64       allowedMem;     /* total memory allowed, in bytes */
     257             :     int         maxTapes;       /* max number of input tapes to merge in each
     258             :                                  * pass */
     259             :     int64       maxSpace;       /* maximum amount of space occupied among sort
     260             :                                  * of groups, either in-memory or on-disk */
     261             :     bool        isMaxSpaceDisk; /* true when maxSpace is value for on-disk
     262             :                                  * space, false when it's value for in-memory
     263             :                                  * space */
     264             :     TupSortStatus maxSpaceStatus;   /* sort status when maxSpace was reached */
     265             :     MemoryContext maincontext;  /* memory context for tuple sort metadata that
     266             :                                  * persists across multiple batches */
     267             :     MemoryContext sortcontext;  /* memory context holding most sort data */
     268             :     MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
     269             :     LogicalTapeSet *tapeset;    /* logtape.c object for tapes in a temp file */
     270             : 
     271             :     /*
     272             :      * These function pointers decouple the routines that must know what kind
     273             :      * of tuple we are sorting from the routines that don't need to know it.
     274             :      * They are set up by the tuplesort_begin_xxx routines.
     275             :      *
     276             :      * Function to compare two tuples; result is per qsort() convention, ie:
     277             :      * <0, 0, >0 according as a<b, a=b, a>b.  The API must match
     278             :      * qsort_arg_comparator.
     279             :      */
     280             :     SortTupleComparator comparetup;
     281             : 
     282             :     /*
     283             :      * Function to copy a supplied input tuple into palloc'd space and set up
     284             :      * its SortTuple representation (ie, set tuple/datum1/isnull1).  Also,
     285             :      * state->availMem must be decreased by the amount of space used for the
     286             :      * tuple copy (note the SortTuple struct itself is not counted).
     287             :      */
     288             :     void        (*copytup) (Tuplesortstate *state, SortTuple *stup, void *tup);
     289             : 
     290             :     /*
     291             :      * Function to write a stored tuple onto tape.  The representation of the
     292             :      * tuple on tape need not be the same as it is in memory; requirements on
     293             :      * the tape representation are given below.  Unless the slab allocator is
     294             :      * used, after writing the tuple, pfree() the out-of-line data (not the
     295             :      * SortTuple struct!), and increase state->availMem by the amount of
     296             :      * memory space thereby released.
     297             :      */
     298             :     void        (*writetup) (Tuplesortstate *state, LogicalTape *tape,
     299             :                              SortTuple *stup);
     300             : 
     301             :     /*
     302             :      * Function to read a stored tuple from tape back into memory. 'len' is
     303             :      * the already-read length of the stored tuple.  The tuple is allocated
     304             :      * from the slab memory arena, or is palloc'd, see readtup_alloc().
     305             :      */
     306             :     void        (*readtup) (Tuplesortstate *state, SortTuple *stup,
     307             :                             LogicalTape *tape, unsigned int len);
     308             : 
     309             :     /*
     310             :      * This array holds the tuples now in sort memory.  If we are in state
     311             :      * INITIAL, the tuples are in no particular order; if we are in state
     312             :      * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
     313             :      * and FINALMERGE, the tuples are organized in "heap" order per Algorithm
     314             :      * H.  In state SORTEDONTAPE, the array is not used.
     315             :      */
     316             :     SortTuple  *memtuples;      /* array of SortTuple structs */
     317             :     int         memtupcount;    /* number of tuples currently present */
     318             :     int         memtupsize;     /* allocated length of memtuples array */
     319             :     bool        growmemtuples;  /* memtuples' growth still underway? */
     320             : 
     321             :     /*
     322             :      * Memory for tuples is sometimes allocated using a simple slab allocator,
     323             :      * rather than with palloc().  Currently, we switch to slab allocation
     324             :      * when we start merging.  Merging only needs to keep a small, fixed
     325             :      * number of tuples in memory at any time, so we can avoid the
     326             :      * palloc/pfree overhead by recycling a fixed number of fixed-size slots
     327             :      * to hold the tuples.
     328             :      *
     329             :      * For the slab, we use one large allocation, divided into SLAB_SLOT_SIZE
     330             :      * slots.  The allocation is sized to have one slot per tape, plus one
     331             :      * additional slot.  We need that many slots to hold all the tuples kept
     332             :      * in the heap during merge, plus the one we have last returned from the
     333             :      * sort, with tuplesort_gettuple.
     334             :      *
     335             :      * Initially, all the slots are kept in a linked list of free slots.  When
     336             :      * a tuple is read from a tape, it is put to the next available slot, if
     337             :      * it fits.  If the tuple is larger than SLAB_SLOT_SIZE, it is palloc'd
     338             :      * instead.
     339             :      *
     340             :      * When we're done processing a tuple, we return the slot back to the free
     341             :      * list, or pfree() if it was palloc'd.  We know that a tuple was
     342             :      * allocated from the slab, if its pointer value is between
     343             :      * slabMemoryBegin and -End.
     344             :      *
     345             :      * When the slab allocator is used, the USEMEM/LACKMEM mechanism of
     346             :      * tracking memory usage is not used.
     347             :      */
     348             :     bool        slabAllocatorUsed;
     349             : 
     350             :     char       *slabMemoryBegin;    /* beginning of slab memory arena */
     351             :     char       *slabMemoryEnd;  /* end of slab memory arena */
     352             :     SlabSlot   *slabFreeHead;   /* head of free list */
     353             : 
     354             :     /* Memory used for input and output tape buffers. */
     355             :     size_t      tape_buffer_mem;
     356             : 
     357             :     /*
     358             :      * When we return a tuple to the caller in tuplesort_gettuple_XXX, that
     359             :      * came from a tape (that is, in TSS_SORTEDONTAPE or TSS_FINALMERGE
     360             :      * modes), we remember the tuple in 'lastReturnedTuple', so that we can
     361             :      * recycle the memory on next gettuple call.
     362             :      */
     363             :     void       *lastReturnedTuple;
     364             : 
     365             :     /*
     366             :      * While building initial runs, this is the current output run number.
     367             :      * Afterwards, it is the number of initial runs we made.
     368             :      */
     369             :     int         currentRun;
     370             : 
     371             :     /*
     372             :      * Logical tapes, for merging.
     373             :      *
     374             :      * The initial runs are written in the output tapes.  In each merge pass,
     375             :      * the output tapes of the previous pass become the input tapes, and new
     376             :      * output tapes are created as needed.  When nInputTapes equals
     377             :      * nInputRuns, there is only one merge pass left.
     378             :      */
     379             :     LogicalTape **inputTapes;
     380             :     int         nInputTapes;
     381             :     int         nInputRuns;
     382             : 
     383             :     LogicalTape **outputTapes;
     384             :     int         nOutputTapes;
     385             :     int         nOutputRuns;
     386             : 
     387             :     LogicalTape *destTape;      /* current output tape */
     388             : 
     389             :     /*
     390             :      * These variables are used after completion of sorting to keep track of
     391             :      * the next tuple to return.  (In the tape case, the tape's current read
     392             :      * position is also critical state.)
     393             :      */
     394             :     LogicalTape *result_tape;   /* actual tape of finished output */
     395             :     int         current;        /* array index (only used if SORTEDINMEM) */
     396             :     bool        eof_reached;    /* reached EOF (needed for cursors) */
     397             : 
     398             :     /* markpos_xxx holds marked position for mark and restore */
     399             :     long        markpos_block;  /* tape block# (only used if SORTEDONTAPE) */
     400             :     int         markpos_offset; /* saved "current", or offset in tape block */
     401             :     bool        markpos_eof;    /* saved "eof_reached" */
     402             : 
     403             :     /*
     404             :      * These variables are used during parallel sorting.
     405             :      *
     406             :      * worker is our worker identifier.  Follows the general convention that
     407             :      * -1 value relates to a leader tuplesort, and values >= 0 worker
     408             :      * tuplesorts. (-1 can also be a serial tuplesort.)
     409             :      *
     410             :      * shared is mutable shared memory state, which is used to coordinate
     411             :      * parallel sorts.
     412             :      *
     413             :      * nParticipants is the number of worker Tuplesortstates known by the
     414             :      * leader to have actually been launched, which implies that they must
     415             :      * finish a run that the leader needs to merge.  Typically includes a
     416             :      * worker state held by the leader process itself.  Set in the leader
     417             :      * Tuplesortstate only.
     418             :      */
     419             :     int         worker;
     420             :     Sharedsort *shared;
     421             :     int         nParticipants;
     422             : 
     423             :     /*
     424             :      * The sortKeys variable is used by every case other than the hash index
     425             :      * case; it is set by tuplesort_begin_xxx.  tupDesc is only used by the
     426             :      * MinimalTuple and CLUSTER routines, though.
     427             :      */
     428             :     TupleDesc   tupDesc;
     429             :     SortSupport sortKeys;       /* array of length nKeys */
     430             : 
     431             :     /*
     432             :      * This variable is shared by the single-key MinimalTuple case and the
     433             :      * Datum case (which both use qsort_ssup()).  Otherwise it's NULL.
     434             :      */
     435             :     SortSupport onlyKey;
     436             : 
     437             :     /*
     438             :      * Additional state for managing "abbreviated key" sortsupport routines
     439             :      * (which currently may be used by all cases except the hash index case).
     440             :      * Tracks the intervals at which the optimization's effectiveness is
     441             :      * tested.
     442             :      */
     443             :     int64       abbrevNext;     /* Tuple # at which to next check
     444             :                                  * applicability */
     445             : 
     446             :     /*
     447             :      * These variables are specific to the CLUSTER case; they are set by
     448             :      * tuplesort_begin_cluster.
     449             :      */
     450             :     IndexInfo  *indexInfo;      /* info about index being used for reference */
     451             :     EState     *estate;         /* for evaluating index expressions */
     452             : 
     453             :     /*
     454             :      * These variables are specific to the IndexTuple case; they are set by
     455             :      * tuplesort_begin_index_xxx and used only by the IndexTuple routines.
     456             :      */
     457             :     Relation    heapRel;        /* table the index is being built on */
     458             :     Relation    indexRel;       /* index being built */
     459             : 
     460             :     /* These are specific to the index_btree subcase: */
     461             :     bool        enforceUnique;  /* complain if we find duplicate tuples */
     462             : 
     463             :     /* These are specific to the index_hash subcase: */
     464             :     uint32      high_mask;      /* masks for sortable part of hash code */
     465             :     uint32      low_mask;
     466             :     uint32      max_buckets;
     467             : 
     468             :     /*
     469             :      * These variables are specific to the Datum case; they are set by
     470             :      * tuplesort_begin_datum and used only by the DatumTuple routines.
     471             :      */
     472             :     Oid         datumType;
     473             :     /* we need typelen in order to know how to copy the Datums. */
     474             :     int         datumTypeLen;
     475             : 
     476             :     /*
     477             :      * Resource snapshot for time of sort start.
     478             :      */
     479             : #ifdef TRACE_SORT
     480             :     PGRUsage    ru_start;
     481             : #endif
     482             : };
     483             : 
     484             : /*
     485             :  * Private mutable state of tuplesort-parallel-operation.  This is allocated
     486             :  * in shared memory.
     487             :  */
     488             : struct Sharedsort
     489             : {
     490             :     /* mutex protects all fields prior to tapes */
     491             :     slock_t     mutex;
     492             : 
     493             :     /*
     494             :      * currentWorker generates ordinal identifier numbers for parallel sort
     495             :      * workers.  These start from 0, and are always gapless.
     496             :      *
     497             :      * Workers increment workersFinished to indicate having finished.  If this
     498             :      * is equal to state.nParticipants within the leader, leader is ready to
     499             :      * merge worker runs.
     500             :      */
     501             :     int         currentWorker;
     502             :     int         workersFinished;
     503             : 
     504             :     /* Temporary file space */
     505             :     SharedFileSet fileset;
     506             : 
     507             :     /* Size of tapes flexible array */
     508             :     int         nTapes;
     509             : 
     510             :     /*
     511             :      * Tapes array used by workers to report back information needed by the
     512             :      * leader to concatenate all worker tapes into one for merging
     513             :      */
     514             :     TapeShare   tapes[FLEXIBLE_ARRAY_MEMBER];
     515             : };
     516             : 
     517             : /*
     518             :  * Is the given tuple allocated from the slab memory arena?
     519             :  */
     520             : #define IS_SLAB_SLOT(state, tuple) \
     521             :     ((char *) (tuple) >= (state)->slabMemoryBegin && \
     522             :      (char *) (tuple) < (state)->slabMemoryEnd)
     523             : 
     524             : /*
     525             :  * Return the given tuple to the slab memory free list, or free it
     526             :  * if it was palloc'd.
     527             :  */
     528             : #define RELEASE_SLAB_SLOT(state, tuple) \
     529             :     do { \
     530             :         SlabSlot *buf = (SlabSlot *) tuple; \
     531             :         \
     532             :         if (IS_SLAB_SLOT((state), buf)) \
     533             :         { \
     534             :             buf->nextfree = (state)->slabFreeHead; \
     535             :             (state)->slabFreeHead = buf; \
     536             :         } else \
     537             :             pfree(buf); \
     538             :     } while(0)
     539             : 
     540             : #define COMPARETUP(state,a,b)   ((*(state)->comparetup) (a, b, state))
     541             : #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
     542             : #define WRITETUP(state,tape,stup)   ((*(state)->writetup) (state, tape, stup))
     543             : #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len))
     544             : #define LACKMEM(state)      ((state)->availMem < 0 && !(state)->slabAllocatorUsed)
     545             : #define USEMEM(state,amt)   ((state)->availMem -= (amt))
     546             : #define FREEMEM(state,amt)  ((state)->availMem += (amt))
     547             : #define SERIAL(state)       ((state)->shared == NULL)
     548             : #define WORKER(state)       ((state)->shared && (state)->worker != -1)
     549             : #define LEADER(state)       ((state)->shared && (state)->worker == -1)
     550             : 
     551             : /*
     552             :  * NOTES about on-tape representation of tuples:
     553             :  *
     554             :  * We require the first "unsigned int" of a stored tuple to be the total size
     555             :  * on-tape of the tuple, including itself (so it is never zero; an all-zero
     556             :  * unsigned int is used to delimit runs).  The remainder of the stored tuple
     557             :  * may or may not match the in-memory representation of the tuple ---
     558             :  * any conversion needed is the job of the writetup and readtup routines.
     559             :  *
     560             :  * If state->randomAccess is true, then the stored representation of the
     561             :  * tuple must be followed by another "unsigned int" that is a copy of the
     562             :  * length --- so the total tape space used is actually sizeof(unsigned int)
     563             :  * more than the stored length value.  This allows read-backwards.  When
     564             :  * randomAccess is not true, the write/read routines may omit the extra
     565             :  * length word.
     566             :  *
     567             :  * writetup is expected to write both length words as well as the tuple
     568             :  * data.  When readtup is called, the tape is positioned just after the
     569             :  * front length word; readtup must read the tuple data and advance past
     570             :  * the back length word (if present).
     571             :  *
     572             :  * The write/read routines can make use of the tuple description data
     573             :  * stored in the Tuplesortstate record, if needed.  They are also expected
     574             :  * to adjust state->availMem by the amount of memory space (not tape space!)
     575             :  * released or consumed.  There is no error return from either writetup
     576             :  * or readtup; they should ereport() on failure.
     577             :  *
     578             :  *
     579             :  * NOTES about memory consumption calculations:
     580             :  *
     581             :  * We count space allocated for tuples against the workMem limit, plus
     582             :  * the space used by the variable-size memtuples array.  Fixed-size space
     583             :  * is not counted; it's small enough to not be interesting.
     584             :  *
     585             :  * Note that we count actual space used (as shown by GetMemoryChunkSpace)
     586             :  * rather than the originally-requested size.  This is important since
     587             :  * palloc can add substantial overhead.  It's not a complete answer since
     588             :  * we won't count any wasted space in palloc allocation blocks, but it's
     589             :  * a lot better than what we were doing before 7.3.  As of 9.6, a
     590             :  * separate memory context is used for caller passed tuples.  Resetting
     591             :  * it at certain key increments significantly ameliorates fragmentation.
     592             :  * Note that this places a responsibility on copytup routines to use the
     593             :  * correct memory context for these tuples (and to not use the reset
     594             :  * context for anything whose lifetime needs to span multiple external
     595             :  * sort runs).  readtup routines use the slab allocator (they cannot use
     596             :  * the reset context because it gets deleted at the point that merging
     597             :  * begins).
     598             :  */
     599             : 
     600             : /* When using this macro, beware of double evaluation of len */
     601             : #define LogicalTapeReadExact(tape, ptr, len) \
     602             :     do { \
     603             :         if (LogicalTapeRead(tape, ptr, len) != (size_t) (len)) \
     604             :             elog(ERROR, "unexpected end of data"); \
     605             :     } while(0)
     606             : 
     607             : 
     608             : static Tuplesortstate *tuplesort_begin_common(int workMem,
     609             :                                               SortCoordinate coordinate,
     610             :                                               bool randomAccess);
     611             : static void tuplesort_begin_batch(Tuplesortstate *state);
     612             : static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
     613             : static bool consider_abort_common(Tuplesortstate *state);
     614             : static void inittapes(Tuplesortstate *state, bool mergeruns);
     615             : static void inittapestate(Tuplesortstate *state, int maxTapes);
     616             : static void selectnewtape(Tuplesortstate *state);
     617             : static void init_slab_allocator(Tuplesortstate *state, int numSlots);
     618             : static void mergeruns(Tuplesortstate *state);
     619             : static void mergeonerun(Tuplesortstate *state);
     620             : static void beginmerge(Tuplesortstate *state);
     621             : static bool mergereadnext(Tuplesortstate *state, LogicalTape *srcTape, SortTuple *stup);
     622             : static void dumptuples(Tuplesortstate *state, bool alltuples);
     623             : static void make_bounded_heap(Tuplesortstate *state);
     624             : static void sort_bounded_heap(Tuplesortstate *state);
     625             : static void tuplesort_sort_memtuples(Tuplesortstate *state);
     626             : static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple);
     627             : static void tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple);
     628             : static void tuplesort_heap_delete_top(Tuplesortstate *state);
     629             : static void reversedirection(Tuplesortstate *state);
     630             : static unsigned int getlen(LogicalTape *tape, bool eofOK);
     631             : static void markrunend(LogicalTape *tape);
     632             : static void *readtup_alloc(Tuplesortstate *state, Size tuplen);
     633             : static int  comparetup_heap(const SortTuple *a, const SortTuple *b,
     634             :                             Tuplesortstate *state);
     635             : static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
     636             : static void writetup_heap(Tuplesortstate *state, LogicalTape *tape,
     637             :                           SortTuple *stup);
     638             : static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
     639             :                          LogicalTape *tape, unsigned int len);
     640             : static int  comparetup_cluster(const SortTuple *a, const SortTuple *b,
     641             :                                Tuplesortstate *state);
     642             : static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup);
     643             : static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape,
     644             :                              SortTuple *stup);
     645             : static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
     646             :                             LogicalTape *tape, unsigned int len);
     647             : static int  comparetup_index_btree(const SortTuple *a, const SortTuple *b,
     648             :                                    Tuplesortstate *state);
     649             : static int  comparetup_index_hash(const SortTuple *a, const SortTuple *b,
     650             :                                   Tuplesortstate *state);
     651             : static void copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup);
     652             : static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
     653             :                            SortTuple *stup);
     654             : static void readtup_index(Tuplesortstate *state, SortTuple *stup,
     655             :                           LogicalTape *tape, unsigned int len);
     656             : static int  comparetup_datum(const SortTuple *a, const SortTuple *b,
     657             :                              Tuplesortstate *state);
     658             : static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup);
     659             : static void writetup_datum(Tuplesortstate *state, LogicalTape *tape,
     660             :                            SortTuple *stup);
     661             : static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
     662             :                           LogicalTape *tape, unsigned int len);
     663             : static int  worker_get_identifier(Tuplesortstate *state);
     664             : static void worker_freeze_result_tape(Tuplesortstate *state);
     665             : static void worker_nomergeruns(Tuplesortstate *state);
     666             : static void leader_takeover_tapes(Tuplesortstate *state);
     667             : static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
     668             : static void tuplesort_free(Tuplesortstate *state);
     669             : static void tuplesort_updatemax(Tuplesortstate *state);
     670             : 
     671             : /*
     672             :  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
     673             :  * any variant of SortTuples, using the appropriate comparetup function.
     674             :  * qsort_ssup() is specialized for the case where the comparetup function
     675             :  * reduces to ApplySortComparator(), that is single-key MinimalTuple sorts
     676             :  * and Datum sorts.
     677             :  */
     678             : 
     679             : #define ST_SORT qsort_tuple
     680             : #define ST_ELEMENT_TYPE SortTuple
     681             : #define ST_COMPARE_RUNTIME_POINTER
     682             : #define ST_COMPARE_ARG_TYPE Tuplesortstate
     683             : #define ST_CHECK_FOR_INTERRUPTS
     684             : #define ST_SCOPE static
     685             : #define ST_DECLARE
     686             : #define ST_DEFINE
     687             : #include "lib/sort_template.h"
     688             : 
     689             : #define ST_SORT qsort_ssup
     690             : #define ST_ELEMENT_TYPE SortTuple
     691             : #define ST_COMPARE(a, b, ssup) \
     692             :     ApplySortComparator((a)->datum1, (a)->isnull1, \
     693             :                         (b)->datum1, (b)->isnull1, (ssup))
     694             : #define ST_COMPARE_ARG_TYPE SortSupportData
     695             : #define ST_CHECK_FOR_INTERRUPTS
     696             : #define ST_SCOPE static
     697             : #define ST_DEFINE
     698             : #include "lib/sort_template.h"
     699             : 
     700             : /*
     701             :  *      tuplesort_begin_xxx
     702             :  *
     703             :  * Initialize for a tuple sort operation.
     704             :  *
     705             :  * After calling tuplesort_begin, the caller should call tuplesort_putXXX
     706             :  * zero or more times, then call tuplesort_performsort when all the tuples
     707             :  * have been supplied.  After performsort, retrieve the tuples in sorted
     708             :  * order by calling tuplesort_getXXX until it returns false/NULL.  (If random
     709             :  * access was requested, rescan, markpos, and restorepos can also be called.)
     710             :  * Call tuplesort_end to terminate the operation and release memory/disk space.
     711             :  *
     712             :  * Each variant of tuplesort_begin has a workMem parameter specifying the
     713             :  * maximum number of kilobytes of RAM to use before spilling data to disk.
     714             :  * (The normal value of this parameter is work_mem, but some callers use
     715             :  * other values.)  Each variant also has a randomAccess parameter specifying
     716             :  * whether the caller needs non-sequential access to the sort result.
     717             :  */
     718             : 
     719             : static Tuplesortstate *
     720     1932216 : tuplesort_begin_common(int workMem, SortCoordinate coordinate,
     721             :                        bool randomAccess)
     722             : {
     723             :     Tuplesortstate *state;
     724             :     MemoryContext maincontext;
     725             :     MemoryContext sortcontext;
     726             :     MemoryContext oldcontext;
     727             : 
     728             :     /* See leader_takeover_tapes() remarks on randomAccess support */
     729     1932216 :     if (coordinate && randomAccess)
     730           0 :         elog(ERROR, "random access disallowed under parallel sort");
     731             : 
     732             :     /*
     733             :      * Memory context surviving tuplesort_reset.  This memory context holds
     734             :      * data which is useful to keep while sorting multiple similar batches.
     735             :      */
     736     1932216 :     maincontext = AllocSetContextCreate(CurrentMemoryContext,
     737             :                                         "TupleSort main",
     738             :                                         ALLOCSET_DEFAULT_SIZES);
     739             : 
     740             :     /*
     741             :      * Create a working memory context for one sort operation.  The content of
     742             :      * this context is deleted by tuplesort_reset.
     743             :      */
     744     1932216 :     sortcontext = AllocSetContextCreate(maincontext,
     745             :                                         "TupleSort sort",
     746             :                                         ALLOCSET_DEFAULT_SIZES);
     747             : 
     748             :     /*
     749             :      * Additionally a working memory context for tuples is setup in
     750             :      * tuplesort_begin_batch.
     751             :      */
     752             : 
     753             :     /*
     754             :      * Make the Tuplesortstate within the per-sortstate context.  This way, we
     755             :      * don't need a separate pfree() operation for it at shutdown.
     756             :      */
     757     1932216 :     oldcontext = MemoryContextSwitchTo(maincontext);
     758             : 
     759     1932216 :     state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
     760             : 
     761             : #ifdef TRACE_SORT
     762     1932216 :     if (trace_sort)
     763           0 :         pg_rusage_init(&state->ru_start);
     764             : #endif
     765             : 
     766     1932216 :     state->randomAccess = randomAccess;
     767     1932216 :     state->tuples = true;
     768             : 
     769             :     /*
     770             :      * workMem is forced to be at least 64KB, the current minimum valid value
     771             :      * for the work_mem GUC.  This is a defense against parallel sort callers
     772             :      * that divide out memory among many workers in a way that leaves each
     773             :      * with very little memory.
     774             :      */
     775     1932216 :     state->allowedMem = Max(workMem, 64) * (int64) 1024;
     776     1932216 :     state->sortcontext = sortcontext;
     777     1932216 :     state->maincontext = maincontext;
     778             : 
     779             :     /*
     780             :      * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
     781             :      * see comments in grow_memtuples().
     782             :      */
     783     1932216 :     state->memtupsize = INITIAL_MEMTUPSIZE;
     784     1932216 :     state->memtuples = NULL;
     785             : 
     786             :     /*
     787             :      * After all of the other non-parallel-related state, we setup all of the
     788             :      * state needed for each batch.
     789             :      */
     790     1932216 :     tuplesort_begin_batch(state);
     791             : 
     792             :     /*
     793             :      * Initialize parallel-related state based on coordination information
     794             :      * from caller
     795             :      */
     796     1932216 :     if (!coordinate)
     797             :     {
     798             :         /* Serial sort */
     799     1931796 :         state->shared = NULL;
     800     1931796 :         state->worker = -1;
     801     1931796 :         state->nParticipants = -1;
     802             :     }
     803         420 :     else if (coordinate->isWorker)
     804             :     {
     805             :         /* Parallel worker produces exactly one final run from all input */
     806         280 :         state->shared = coordinate->sharedsort;
     807         280 :         state->worker = worker_get_identifier(state);
     808         280 :         state->nParticipants = -1;
     809             :     }
     810             :     else
     811             :     {
     812             :         /* Parallel leader state only used for final merge */
     813         140 :         state->shared = coordinate->sharedsort;
     814         140 :         state->worker = -1;
     815         140 :         state->nParticipants = coordinate->nParticipants;
     816             :         Assert(state->nParticipants >= 1);
     817             :     }
     818             : 
     819     1932216 :     MemoryContextSwitchTo(oldcontext);
     820             : 
     821     1932216 :     return state;
     822             : }
     823             : 
     824             : /*
     825             :  *      tuplesort_begin_batch
     826             :  *
     827             :  * Setup, or reset, all state need for processing a new set of tuples with this
     828             :  * sort state. Called both from tuplesort_begin_common (the first time sorting
     829             :  * with this sort state) and tuplesort_reset (for subsequent usages).
     830             :  */
     831             : static void
     832     1932416 : tuplesort_begin_batch(Tuplesortstate *state)
     833             : {
     834             :     MemoryContext oldcontext;
     835             : 
     836     1932416 :     oldcontext = MemoryContextSwitchTo(state->maincontext);
     837             : 
     838             :     /*
     839             :      * Caller tuple (e.g. IndexTuple) memory context.
     840             :      *
     841             :      * A dedicated child context used exclusively for caller passed tuples
     842             :      * eases memory management.  Resetting at key points reduces
     843             :      * fragmentation. Note that the memtuples array of SortTuples is allocated
     844             :      * in the parent context, not this context, because there is no need to
     845             :      * free memtuples early.
     846             :      */
     847     1932416 :     state->tuplecontext = AllocSetContextCreate(state->sortcontext,
     848             :                                                 "Caller tuples",
     849             :                                                 ALLOCSET_DEFAULT_SIZES);
     850             : 
     851     1932416 :     state->status = TSS_INITIAL;
     852     1932416 :     state->bounded = false;
     853     1932416 :     state->boundUsed = false;
     854             : 
     855     1932416 :     state->availMem = state->allowedMem;
     856             : 
     857     1932416 :     state->tapeset = NULL;
     858             : 
     859     1932416 :     state->memtupcount = 0;
     860             : 
     861             :     /*
     862             :      * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
     863             :      * see comments in grow_memtuples().
     864             :      */
     865     1932416 :     state->growmemtuples = true;
     866     1932416 :     state->slabAllocatorUsed = false;
     867     1932416 :     if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)
     868             :     {
     869           0 :         pfree(state->memtuples);
     870           0 :         state->memtuples = NULL;
     871           0 :         state->memtupsize = INITIAL_MEMTUPSIZE;
     872             :     }
     873     1932416 :     if (state->memtuples == NULL)
     874             :     {
     875     1932216 :         state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
     876     1932216 :         USEMEM(state, GetMemoryChunkSpace(state->memtuples));
     877             :     }
     878             : 
     879             :     /* workMem must be large enough for the minimal memtuples array */
     880     1932416 :     if (LACKMEM(state))
     881           0 :         elog(ERROR, "insufficient memory allowed for sort");
     882             : 
     883     1932416 :     state->currentRun = 0;
     884             : 
     885             :     /*
     886             :      * Tape variables (inputTapes, outputTapes, etc.) will be initialized by
     887             :      * inittapes(), if needed.
     888             :      */
     889             : 
     890     1932416 :     state->result_tape = NULL;   /* flag that result tape has not been formed */
     891             : 
     892     1932416 :     MemoryContextSwitchTo(oldcontext);
     893     1932416 : }
     894             : 
     895             : Tuplesortstate *
     896     1670924 : tuplesort_begin_heap(TupleDesc tupDesc,
     897             :                      int nkeys, AttrNumber *attNums,
     898             :                      Oid *sortOperators, Oid *sortCollations,
     899             :                      bool *nullsFirstFlags,
     900             :                      int workMem, SortCoordinate coordinate, bool randomAccess)
     901             : {
     902     1670924 :     Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
     903             :                                                    randomAccess);
     904             :     MemoryContext oldcontext;
     905             :     int         i;
     906             : 
     907     1670924 :     oldcontext = MemoryContextSwitchTo(state->maincontext);
     908             : 
     909             :     AssertArg(nkeys > 0);
     910             : 
     911             : #ifdef TRACE_SORT
     912     1670924 :     if (trace_sort)
     913           0 :         elog(LOG,
     914             :              "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
     915             :              nkeys, workMem, randomAccess ? 't' : 'f');
     916             : #endif
     917             : 
     918     1670924 :     state->nKeys = nkeys;
     919             : 
     920             :     TRACE_POSTGRESQL_SORT_START(HEAP_SORT,
     921             :                                 false,  /* no unique check */
     922             :                                 nkeys,
     923             :                                 workMem,
     924             :                                 randomAccess,
     925             :                                 PARALLEL_SORT(state));
     926             : 
     927     1670924 :     state->comparetup = comparetup_heap;
     928     1670924 :     state->copytup = copytup_heap;
     929     1670924 :     state->writetup = writetup_heap;
     930     1670924 :     state->readtup = readtup_heap;
     931             : 
     932     1670924 :     state->tupDesc = tupDesc;    /* assume we need not copy tupDesc */
     933     1670924 :     state->abbrevNext = 10;
     934             : 
     935             :     /* Prepare SortSupport data for each column */
     936     1670924 :     state->sortKeys = (SortSupport) palloc0(nkeys * sizeof(SortSupportData));
     937             : 
     938     3353442 :     for (i = 0; i < nkeys; i++)
     939             :     {
     940     1682522 :         SortSupport sortKey = state->sortKeys + i;
     941             : 
     942             :         AssertArg(attNums[i] != 0);
     943             :         AssertArg(sortOperators[i] != 0);
     944             : 
     945     1682522 :         sortKey->ssup_cxt = CurrentMemoryContext;
     946     1682522 :         sortKey->ssup_collation = sortCollations[i];
     947     1682522 :         sortKey->ssup_nulls_first = nullsFirstFlags[i];
     948     1682522 :         sortKey->ssup_attno = attNums[i];
     949             :         /* Convey if abbreviation optimization is applicable in principle */
     950     1682522 :         sortKey->abbreviate = (i == 0);
     951             : 
     952     1682522 :         PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
     953             :     }
     954             : 
     955             :     /*
     956             :      * The "onlyKey" optimization cannot be used with abbreviated keys, since
     957             :      * tie-breaker comparisons may be required.  Typically, the optimization
     958             :      * is only of value to pass-by-value types anyway, whereas abbreviated
     959             :      * keys are typically only of value to pass-by-reference types.
     960             :      */
     961     1670920 :     if (nkeys == 1 && !state->sortKeys->abbrev_converter)
     962     1637824 :         state->onlyKey = state->sortKeys;
     963             : 
     964     1670920 :     MemoryContextSwitchTo(oldcontext);
     965             : 
     966     1670920 :     return state;
     967             : }
     968             : 
     969             : Tuplesortstate *
     970          56 : tuplesort_begin_cluster(TupleDesc tupDesc,
     971             :                         Relation indexRel,
     972             :                         int workMem,
     973             :                         SortCoordinate coordinate, bool randomAccess)
     974             : {
     975          56 :     Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
     976             :                                                    randomAccess);
     977             :     BTScanInsert indexScanKey;
     978             :     MemoryContext oldcontext;
     979             :     int         i;
     980             : 
     981             :     Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
     982             : 
     983          56 :     oldcontext = MemoryContextSwitchTo(state->maincontext);
     984             : 
     985             : #ifdef TRACE_SORT
     986          56 :     if (trace_sort)
     987           0 :         elog(LOG,
     988             :              "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
     989             :              RelationGetNumberOfAttributes(indexRel),
     990             :              workMem, randomAccess ? 't' : 'f');
     991             : #endif
     992             : 
     993          56 :     state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
     994             : 
     995             :     TRACE_POSTGRESQL_SORT_START(CLUSTER_SORT,
     996             :                                 false,  /* no unique check */
     997             :                                 state->nKeys,
     998             :                                 workMem,
     999             :                                 randomAccess,
    1000             :                                 PARALLEL_SORT(state));
    1001             : 
    1002          56 :     state->comparetup = comparetup_cluster;
    1003          56 :     state->copytup = copytup_cluster;
    1004          56 :     state->writetup = writetup_cluster;
    1005          56 :     state->readtup = readtup_cluster;
    1006          56 :     state->abbrevNext = 10;
    1007             : 
    1008          56 :     state->indexInfo = BuildIndexInfo(indexRel);
    1009             : 
    1010          56 :     state->tupDesc = tupDesc;    /* assume we need not copy tupDesc */
    1011             : 
    1012          56 :     indexScanKey = _bt_mkscankey(indexRel, NULL);
    1013             : 
    1014          56 :     if (state->indexInfo->ii_Expressions != NULL)
    1015             :     {
    1016             :         TupleTableSlot *slot;
    1017             :         ExprContext *econtext;
    1018             : 
    1019             :         /*
    1020             :          * We will need to use FormIndexDatum to evaluate the index
    1021             :          * expressions.  To do that, we need an EState, as well as a
    1022             :          * TupleTableSlot to put the table tuples into.  The econtext's
    1023             :          * scantuple has to point to that slot, too.
    1024             :          */
    1025           8 :         state->estate = CreateExecutorState();
    1026           8 :         slot = MakeSingleTupleTableSlot(tupDesc, &TTSOpsHeapTuple);
    1027           8 :         econtext = GetPerTupleExprContext(state->estate);
    1028           8 :         econtext->ecxt_scantuple = slot;
    1029             :     }
    1030             : 
    1031             :     /* Prepare SortSupport data for each column */
    1032          56 :     state->sortKeys = (SortSupport) palloc0(state->nKeys *
    1033             :                                             sizeof(SortSupportData));
    1034             : 
    1035         124 :     for (i = 0; i < state->nKeys; i++)
    1036             :     {
    1037          68 :         SortSupport sortKey = state->sortKeys + i;
    1038          68 :         ScanKey     scanKey = indexScanKey->scankeys + i;
    1039             :         int16       strategy;
    1040             : 
    1041          68 :         sortKey->ssup_cxt = CurrentMemoryContext;
    1042          68 :         sortKey->ssup_collation = scanKey->sk_collation;
    1043          68 :         sortKey->ssup_nulls_first =
    1044          68 :             (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
    1045          68 :         sortKey->ssup_attno = scanKey->sk_attno;
    1046             :         /* Convey if abbreviation optimization is applicable in principle */
    1047          68 :         sortKey->abbreviate = (i == 0);
    1048             : 
    1049             :         AssertState(sortKey->ssup_attno != 0);
    1050             : 
    1051          68 :         strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
    1052             :             BTGreaterStrategyNumber : BTLessStrategyNumber;
    1053             : 
    1054          68 :         PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
    1055             :     }
    1056             : 
    1057          56 :     pfree(indexScanKey);
    1058             : 
    1059          56 :     MemoryContextSwitchTo(oldcontext);
    1060             : 
    1061          56 :     return state;
    1062             : }
    1063             : 
    1064             : Tuplesortstate *
    1065      187798 : tuplesort_begin_index_btree(Relation heapRel,
    1066             :                             Relation indexRel,
    1067             :                             bool enforceUnique,
    1068             :                             int workMem,
    1069             :                             SortCoordinate coordinate,
    1070             :                             bool randomAccess)
    1071             : {
    1072      187798 :     Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
    1073             :                                                    randomAccess);
    1074             :     BTScanInsert indexScanKey;
    1075             :     MemoryContext oldcontext;
    1076             :     int         i;
    1077             : 
    1078      187798 :     oldcontext = MemoryContextSwitchTo(state->maincontext);
    1079             : 
    1080             : #ifdef TRACE_SORT
    1081      187798 :     if (trace_sort)
    1082           0 :         elog(LOG,
    1083             :              "begin index sort: unique = %c, workMem = %d, randomAccess = %c",
    1084             :              enforceUnique ? 't' : 'f',
    1085             :              workMem, randomAccess ? 't' : 'f');
    1086             : #endif
    1087             : 
    1088      187798 :     state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
    1089             : 
    1090             :     TRACE_POSTGRESQL_SORT_START(INDEX_SORT,
    1091             :                                 enforceUnique,
    1092             :                                 state->nKeys,
    1093             :                                 workMem,
    1094             :                                 randomAccess,
    1095             :                                 PARALLEL_SORT(state));
    1096             : 
    1097      187798 :     state->comparetup = comparetup_index_btree;
    1098      187798 :     state->copytup = copytup_index;
    1099      187798 :     state->writetup = writetup_index;
    1100      187798 :     state->readtup = readtup_index;
    1101      187798 :     state->abbrevNext = 10;
    1102             : 
    1103      187798 :     state->heapRel = heapRel;
    1104      187798 :     state->indexRel = indexRel;
    1105      187798 :     state->enforceUnique = enforceUnique;
    1106             : 
    1107      187798 :     indexScanKey = _bt_mkscankey(indexRel, NULL);
    1108             : 
    1109             :     /* Prepare SortSupport data for each column */
    1110      187798 :     state->sortKeys = (SortSupport) palloc0(state->nKeys *
    1111             :                                             sizeof(SortSupportData));
    1112             : 
    1113      508486 :     for (i = 0; i < state->nKeys; i++)
    1114             :     {
    1115      320688 :         SortSupport sortKey = state->sortKeys + i;
    1116      320688 :         ScanKey     scanKey = indexScanKey->scankeys + i;
    1117             :         int16       strategy;
    1118             : 
    1119      320688 :         sortKey->ssup_cxt = CurrentMemoryContext;
    1120      320688 :         sortKey->ssup_collation = scanKey->sk_collation;
    1121      320688 :         sortKey->ssup_nulls_first =
    1122      320688 :             (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
    1123      320688 :         sortKey->ssup_attno = scanKey->sk_attno;
    1124             :         /* Convey if abbreviation optimization is applicable in principle */
    1125      320688 :         sortKey->abbreviate = (i == 0);
    1126             : 
    1127             :         AssertState(sortKey->ssup_attno != 0);
    1128             : 
    1129      320688 :         strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
    1130             :             BTGreaterStrategyNumber : BTLessStrategyNumber;
    1131             : 
    1132      320688 :         PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
    1133             :     }
    1134             : 
    1135      187798 :     pfree(indexScanKey);
    1136             : 
    1137      187798 :     MemoryContextSwitchTo(oldcontext);
    1138             : 
    1139      187798 :     return state;
    1140             : }
    1141             : 
    1142             : Tuplesortstate *
    1143           4 : tuplesort_begin_index_hash(Relation heapRel,
    1144             :                            Relation indexRel,
    1145             :                            uint32 high_mask,
    1146             :                            uint32 low_mask,
    1147             :                            uint32 max_buckets,
    1148             :                            int workMem,
    1149             :                            SortCoordinate coordinate,
    1150             :                            bool randomAccess)
    1151             : {
    1152           4 :     Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
    1153             :                                                    randomAccess);
    1154             :     MemoryContext oldcontext;
    1155             : 
    1156           4 :     oldcontext = MemoryContextSwitchTo(state->maincontext);
    1157             : 
    1158             : #ifdef TRACE_SORT
    1159           4 :     if (trace_sort)
    1160           0 :         elog(LOG,
    1161             :              "begin index sort: high_mask = 0x%x, low_mask = 0x%x, "
    1162             :              "max_buckets = 0x%x, workMem = %d, randomAccess = %c",
    1163             :              high_mask,
    1164             :              low_mask,
    1165             :              max_buckets,
    1166             :              workMem, randomAccess ? 't' : 'f');
    1167             : #endif
    1168             : 
    1169           4 :     state->nKeys = 1;            /* Only one sort column, the hash code */
    1170             : 
    1171           4 :     state->comparetup = comparetup_index_hash;
    1172           4 :     state->copytup = copytup_index;
    1173           4 :     state->writetup = writetup_index;
    1174           4 :     state->readtup = readtup_index;
    1175             : 
    1176           4 :     state->heapRel = heapRel;
    1177           4 :     state->indexRel = indexRel;
    1178             : 
    1179           4 :     state->high_mask = high_mask;
    1180           4 :     state->low_mask = low_mask;
    1181           4 :     state->max_buckets = max_buckets;
    1182             : 
    1183           4 :     MemoryContextSwitchTo(oldcontext);
    1184             : 
    1185           4 :     return state;
    1186             : }
    1187             : 
    1188             : Tuplesortstate *
    1189         118 : tuplesort_begin_index_gist(Relation heapRel,
    1190             :                            Relation indexRel,
    1191             :                            int workMem,
    1192             :                            SortCoordinate coordinate,
    1193             :                            bool randomAccess)
    1194             : {
    1195         118 :     Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
    1196             :                                                    randomAccess);
    1197             :     MemoryContext oldcontext;
    1198             :     int         i;
    1199             : 
    1200         118 :     oldcontext = MemoryContextSwitchTo(state->sortcontext);
    1201             : 
    1202             : #ifdef TRACE_SORT
    1203         118 :     if (trace_sort)
    1204           0 :         elog(LOG,
    1205             :              "begin index sort: workMem = %d, randomAccess = %c",
    1206             :              workMem, randomAccess ? 't' : 'f');
    1207             : #endif
    1208             : 
    1209         118 :     state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
    1210             : 
    1211         118 :     state->comparetup = comparetup_index_btree;
    1212         118 :     state->copytup = copytup_index;
    1213         118 :     state->writetup = writetup_index;
    1214         118 :     state->readtup = readtup_index;
    1215             : 
    1216         118 :     state->heapRel = heapRel;
    1217         118 :     state->indexRel = indexRel;
    1218             : 
    1219             :     /* Prepare SortSupport data for each column */
    1220         118 :     state->sortKeys = (SortSupport) palloc0(state->nKeys *
    1221             :                                             sizeof(SortSupportData));
    1222             : 
    1223         236 :     for (i = 0; i < state->nKeys; i++)
    1224             :     {
    1225         118 :         SortSupport sortKey = state->sortKeys + i;
    1226             : 
    1227         118 :         sortKey->ssup_cxt = CurrentMemoryContext;
    1228         118 :         sortKey->ssup_collation = indexRel->rd_indcollation[i];
    1229         118 :         sortKey->ssup_nulls_first = false;
    1230         118 :         sortKey->ssup_attno = i + 1;
    1231             :         /* Convey if abbreviation optimization is applicable in principle */
    1232         118 :         sortKey->abbreviate = (i == 0);
    1233             : 
    1234             :         AssertState(sortKey->ssup_attno != 0);
    1235             : 
    1236             :         /* Look for a sort support function */
    1237         118 :         PrepareSortSupportFromGistIndexRel(indexRel, sortKey);
    1238             :     }
    1239             : 
    1240         118 :     MemoryContextSwitchTo(oldcontext);
    1241             : 
    1242         118 :     return state;
    1243             : }
    1244             : 
    1245             : Tuplesortstate *
    1246       73316 : tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
    1247             :                       bool nullsFirstFlag, int workMem,
    1248             :                       SortCoordinate coordinate, bool randomAccess)
    1249             : {
    1250       73316 :     Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
    1251             :                                                    randomAccess);
    1252             :     MemoryContext oldcontext;
    1253             :     int16       typlen;
    1254             :     bool        typbyval;
    1255             : 
    1256       73316 :     oldcontext = MemoryContextSwitchTo(state->maincontext);
    1257             : 
    1258             : #ifdef TRACE_SORT
    1259       73316 :     if (trace_sort)
    1260           0 :         elog(LOG,
    1261             :              "begin datum sort: workMem = %d, randomAccess = %c",
    1262             :              workMem, randomAccess ? 't' : 'f');
    1263             : #endif
    1264             : 
    1265       73316 :     state->nKeys = 1;            /* always a one-column sort */
    1266             : 
    1267             :     TRACE_POSTGRESQL_SORT_START(DATUM_SORT,
    1268             :                                 false,  /* no unique check */
    1269             :                                 1,
    1270             :                                 workMem,
    1271             :                                 randomAccess,
    1272             :                                 PARALLEL_SORT(state));
    1273             : 
    1274       73316 :     state->comparetup = comparetup_datum;
    1275       73316 :     state->copytup = copytup_datum;
    1276       73316 :     state->writetup = writetup_datum;
    1277       73316 :     state->readtup = readtup_datum;
    1278       73316 :     state->abbrevNext = 10;
    1279             : 
    1280       73316 :     state->datumType = datumType;
    1281             : 
    1282             :     /* lookup necessary attributes of the datum type */
    1283       73316 :     get_typlenbyval(datumType, &typlen, &typbyval);
    1284       73316 :     state->datumTypeLen = typlen;
    1285       73316 :     state->tuples = !typbyval;
    1286             : 
    1287             :     /* Prepare SortSupport data */
    1288       73316 :     state->sortKeys = (SortSupport) palloc0(sizeof(SortSupportData));
    1289             : 
    1290       73316 :     state->sortKeys->ssup_cxt = CurrentMemoryContext;
    1291       73316 :     state->sortKeys->ssup_collation = sortCollation;
    1292       73316 :     state->sortKeys->ssup_nulls_first = nullsFirstFlag;
    1293             : 
    1294             :     /*
    1295             :      * Abbreviation is possible here only for by-reference types.  In theory,
    1296             :      * a pass-by-value datatype could have an abbreviated form that is cheaper
    1297             :      * to compare.  In a tuple sort, we could support that, because we can
    1298             :      * always extract the original datum from the tuple as needed.  Here, we
    1299             :      * can't, because a datum sort only stores a single copy of the datum; the
    1300             :      * "tuple" field of each SortTuple is NULL.
    1301             :      */
    1302       73316 :     state->sortKeys->abbreviate = !typbyval;
    1303             : 
    1304       73316 :     PrepareSortSupportFromOrderingOp(sortOperator, state->sortKeys);
    1305             : 
    1306             :     /*
    1307             :      * The "onlyKey" optimization cannot be used with abbreviated keys, since
    1308             :      * tie-breaker comparisons may be required.  Typically, the optimization
    1309             :      * is only of value to pass-by-value types anyway, whereas abbreviated
    1310             :      * keys are typically only of value to pass-by-reference types.
    1311             :      */
    1312       73316 :     if (!state->sortKeys->abbrev_converter)
    1313       73136 :         state->onlyKey = state->sortKeys;
    1314             : 
    1315       73316 :     MemoryContextSwitchTo(oldcontext);
    1316             : 
    1317       73316 :     return state;
    1318             : }
    1319             : 
    1320             : /*
    1321             :  * tuplesort_set_bound
    1322             :  *
    1323             :  *  Advise tuplesort that at most the first N result tuples are required.
    1324             :  *
    1325             :  * Must be called before inserting any tuples.  (Actually, we could allow it
    1326             :  * as long as the sort hasn't spilled to disk, but there seems no need for
    1327             :  * delayed calls at the moment.)
    1328             :  *
    1329             :  * This is a hint only. The tuplesort may still return more tuples than
    1330             :  * requested.  Parallel leader tuplesorts will always ignore the hint.
    1331             :  */
    1332             : void
    1333        1276 : tuplesort_set_bound(Tuplesortstate *state, int64 bound)
    1334             : {
    1335             :     /* Assert we're called before loading any tuples */
    1336             :     Assert(state->status == TSS_INITIAL && state->memtupcount == 0);
    1337             :     /* Can't set the bound twice, either */
    1338             :     Assert(!state->bounded);
    1339             :     /* Also, this shouldn't be called in a parallel worker */
    1340             :     Assert(!WORKER(state));
    1341             : 
    1342             :     /* Parallel leader allows but ignores hint */
    1343        1276 :     if (LEADER(state))
    1344           0 :         return;
    1345             : 
    1346             : #ifdef DEBUG_BOUNDED_SORT
    1347             :     /* Honor GUC setting that disables the feature (for easy testing) */
    1348             :     if (!optimize_bounded_sort)
    1349             :         return;
    1350             : #endif
    1351             : 
    1352             :     /* We want to be able to compute bound * 2, so limit the setting */
    1353        1276 :     if (bound > (int64) (INT_MAX / 2))
    1354           0 :         return;
    1355             : 
    1356        1276 :     state->bounded = true;
    1357        1276 :     state->bound = (int) bound;
    1358             : 
    1359             :     /*
    1360             :      * Bounded sorts are not an effective target for abbreviated key
    1361             :      * optimization.  Disable by setting state to be consistent with no
    1362             :      * abbreviation support.
    1363             :      */
    1364        1276 :     state->sortKeys->abbrev_converter = NULL;
    1365        1276 :     if (state->sortKeys->abbrev_full_comparator)
    1366           8 :         state->sortKeys->comparator = state->sortKeys->abbrev_full_comparator;
    1367             : 
    1368             :     /* Not strictly necessary, but be tidy */
    1369        1276 :     state->sortKeys->abbrev_abort = NULL;
    1370        1276 :     state->sortKeys->abbrev_full_comparator = NULL;
    1371             : }
    1372             : 
    1373             : /*
    1374             :  * tuplesort_used_bound
    1375             :  *
    1376             :  * Allow callers to find out if the sort state was able to use a bound.
    1377             :  */
    1378             : bool
    1379          70 : tuplesort_used_bound(Tuplesortstate *state)
    1380             : {
    1381          70 :     return state->boundUsed;
    1382             : }
    1383             : 
    1384             : /*
    1385             :  * tuplesort_free
    1386             :  *
    1387             :  *  Internal routine for freeing resources of tuplesort.
    1388             :  */
    1389             : static void
    1390     1932312 : tuplesort_free(Tuplesortstate *state)
    1391             : {
    1392             :     /* context swap probably not needed, but let's be safe */
    1393     1932312 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
    1394             : 
    1395             : #ifdef TRACE_SORT
    1396             :     long        spaceUsed;
    1397             : 
    1398     1932312 :     if (state->tapeset)
    1399         456 :         spaceUsed = LogicalTapeSetBlocks(state->tapeset);
    1400             :     else
    1401     1931856 :         spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
    1402             : #endif
    1403             : 
    1404             :     /*
    1405             :      * Delete temporary "tape" files, if any.
    1406             :      *
    1407             :      * Note: want to include this in reported total cost of sort, hence need
    1408             :      * for two #ifdef TRACE_SORT sections.
    1409             :      *
    1410             :      * We don't bother to destroy the individual tapes here. They will go away
    1411             :      * with the sortcontext.  (In TSS_FINALMERGE state, we have closed
    1412             :      * finished tapes already.)
    1413             :      */
    1414     1932312 :     if (state->tapeset)
    1415         456 :         LogicalTapeSetClose(state->tapeset);
    1416             : 
    1417             : #ifdef TRACE_SORT
    1418     1932312 :     if (trace_sort)
    1419             :     {
    1420           0 :         if (state->tapeset)
    1421           0 :             elog(LOG, "%s of worker %d ended, %ld disk blocks used: %s",
    1422             :                  SERIAL(state) ? "external sort" : "parallel external sort",
    1423             :                  state->worker, spaceUsed, pg_rusage_show(&state->ru_start));
    1424             :         else
    1425           0 :             elog(LOG, "%s of worker %d ended, %ld KB used: %s",
    1426             :                  SERIAL(state) ? "internal sort" : "unperformed parallel sort",
    1427             :                  state->worker, spaceUsed, pg_rusage_show(&state->ru_start));
    1428             :     }
    1429             : 
    1430             :     TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, spaceUsed);
    1431             : #else
    1432             : 
    1433             :     /*
    1434             :      * If you disabled TRACE_SORT, you can still probe sort__done, but you
    1435             :      * ain't getting space-used stats.
    1436             :      */
    1437             :     TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, 0L);
    1438             : #endif
    1439             : 
    1440             :     /* Free any execution state created for CLUSTER case */
    1441     1932312 :     if (state->estate != NULL)
    1442             :     {
    1443           8 :         ExprContext *econtext = GetPerTupleExprContext(state->estate);
    1444             : 
    1445           8 :         ExecDropSingleTupleTableSlot(econtext->ecxt_scantuple);
    1446           8 :         FreeExecutorState(state->estate);
    1447             :     }
    1448             : 
    1449     1932312 :     MemoryContextSwitchTo(oldcontext);
    1450             : 
    1451             :     /*
    1452             :      * Free the per-sort memory context, thereby releasing all working memory.
    1453             :      */
    1454     1932312 :     MemoryContextReset(state->sortcontext);
    1455     1932312 : }
    1456             : 
    1457             : /*
    1458             :  * tuplesort_end
    1459             :  *
    1460             :  *  Release resources and clean up.
    1461             :  *
    1462             :  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
    1463             :  * pointing to garbage.  Be careful not to attempt to use or free such
    1464             :  * pointers afterwards!
    1465             :  */
    1466             : void
    1467     1932112 : tuplesort_end(Tuplesortstate *state)
    1468             : {
    1469     1932112 :     tuplesort_free(state);
    1470             : 
    1471             :     /*
    1472             :      * Free the main memory context, including the Tuplesortstate struct
    1473             :      * itself.
    1474             :      */
    1475     1932112 :     MemoryContextDelete(state->maincontext);
    1476     1932112 : }
    1477             : 
    1478             : /*
    1479             :  * tuplesort_updatemax
    1480             :  *
    1481             :  *  Update maximum resource usage statistics.
    1482             :  */
    1483             : static void
    1484         396 : tuplesort_updatemax(Tuplesortstate *state)
    1485             : {
    1486             :     int64       spaceUsed;
    1487             :     bool        isSpaceDisk;
    1488             : 
    1489             :     /*
    1490             :      * Note: it might seem we should provide both memory and disk usage for a
    1491             :      * disk-based sort.  However, the current code doesn't track memory space
    1492             :      * accurately once we have begun to return tuples to the caller (since we
    1493             :      * don't account for pfree's the caller is expected to do), so we cannot
    1494             :      * rely on availMem in a disk sort.  This does not seem worth the overhead
    1495             :      * to fix.  Is it worth creating an API for the memory context code to
    1496             :      * tell us how much is actually used in sortcontext?
    1497             :      */
    1498         396 :     if (state->tapeset)
    1499             :     {
    1500           4 :         isSpaceDisk = true;
    1501           4 :         spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
    1502             :     }
    1503             :     else
    1504             :     {
    1505         392 :         isSpaceDisk = false;
    1506         392 :         spaceUsed = state->allowedMem - state->availMem;
    1507             :     }
    1508             : 
    1509             :     /*
    1510             :      * Sort evicts data to the disk when it wasn't able to fit that data into
    1511             :      * main memory.  This is why we assume space used on the disk to be more
    1512             :      * important for tracking resource usage than space used in memory. Note
    1513             :      * that the amount of space occupied by some tupleset on the disk might be
    1514             :      * less than amount of space occupied by the same tupleset in memory due
    1515             :      * to more compact representation.
    1516             :      */
    1517         396 :     if ((isSpaceDisk && !state->isMaxSpaceDisk) ||
    1518         392 :         (isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))
    1519             :     {
    1520         188 :         state->maxSpace = spaceUsed;
    1521         188 :         state->isMaxSpaceDisk = isSpaceDisk;
    1522         188 :         state->maxSpaceStatus = state->status;
    1523             :     }
    1524         396 : }
    1525             : 
    1526             : /*
    1527             :  * tuplesort_reset
    1528             :  *
    1529             :  *  Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
    1530             :  *  meta-information in.  After tuplesort_reset, tuplesort is ready to start
    1531             :  *  a new sort.  This allows avoiding recreation of tuple sort states (and
    1532             :  *  save resources) when sorting multiple small batches.
    1533             :  */
    1534             : void
    1535         200 : tuplesort_reset(Tuplesortstate *state)
    1536             : {
    1537         200 :     tuplesort_updatemax(state);
    1538         200 :     tuplesort_free(state);
    1539             : 
    1540             :     /*
    1541             :      * After we've freed up per-batch memory, re-setup all of the state common
    1542             :      * to both the first batch and any subsequent batch.
    1543             :      */
    1544         200 :     tuplesort_begin_batch(state);
    1545             : 
    1546         200 :     state->lastReturnedTuple = NULL;
    1547         200 :     state->slabMemoryBegin = NULL;
    1548         200 :     state->slabMemoryEnd = NULL;
    1549         200 :     state->slabFreeHead = NULL;
    1550         200 : }
    1551             : 
    1552             : /*
    1553             :  * Grow the memtuples[] array, if possible within our memory constraint.  We
    1554             :  * must not exceed INT_MAX tuples in memory or the caller-provided memory
    1555             :  * limit.  Return true if we were able to enlarge the array, false if not.
    1556             :  *
    1557             :  * Normally, at each increment we double the size of the array.  When doing
    1558             :  * that would exceed a limit, we attempt one last, smaller increase (and then
    1559             :  * clear the growmemtuples flag so we don't try any more).  That allows us to
    1560             :  * use memory as fully as permitted; sticking to the pure doubling rule could
    1561             :  * result in almost half going unused.  Because availMem moves around with
    1562             :  * tuple addition/removal, we need some rule to prevent making repeated small
    1563             :  * increases in memtupsize, which would just be useless thrashing.  The
    1564             :  * growmemtuples flag accomplishes that and also prevents useless
    1565             :  * recalculations in this function.
    1566             :  */
    1567             : static bool
    1568        8100 : grow_memtuples(Tuplesortstate *state)
    1569             : {
    1570             :     int         newmemtupsize;
    1571        8100 :     int         memtupsize = state->memtupsize;
    1572        8100 :     int64       memNowUsed = state->allowedMem - state->availMem;
    1573             : 
    1574             :     /* Forget it if we've already maxed out memtuples, per comment above */
    1575        8100 :     if (!state->growmemtuples)
    1576          52 :         return false;
    1577             : 
    1578             :     /* Select new value of memtupsize */
    1579        8048 :     if (memNowUsed <= state->availMem)
    1580             :     {
    1581             :         /*
    1582             :          * We've used no more than half of allowedMem; double our usage,
    1583             :          * clamping at INT_MAX tuples.
    1584             :          */
    1585        7988 :         if (memtupsize < INT_MAX / 2)
    1586        7988 :             newmemtupsize = memtupsize * 2;
    1587             :         else
    1588             :         {
    1589           0 :             newmemtupsize = INT_MAX;
    1590           0 :             state->growmemtuples = false;
    1591             :         }
    1592             :     }
    1593             :     else
    1594             :     {
    1595             :         /*
    1596             :          * This will be the last increment of memtupsize.  Abandon doubling
    1597             :          * strategy and instead increase as much as we safely can.
    1598             :          *
    1599             :          * To stay within allowedMem, we can't increase memtupsize by more
    1600             :          * than availMem / sizeof(SortTuple) elements.  In practice, we want
    1601             :          * to increase it by considerably less, because we need to leave some
    1602             :          * space for the tuples to which the new array slots will refer.  We
    1603             :          * assume the new tuples will be about the same size as the tuples
    1604             :          * we've already seen, and thus we can extrapolate from the space
    1605             :          * consumption so far to estimate an appropriate new size for the
    1606             :          * memtuples array.  The optimal value might be higher or lower than
    1607             :          * this estimate, but it's hard to know that in advance.  We again
    1608             :          * clamp at INT_MAX tuples.
    1609             :          *
    1610             :          * This calculation is safe against enlarging the array so much that
    1611             :          * LACKMEM becomes true, because the memory currently used includes
    1612             :          * the present array; thus, there would be enough allowedMem for the
    1613             :          * new array elements even if no other memory were currently used.
    1614             :          *
    1615             :          * We do the arithmetic in float8, because otherwise the product of
    1616             :          * memtupsize and allowedMem could overflow.  Any inaccuracy in the
    1617             :          * result should be insignificant; but even if we computed a
    1618             :          * completely insane result, the checks below will prevent anything
    1619             :          * really bad from happening.
    1620             :          */
    1621             :         double      grow_ratio;
    1622             : 
    1623          60 :         grow_ratio = (double) state->allowedMem / (double) memNowUsed;
    1624          60 :         if (memtupsize * grow_ratio < INT_MAX)
    1625          60 :             newmemtupsize = (int) (memtupsize * grow_ratio);
    1626             :         else
    1627           0 :             newmemtupsize = INT_MAX;
    1628             : 
    1629             :         /* We won't make any further enlargement attempts */
    1630          60 :         state->growmemtuples = false;
    1631             :     }
    1632             : 
    1633             :     /* Must enlarge array by at least one element, else report failure */
    1634        8048 :     if (newmemtupsize <= memtupsize)
    1635           0 :         goto noalloc;
    1636             : 
    1637             :     /*
    1638             :      * On a 32-bit machine, allowedMem could exceed MaxAllocHugeSize.  Clamp
    1639             :      * to ensure our request won't be rejected.  Note that we can easily
    1640             :      * exhaust address space before facing this outcome.  (This is presently
    1641             :      * impossible due to guc.c's MAX_KILOBYTES limitation on work_mem, but
    1642             :      * don't rely on that at this distance.)
    1643             :      */
    1644        8048 :     if ((Size) newmemtupsize >= MaxAllocHugeSize / sizeof(SortTuple))
    1645             :     {
    1646           0 :         newmemtupsize = (int) (MaxAllocHugeSize / sizeof(SortTuple));
    1647           0 :         state->growmemtuples = false;    /* can't grow any more */
    1648             :     }
    1649             : 
    1650             :     /*
    1651             :      * We need to be sure that we do not cause LACKMEM to become true, else
    1652             :      * the space management algorithm will go nuts.  The code above should
    1653             :      * never generate a dangerous request, but to be safe, check explicitly
    1654             :      * that the array growth fits within availMem.  (We could still cause
    1655             :      * LACKMEM if the memory chunk overhead associated with the memtuples
    1656             :      * array were to increase.  That shouldn't happen because we chose the
    1657             :      * initial array size large enough to ensure that palloc will be treating
    1658             :      * both old and new arrays as separate chunks.  But we'll check LACKMEM
    1659             :      * explicitly below just in case.)
    1660             :      */
    1661        8048 :     if (state->availMem < (int64) ((newmemtupsize - memtupsize) * sizeof(SortTuple)))
    1662           0 :         goto noalloc;
    1663             : 
    1664             :     /* OK, do it */
    1665        8048 :     FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
    1666        8048 :     state->memtupsize = newmemtupsize;
    1667        8048 :     state->memtuples = (SortTuple *)
    1668        8048 :         repalloc_huge(state->memtuples,
    1669        8048 :                       state->memtupsize * sizeof(SortTuple));
    1670        8048 :     USEMEM(state, GetMemoryChunkSpace(state->memtuples));
    1671        8048 :     if (LACKMEM(state))
    1672           0 :         elog(ERROR, "unexpected out-of-memory situation in tuplesort");
    1673        8048 :     return true;
    1674             : 
    1675           0 : noalloc:
    1676             :     /* If for any reason we didn't realloc, shut off future attempts */
    1677           0 :     state->growmemtuples = false;
    1678           0 :     return false;
    1679             : }
    1680             : 
    1681             : /*
    1682             :  * Accept one tuple while collecting input data for sort.
    1683             :  *
    1684             :  * Note that the input data is always copied; the caller need not save it.
    1685             :  */
    1686             : void
    1687     7935892 : tuplesort_puttupleslot(Tuplesortstate *state, TupleTableSlot *slot)
    1688             : {
    1689     7935892 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
    1690             :     SortTuple   stup;
    1691             : 
    1692             :     /*
    1693             :      * Copy the given tuple into memory we control, and decrease availMem.
    1694             :      * Then call the common code.
    1695             :      */
    1696     7935892 :     COPYTUP(state, &stup, (void *) slot);
    1697             : 
    1698     7935892 :     puttuple_common(state, &stup);
    1699             : 
    1700     7935892 :     MemoryContextSwitchTo(oldcontext);
    1701     7935892 : }
    1702             : 
    1703             : /*
    1704             :  * Accept one tuple while collecting input data for sort.
    1705             :  *
    1706             :  * Note that the input data is always copied; the caller need not save it.
    1707             :  */
    1708             : void
    1709      366434 : tuplesort_putheaptuple(Tuplesortstate *state, HeapTuple tup)
    1710             : {
    1711      366434 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
    1712             :     SortTuple   stup;
    1713             : 
    1714             :     /*
    1715             :      * Copy the given tuple into memory we control, and decrease availMem.
    1716             :      * Then call the common code.
    1717             :      */
    1718      366434 :     COPYTUP(state, &stup, (void *) tup);
    1719             : 
    1720      366434 :     puttuple_common(state, &stup);
    1721             : 
    1722      366434 :     MemoryContextSwitchTo(oldcontext);
    1723      366434 : }
    1724             : 
    1725             : /*
    1726             :  * Collect one index tuple while collecting input data for sort, building
    1727             :  * it from caller-supplied values.
    1728             :  */
    1729             : void
    1730    19350504 : tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
    1731             :                               ItemPointer self, Datum *values,
    1732             :                               bool *isnull)
    1733             : {
    1734    19350504 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->tuplecontext);
    1735             :     SortTuple   stup;
    1736             :     Datum       original;
    1737             :     IndexTuple  tuple;
    1738             : 
    1739    19350504 :     stup.tuple = index_form_tuple(RelationGetDescr(rel), values, isnull);
    1740    19350504 :     tuple = ((IndexTuple) stup.tuple);
    1741    19350504 :     tuple->t_tid = *self;
    1742    19350504 :     USEMEM(state, GetMemoryChunkSpace(stup.tuple));
    1743             :     /* set up first-column key value */
    1744    19350504 :     original = index_getattr(tuple,
    1745             :                              1,
    1746             :                              RelationGetDescr(state->indexRel),
    1747             :                              &stup.isnull1);
    1748             : 
    1749    19350504 :     MemoryContextSwitchTo(state->sortcontext);
    1750             : 
    1751    19350504 :     if (!state->sortKeys || !state->sortKeys->abbrev_converter || stup.isnull1)
    1752             :     {
    1753             :         /*
    1754             :          * Store ordinary Datum representation, or NULL value.  If there is a
    1755             :          * converter it won't expect NULL values, and cost model is not
    1756             :          * required to account for NULL, so in that case we avoid calling
    1757             :          * converter and just set datum1 to zeroed representation (to be
    1758             :          * consistent, and to support cheap inequality tests for NULL
    1759             :          * abbreviated keys).
    1760             :          */
    1761    17959968 :         stup.datum1 = original;
    1762             :     }
    1763     1390536 :     else if (!consider_abort_common(state))
    1764             :     {
    1765             :         /* Store abbreviated key representation */
    1766     1390496 :         stup.datum1 = state->sortKeys->abbrev_converter(original,
    1767             :                                                         state->sortKeys);
    1768             :     }
    1769             :     else
    1770             :     {
    1771             :         /* Abort abbreviation */
    1772             :         int         i;
    1773             : 
    1774          40 :         stup.datum1 = original;
    1775             : 
    1776             :         /*
    1777             :          * Set state to be consistent with never trying abbreviation.
    1778             :          *
    1779             :          * Alter datum1 representation in already-copied tuples, so as to
    1780             :          * ensure a consistent representation (current tuple was just
    1781             :          * handled).  It does not matter if some dumped tuples are already
    1782             :          * sorted on tape, since serialized tuples lack abbreviated keys
    1783             :          * (TSS_BUILDRUNS state prevents control reaching here in any case).
    1784             :          */
    1785      409640 :         for (i = 0; i < state->memtupcount; i++)
    1786             :         {
    1787      409600 :             SortTuple  *mtup = &state->memtuples[i];
    1788             : 
    1789      409600 :             tuple = mtup->tuple;
    1790      409600 :             mtup->datum1 = index_getattr(tuple,
    1791             :                                          1,
    1792             :                                          RelationGetDescr(state->indexRel),
    1793             :                                          &mtup->isnull1);
    1794             :         }
    1795             :     }
    1796             : 
    1797    19350504 :     puttuple_common(state, &stup);
    1798             : 
    1799    19350504 :     MemoryContextSwitchTo(oldcontext);
    1800    19350504 : }
    1801             : 
    1802             : /*
    1803             :  * Accept one Datum while collecting input data for sort.
    1804             :  *
    1805             :  * If the Datum is pass-by-ref type, the value will be copied.
    1806             :  */
    1807             : void
    1808     2585898 : tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
    1809             : {
    1810     2585898 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->tuplecontext);
    1811             :     SortTuple   stup;
    1812             : 
    1813             :     /*
    1814             :      * Pass-by-value types or null values are just stored directly in
    1815             :      * stup.datum1 (and stup.tuple is not used and set to NULL).
    1816             :      *
    1817             :      * Non-null pass-by-reference values need to be copied into memory we
    1818             :      * control, and possibly abbreviated. The copied value is pointed to by
    1819             :      * stup.tuple and is treated as the canonical copy (e.g. to return via
    1820             :      * tuplesort_getdatum or when writing to tape); stup.datum1 gets the
    1821             :      * abbreviated value if abbreviation is happening, otherwise it's
    1822             :      * identical to stup.tuple.
    1823             :      */
    1824             : 
    1825     2585898 :     if (isNull || !state->tuples)
    1826             :     {
    1827             :         /*
    1828             :          * Set datum1 to zeroed representation for NULLs (to be consistent,
    1829             :          * and to support cheap inequality tests for NULL abbreviated keys).
    1830             :          */
    1831     1366064 :         stup.datum1 = !isNull ? val : (Datum) 0;
    1832     1366064 :         stup.isnull1 = isNull;
    1833     1366064 :         stup.tuple = NULL;      /* no separate storage */
    1834     1366064 :         MemoryContextSwitchTo(state->sortcontext);
    1835             :     }
    1836             :     else
    1837             :     {
    1838     1219834 :         Datum       original = datumCopy(val, false, state->datumTypeLen);
    1839             : 
    1840     1219834 :         stup.isnull1 = false;
    1841     1219834 :         stup.tuple = DatumGetPointer(original);
    1842     1219834 :         USEMEM(state, GetMemoryChunkSpace(stup.tuple));
    1843     1219834 :         MemoryContextSwitchTo(state->sortcontext);
    1844             : 
    1845     1219834 :         if (!state->sortKeys->abbrev_converter)
    1846             :         {
    1847      816968 :             stup.datum1 = original;
    1848             :         }
    1849      402866 :         else if (!consider_abort_common(state))
    1850             :         {
    1851             :             /* Store abbreviated key representation */
    1852      402858 :             stup.datum1 = state->sortKeys->abbrev_converter(original,
    1853             :                                                             state->sortKeys);
    1854             :         }
    1855             :         else
    1856             :         {
    1857             :             /* Abort abbreviation */
    1858             :             int         i;
    1859             : 
    1860           8 :             stup.datum1 = original;
    1861             : 
    1862             :             /*
    1863             :              * Set state to be consistent with never trying abbreviation.
    1864             :              *
    1865             :              * Alter datum1 representation in already-copied tuples, so as to
    1866             :              * ensure a consistent representation (current tuple was just
    1867             :              * handled).  It does not matter if some dumped tuples are already
    1868             :              * sorted on tape, since serialized tuples lack abbreviated keys
    1869             :              * (TSS_BUILDRUNS state prevents control reaching here in any
    1870             :              * case).
    1871             :              */
    1872       81928 :             for (i = 0; i < state->memtupcount; i++)
    1873             :             {
    1874       81920 :                 SortTuple  *mtup = &state->memtuples[i];
    1875             : 
    1876       81920 :                 mtup->datum1 = PointerGetDatum(mtup->tuple);
    1877             :             }
    1878             :         }
    1879             :     }
    1880             : 
    1881     2585898 :     puttuple_common(state, &stup);
    1882             : 
    1883     2585898 :     MemoryContextSwitchTo(oldcontext);
    1884     2585898 : }
    1885             : 
    1886             : /*
    1887             :  * Shared code for tuple and datum cases.
    1888             :  */
    1889             : static void
    1890    30238728 : puttuple_common(Tuplesortstate *state, SortTuple *tuple)
    1891             : {
    1892             :     Assert(!LEADER(state));
    1893             : 
    1894    30238728 :     switch (state->status)
    1895             :     {
    1896    26248392 :         case TSS_INITIAL:
    1897             : 
    1898             :             /*
    1899             :              * Save the tuple into the unsorted array.  First, grow the array
    1900             :              * as needed.  Note that we try to grow the array when there is
    1901             :              * still one free slot remaining --- if we fail, there'll still be
    1902             :              * room to store the incoming tuple, and then we'll switch to
    1903             :              * tape-based operation.
    1904             :              */
    1905    26248392 :             if (state->memtupcount >= state->memtupsize - 1)
    1906             :             {
    1907        8100 :                 (void) grow_memtuples(state);
    1908             :                 Assert(state->memtupcount < state->memtupsize);
    1909             :             }
    1910    26248392 :             state->memtuples[state->memtupcount++] = *tuple;
    1911             : 
    1912             :             /*
    1913             :              * Check if it's time to switch over to a bounded heapsort. We do
    1914             :              * so if the input tuple count exceeds twice the desired tuple
    1915             :              * count (this is a heuristic for where heapsort becomes cheaper
    1916             :              * than a quicksort), or if we've just filled workMem and have
    1917             :              * enough tuples to meet the bound.
    1918             :              *
    1919             :              * Note that once we enter TSS_BOUNDED state we will always try to
    1920             :              * complete the sort that way.  In the worst case, if later input
    1921             :              * tuples are larger than earlier ones, this might cause us to
    1922             :              * exceed workMem significantly.
    1923             :              */
    1924    26248392 :             if (state->bounded &&
    1925       39932 :                 (state->memtupcount > state->bound * 2 ||
    1926       39610 :                  (state->memtupcount > state->bound && LACKMEM(state))))
    1927             :             {
    1928             : #ifdef TRACE_SORT
    1929         322 :                 if (trace_sort)
    1930           0 :                     elog(LOG, "switching to bounded heapsort at %d tuples: %s",
    1931             :                          state->memtupcount,
    1932             :                          pg_rusage_show(&state->ru_start));
    1933             : #endif
    1934         322 :                 make_bounded_heap(state);
    1935         322 :                 return;
    1936             :             }
    1937             : 
    1938             :             /*
    1939             :              * Done if we still fit in available memory and have array slots.
    1940             :              */
    1941    26248070 :             if (state->memtupcount < state->memtupsize && !LACKMEM(state))
    1942    26247990 :                 return;
    1943             : 
    1944             :             /*
    1945             :              * Nope; time to switch to tape-based operation.
    1946             :              */
    1947          80 :             inittapes(state, true);
    1948             : 
    1949             :             /*
    1950             :              * Dump all tuples.
    1951             :              */
    1952          80 :             dumptuples(state, false);
    1953          80 :             break;
    1954             : 
    1955     3215572 :         case TSS_BOUNDED:
    1956             : 
    1957             :             /*
    1958             :              * We don't want to grow the array here, so check whether the new
    1959             :              * tuple can be discarded before putting it in.  This should be a
    1960             :              * good speed optimization, too, since when there are many more
    1961             :              * input tuples than the bound, most input tuples can be discarded
    1962             :              * with just this one comparison.  Note that because we currently
    1963             :              * have the sort direction reversed, we must check for <= not >=.
    1964             :              */
    1965     3215572 :             if (COMPARETUP(state, tuple, &state->memtuples[0]) <= 0)
    1966             :             {
    1967             :                 /* new tuple <= top of the heap, so we can discard it */
    1968     2879528 :                 free_sort_tuple(state, tuple);
    1969     2879528 :                 CHECK_FOR_INTERRUPTS();
    1970             :             }
    1971             :             else
    1972             :             {
    1973             :                 /* discard top of heap, replacing it with the new tuple */
    1974      336044 :                 free_sort_tuple(state, &state->memtuples[0]);
    1975      336044 :                 tuplesort_heap_replace_top(state, tuple);
    1976             :             }
    1977     3215572 :             break;
    1978             : 
    1979      774764 :         case TSS_BUILDRUNS:
    1980             : 
    1981             :             /*
    1982             :              * Save the tuple into the unsorted array (there must be space)
    1983             :              */
    1984      774764 :             state->memtuples[state->memtupcount++] = *tuple;
    1985             : 
    1986             :             /*
    1987             :              * If we are over the memory limit, dump all tuples.
    1988             :              */
    1989      774764 :             dumptuples(state, false);
    1990      774764 :             break;
    1991             : 
    1992           0 :         default:
    1993           0 :             elog(ERROR, "invalid tuplesort state");
    1994             :             break;
    1995             :     }
    1996             : }
    1997             : 
    1998             : static bool
    1999     2749820 : consider_abort_common(Tuplesortstate *state)
    2000             : {
    2001             :     Assert(state->sortKeys[0].abbrev_converter != NULL);
    2002             :     Assert(state->sortKeys[0].abbrev_abort != NULL);
    2003             :     Assert(state->sortKeys[0].abbrev_full_comparator != NULL);
    2004             : 
    2005             :     /*
    2006             :      * Check effectiveness of abbreviation optimization.  Consider aborting
    2007             :      * when still within memory limit.
    2008             :      */
    2009     2749820 :     if (state->status == TSS_INITIAL &&
    2010     2455516 :         state->memtupcount >= state->abbrevNext)
    2011             :     {
    2012      132586 :         state->abbrevNext *= 2;
    2013             : 
    2014             :         /*
    2015             :          * Check opclass-supplied abbreviation abort routine.  It may indicate
    2016             :          * that abbreviation should not proceed.
    2017             :          */
    2018      132586 :         if (!state->sortKeys->abbrev_abort(state->memtupcount,
    2019             :                                            state->sortKeys))
    2020      132522 :             return false;
    2021             : 
    2022             :         /*
    2023             :          * Finally, restore authoritative comparator, and indicate that
    2024             :          * abbreviation is not in play by setting abbrev_converter to NULL
    2025             :          */
    2026          64 :         state->sortKeys[0].comparator = state->sortKeys[0].abbrev_full_comparator;
    2027          64 :         state->sortKeys[0].abbrev_converter = NULL;
    2028             :         /* Not strictly necessary, but be tidy */
    2029          64 :         state->sortKeys[0].abbrev_abort = NULL;
    2030          64 :         state->sortKeys[0].abbrev_full_comparator = NULL;
    2031             : 
    2032             :         /* Give up - expect original pass-by-value representation */
    2033          64 :         return true;
    2034             :     }
    2035             : 
    2036     2617234 :     return false;
    2037             : }
    2038             : 
    2039             : /*
    2040             :  * All tuples have been provided; finish the sort.
    2041             :  */
    2042             : void
    2043     1843824 : tuplesort_performsort(Tuplesortstate *state)
    2044             : {
    2045     1843824 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
    2046             : 
    2047             : #ifdef TRACE_SORT
    2048     1843824 :     if (trace_sort)
    2049           0 :         elog(LOG, "performsort of worker %d starting: %s",
    2050             :              state->worker, pg_rusage_show(&state->ru_start));
    2051             : #endif
    2052             : 
    2053     1843824 :     switch (state->status)
    2054             :     {
    2055     1843422 :         case TSS_INITIAL:
    2056             : 
    2057             :             /*
    2058             :              * We were able to accumulate all the tuples within the allowed
    2059             :              * amount of memory, or leader to take over worker tapes
    2060             :              */
    2061     1843422 :             if (SERIAL(state))
    2062             :             {
    2063             :                 /* Just qsort 'em and we're done */
    2064     1843046 :                 tuplesort_sort_memtuples(state);
    2065     1843002 :                 state->status = TSS_SORTEDINMEM;
    2066             :             }
    2067         376 :             else if (WORKER(state))
    2068             :             {
    2069             :                 /*
    2070             :                  * Parallel workers must still dump out tuples to tape.  No
    2071             :                  * merge is required to produce single output run, though.
    2072             :                  */
    2073         280 :                 inittapes(state, false);
    2074         280 :                 dumptuples(state, true);
    2075         280 :                 worker_nomergeruns(state);
    2076         280 :                 state->status = TSS_SORTEDONTAPE;
    2077             :             }
    2078             :             else
    2079             :             {
    2080             :                 /*
    2081             :                  * Leader will take over worker tapes and merge worker runs.
    2082             :                  * Note that mergeruns sets the correct state->status.
    2083             :                  */
    2084          96 :                 leader_takeover_tapes(state);
    2085          96 :                 mergeruns(state);
    2086             :             }
    2087     1843378 :             state->current = 0;
    2088     1843378 :             state->eof_reached = false;
    2089     1843378 :             state->markpos_block = 0L;
    2090     1843378 :             state->markpos_offset = 0;
    2091     1843378 :             state->markpos_eof = false;
    2092     1843378 :             break;
    2093             : 
    2094         322 :         case TSS_BOUNDED:
    2095             : 
    2096             :             /*
    2097             :              * We were able to accumulate all the tuples required for output
    2098             :              * in memory, using a heap to eliminate excess tuples.  Now we
    2099             :              * have to transform the heap to a properly-sorted array.
    2100             :              */
    2101         322 :             sort_bounded_heap(state);
    2102         322 :             state->current = 0;
    2103         322 :             state->eof_reached = false;
    2104         322 :             state->markpos_offset = 0;
    2105         322 :             state->markpos_eof = false;
    2106         322 :             state->status = TSS_SORTEDINMEM;
    2107         322 :             break;
    2108             : 
    2109          80 :         case TSS_BUILDRUNS:
    2110             : 
    2111             :             /*
    2112             :              * Finish tape-based sort.  First, flush all tuples remaining in
    2113             :              * memory out to tape; then merge until we have a single remaining
    2114             :              * run (or, if !randomAccess and !WORKER(), one run per tape).
    2115             :              * Note that mergeruns sets the correct state->status.
    2116             :              */
    2117          80 :             dumptuples(state, true);
    2118          80 :             mergeruns(state);
    2119          80 :             state->eof_reached = false;
    2120          80 :             state->markpos_block = 0L;
    2121          80 :             state->markpos_offset = 0;
    2122          80 :             state->markpos_eof = false;
    2123          80 :             break;
    2124             : 
    2125           0 :         default:
    2126           0 :             elog(ERROR, "invalid tuplesort state");
    2127             :             break;
    2128             :     }
    2129             : 
    2130             : #ifdef TRACE_SORT
    2131     1843780 :     if (trace_sort)
    2132             :     {
    2133           0 :         if (state->status == TSS_FINALMERGE)
    2134           0 :             elog(LOG, "performsort of worker %d done (except %d-way final merge): %s",
    2135             :                  state->worker, state->nInputTapes,
    2136             :                  pg_rusage_show(&state->ru_start));
    2137             :         else
    2138           0 :             elog(LOG, "performsort of worker %d done: %s",
    2139             :                  state->worker, pg_rusage_show(&state->ru_start));
    2140             :     }
    2141             : #endif
    2142             : 
    2143     1843780 :     MemoryContextSwitchTo(oldcontext);
    2144     1843780 : }
    2145             : 
    2146             : /*
    2147             :  * Internal routine to fetch the next tuple in either forward or back
    2148             :  * direction into *stup.  Returns false if no more tuples.
    2149             :  * Returned tuple belongs to tuplesort memory context, and must not be freed
    2150             :  * by caller.  Note that fetched tuple is stored in memory that may be
    2151             :  * recycled by any future fetch.
    2152             :  */
    2153             : static bool
    2154    29972582 : tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
    2155             :                           SortTuple *stup)
    2156             : {
    2157             :     unsigned int tuplen;
    2158             :     size_t      nmoved;
    2159             : 
    2160             :     Assert(!WORKER(state));
    2161             : 
    2162    29972582 :     switch (state->status)
    2163             :     {
    2164    25982602 :         case TSS_SORTEDINMEM:
    2165             :             Assert(forward || state->randomAccess);
    2166             :             Assert(!state->slabAllocatorUsed);
    2167    25982602 :             if (forward)
    2168             :             {
    2169    25982558 :                 if (state->current < state->memtupcount)
    2170             :                 {
    2171    24139722 :                     *stup = state->memtuples[state->current++];
    2172    24139722 :                     return true;
    2173             :                 }
    2174     1842836 :                 state->eof_reached = true;
    2175             : 
    2176             :                 /*
    2177             :                  * Complain if caller tries to retrieve more tuples than
    2178             :                  * originally asked for in a bounded sort.  This is because
    2179             :                  * returning EOF here might be the wrong thing.
    2180             :                  */
    2181     1842836 :                 if (state->bounded && state->current >= state->bound)
    2182           0 :                     elog(ERROR, "retrieved too many tuples in a bounded sort");
    2183             : 
    2184     1842836 :                 return false;
    2185             :             }
    2186             :             else
    2187             :             {
    2188          44 :                 if (state->current <= 0)
    2189           0 :                     return false;
    2190             : 
    2191             :                 /*
    2192             :                  * if all tuples are fetched already then we return last
    2193             :                  * tuple, else - tuple before last returned.
    2194             :                  */
    2195          44 :                 if (state->eof_reached)
    2196           8 :                     state->eof_reached = false;
    2197             :                 else
    2198             :                 {
    2199          36 :                     state->current--;    /* last returned tuple */
    2200          36 :                     if (state->current <= 0)
    2201           4 :                         return false;
    2202             :                 }
    2203          40 :                 *stup = state->memtuples[state->current - 1];
    2204          40 :                 return true;
    2205             :             }
    2206             :             break;
    2207             : 
    2208      181996 :         case TSS_SORTEDONTAPE:
    2209             :             Assert(forward || state->randomAccess);
    2210             :             Assert(state->slabAllocatorUsed);
    2211             : 
    2212             :             /*
    2213             :              * The slot that held the tuple that we returned in previous
    2214             :              * gettuple call can now be reused.
    2215             :              */
    2216      181996 :             if (state->lastReturnedTuple)
    2217             :             {
    2218      101900 :                 RELEASE_SLAB_SLOT(state, state->lastReturnedTuple);
    2219      101900 :                 state->lastReturnedTuple = NULL;
    2220             :             }
    2221             : 
    2222      181996 :             if (forward)
    2223             :             {
    2224      181976 :                 if (state->eof_reached)
    2225           0 :                     return false;
    2226             : 
    2227      181976 :                 if ((tuplen = getlen(state->result_tape, true)) != 0)
    2228             :                 {
    2229      181960 :                     READTUP(state, stup, state->result_tape, tuplen);
    2230             : 
    2231             :                     /*
    2232             :                      * Remember the tuple we return, so that we can recycle
    2233             :                      * its memory on next call.  (This can be NULL, in the
    2234             :                      * !state->tuples case).
    2235             :                      */
    2236      181960 :                     state->lastReturnedTuple = stup->tuple;
    2237             : 
    2238      181960 :                     return true;
    2239             :                 }
    2240             :                 else
    2241             :                 {
    2242          16 :                     state->eof_reached = true;
    2243          16 :                     return false;
    2244             :                 }
    2245             :             }
    2246             : 
    2247             :             /*
    2248             :              * Backward.
    2249             :              *
    2250             :              * if all tuples are fetched already then we return last tuple,
    2251             :              * else - tuple before last returned.
    2252             :              */
    2253          20 :             if (state->eof_reached)
    2254             :             {
    2255             :                 /*
    2256             :                  * Seek position is pointing just past the zero tuplen at the
    2257             :                  * end of file; back up to fetch last tuple's ending length
    2258             :                  * word.  If seek fails we must have a completely empty file.
    2259             :                  */
    2260           8 :                 nmoved = LogicalTapeBackspace(state->result_tape,
    2261             :                                               2 * sizeof(unsigned int));
    2262           8 :                 if (nmoved == 0)
    2263           0 :                     return false;
    2264           8 :                 else if (nmoved != 2 * sizeof(unsigned int))
    2265           0 :                     elog(ERROR, "unexpected tape position");
    2266           8 :                 state->eof_reached = false;
    2267             :             }
    2268             :             else
    2269             :             {
    2270             :                 /*
    2271             :                  * Back up and fetch previously-returned tuple's ending length
    2272             :                  * word.  If seek fails, assume we are at start of file.
    2273             :                  */
    2274          12 :                 nmoved = LogicalTapeBackspace(state->result_tape,
    2275             :                                               sizeof(unsigned int));
    2276          12 :                 if (nmoved == 0)
    2277           0 :                     return false;
    2278          12 :                 else if (nmoved != sizeof(unsigned int))
    2279           0 :                     elog(ERROR, "unexpected tape position");
    2280          12 :                 tuplen = getlen(state->result_tape, false);
    2281             : 
    2282             :                 /*
    2283             :                  * Back up to get ending length word of tuple before it.
    2284             :                  */
    2285          12 :                 nmoved = LogicalTapeBackspace(state->result_tape,
    2286             :                                               tuplen + 2 * sizeof(unsigned int));
    2287          12 :                 if (nmoved == tuplen + sizeof(unsigned int))
    2288             :                 {
    2289             :                     /*
    2290             :                      * We backed up over the previous tuple, but there was no
    2291             :                      * ending length word before it.  That means that the prev
    2292             :                      * tuple is the first tuple in the file.  It is now the
    2293             :                      * next to read in forward direction (not obviously right,
    2294             :                      * but that is what in-memory case does).
    2295             :                      */
    2296           4 :                     return false;
    2297             :                 }
    2298           8 :                 else if (nmoved != tuplen + 2 * sizeof(unsigned int))
    2299           0 :                     elog(ERROR, "bogus tuple length in backward scan");
    2300             :             }
    2301             : 
    2302          16 :             tuplen = getlen(state->result_tape, false);
    2303             : 
    2304             :             /*
    2305             :              * Now we have the length of the prior tuple, back up and read it.
    2306             :              * Note: READTUP expects we are positioned after the initial
    2307             :              * length word of the tuple, so back up to that point.
    2308             :              */
    2309          16 :             nmoved = LogicalTapeBackspace(state->result_tape,
    2310             :                                           tuplen);
    2311          16 :             if (nmoved != tuplen)
    2312           0 :                 elog(ERROR, "bogus tuple length in backward scan");
    2313          16 :             READTUP(state, stup, state->result_tape, tuplen);
    2314             : 
    2315             :             /*
    2316             :              * Remember the tuple we return, so that we can recycle its memory
    2317             :              * on next call. (This can be NULL, in the Datum case).
    2318             :              */
    2319          16 :             state->lastReturnedTuple = stup->tuple;
    2320             : 
    2321          16 :             return true;
    2322             : 
    2323     3807984 :         case TSS_FINALMERGE:
    2324             :             Assert(forward);
    2325             :             /* We are managing memory ourselves, with the slab allocator. */
    2326             :             Assert(state->slabAllocatorUsed);
    2327             : 
    2328             :             /*
    2329             :              * The slab slot holding the tuple that we returned in previous
    2330             :              * gettuple call can now be reused.
    2331             :              */
    2332     3807984 :             if (state->lastReturnedTuple)
    2333             :             {
    2334     3687752 :                 RELEASE_SLAB_SLOT(state, state->lastReturnedTuple);
    2335     3687752 :                 state->lastReturnedTuple = NULL;
    2336             :             }
    2337             : 
    2338             :             /*
    2339             :              * This code should match the inner loop of mergeonerun().
    2340             :              */
    2341     3807984 :             if (state->memtupcount > 0)
    2342             :             {
    2343     3807832 :                 int         srcTapeIndex = state->memtuples[0].srctape;
    2344     3807832 :                 LogicalTape *srcTape = state->inputTapes[srcTapeIndex];
    2345             :                 SortTuple   newtup;
    2346             : 
    2347     3807832 :                 *stup = state->memtuples[0];
    2348             : 
    2349             :                 /*
    2350             :                  * Remember the tuple we return, so that we can recycle its
    2351             :                  * memory on next call. (This can be NULL, in the Datum case).
    2352             :                  */
    2353     3807832 :                 state->lastReturnedTuple = stup->tuple;
    2354             : 
    2355             :                 /*
    2356             :                  * Pull next tuple from tape, and replace the returned tuple
    2357             :                  * at top of the heap with it.
    2358             :                  */
    2359     3807832 :                 if (!mergereadnext(state, srcTape, &newtup))
    2360             :                 {
    2361             :                     /*
    2362             :                      * If no more data, we've reached end of run on this tape.
    2363             :                      * Remove the top node from the heap.
    2364             :                      */
    2365         228 :                     tuplesort_heap_delete_top(state);
    2366         228 :                     state->nInputRuns--;
    2367             : 
    2368             :                     /*
    2369             :                      * Close the tape.  It'd go away at the end of the sort
    2370             :                      * anyway, but better to release the memory early.
    2371             :                      */
    2372         228 :                     LogicalTapeClose(srcTape);
    2373         228 :                     return true;
    2374             :                 }
    2375     3807604 :                 newtup.srctape = srcTapeIndex;
    2376     3807604 :                 tuplesort_heap_replace_top(state, &newtup);
    2377     3807604 :                 return true;
    2378             :             }
    2379         152 :             return false;
    2380             : 
    2381           0 :         default:
    2382           0 :             elog(ERROR, "invalid tuplesort state");
    2383             :             return false;       /* keep compiler quiet */
    2384             :     }
    2385             : }
    2386             : 
    2387             : /*
    2388             :  * Fetch the next tuple in either forward or back direction.
    2389             :  * If successful, put tuple in slot and return true; else, clear the slot
    2390             :  * and return false.
    2391             :  *
    2392             :  * Caller may optionally be passed back abbreviated value (on true return
    2393             :  * value) when abbreviation was used, which can be used to cheaply avoid
    2394             :  * equality checks that might otherwise be required.  Caller can safely make a
    2395             :  * determination of "non-equal tuple" based on simple binary inequality.  A
    2396             :  * NULL value in leading attribute will set abbreviated value to zeroed
    2397             :  * representation, which caller may rely on in abbreviated inequality check.
    2398             :  *
    2399             :  * If copy is true, the slot receives a tuple that's been copied into the
    2400             :  * caller's memory context, so that it will stay valid regardless of future
    2401             :  * manipulations of the tuplesort's state (up to and including deleting the
    2402             :  * tuplesort).  If copy is false, the slot will just receive a pointer to a
    2403             :  * tuple held within the tuplesort, which is more efficient, but only safe for
    2404             :  * callers that are prepared to have any subsequent manipulation of the
    2405             :  * tuplesort's state invalidate slot contents.
    2406             :  */
    2407             : bool
    2408     8305534 : tuplesort_gettupleslot(Tuplesortstate *state, bool forward, bool copy,
    2409             :                        TupleTableSlot *slot, Datum *abbrev)
    2410             : {
    2411     8305534 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
    2412             :     SortTuple   stup;
    2413             : 
    2414     8305534 :     if (!tuplesort_gettuple_common(state, forward, &stup))
    2415     1671276 :         stup.tuple = NULL;
    2416             : 
    2417     8305534 :     MemoryContextSwitchTo(oldcontext);
    2418             : 
    2419     8305534 :     if (stup.tuple)
    2420             :     {
    2421             :         /* Record abbreviated key for caller */
    2422     6634258 :         if (state->sortKeys->abbrev_converter && abbrev)
    2423          96 :             *abbrev = stup.datum1;
    2424             : 
    2425     6634258 :         if (copy)
    2426       21154 :             stup.tuple = heap_copy_minimal_tuple((MinimalTuple) stup.tuple);
    2427             : 
    2428     6634258 :         ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, copy);
    2429     6634258 :         return true;
    2430             :     }
    2431             :     else
    2432             :     {
    2433     1671276 :         ExecClearTuple(slot);
    2434     1671276 :         return false;
    2435             :     }
    2436             : }
    2437             : 
    2438             : /*
    2439             :  * Fetch the next tuple in either forward or back direction.
    2440             :  * Returns NULL if no more tuples.  Returned tuple belongs to tuplesort memory
    2441             :  * context, and must not be freed by caller.  Caller may not rely on tuple
    2442             :  * remaining valid after any further manipulation of tuplesort.
    2443             :  */
    2444             : HeapTuple
    2445      366490 : tuplesort_getheaptuple(Tuplesortstate *state, bool forward)
    2446             : {
    2447      366490 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
    2448             :     SortTuple   stup;
    2449             : 
    2450      366490 :     if (!tuplesort_gettuple_common(state, forward, &stup))
    2451          56 :         stup.tuple = NULL;
    2452             : 
    2453      366490 :     MemoryContextSwitchTo(oldcontext);
    2454             : 
    2455      366490 :     return stup.tuple;
    2456             : }
    2457             : 
    2458             : /*
    2459             :  * Fetch the next index tuple in either forward or back direction.
    2460             :  * Returns NULL if no more tuples.  Returned tuple belongs to tuplesort memory
    2461             :  * context, and must not be freed by caller.  Caller may not rely on tuple
    2462             :  * remaining valid after any further manipulation of tuplesort.
    2463             :  */
    2464             : IndexTuple
    2465    19449368 : tuplesort_getindextuple(Tuplesortstate *state, bool forward)
    2466             : {
    2467    19449368 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
    2468             :     SortTuple   stup;
    2469             : 
    2470    19449368 :     if (!tuplesort_gettuple_common(state, forward, &stup))
    2471       99056 :         stup.tuple = NULL;
    2472             : 
    2473    19449368 :     MemoryContextSwitchTo(oldcontext);
    2474             : 
    2475    19449368 :     return (IndexTuple) stup.tuple;
    2476             : }
    2477             : 
    2478             : /*
    2479             :  * Fetch the next Datum in either forward or back direction.
    2480             :  * Returns false if no more datums.
    2481             :  *
    2482             :  * If the Datum is pass-by-ref type, the returned value is freshly palloc'd
    2483             :  * in caller's context, and is now owned by the caller (this differs from
    2484             :  * similar routines for other types of tuplesorts).
    2485             :  *
    2486             :  * Caller may optionally be passed back abbreviated value (on true return
    2487             :  * value) when abbreviation was used, which can be used to cheaply avoid
    2488             :  * equality checks that might otherwise be required.  Caller can safely make a
    2489             :  * determination of "non-equal tuple" based on simple binary inequality.  A
    2490             :  * NULL value will have a zeroed abbreviated value representation, which caller
    2491             :  * may rely on in abbreviated inequality check.
    2492             :  */
    2493             : bool
    2494     1691118 : tuplesort_getdatum(Tuplesortstate *state, bool forward,
    2495             :                    Datum *val, bool *isNull, Datum *abbrev)
    2496             : {
    2497     1691118 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
    2498             :     SortTuple   stup;
    2499             : 
    2500     1691118 :     if (!tuplesort_gettuple_common(state, forward, &stup))
    2501             :     {
    2502       72624 :         MemoryContextSwitchTo(oldcontext);
    2503       72624 :         return false;
    2504             :     }
    2505             : 
    2506             :     /* Ensure we copy into caller's memory context */
    2507     1618494 :     MemoryContextSwitchTo(oldcontext);
    2508             : 
    2509             :     /* Record abbreviated key for caller */
    2510     1618494 :     if (state->sortKeys->abbrev_converter && abbrev)
    2511          72 :         *abbrev = stup.datum1;
    2512             : 
    2513     1618494 :     if (stup.isnull1 || !state->tuples)
    2514             :     {
    2515      798786 :         *val = stup.datum1;
    2516      798786 :         *isNull = stup.isnull1;
    2517             :     }
    2518             :     else
    2519             :     {
    2520             :         /* use stup.tuple because stup.datum1 may be an abbreviation */
    2521      819708 :         *val = datumCopy(PointerGetDatum(stup.tuple), false, state->datumTypeLen);
    2522      819708 :         *isNull = false;
    2523             :     }
    2524             : 
    2525     1618494 :     return true;
    2526             : }
    2527             : 
    2528             : /*
    2529             :  * Advance over N tuples in either forward or back direction,
    2530             :  * without returning any data.  N==0 is a no-op.
    2531             :  * Returns true if successful, false if ran out of tuples.
    2532             :  */
    2533             : bool
    2534         268 : tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)
    2535             : {
    2536             :     MemoryContext oldcontext;
    2537             : 
    2538             :     /*
    2539             :      * We don't actually support backwards skip yet, because no callers need
    2540             :      * it.  The API is designed to allow for that later, though.
    2541             :      */
    2542             :     Assert(forward);
    2543             :     Assert(ntuples >= 0);
    2544             :     Assert(!WORKER(state));
    2545             : 
    2546         268 :     switch (state->status)
    2547             :     {
    2548         252 :         case TSS_SORTEDINMEM:
    2549         252 :             if (state->memtupcount - state->current >= ntuples)
    2550             :             {
    2551         252 :                 state->current += ntuples;
    2552         252 :                 return true;
    2553             :             }
    2554           0 :             state->current = state->memtupcount;
    2555           0 :             state->eof_reached = true;
    2556             : 
    2557             :             /*
    2558             :              * Complain if caller tries to retrieve more tuples than
    2559             :              * originally asked for in a bounded sort.  This is because
    2560             :              * returning EOF here might be the wrong thing.
    2561             :              */
    2562           0 :             if (state->bounded && state->current >= state->bound)
    2563           0 :                 elog(ERROR, "retrieved too many tuples in a bounded sort");
    2564             : 
    2565           0 :             return false;
    2566             : 
    2567          16 :         case TSS_SORTEDONTAPE:
    2568             :         case TSS_FINALMERGE:
    2569             : 
    2570             :             /*
    2571             :              * We could probably optimize these cases better, but for now it's
    2572             :              * not worth the trouble.
    2573             :              */
    2574          16 :             oldcontext = MemoryContextSwitchTo(state->sortcontext);
    2575      160088 :             while (ntuples-- > 0)
    2576             :             {
    2577             :                 SortTuple   stup;
    2578             : 
    2579      160072 :                 if (!tuplesort_gettuple_common(state, forward, &stup))
    2580             :                 {
    2581           0 :                     MemoryContextSwitchTo(oldcontext);
    2582           0 :                     return false;
    2583             :                 }
    2584      160072 :                 CHECK_FOR_INTERRUPTS();
    2585             :             }
    2586          16 :             MemoryContextSwitchTo(oldcontext);
    2587          16 :             return true;
    2588             : 
    2589           0 :         default:
    2590           0 :             elog(ERROR, "invalid tuplesort state");
    2591             :             return false;       /* keep compiler quiet */
    2592             :     }
    2593             : }
    2594             : 
    2595             : /*
    2596             :  * tuplesort_merge_order - report merge order we'll use for given memory
    2597             :  * (note: "merge order" just means the number of input tapes in the merge).
    2598             :  *
    2599             :  * This is exported for use by the planner.  allowedMem is in bytes.
    2600             :  */
    2601             : int
    2602        9840 : tuplesort_merge_order(int64 allowedMem)
    2603             : {
    2604             :     int         mOrder;
    2605             : 
    2606             :     /*----------
    2607             :      * In the merge phase, we need buffer space for each input and output tape.
    2608             :      * Each pass in the balanced merge algorithm reads from M input tapes, and
    2609             :      * writes to N output tapes.  Each tape consumes TAPE_BUFFER_OVERHEAD bytes
    2610             :      * of memory.  In addition to that, we want MERGE_BUFFER_SIZE workspace per
    2611             :      * input tape.
    2612             :      *
    2613             :      * totalMem = M * (TAPE_BUFFER_OVERHEAD + MERGE_BUFFER_SIZE) +
    2614             :      *            N * TAPE_BUFFER_OVERHEAD
    2615             :      *
    2616             :      * Except for the last and next-to-last merge passes, where there can be
    2617             :      * fewer tapes left to process, M = N.  We choose M so that we have the
    2618             :      * desired amount of memory available for the input buffers
    2619             :      * (TAPE_BUFFER_OVERHEAD + MERGE_BUFFER_SIZE), given the total memory
    2620             :      * available for the tape buffers (allowedMem).
    2621             :      *
    2622             :      * Note: you might be thinking we need to account for the memtuples[]
    2623             :      * array in this calculation, but we effectively treat that as part of the
    2624             :      * MERGE_BUFFER_SIZE workspace.
    2625             :      *----------
    2626             :      */
    2627        9840 :     mOrder = allowedMem /
    2628             :         (2 * TAPE_BUFFER_OVERHEAD + MERGE_BUFFER_SIZE);
    2629             : 
    2630             :     /*
    2631             :      * Even in minimum memory, use at least a MINORDER merge.  On the other
    2632             :      * hand, even when we have lots of memory, do not use more than a MAXORDER
    2633             :      * merge.  Tapes are pretty cheap, but they're not entirely free.  Each
    2634             :      * additional tape reduces the amount of memory available to build runs,
    2635             :      * which in turn can cause the same sort to need more runs, which makes
    2636             :      * merging slower even if it can still be done in a single pass.  Also,
    2637             :      * high order merges are quite slow due to CPU cache effects; it can be
    2638             :      * faster to pay the I/O cost of a multi-pass merge than to perform a
    2639             :      * single merge pass across many hundreds of tapes.
    2640             :      */
    2641        9840 :     mOrder = Max(mOrder, MINORDER);
    2642        9840 :     mOrder = Min(mOrder, MAXORDER);
    2643             : 
    2644        9840 :     return mOrder;
    2645             : }
    2646             : 
    2647             : /*
    2648             :  * Helper function to calculate how much memory to allocate for the read buffer
    2649             :  * of each input tape in a merge pass.
    2650             :  *
    2651             :  * 'avail_mem' is the amount of memory available for the buffers of all the
    2652             :  *      tapes, both input and output.
    2653             :  * 'nInputTapes' and 'nInputRuns' are the number of input tapes and runs.
    2654             :  * 'maxOutputTapes' is the max. number of output tapes we should produce.
    2655             :  */
    2656             : static int64
    2657         240 : merge_read_buffer_size(int64 avail_mem, int nInputTapes, int nInputRuns,
    2658             :                        int maxOutputTapes)
    2659             : {
    2660             :     int         nOutputRuns;
    2661             :     int         nOutputTapes;
    2662             : 
    2663             :     /*
    2664             :      * How many output tapes will we produce in this pass?
    2665             :      *
    2666             :      * This is nInputRuns / nInputTapes, rounded up.
    2667             :      */
    2668         240 :     nOutputRuns = (nInputRuns + nInputTapes - 1) / nInputTapes;
    2669             : 
    2670         240 :     nOutputTapes = Min(nOutputRuns, maxOutputTapes);
    2671             : 
    2672             :     /*
    2673             :      * Each output tape consumes TAPE_BUFFER_OVERHEAD bytes of memory.  All
    2674             :      * remaining memory is divided evenly between the input tapes.
    2675             :      *
    2676             :      * This also follows from the formula in tuplesort_merge_order, but here
    2677             :      * we derive the input buffer size from the amount of memory available,
    2678             :      * and M and N.
    2679             :      */
    2680         240 :     return Max((avail_mem - TAPE_BUFFER_OVERHEAD * nOutputTapes) / nInputTapes, 0);
    2681             : }
    2682             : 
    2683             : /*
    2684             :  * inittapes - initialize for tape sorting.
    2685             :  *
    2686             :  * This is called only if we have found we won't sort in memory.
    2687             :  */
    2688             : static void
    2689         360 : inittapes(Tuplesortstate *state, bool mergeruns)
    2690             : {
    2691             :     Assert(!LEADER(state));
    2692             : 
    2693         360 :     if (mergeruns)
    2694             :     {
    2695             :         /* Compute number of input tapes to use when merging */
    2696          80 :         state->maxTapes = tuplesort_merge_order(state->allowedMem);
    2697             :     }
    2698             :     else
    2699             :     {
    2700             :         /* Workers can sometimes produce single run, output without merge */
    2701             :         Assert(WORKER(state));
    2702         280 :         state->maxTapes = MINORDER;
    2703             :     }
    2704             : 
    2705             : #ifdef TRACE_SORT
    2706         360 :     if (trace_sort)
    2707           0 :         elog(LOG, "worker %d switching to external sort with %d tapes: %s",
    2708             :              state->worker, state->maxTapes, pg_rusage_show(&state->ru_start));
    2709             : #endif
    2710             : 
    2711             :     /* Create the tape set */
    2712         360 :     inittapestate(state, state->maxTapes);
    2713         360 :     state->tapeset =
    2714         360 :         LogicalTapeSetCreate(false,
    2715         360 :                              state->shared ? &state->shared->fileset : NULL,
    2716             :                              state->worker);
    2717             : 
    2718         360 :     state->currentRun = 0;
    2719             : 
    2720             :     /*
    2721             :      * Initialize logical tape arrays.
    2722             :      */
    2723         360 :     state->inputTapes = NULL;
    2724         360 :     state->nInputTapes = 0;
    2725         360 :     state->nInputRuns = 0;
    2726             : 
    2727         360 :     state->outputTapes = palloc0(state->maxTapes * sizeof(LogicalTape *));
    2728         360 :     state->nOutputTapes = 0;
    2729         360 :     state->nOutputRuns = 0;
    2730             : 
    2731         360 :     state->status = TSS_BUILDRUNS;
    2732             : 
    2733         360 :     selectnewtape(state);
    2734         360 : }
    2735             : 
    2736             : /*
    2737             :  * inittapestate - initialize generic tape management state
    2738             :  */
    2739             : static void
    2740         456 : inittapestate(Tuplesortstate *state, int maxTapes)
    2741             : {
    2742             :     int64       tapeSpace;
    2743             : 
    2744             :     /*
    2745             :      * Decrease availMem to reflect the space needed for tape buffers; but
    2746             :      * don't decrease it to the point that we have no room for tuples. (That
    2747             :      * case is only likely to occur if sorting pass-by-value Datums; in all
    2748             :      * other scenarios the memtuples[] array is unlikely to occupy more than
    2749             :      * half of allowedMem.  In the pass-by-value case it's not important to
    2750             :      * account for tuple space, so we don't care if LACKMEM becomes
    2751             :      * inaccurate.)
    2752             :      */
    2753         456 :     tapeSpace = (int64) maxTapes * TAPE_BUFFER_OVERHEAD;
    2754             : 
    2755         456 :     if (tapeSpace + GetMemoryChunkSpace(state->memtuples) < state->allowedMem)
    2756         408 :         USEMEM(state, tapeSpace);
    2757             : 
    2758             :     /*
    2759             :      * Make sure that the temp file(s) underlying the tape set are created in
    2760             :      * suitable temp tablespaces.  For parallel sorts, this should have been
    2761             :      * called already, but it doesn't matter if it is called a second time.
    2762             :      */
    2763         456 :     PrepareTempTablespaces();
    2764         456 : }
    2765             : 
    2766             : /*
    2767             :  * selectnewtape -- select next tape to output to.
    2768             :  *
    2769             :  * This is called after finishing a run when we know another run
    2770             :  * must be started.  This is used both when building the initial
    2771             :  * runs, and during merge passes.
    2772             :  */
    2773             : static void
    2774        4184 : selectnewtape(Tuplesortstate *state)
    2775             : {
    2776             :     /*
    2777             :      * At the beginning of each merge pass, nOutputTapes and nOutputRuns are
    2778             :      * both zero.  On each call, we create a new output tape to hold the next
    2779             :      * run, until maxTapes is reached.  After that, we assign new runs to the
    2780             :      * existing tapes in a round robin fashion.
    2781             :      */
    2782        4184 :     if (state->nOutputTapes < state->maxTapes)
    2783             :     {
    2784             :         /* Create a new tape to hold the next run */
    2785             :         Assert(state->outputTapes[state->nOutputRuns] == NULL);
    2786             :         Assert(state->nOutputRuns == state->nOutputTapes);
    2787         928 :         state->destTape = LogicalTapeCreate(state->tapeset);
    2788         928 :         state->outputTapes[state->nOutputTapes] = state->destTape;
    2789         928 :         state->nOutputTapes++;
    2790         928 :         state->nOutputRuns++;
    2791             :     }
    2792             :     else
    2793             :     {
    2794             :         /*
    2795             :          * We have reached the max number of tapes.  Append to an existing
    2796             :          * tape.
    2797             :          */
    2798        3256 :         state->destTape = state->outputTapes[state->nOutputRuns % state->nOutputTapes];
    2799        3256 :         state->nOutputRuns++;
    2800             :     }
    2801        4184 : }
    2802             : 
    2803             : /*
    2804             :  * Initialize the slab allocation arena, for the given number of slots.
    2805             :  */
    2806             : static void
    2807         176 : init_slab_allocator(Tuplesortstate *state, int numSlots)
    2808             : {
    2809         176 :     if (numSlots > 0)
    2810             :     {
    2811             :         char       *p;
    2812             :         int         i;
    2813             : 
    2814         164 :         state->slabMemoryBegin = palloc(numSlots * SLAB_SLOT_SIZE);
    2815         164 :         state->slabMemoryEnd = state->slabMemoryBegin +
    2816         164 :             numSlots * SLAB_SLOT_SIZE;
    2817         164 :         state->slabFreeHead = (SlabSlot *) state->slabMemoryBegin;
    2818         164 :         USEMEM(state, numSlots * SLAB_SLOT_SIZE);
    2819             : 
    2820         164 :         p = state->slabMemoryBegin;
    2821         672 :         for (i = 0; i < numSlots - 1; i++)
    2822             :         {
    2823         508 :             ((SlabSlot *) p)->nextfree = (SlabSlot *) (p + SLAB_SLOT_SIZE);
    2824         508 :             p += SLAB_SLOT_SIZE;
    2825             :         }
    2826         164 :         ((SlabSlot *) p)->nextfree = NULL;
    2827             :     }
    2828             :     else
    2829             :     {
    2830          12 :         state->slabMemoryBegin = state->slabMemoryEnd = NULL;
    2831          12 :         state->slabFreeHead = NULL;
    2832             :     }
    2833         176 :     state->slabAllocatorUsed = true;
    2834         176 : }
    2835             : 
    2836             : /*
    2837             :  * mergeruns -- merge all the completed initial runs.
    2838             :  *
    2839             :  * This implements the Balanced k-Way Merge Algorithm.  All input data has
    2840             :  * already been written to initial runs on tape (see dumptuples).
    2841             :  */
    2842             : static void
    2843         176 : mergeruns(Tuplesortstate *state)
    2844             : {
    2845             :     int         tapenum;
    2846             : 
    2847             :     Assert(state->status == TSS_BUILDRUNS);
    2848             :     Assert(state->memtupcount == 0);
    2849             : 
    2850         176 :     if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
    2851             :     {
    2852             :         /*
    2853             :          * If there are multiple runs to be merged, when we go to read back
    2854             :          * tuples from disk, abbreviated keys will not have been stored, and
    2855             :          * we don't care to regenerate them.  Disable abbreviation from this
    2856             :          * point on.
    2857             :          */
    2858          16 :         state->sortKeys->abbrev_converter = NULL;
    2859          16 :         state->sortKeys->comparator = state->sortKeys->abbrev_full_comparator;
    2860             : 
    2861             :         /* Not strictly necessary, but be tidy */
    2862          16 :         state->sortKeys->abbrev_abort = NULL;
    2863          16 :         state->sortKeys->abbrev_full_comparator = NULL;
    2864             :     }
    2865             : 
    2866             :     /*
    2867             :      * Reset tuple memory.  We've freed all the tuples that we previously
    2868             :      * allocated.  We will use the slab allocator from now on.
    2869             :      */
    2870         176 :     MemoryContextResetOnly(state->tuplecontext);
    2871             : 
    2872             :     /*
    2873             :      * We no longer need a large memtuples array.  (We will allocate a smaller
    2874             :      * one for the heap later.)
    2875             :      */
    2876         176 :     FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
    2877         176 :     pfree(state->memtuples);
    2878         176 :     state->memtuples = NULL;
    2879             : 
    2880             :     /*
    2881             :      * Initialize the slab allocator.  We need one slab slot per input tape,
    2882             :      * for the tuples in the heap, plus one to hold the tuple last returned
    2883             :      * from tuplesort_gettuple.  (If we're sorting pass-by-val Datums,
    2884             :      * however, we don't need to do allocate anything.)
    2885             :      *
    2886             :      * In a multi-pass merge, we could shrink this allocation for the last
    2887             :      * merge pass, if it has fewer tapes than previous passes, but we don't
    2888             :      * bother.
    2889             :      *
    2890             :      * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism
    2891             :      * to track memory usage of individual tuples.
    2892             :      */
    2893         176 :     if (state->tuples)
    2894         164 :         init_slab_allocator(state, state->nOutputTapes + 1);
    2895             :     else
    2896          12 :         init_slab_allocator(state, 0);
    2897             : 
    2898             :     /*
    2899             :      * Allocate a new 'memtuples' array, for the heap.  It will hold one tuple
    2900             :      * from each input tape.
    2901             :      *
    2902             :      * We could shrink this, too, between passes in a multi-pass merge, but we
    2903             :      * don't bother.  (The initial input tapes are still in outputTapes.  The
    2904             :      * number of input tapes will not increase between passes.)
    2905             :      */
    2906         176 :     state->memtupsize = state->nOutputTapes;
    2907         352 :     state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
    2908         176 :                                                         state->nOutputTapes * sizeof(SortTuple));
    2909         176 :     USEMEM(state, GetMemoryChunkSpace(state->memtuples));
    2910             : 
    2911             :     /*
    2912             :      * Use all the remaining memory we have available for tape buffers among
    2913             :      * all the input tapes.  At the beginning of each merge pass, we will
    2914             :      * divide this memory between the input and output tapes in the pass.
    2915             :      */
    2916         176 :     state->tape_buffer_mem = state->availMem;
    2917         176 :     USEMEM(state, state->tape_buffer_mem);
    2918             : #ifdef TRACE_SORT
    2919         176 :     if (trace_sort)
    2920           0 :         elog(LOG, "worker %d using %zu KB of memory for tape buffers",
    2921             :              state->worker, state->tape_buffer_mem / 1024);
    2922             : #endif
    2923             : 
    2924             :     for (;;)
    2925             :     {
    2926             :         /*
    2927             :          * On the first iteration, or if we have read all the runs from the
    2928             :          * input tapes in a multi-pass merge, it's time to start a new pass.
    2929             :          * Rewind all the output tapes, and make them inputs for the next
    2930             :          * pass.
    2931             :          */
    2932         804 :         if (state->nInputRuns == 0)
    2933             :         {
    2934             :             int64       input_buffer_size;
    2935             : 
    2936             :             /* Close the old, emptied, input tapes */
    2937         240 :             if (state->nInputTapes > 0)
    2938             :             {
    2939         448 :                 for (tapenum = 0; tapenum < state->nInputTapes; tapenum++)
    2940         384 :                     LogicalTapeClose(state->inputTapes[tapenum]);
    2941          64 :                 pfree(state->inputTapes);
    2942             :             }
    2943             : 
    2944             :             /* Previous pass's outputs become next pass's inputs. */
    2945         240 :             state->inputTapes = state->outputTapes;
    2946         240 :             state->nInputTapes = state->nOutputTapes;
    2947         240 :             state->nInputRuns = state->nOutputRuns;
    2948             : 
    2949             :             /*
    2950             :              * Reset output tape variables.  The actual LogicalTapes will be
    2951             :              * created as needed, here we only allocate the array to hold
    2952             :              * them.
    2953             :              */
    2954         240 :             state->outputTapes = palloc0(state->nInputTapes * sizeof(LogicalTape *));
    2955         240 :             state->nOutputTapes = 0;
    2956         240 :             state->nOutputRuns = 0;
    2957             : 
    2958             :             /*
    2959             :              * Redistribute the memory allocated for tape buffers, among the
    2960             :              * new input and output tapes.
    2961             :              */
    2962         240 :             input_buffer_size = merge_read_buffer_size(state->tape_buffer_mem,
    2963             :                                                        state->nInputTapes,
    2964             :                                                        state->nInputRuns,
    2965             :                                                        state->maxTapes);
    2966             : 
    2967             : #ifdef TRACE_SORT
    2968         240 :             if (trace_sort)
    2969           0 :                 elog(LOG, "starting merge pass of %d input runs on %d tapes, " INT64_FORMAT " KB of memory for each input tape: %s",
    2970             :                      state->nInputRuns, state->nInputTapes, input_buffer_size / 1024,
    2971             :                      pg_rusage_show(&state->ru_start));
    2972             : #endif
    2973             : 
    2974             :             /* Prepare the new input tapes for merge pass. */
    2975        1068 :             for (tapenum = 0; tapenum < state->nInputTapes; tapenum++)
    2976         828 :                 LogicalTapeRewindForRead(state->inputTapes[tapenum], input_buffer_size);
    2977             : 
    2978             :             /*
    2979             :              * If there's just one run left on each input tape, then only one
    2980             :              * merge pass remains.  If we don't have to produce a materialized
    2981             :              * sorted tape, we can stop at this point and do the final merge
    2982             :              * on-the-fly.
    2983             :              */
    2984         240 :             if (!state->randomAccess && state->nInputRuns <= state->nInputTapes
    2985         164 :                 && !WORKER(state))
    2986             :             {
    2987             :                 /* Tell logtape.c we won't be writing anymore */
    2988         164 :                 LogicalTapeSetForgetFreeSpace(state->tapeset);
    2989             :                 /* Initialize for the final merge pass */
    2990         164 :                 beginmerge(state);
    2991         164 :                 state->status = TSS_FINALMERGE;
    2992         164 :                 return;
    2993             :             }
    2994             :         }
    2995             : 
    2996             :         /* Select an output tape */
    2997         640 :         selectnewtape(state);
    2998             : 
    2999             :         /* Merge one run from each input tape. */
    3000         640 :         mergeonerun(state);
    3001             : 
    3002             :         /*
    3003             :          * If the input tapes are empty, and we output only one output run,
    3004             :          * we're done.  The current output tape contains the final result.
    3005             :          */
    3006         640 :         if (state->nInputRuns == 0 && state->nOutputRuns <= 1)
    3007          12 :             break;
    3008             :     }
    3009             : 
    3010             :     /*
    3011             :      * Done.  The result is on a single run on a single tape.
    3012             :      */
    3013          12 :     state->result_tape = state->outputTapes[0];
    3014          12 :     if (!WORKER(state))
    3015          12 :         LogicalTapeFreeze(state->result_tape, NULL);
    3016             :     else
    3017           0 :         worker_freeze_result_tape(state);
    3018          12 :     state->status = TSS_SORTEDONTAPE;
    3019             : 
    3020             :     /* Close all the now-empty input tapes, to release their read buffers. */
    3021          56 :     for (tapenum = 0; tapenum < state->nInputTapes; tapenum++)
    3022          44 :         LogicalTapeClose(state->inputTapes[tapenum]);
    3023             : }
    3024             : 
    3025             : /*
    3026             :  * Merge one run from each input tape.
    3027             :  */
    3028             : static void
    3029         640 : mergeonerun(Tuplesortstate *state)
    3030             : {
    3031             :     int         srcTapeIndex;
    3032             :     LogicalTape *srcTape;
    3033             : 
    3034             :     /*
    3035             :      * Start the merge by loading one tuple from each active source tape into
    3036             :      * the heap.
    3037             :      */
    3038         640 :     beginmerge(state);
    3039             : 
    3040             :     /*
    3041             :      * Execute merge by repeatedly extracting lowest tuple in heap, writing it
    3042             :      * out, and replacing it with next tuple from same tape (if there is
    3043             :      * another one).
    3044             :      */
    3045     1411356 :     while (state->memtupcount > 0)
    3046             :     {
    3047             :         SortTuple   stup;
    3048             : 
    3049             :         /* write the tuple to destTape */
    3050     1410716 :         srcTapeIndex = state->memtuples[0].srctape;
    3051     1410716 :         srcTape = state->inputTapes[srcTapeIndex];
    3052     1410716 :         WRITETUP(state, state->destTape, &state->memtuples[0]);
    3053             : 
    3054             :         /* recycle the slot of the tuple we just wrote out, for the next read */
    3055     1410716 :         if (state->memtuples[0].tuple)
    3056     1330592 :             RELEASE_SLAB_SLOT(state, state->memtuples[0].tuple);
    3057             : 
    3058             :         /*
    3059             :          * pull next tuple from the tape, and replace the written-out tuple in
    3060             :          * the heap with it.
    3061             :          */
    3062     1410716 :         if (mergereadnext(state, srcTape, &stup))
    3063             :         {
    3064     1407032 :             stup.srctape = srcTapeIndex;
    3065     1407032 :             tuplesort_heap_replace_top(state, &stup);
    3066             : 
    3067             :         }
    3068             :         else
    3069             :         {
    3070        3684 :             tuplesort_heap_delete_top(state);
    3071        3684 :             state->nInputRuns--;
    3072             :         }
    3073             :     }
    3074             : 
    3075             :     /*
    3076             :      * When the heap empties, we're done.  Write an end-of-run marker on the
    3077             :      * output tape.
    3078             :      */
    3079         640 :     markrunend(state->destTape);
    3080         640 : }
    3081             : 
    3082             : /*
    3083             :  * beginmerge - initialize for a merge pass
    3084             :  *
    3085             :  * Fill the merge heap with the first tuple from each input tape.
    3086             :  */
    3087             : static void
    3088         804 : beginmerge(Tuplesortstate *state)
    3089             : {
    3090             :     int         activeTapes;
    3091             :     int         srcTapeIndex;
    3092             : 
    3093             :     /* Heap should be empty here */
    3094             :     Assert(state->memtupcount == 0);
    3095             : 
    3096         804 :     activeTapes = Min(state->nInputTapes, state->nInputRuns);
    3097             : 
    3098        4888 :     for (srcTapeIndex = 0; srcTapeIndex < activeTapes; srcTapeIndex++)
    3099             :     {
    3100             :         SortTuple   tup;
    3101             : 
    3102        4084 :         if (mergereadnext(state, state->inputTapes[srcTapeIndex], &tup))
    3103             :         {
    3104        3932 :             tup.srctape = srcTapeIndex;
    3105        3932 :             tuplesort_heap_insert(state, &tup);
    3106             :         }
    3107             :     }
    3108         804 : }
    3109             : 
    3110             : /*
    3111             :  * mergereadnext - read next tuple from one merge input tape
    3112             :  *
    3113             :  * Returns false on EOF.
    3114             :  */
    3115             : static bool
    3116     5222632 : mergereadnext(Tuplesortstate *state, LogicalTape *srcTape, SortTuple *stup)
    3117             : {
    3118             :     unsigned int tuplen;
    3119             : 
    3120             :     /* read next tuple, if any */
    3121     5222632 :     if ((tuplen = getlen(srcTape, true)) == 0)
    3122        4064 :         return false;
    3123     5218568 :     READTUP(state, stup, srcTape, tuplen);
    3124             : 
    3125     5218568 :     return true;
    3126             : }
    3127             : 
    3128             : /*
    3129             :  * dumptuples - remove tuples from memtuples and write initial run to tape
    3130             :  *
    3131             :  * When alltuples = true, dump everything currently in memory.  (This case is
    3132             :  * only used at end of input data.)
    3133             :  */
    3134             : static void
    3135      775204 : dumptuples(Tuplesortstate *state, bool alltuples)
    3136             : {
    3137             :     int         memtupwrite;
    3138             :     int         i;
    3139             : 
    3140             :     /*
    3141             :      * Nothing to do if we still fit in available memory and have array slots,
    3142             :      * unless this is the final call during initial run generation.
    3143             :      */
    3144      775204 :     if (state->memtupcount < state->memtupsize && !LACKMEM(state) &&
    3145      772020 :         !alltuples)
    3146      771660 :         return;
    3147             : 
    3148             :     /*
    3149             :      * Final call might require no sorting, in rare cases where we just so
    3150             :      * happen to have previously LACKMEM()'d at the point where exactly all
    3151             :      * remaining tuples are loaded into memory, just before input was
    3152             :      * exhausted.  In general, short final runs are quite possible, but avoid
    3153             :      * creating a completely empty run.  In a worker, though, we must produce
    3154             :      * at least one tape, even if it's empty.
    3155             :      */
    3156        3544 :     if (state->memtupcount == 0 && state->currentRun > 0)
    3157           0 :         return;
    3158             : 
    3159             :     Assert(state->status == TSS_BUILDRUNS);
    3160             : 
    3161             :     /*
    3162             :      * It seems unlikely that this limit will ever be exceeded, but take no
    3163             :      * chances
    3164             :      */
    3165        3544 :     if (state->currentRun == INT_MAX)
    3166           0 :         ereport(ERROR,
    3167             :                 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
    3168             :                  errmsg("cannot have more than %d runs for an external sort",
    3169             :                         INT_MAX)));
    3170             : 
    3171        3544 :     if (state->currentRun > 0)
    3172        3184 :         selectnewtape(state);
    3173             : 
    3174        3544 :     state->currentRun++;
    3175             : 
    3176             : #ifdef TRACE_SORT
    3177        3544 :     if (trace_sort)
    3178           0 :         elog(LOG, "worker %d starting quicksort of run %d: %s",
    3179             :              state->worker, state->currentRun,
    3180             :              pg_rusage_show(&state->ru_start));
    3181             : #endif
    3182             : 
    3183             :     /*
    3184             :      * Sort all tuples accumulated within the allowed amount of memory for
    3185             :      * this run using quicksort
    3186             :      */
    3187        3544 :     tuplesort_sort_memtuples(state);
    3188             : 
    3189             : #ifdef TRACE_SORT
    3190        3544 :     if (trace_sort)
    3191           0 :         elog(LOG, "worker %d finished quicksort of run %d: %s",
    3192             :              state->worker, state->currentRun,
    3193             :              pg_rusage_show(&state->ru_start));
    3194             : #endif
    3195             : 
    3196        3544 :     memtupwrite = state->memtupcount;
    3197     4141544 :     for (i = 0; i < memtupwrite; i++)
    3198             :     {
    3199     4138000 :         WRITETUP(state, state->destTape, &state->memtuples[i]);
    3200     4138000 :         state->memtupcount--;
    3201             :     }
    3202             : 
    3203             :     /*
    3204             :      * Reset tuple memory.  We've freed all of the tuples that we previously
    3205             :      * allocated.  It's important to avoid fragmentation when there is a stark
    3206             :      * change in the sizes of incoming tuples.  Fragmentation due to
    3207             :      * AllocSetFree's bucketing by size class might be particularly bad if
    3208             :      * this step wasn't taken.
    3209             :      */
    3210        3544 :     MemoryContextReset(state->tuplecontext);
    3211             : 
    3212        3544 :     markrunend(state->destTape);
    3213             : 
    3214             : #ifdef TRACE_SORT
    3215        3544 :     if (trace_sort)
    3216           0 :         elog(LOG, "worker %d finished writing run %d to tape %d: %s",
    3217             :              state->worker, state->currentRun, (state->currentRun - 1) % state->nOutputTapes + 1,
    3218             :              pg_rusage_show(&state->ru_start));
    3219             : #endif
    3220             : }
    3221             : 
    3222             : /*
    3223             :  * tuplesort_rescan     - rewind and replay the scan
    3224             :  */
    3225             : void
    3226          36 : tuplesort_rescan(Tuplesortstate *state)
    3227             : {
    3228          36 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
    3229             : 
    3230             :     Assert(state->randomAccess);
    3231             : 
    3232          36 :     switch (state->status)
    3233             :     {
    3234          32 :         case TSS_SORTEDINMEM:
    3235          32 :             state->current = 0;
    3236          32 :             state->eof_reached = false;
    3237          32 :             state->markpos_offset = 0;
    3238          32 :             state->markpos_eof = false;
    3239          32 :             break;
    3240           4 :         case TSS_SORTEDONTAPE:
    3241           4 :             LogicalTapeRewindForRead(state->result_tape, 0);
    3242           4 :             state->eof_reached = false;
    3243           4 :             state->markpos_block = 0L;
    3244           4 :             state->markpos_offset = 0;
    3245           4 :             state->markpos_eof = false;
    3246           4 :             break;
    3247           0 :         default:
    3248           0 :             elog(ERROR, "invalid tuplesort state");
    3249             :             break;
    3250             :     }
    3251             : 
    3252          36 :     MemoryContextSwitchTo(oldcontext);
    3253          36 : }
    3254             : 
    3255             : /*
    3256             :  * tuplesort_markpos    - saves current position in the merged sort file
    3257             :  */
    3258             : void
    3259      354220 : tuplesort_markpos(Tuplesortstate *state)
    3260             : {
    3261      354220 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
    3262             : 
    3263             :     Assert(state->randomAccess);
    3264             : 
    3265      354220 :     switch (state->status)
    3266             :     {
    3267      348348 :         case TSS_SORTEDINMEM:
    3268      348348 :             state->markpos_offset = state->current;
    3269      348348 :             state->markpos_eof = state->eof_reached;
    3270      348348 :             break;
    3271        5872 :         case TSS_SORTEDONTAPE:
    3272        5872 :             LogicalTapeTell(state->result_tape,
    3273             :                             &state->markpos_block,
    3274             :                             &state->markpos_offset);
    3275        5872 :             state->markpos_eof = state->eof_reached;
    3276        5872 :             break;
    3277           0 :         default:
    3278           0 :             elog(ERROR, "invalid tuplesort state");
    3279             :             break;
    3280             :     }
    3281             : 
    3282      354220 :     MemoryContextSwitchTo(oldcontext);
    3283      354220 : }
    3284             : 
    3285             : /*
    3286             :  * tuplesort_restorepos - restores current position in merged sort file to
    3287             :  *                        last saved position
    3288             :  */
    3289             : void
    3290       17050 : tuplesort_restorepos(Tuplesortstate *state)
    3291             : {
    3292       17050 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
    3293             : 
    3294             :     Assert(state->randomAccess);
    3295             : 
    3296       17050 :     switch (state->status)
    3297             :     {
    3298       12922 :         case TSS_SORTEDINMEM:
    3299       12922 :             state->current = state->markpos_offset;
    3300       12922 :             state->eof_reached = state->markpos_eof;
    3301       12922 :             break;
    3302        4128 :         case TSS_SORTEDONTAPE:
    3303        4128 :             LogicalTapeSeek(state->result_tape,
    3304             :                             state->markpos_block,
    3305             :                             state->markpos_offset);
    3306        4128 :             state->eof_reached = state->markpos_eof;
    3307        4128 :             break;
    3308           0 :         default:
    3309           0 :             elog(ERROR, "invalid tuplesort state");
    3310             :             break;
    3311             :     }
    3312             : 
    3313       17050 :     MemoryContextSwitchTo(oldcontext);
    3314       17050 : }
    3315             : 
    3316             : /*
    3317             :  * tuplesort_get_stats - extract summary statistics
    3318             :  *
    3319             :  * This can be called after tuplesort_performsort() finishes to obtain
    3320             :  * printable summary information about how the sort was performed.
    3321             :  */
    3322             : void
    3323         196 : tuplesort_get_stats(Tuplesortstate *state,
    3324             :                     TuplesortInstrumentation *stats)
    3325             : {
    3326             :     /*
    3327             :      * Note: it might seem we should provide both memory and disk usage for a
    3328             :      * disk-based sort.  However, the current code doesn't track memory space
    3329             :      * accurately once we have begun to return tuples to the caller (since we
    3330             :      * don't account for pfree's the caller is expected to do), so we cannot
    3331             :      * rely on availMem in a disk sort.  This does not seem worth the overhead
    3332             :      * to fix.  Is it worth creating an API for the memory context code to
    3333             :      * tell us how much is actually used in sortcontext?
    3334             :      */
    3335         196 :     tuplesort_updatemax(state);
    3336             : 
    3337         196 :     if (state->isMaxSpaceDisk)
    3338           4 :         stats->spaceType = SORT_SPACE_TYPE_DISK;
    3339             :     else
    3340         192 :         stats->spaceType = SORT_SPACE_TYPE_MEMORY;
    3341         196 :     stats->spaceUsed = (state->maxSpace + 1023) / 1024;
    3342             : 
    3343         196 :     switch (state->maxSpaceStatus)
    3344             :     {
    3345         192 :         case TSS_SORTEDINMEM:
    3346         192 :             if (state->boundUsed)
    3347          28 :                 stats->sortMethod = SORT_TYPE_TOP_N_HEAPSORT;
    3348             :             else
    3349         164 :                 stats->sortMethod = SORT_TYPE_QUICKSORT;
    3350         192 :             break;
    3351           0 :         case TSS_SORTEDONTAPE:
    3352           0 :             stats->sortMethod = SORT_TYPE_EXTERNAL_SORT;
    3353           0 :             break;
    3354           4 :         case TSS_FINALMERGE:
    3355           4 :             stats->sortMethod = SORT_TYPE_EXTERNAL_MERGE;
    3356           4 :             break;
    3357           0 :         default:
    3358           0 :             stats->sortMethod = SORT_TYPE_STILL_IN_PROGRESS;
    3359           0 :             break;
    3360             :     }
    3361         196 : }
    3362             : 
    3363             : /*
    3364             :  * Convert TuplesortMethod to a string.
    3365             :  */
    3366             : const char *
    3367         128 : tuplesort_method_name(TuplesortMethod m)
    3368             : {
    3369         128 :     switch (m)
    3370             :     {
    3371           0 :         case SORT_TYPE_STILL_IN_PROGRESS:
    3372           0 :             return "still in progress";
    3373          28 :         case SORT_TYPE_TOP_N_HEAPSORT:
    3374          28 :             return "top-N heapsort";
    3375          96 :         case SORT_TYPE_QUICKSORT:
    3376          96 :             return "quicksort";
    3377           0 :         case SORT_TYPE_EXTERNAL_SORT:
    3378           0 :             return "external sort";
    3379           4 :         case SORT_TYPE_EXTERNAL_MERGE:
    3380           4 :             return "external merge";
    3381             :     }
    3382             : 
    3383           0 :     return "unknown";
    3384             : }
    3385             : 
    3386             : /*
    3387             :  * Convert TuplesortSpaceType to a string.
    3388             :  */
    3389             : const char *
    3390         104 : tuplesort_space_type_name(TuplesortSpaceType t)
    3391             : {
    3392             :     Assert(t == SORT_SPACE_TYPE_DISK || t == SORT_SPACE_TYPE_MEMORY);
    3393         104 :     return t == SORT_SPACE_TYPE_DISK ? "Disk" : "Memory";
    3394             : }
    3395             : 
    3396             : 
    3397             : /*
    3398             :  * Heap manipulation routines, per Knuth's Algorithm 5.2.3H.
    3399             :  */
    3400             : 
    3401             : /*
    3402             :  * Convert the existing unordered array of SortTuples to a bounded heap,
    3403             :  * discarding all but the smallest "state->bound" tuples.
    3404             :  *
    3405             :  * When working with a bounded heap, we want to keep the largest entry
    3406             :  * at the root (array entry zero), instead of the smallest as in the normal
    3407             :  * sort case.  This allows us to discard the largest entry cheaply.
    3408             :  * Therefore, we temporarily reverse the sort direction.
    3409             :  */
    3410             : static void
    3411         322 : make_bounded_heap(Tuplesortstate *state)
    3412             : {
    3413         322 :     int         tupcount = state->memtupcount;
    3414             :     int         i;
    3415             : 
    3416             :     Assert(state->status == TSS_INITIAL);
    3417             :     Assert(state->bounded);
    3418             :     Assert(tupcount >= state->bound);
    3419             :     Assert(SERIAL(state));
    3420             : 
    3421             :     /* Reverse sort direction so largest entry will be at root */
    3422         322 :     reversedirection(state);
    3423             : 
    3424         322 :     state->memtupcount = 0;      /* make the heap empty */
    3425       27408 :     for (i = 0; i < tupcount; i++)
    3426             :     {
    3427       27086 :         if (state->memtupcount < state->bound)
    3428             :         {
    3429             :             /* Insert next tuple into heap */
    3430             :             /* Must copy source tuple to avoid possible overwrite */
    3431       13382 :             SortTuple   stup = state->memtuples[i];
    3432             : 
    3433       13382 :             tuplesort_heap_insert(state, &stup);
    3434             :         }
    3435             :         else
    3436             :         {
    3437             :             /*
    3438             :              * The heap is full.  Replace the largest entry with the new
    3439             :              * tuple, or just discard it, if it's larger than anything already
    3440             :              * in the heap.
    3441             :              */
    3442       13704 :             if (COMPARETUP(state, &state->memtuples[i], &state->memtuples[0]) <= 0)
    3443             :             {
    3444        7430 :                 free_sort_tuple(state, &state->memtuples[i]);
    3445        7430 :                 CHECK_FOR_INTERRUPTS();
    3446             :             }
    3447             :             else
    3448        6274 :                 tuplesort_heap_replace_top(state, &state->memtuples[i]);
    3449             :         }
    3450             :     }
    3451             : 
    3452             :     Assert(state->memtupcount == state->bound);
    3453         322 :     state->status = TSS_BOUNDED;
    3454         322 : }
    3455             : 
    3456             : /*
    3457             :  * Convert the bounded heap to a properly-sorted array
    3458             :  */
    3459             : static void
    3460         322 : sort_bounded_heap(Tuplesortstate *state)
    3461             : {
    3462         322 :     int         tupcount = state->memtupcount;
    3463             : 
    3464             :     Assert(state->status == TSS_BOUNDED);
    3465             :     Assert(state->bounded);
    3466             :     Assert(tupcount == state->bound);
    3467             :     Assert(SERIAL(state));
    3468             : 
    3469             :     /*
    3470             :      * We can unheapify in place because each delete-top call will remove the
    3471             :      * largest entry, which we can promptly store in the newly freed slot at
    3472             :      * the end.  Once we're down to a single-entry heap, we're done.
    3473             :      */
    3474       13382 :     while (state->memtupcount > 1)
    3475             :     {
    3476       13060 :         SortTuple   stup = state->memtuples[0];
    3477             : 
    3478             :         /* this sifts-up the next-largest entry and decreases memtupcount */
    3479       13060 :         tuplesort_heap_delete_top(state);
    3480       13060 :         state->memtuples[state->memtupcount] = stup;
    3481             :     }
    3482         322 :     state->memtupcount = tupcount;
    3483             : 
    3484             :     /*
    3485             :      * Reverse sort direction back to the original state.  This is not
    3486             :      * actually necessary but seems like a good idea for tidiness.
    3487             :      */
    3488         322 :     reversedirection(state);
    3489             : 
    3490         322 :     state->status = TSS_SORTEDINMEM;
    3491         322 :     state->boundUsed = true;
    3492         322 : }
    3493             : 
    3494             : /*
    3495             :  * Sort all memtuples using specialized qsort() routines.
    3496             :  *
    3497             :  * Quicksort is used for small in-memory sorts, and external sort runs.
    3498             :  */
    3499             : static void
    3500     1846590 : tuplesort_sort_memtuples(Tuplesortstate *state)
    3501             : {
    3502             :     Assert(!LEADER(state));
    3503             : 
    3504     1846590 :     if (state->memtupcount > 1)
    3505             :     {
    3506             :         /* Can we use the single-key sort function? */
    3507       86378 :         if (state->onlyKey != NULL)
    3508       54766 :             qsort_ssup(state->memtuples, state->memtupcount,
    3509       54766 :                        state->onlyKey);
    3510             :         else
    3511       31612 :             qsort_tuple(state->memtuples,
    3512       31612 :                         state->memtupcount,
    3513             :                         state->comparetup,
    3514             :                         state);
    3515             :     }
    3516     1846546 : }
    3517             : 
    3518             : /*
    3519             :  * Insert a new tuple into an empty or existing heap, maintaining the
    3520             :  * heap invariant.  Caller is responsible for ensuring there's room.
    3521             :  *
    3522             :  * Note: For some callers, tuple points to a memtuples[] entry above the
    3523             :  * end of the heap.  This is safe as long as it's not immediately adjacent
    3524             :  * to the end of the heap (ie, in the [memtupcount] array entry) --- if it
    3525             :  * is, it might get overwritten before being moved into the heap!
    3526             :  */
    3527             : static void
    3528       17314 : tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple)
    3529             : {
    3530             :     SortTuple  *memtuples;
    3531             :     int         j;
    3532             : 
    3533       17314 :     memtuples = state->memtuples;
    3534             :     Assert(state->memtupcount < state->memtupsize);
    3535             : 
    3536       17314 :     CHECK_FOR_INTERRUPTS();
    3537             : 
    3538             :     /*
    3539             :      * Sift-up the new entry, per Knuth 5.2.3 exercise 16. Note that Knuth is
    3540             :      * using 1-based array indexes, not 0-based.
    3541             :      */
    3542       17314 :     j = state->memtupcount++;
    3543       47992 :     while (j > 0)
    3544             :     {
    3545       41042 :         int         i = (j - 1) >> 1;
    3546             : 
    3547       41042 :         if (COMPARETUP(state, tuple, &memtuples[i]) >= 0)
    3548       10364 :             break;
    3549       30678 :         memtuples[j] = memtuples[i];
    3550       30678 :         j = i;
    3551             :     }
    3552       17314 :     memtuples[j] = *tuple;
    3553       17314 : }
    3554             : 
    3555             : /*
    3556             :  * Remove the tuple at state->memtuples[0] from the heap.  Decrement
    3557             :  * memtupcount, and sift up to maintain the heap invariant.
    3558             :  *
    3559             :  * The caller has already free'd the tuple the top node points to,
    3560             :  * if necessary.
    3561             :  */
    3562             : static void
    3563       16972 : tuplesort_heap_delete_top(Tuplesortstate *state)
    3564             : {
    3565       16972 :     SortTuple  *memtuples = state->memtuples;
    3566             :     SortTuple  *tuple;
    3567             : 
    3568       16972 :     if (--state->memtupcount <= 0)
    3569         724 :         return;
    3570             : 
    3571             :     /*
    3572             :      * Remove the last tuple in the heap, and re-insert it, by replacing the
    3573             :      * current top node with it.
    3574             :      */
    3575       16248 :     tuple = &memtuples[state->memtupcount];
    3576       16248 :     tuplesort_heap_replace_top(state, tuple);
    3577             : }
    3578             : 
    3579             : /*
    3580             :  * Replace the tuple at state->memtuples[0] with a new tuple.  Sift up to
    3581             :  * maintain the heap invariant.
    3582             :  *
    3583             :  * This corresponds to Knuth's "sift-up" algorithm (Algorithm 5.2.3H,
    3584             :  * Heapsort, steps H3-H8).
    3585             :  */
    3586             : static void
    3587     5573202 : tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple)
    3588             : {
    3589     5573202 :     SortTuple  *memtuples = state->memtuples;
    3590             :     unsigned int i,
    3591             :                 n;
    3592             : 
    3593             :     Assert(state->memtupcount >= 1);
    3594             : 
    3595     5573202 :     CHECK_FOR_INTERRUPTS();
    3596             : 
    3597             :     /*
    3598             :      * state->memtupcount is "int", but we use "unsigned int" for i, j, n.
    3599             :      * This prevents overflow in the "2 * i + 1" calculation, since at the top
    3600             :      * of the loop we must have i < n <= INT_MAX <= UINT_MAX/2.
    3601             :      */
    3602     5573202 :     n = state->memtupcount;
    3603     5573202 :     i = 0;                      /* i is where the "hole" is */
    3604             :     for (;;)
    3605     1184614 :     {
    3606     6757816 :         unsigned int j = 2 * i + 1;
    3607             : 
    3608     6757816 :         if (j >= n)
    3609     1277232 :             break;
    3610     7502954 :         if (j + 1 < n &&
    3611     2022370 :             COMPARETUP(state, &memtuples[j], &memtuples[j + 1]) > 0)
    3612      875810 :             j++;
    3613     5480584 :         if (COMPARETUP(state, tuple, &memtuples[j]) <= 0)
    3614     4295970 :             break;
    3615     1184614 :         memtuples[i] = memtuples[j];
    3616     1184614 :         i = j;
    3617             :     }
    3618     5573202 :     memtuples[i] = *tuple;
    3619     5573202 : }
    3620             : 
    3621             : /*
    3622             :  * Function to reverse the sort direction from its current state
    3623             :  *
    3624             :  * It is not safe to call this when performing hash tuplesorts
    3625             :  */
    3626             : static void
    3627         644 : reversedirection(Tuplesortstate *state)
    3628             : {
    3629         644 :     SortSupport sortKey = state->sortKeys;
    3630             :     int         nkey;
    3631             : 
    3632        1560 :     for (nkey = 0; nkey < state->nKeys; nkey++, sortKey++)
    3633             :     {
    3634         916 :         sortKey->ssup_reverse = !sortKey->ssup_reverse;
    3635         916 :         sortKey->ssup_nulls_first = !sortKey->ssup_nulls_first;
    3636             :     }
    3637         644 : }
    3638             : 
    3639             : 
    3640             : /*
    3641             :  * Tape interface routines
    3642             :  */
    3643             : 
    3644             : static unsigned int
    3645     5404636 : getlen(LogicalTape *tape, bool eofOK)
    3646             : {
    3647             :     unsigned int len;
    3648             : 
    3649     5404636 :     if (LogicalTapeRead(tape,
    3650             :                         &len, sizeof(len)) != sizeof(len))
    3651           0 :         elog(ERROR, "unexpected end of tape");
    3652     5404636 :     if (len == 0 && !eofOK)
    3653           0 :         elog(ERROR, "unexpected end of data");
    3654     5404636 :     return len;
    3655             : }
    3656             : 
    3657             : static void
    3658        4184 : markrunend(LogicalTape *tape)
    3659             : {
    3660        4184 :     unsigned int len = 0;
    3661             : 
    3662        4184 :     LogicalTapeWrite(tape, (void *) &len, sizeof(len));
    3663        4184 : }
    3664             : 
    3665             : /*
    3666             :  * Get memory for tuple from within READTUP() routine.
    3667             :  *
    3668             :  * We use next free slot from the slab allocator, or palloc() if the tuple
    3669             :  * is too large for that.
    3670             :  */
    3671             : static void *
    3672     5120276 : readtup_alloc(Tuplesortstate *state, Size tuplen)
    3673             : {
    3674             :     SlabSlot   *buf;
    3675             : 
    3676             :     /*
    3677             :      * We pre-allocate enough slots in the slab arena that we should never run
    3678             :      * out.
    3679             :      */
    3680             :     Assert(state->slabFreeHead);
    3681             : 
    3682     5120276 :     if (tuplen > SLAB_SLOT_SIZE || !state->slabFreeHead)
    3683           0 :         return MemoryContextAlloc(state->sortcontext, tuplen);
    3684             :     else
    3685             :     {
    3686     5120276 :         buf = state->slabFreeHead;
    3687             :         /* Reuse this slot */
    3688     5120276 :         state->slabFreeHead = buf->nextfree;
    3689             : 
    3690     5120276 :         return buf;
    3691             :     }
    3692             : }
    3693             : 
    3694             : 
    3695             : /*
    3696             :  * Routines specialized for HeapTuple (actually MinimalTuple) case
    3697             :  */
    3698             : 
    3699             : static int
    3700    34738726 : comparetup_heap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
    3701             : {
    3702    34738726 :     SortSupport sortKey = state->sortKeys;
    3703             :     HeapTupleData ltup;
    3704             :     HeapTupleData rtup;
    3705             :     TupleDesc   tupDesc;
    3706             :     int         nkey;
    3707             :     int32       compare;
    3708             :     AttrNumber  attno;
    3709             :     Datum       datum1,
    3710             :                 datum2;
    3711             :     bool        isnull1,
    3712             :                 isnull2;
    3713             : 
    3714             : 
    3715             :     /* Compare the leading sort key */
    3716    34738726 :     compare = ApplySortComparator(a->datum1, a->isnull1,
    3717    34738726 :                                   b->datum1, b->isnull1,
    3718             :                                   sortKey);
    3719    34738726 :     if (compare != 0)
    3720    20280896 :         return compare;
    3721             : 
    3722             :     /* Compare additional sort keys */
    3723    14457830 :     ltup.t_len = ((MinimalTuple) a->tuple)->t_len + MINIMAL_TUPLE_OFFSET;
    3724    14457830 :     ltup.t_data = (HeapTupleHeader) ((char *) a->tuple - MINIMAL_TUPLE_OFFSET);
    3725    14457830 :     rtup.t_len = ((MinimalTuple) b->tuple)->t_len + MINIMAL_TUPLE_OFFSET;
    3726    14457830 :     rtup.t_data = (HeapTupleHeader) ((char *) b->tuple - MINIMAL_TUPLE_OFFSET);
    3727    14457830 :     tupDesc = state->tupDesc;
    3728             : 
    3729    14457830 :     if (sortKey->abbrev_converter)
    3730             :     {
    3731      631712 :         attno = sortKey->ssup_attno;
    3732             : 
    3733      631712 :         datum1 = heap_getattr(&ltup, attno, tupDesc, &isnull1);
    3734      631712 :         datum2 = heap_getattr(&rtup, attno, tupDesc, &isnull2);
    3735             : 
    3736      631712 :         compare = ApplySortAbbrevFullComparator(datum1, isnull1,
    3737             :                                                 datum2, isnull2,
    3738             :                                                 sortKey);
    3739      631712 :         if (compare != 0)
    3740      535566 :             return compare;
    3741             :     }
    3742             : 
    3743    13922264 :     sortKey++;
    3744    14764328 :     for (nkey = 1; nkey < state->nKeys; nkey++, sortKey++)
    3745             :     {
    3746    14010028 :         attno = sortKey->ssup_attno;
    3747             : 
    3748    14010028 :         datum1 = heap_getattr(&ltup, attno, tupDesc, &isnull1);
    3749    14010028 :         datum2 = heap_getattr(&rtup, attno, tupDesc, &isnull2);
    3750             : 
    3751    14010028 :         compare = ApplySortComparator(datum1, isnull1,
    3752             :                                       datum2, isnull2,
    3753             :                                       sortKey);
    3754    14010028 :         if (compare != 0)
    3755    13167964 :             return compare;
    3756             :     }
    3757             : 
    3758      754300 :     return 0;
    3759             : }
    3760             : 
    3761             : static void
    3762     7935892 : copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup)
    3763             : {
    3764             :     /*
    3765             :      * We expect the passed "tup" to be a TupleTableSlot, and form a
    3766             :      * MinimalTuple using the exported interface for that.
    3767             :      */
    3768     7935892 :     TupleTableSlot *slot = (TupleTableSlot *) tup;
    3769             :     Datum       original;
    3770             :     MinimalTuple tuple;
    3771             :     HeapTupleData htup;
    3772     7935892 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->tuplecontext);
    3773             : 
    3774             :     /* copy the tuple into sort storage */
    3775     7935892 :     tuple = ExecCopySlotMinimalTuple(slot);
    3776     7935892 :     stup->tuple = (void *) tuple;
    3777     7935892 :     USEMEM(state, GetMemoryChunkSpace(tuple));
    3778             :     /* set up first-column key value */
    3779     7935892 :     htup.t_len = tuple->t_len + MINIMAL_TUPLE_OFFSET;
    3780     7935892 :     htup.t_data = (HeapTupleHeader) ((char *) tuple - MINIMAL_TUPLE_OFFSET);
    3781     7935892 :     original = heap_getattr(&htup,
    3782             :                             state->sortKeys[0].ssup_attno,
    3783             :                             state->tupDesc,
    3784             :                             &stup->isnull1);
    3785             : 
    3786     7935892 :     MemoryContextSwitchTo(oldcontext);
    3787             : 
    3788     7935892 :     if (!state->sortKeys->abbrev_converter || stup->isnull1)
    3789             :     {
    3790             :         /*
    3791             :          * Store ordinary Datum representation, or NULL value.  If there is a
    3792             :          * converter it won't expect NULL values, and cost model is not
    3793             :          * required to account for NULL, so in that case we avoid calling
    3794             :          * converter and just set datum1 to zeroed representation (to be
    3795             :          * consistent, and to support cheap inequality tests for NULL
    3796             :          * abbreviated keys).
    3797             :          */
    3798     7221466 :         stup->datum1 = original;
    3799             :     }
    3800      714426 :     else if (!consider_abort_common(state))
    3801             :     {
    3802             :         /* Store abbreviated key representation */
    3803      714418 :         stup->datum1 = state->sortKeys->abbrev_converter(original,
    3804             :                                                          state->sortKeys);
    3805             :     }
    3806             :     else
    3807             :     {
    3808             :         /* Abort abbreviation */
    3809             :         int         i;
    3810             : 
    3811           8 :         stup->datum1 = original;
    3812             : 
    3813             :         /*
    3814             :          * Set state to be consistent with never trying abbreviation.
    3815             :          *
    3816             :          * Alter datum1 representation in already-copied tuples, so as to
    3817             :          * ensure a consistent representation (current tuple was just
    3818             :          * handled).  It does not matter if some dumped tuples are already
    3819             :          * sorted on tape, since serialized tuples lack abbreviated keys
    3820             :          * (TSS_BUILDRUNS state prevents control reaching here in any case).
    3821             :          */
    3822       81928 :         for (i = 0; i < state->memtupcount; i++)
    3823             :         {
    3824       81920 :             SortTuple  *mtup = &state->memtuples[i];
    3825             : 
    3826       81920 :             htup.t_len = ((MinimalTuple) mtup->tuple)->t_len +
    3827             :                 MINIMAL_TUPLE_OFFSET;
    3828       81920 :             htup.t_data = (HeapTupleHeader) ((char *) mtup->tuple -
    3829             :                                              MINIMAL_TUPLE_OFFSET);
    3830             : 
    3831       81920 :             mtup->datum1 = heap_getattr(&htup,
    3832             :                                         state->sortKeys[0].ssup_attno,
    3833             :                                         state->tupDesc,
    3834             :                                         &mtup->isnull1);
    3835             :         }
    3836             :     }
    3837     7935892 : }
    3838             : 
    3839             : static void
    3840      507788 : writetup_heap(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
    3841             : {
    3842      507788 :     MinimalTuple tuple = (MinimalTuple) stup->tuple;
    3843             : 
    3844             :     /* the part of the MinimalTuple we'll write: */
    3845      507788 :     char       *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
    3846      507788 :     unsigned int tupbodylen = tuple->t_len - MINIMAL_TUPLE_DATA_OFFSET;
    3847             : 
    3848             :     /* total on-disk footprint: */
    3849      507788 :     unsigned int tuplen = tupbodylen + sizeof(int);
    3850             : 
    3851      507788 :     LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen));
    3852      507788 :     LogicalTapeWrite(tape, (void *) tupbody, tupbodylen);
    3853      507788 :     if (state->randomAccess) /* need trailing length word? */
    3854       20000 :         LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen));
    3855             : 
    3856      507788 :     if (!state->slabAllocatorUsed)
    3857             :     {
    3858      257684 :         FREEMEM(state, GetMemoryChunkSpace(tuple));
    3859      257684 :         heap_free_minimal_tuple(tuple);
    3860             :     }
    3861      507788 : }
    3862             : 
    3863             : static void
    3864      439608 : readtup_heap(Tuplesortstate *state, SortTuple *stup,
    3865             :              LogicalTape *tape, unsigned int len)
    3866             : {
    3867      439608 :     unsigned int tupbodylen = len - sizeof(int);
    3868      439608 :     unsigned int tuplen = tupbodylen + MINIMAL_TUPLE_DATA_OFFSET;
    3869      439608 :     MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tuplen);
    3870      439608 :     char       *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
    3871             :     HeapTupleData htup;
    3872             : 
    3873             :     /* read in the tuple proper */
    3874      439608 :     tuple->t_len = tuplen;
    3875      439608 :     LogicalTapeReadExact(tape, tupbody, tupbodylen);
    3876      439608 :     if (state->randomAccess) /* need trailing length word? */
    3877       31856 :         LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
    3878      439608 :     stup->tuple = (void *) tuple;
    3879             :     /* set up first-column key value */
    3880      439608 :     htup.t_len = tuple->t_len + MINIMAL_TUPLE_OFFSET;
    3881      439608 :     htup.t_data = (HeapTupleHeader) ((char *) tuple - MINIMAL_TUPLE_OFFSET);
    3882      439608 :     stup->datum1 = heap_getattr(&htup,
    3883             :                                 state->sortKeys[0].ssup_attno,
    3884             :                                 state->tupDesc,
    3885             :                                 &stup->isnull1);
    3886      439608 : }
    3887             : 
    3888             : /*
    3889             :  * Routines specialized for the CLUSTER case (HeapTuple data, with
    3890             :  * comparisons per a btree index definition)
    3891             :  */
    3892             : 
    3893             : static int
    3894     7299090 : comparetup_cluster(const SortTuple *a, const SortTuple *b,
    3895             :                    Tuplesortstate *state)
    3896             : {
    3897     7299090 :     SortSupport sortKey = state->sortKeys;
    3898             :     HeapTuple   ltup;
    3899             :     HeapTuple   rtup;
    3900             :     TupleDesc   tupDesc;
    3901             :     int         nkey;
    3902             :     int32       compare;
    3903             :     Datum       datum1,
    3904             :                 datum2;
    3905             :     bool        isnull1,
    3906             :                 isnull2;
    3907     7299090 :     AttrNumber  leading = state->indexInfo->ii_IndexAttrNumbers[0];
    3908             : 
    3909             :     /* Be prepared to compare additional sort keys */
    3910     7299090 :     ltup = (HeapTuple) a->tuple;
    3911     7299090 :     rtup = (HeapTuple) b->tuple;
    3912     7299090 :     tupDesc = state->tupDesc;
    3913             : 
    3914             :     /* Compare the leading sort key, if it's simple */
    3915     7299090 :     if (leading != 0)
    3916             :     {
    3917     7239206 :         compare = ApplySortComparator(a->datum1, a->isnull1,
    3918     7239206 :                                       b->datum1, b->isnull1,
    3919             :                                       sortKey);
    3920     7239206 :         if (compare != 0)
    3921     6858162 :             return compare;
    3922             : 
    3923      381044 :         if (sortKey->abbrev_converter)
    3924             :         {
    3925       96300 :             datum1 = heap_getattr(ltup, leading, tupDesc, &isnull1);
    3926       96300 :             datum2 = heap_getattr(rtup, leading, tupDesc, &isnull2);
    3927             : 
    3928       96300 :             compare = ApplySortAbbrevFullComparator(datum1, isnull1,
    3929             :                                                     datum2, isnull2,
    3930             :                                                     sortKey);
    3931             :         }
    3932      381044 :         if (compare != 0 || state->nKeys == 1)
    3933       99352 :             return compare;
    3934             :         /* Compare additional columns the hard way */
    3935      281692 :         sortKey++;
    3936      281692 :         nkey = 1;
    3937             :     }
    3938             :     else
    3939             :     {
    3940             :         /* Must compare all keys the hard way */
    3941       59884 :         nkey = 0;
    3942             :     }
    3943             : 
    3944      341576 :     if (state->indexInfo->ii_Expressions == NULL)
    3945             :     {
    3946             :         /* If not expression index, just compare the proper heap attrs */
    3947             : 
    3948      391892 :         for (; nkey < state->nKeys; nkey++, sortKey++)
    3949             :         {
    3950      391892 :             AttrNumber  attno = state->indexInfo->ii_IndexAttrNumbers[nkey];
    3951             : 
    3952      391892 :             datum1 = heap_getattr(ltup, attno, tupDesc, &isnull1);
    3953      391892 :             datum2 = heap_getattr(rtup, attno, tupDesc, &isnull2);
    3954             : 
    3955      391892 :             compare = ApplySortComparator(datum1, isnull1,
    3956             :                                           datum2, isnull2,
    3957             :                                           sortKey);
    3958      391892 :             if (compare != 0)
    3959      281692 :                 return compare;
    3960             :         }
    3961             :     }
    3962             :     else
    3963             :     {
    3964             :         /*
    3965             :          * In the expression index case, compute the whole index tuple and
    3966             :          * then compare values.  It would perhaps be faster to compute only as
    3967             :          * many columns as we need to compare, but that would require
    3968             :          * duplicating all the logic in FormIndexDatum.
    3969             :          */
    3970             :         Datum       l_index_values[INDEX_MAX_KEYS];
    3971             :         bool        l_index_isnull[INDEX_MAX_KEYS];
    3972             :         Datum       r_index_values[INDEX_MAX_KEYS];
    3973             :         bool        r_index_isnull[INDEX_MAX_KEYS];
    3974             :         TupleTableSlot *ecxt_scantuple;
    3975             : 
    3976             :         /* Reset context each time to prevent memory leakage */
    3977       59884 :         ResetPerTupleExprContext(state->estate);
    3978             : 
    3979       59884 :         ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
    3980             : 
    3981       59884 :         ExecStoreHeapTuple(ltup, ecxt_scantuple, false);
    3982       59884 :         FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
    3983             :                        l_index_values, l_index_isnull);
    3984             : 
    3985       59884 :         ExecStoreHeapTuple(rtup, ecxt_scantuple, false);
    3986       59884 :         FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
    3987             :                        r_index_values, r_index_isnull);
    3988             : 
    3989       60500 :         for (; nkey < state->nKeys; nkey++, sortKey++)
    3990             :         {
    3991       60500 :             compare = ApplySortComparator(l_index_values[nkey],
    3992       60500 :                                           l_index_isnull[nkey],
    3993             :                                           r_index_values[nkey],
    3994       60500 :                                           r_index_isnull[nkey],
    3995             :                                           sortKey);
    3996       60500 :             if (compare != 0)
    3997       59884 :                 return compare;
    3998             :         }
    3999             :     }
    4000             : 
    4001           0 :     return 0;
    4002             : }
    4003             : 
    4004             : static void
    4005      366434 : copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
    4006             : {
    4007      366434 :     HeapTuple   tuple = (HeapTuple) tup;
    4008             :     Datum       original;
    4009      366434 :     MemoryContext oldcontext = MemoryContextSwitchTo(state->tuplecontext);
    4010             : 
    4011             :     /* copy the tuple into sort storage */
    4012      366434 :     tuple = heap_copytuple(tuple);
    4013      366434 :     stup->tuple = (void *) tuple;
    4014      366434 :     USEMEM(state, GetMemoryChunkSpace(tuple));
    4015             : 
    4016      366434 :     MemoryContextSwitchTo(oldcontext);
    4017             : 
    4018             :     /*
    4019             :      * set up first-column key value, and potentially abbreviate, if it's a
    4020             :      * simple column
    4021             :      */
    4022      366434 :     if (state->indexInfo->ii_IndexAttrNumbers[0] == 0)
    4023        1064 :         return;
    4024             : 
    4025      365370 :     original = heap_getattr(tuple,
    4026             :                             state->indexInfo->ii_IndexAttrNumbers[0],
    4027             :                             state->tupDesc,
    4028             :                             &stup->isnull1);
    4029             : 
    4030      365370 :     if (!state->sortKeys->abbrev_converter || stup->isnull1)
    4031             :     {
    4032             :         /*
    4033             :          * Store ordinary Datum representation, or NULL value.  If there is a
    4034             :          * converter it won't expect NULL values, and cost model is not
    4035             :          * required to account for NULL, so in that case we avoid calling
    4036             :          * converter and just set datum1 to zeroed representation (to be
    4037             :          * consistent, and to support cheap inequality tests for NULL
    4038             :          * abbreviated keys).
    4039             :          */
    4040      123378 :         stup->datum1 = original;
    4041             :     }
    4042      241992 :     else if (!consider_abort_common(state))
    4043             :     {
    4044             :         /* Store abbreviated key representation */
    4045      241984 :         stup->datum1 = state->sortKeys->abbrev_converter(original,
    4046             :                                                          state->sortKeys);
    4047             :     }
    4048             :     else
    4049             :     {
    4050             :         /* Abort abbreviation */
    4051             :         int         i;
    4052             : 
    4053           8 :         stup->datum1 = original;
    4054             : 
    4055             :         /*
    4056             :          * Set state to be consistent with never trying abbreviation.
    4057             :          *
    4058             :          * Alter datum1 representation in already-copied tuples, so as to
    4059             :          * ensure a consistent representation (current tuple was just
    4060             :          * handled).  It does not matter if some dumped tuples are already
    4061             :          * sorted on tape, since serialized tuples lack abbreviated keys
    4062             :          * (TSS_BUILDRUNS state prevents control reaching here in any case).
    4063             :          */
    4064       81928 :         for (i = 0; i < state->memtupcount; i++)
    4065             :         {
    4066       81920 :             SortTuple  *mtup = &state->memtuples[i];
    4067             : 
    4068       81920 :             tuple = (HeapTuple) mtup->tuple;
    4069       81920 :             mtup->datum1 = heap_getattr(tuple,
    4070             :                                         state->indexInfo->ii_IndexAttrNumbers[0],
    4071             :                                         state->tupDesc,
    4072             :                                         &mtup->isnull1);
    4073             :         }
    4074             :     }
    4075             : }
    4076             : 
    4077             : static void
    4078       40000 : writetup_cluster(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
    4079             : {
    4080       40000 :     HeapTuple   tuple = (HeapTuple) stup->tuple;
    4081       40000 :     unsigned int tuplen = tuple->t_len + sizeof(ItemPointerData) + sizeof(int);
    4082             : 
    4083             :     /* We need to store t_self, but not other fields of HeapTupleData */
    4084       40000 :     LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
    4085       40000 :     LogicalTapeWrite(tape, &tuple->t_self, sizeof(ItemPointerData));
    4086       40000 :     LogicalTapeWrite(tape, tuple->t_data, tuple->t_len);
    4087       40000 :     if (state->randomAccess) /* need trailing length word? */
    4088           0 :         LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
    4089             : 
    4090       40000 :     if (!state->slabAllocatorUsed)
    4091             :     {
    4092       40000 :         FREEMEM(state, GetMemoryChunkSpace(tuple));
    4093       40000 :         heap_freetuple(tuple);
    4094             :     }
    4095       40000 : }
    4096             : 
    4097             : static void
    4098       40000 : readtup_cluster(Tuplesortstate *state, SortTuple *stup,
    4099             :                 LogicalTape *tape, unsigned int tuplen)
    4100             : {
    4101       40000 :     unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
    4102       40000 :     HeapTuple   tuple = (HeapTuple) readtup_alloc(state,
    4103             :                                                   t_len + HEAPTUPLESIZE);
    4104             : 
    4105             :     /* Reconstruct the HeapTupleData header */
    4106       40000 :     tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
    4107       40000 :     tuple->t_len = t_len;
    4108       40000 :     LogicalTapeReadExact(tape, &tuple->t_self, sizeof(ItemPointerData));
    4109             :     /* We don't currently bother to reconstruct t_tableOid */
    4110       40000 :     tuple->t_tableOid = InvalidOid;
    4111             :     /* Read in the tuple body */
    4112       40000 :     LogicalTapeReadExact(tape, tuple->t_data, tuple->t_len);
    4113       40000 :     if (state->randomAccess) /* need trailing length word? */
    4114           0 :         LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
    4115       40000 :     stup->tuple = (void *) tuple;
    4116             :     /* set up first-column key value, if it's a simple column */
    4117       40000 :     if (state->indexInfo->ii_IndexAttrNumbers[0] != 0)
    4118       40000 :         stup->datum1 = heap_getattr(tuple,
    4119             :                                     state->indexInfo->ii_IndexAttrNumbers[0],
    4120             :                                     state->tupDesc,
    4121             :                                     &stup->isnull1);
    4122       40000 : }
    4123             : 
    4124             : /*
    4125             :  * Routines specialized for IndexTuple case
    4126             :  *
    4127             :  * The btree and hash cases require separate comparison functions, but the
    4128             :  * IndexTuple representation is the same so the copy/write/read support
    4129             :  * functions can be shared.
    4130             :  */
    4131             : 
    4132             : static int
    4133   174234090 : comparetup_index_btree(const SortTuple *a, const SortTuple *b,
    4134             :                        Tuplesortstate *state)
    4135             : {
    4136             :     /*
    4137             :      * This is similar to comparetup_heap(), but expects index tuples.  There
    4138             :      * is also special handling for enforcing uniqueness, and special
    4139             :      * treatment for equal keys at the end.
    4140             :      */
    4141   174234090 :     SortSupport sortKey = state->sortKeys;
    4142             :     IndexTuple  tuple1;
    4143             :     IndexTuple  tuple2;
    4144             :     int         keysz;
    4145             :     TupleDesc   tupDes;
    4146   174234090 :     bool        equal_hasnull = false;
    4147             :     int         nkey;
    4148             :     int32       compare;
    4149             :     Datum       datum1,
    4150             :                 datum2;
    4151             :     bool        isnull1,
    4152             :                 isnull2;
    4153             : 
    4154             : 
    4155             :     /* Compare the leading sort key */
    4156   174234090 :     compare = ApplySortComparator(a->datum1, a->isnull1,
    4157   174234090 :                                   b->datum1, b->isnull1,
    4158             :                                   sortKey);
    4159   174234090 :     if (compare != 0)
    4160   150237056 :         return compare;
    4161             : 
    4162             :     /* Compare additional sort keys */
    4163    23997034 :     tuple1 = (IndexTuple) a->tuple;
    4164    23997034 :     tuple2 = (IndexTuple) b->tuple;
    4165    23997034 :     keysz = state->nKeys;
    4166    23997034 :     tupDes = RelationGetDescr(state->indexRel);
    4167             : 
    4168    23997034 :     if (sortKey->abbrev_converter)
    4169             :     {
    4170      567832 :         datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
    4171      567832 :         datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
    4172             : 
    4173      567832 :         compare = ApplySortAbbrevFullComparator(datum1, isnull1,
    4174             :                                                 datum2, isnull2,
    4175             :                                                 sortKey);
    4176      567832 :         if (compare != 0)
    4177      538186 :             return compare;
    4178             :     }
    4179             : 
    4180             :     /* they are equal, so we only need to examine one null flag */
    4181    23458848 :     if (a->isnull1)
    4182         276 :         equal_hasnull = true;
    4183             : 
    4184    23458848 :     sortKey++;
    4185    29504270 :     for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
    4186             :     {
    4187    21192606 :         datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
    4188    21192606 :         datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
    4189             : 
    4190    21192606 :         compare = ApplySortComparator(datum1, isnull1,
    4191             :                                       datum2, isnull2,
    4192             :                                       sortKey);
    4193    21192606 :         if (compare != 0)
    4194    15147184 :             return compare;     /* done when we find unequal attributes */
    4195             : 
    4196             :         /* they are equal, so we only need to examine one null flag */
    4197     6045422 :         if (isnull1)
    4198        3996 :             equal_hasnull = true;
    4199             :     }
    4200             : 
    4201             :     /*
    4202             :      * If btree has asked us to enforce uniqueness, complain if two equal
    4203             :      * tuples are detected (unless there was at least one NULL field).
    4204             :      *
    4205             :      * It is sufficient to make the test here, because if two tuples are equal
    4206             :      * they *must* get compared at some stage of the sort --- otherwise the
    4207             :      * sort algorithm wouldn't have checked whether one must appear before the
    4208             :      * other.
    4209             :      */
    4210     8311664 :     if (state->enforceUnique && !equal_hasnull)
    4211             :     {
    4212             :         Datum       values[INDEX_MAX_KEYS];
    4213             :         bool        isnull[INDEX_MAX_KEYS];
    4214             :         char       *key_desc;
    4215             : 
    4216             :         /*
    4217             :          * Some rather brain-dead implementations of qsort (such as the one in
    4218             :          * QNX 4) will sometimes call the comparison routine to compare a
    4219             :          * value to itself, but we always use our own implementation, which
    4220             :          * does not.
    4221             :          */
    4222             :         Assert(tuple1 != tuple2);
    4223             : 
    4224          44 :         index_deform_tuple(tuple1, tupDes, values, isnull);
    4225             : 
    4226          44 :         key_desc = BuildIndexValueDescription(state->indexRel, values, isnull);
    4227             : 
    4228          44 :         ereport(ERROR,
    4229             :                 (errcode(ERRCODE_UNIQUE_VIOLATION),
    4230             :                  errmsg("could not create unique index \"%s\"",
    4231             :                         RelationGetRelationName(state->indexRel)),
    4232             :                  key_desc ? errdetail("Key %s is duplicated.", key_desc) :
    4233             :                  errdetail("Duplicate keys exist."),
    4234             :                  errtableconstraint(state->heapRel,
    4235             :                                     RelationGetRelationName(state->indexRel))));
    4236             :     }
    4237             : 
    4238             :     /*
    4239             :      * If key values are equal, we sort on ItemPointer.  This is required for
    4240             :      * btree indexes, since heap TID is treated as an implicit last key
    4241             :      * attribute in order to ensure that all keys in the index are physically
    4242             :      * unique.
    4243             :      */
    4244             :     {
    4245     8311620 :         BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
    4246     8311620 :         BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
    4247             : 
    4248     8311620 :         if (blk1 != blk2)
    4249     7581996 :             return (blk1 < blk2) ? -1 : 1;
    4250             :     }
    4251             :     {
    4252      729624 :         OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
    4253      729624 :         OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
    4254             : 
    4255      729624 :         if (pos1 != pos2)
    4256      729624 :             return (pos1 < pos2) ? -1 : 1;
    4257             :     }
    4258             : 
    4259             :     /* ItemPointer values should never be equal */
    4260             :     Assert(false);
    4261             : 
    4262           0 :     return 0;
    4263             : }
    4264             : 
    4265             : static int
    4266      560280 : comparetup_index_hash(const SortTuple *a, const SortTuple *b,
    4267             :                       Tuplesortstate *state)
    4268             : {
    4269             :     Bucket      bucket1;
    4270             :     Bucket      bucket2;
    4271             :     IndexTuple  tuple1;
    4272             :     IndexTuple  tuple2;
    4273             : 
    4274             :     /*
    4275             :      * Fetch hash keys and mask off bits we don't want to sort by. We know
    4276             :      * that the first column of the index tuple is the hash key.
    4277             :      */
    4278             :     Assert(!a->isnull1);
    4279      560280 :     bucket1 = _hash_hashkey2bucket(DatumGetUInt32(a->datum1),
    4280             :                                    state->max_buckets, state->high_mask,
    4281             :                                    state->low_mask);
    4282             :     Assert(!b->isnull1);
    4283      560280 :     bucket2 = _hash_hashkey2bucket(DatumGetUInt32(b->datum1),
    4284             :                                    state->max_buckets, state->high_mask,
    4285             :                                    state->low_mask);
    4286      560280 :     if (bucket1 > bucket2)
    4287      176388 :         return 1;
    4288      383892 :     else if (bucket1 < bucket2)
    4289      157188 :         return -1;
    4290             : 
    4291             :     /*
    4292             :      * If hash values are equal, we sort on ItemPointer.  This does not affect
    4293             :      * validity of the finished index, but it may be useful to have index
    4294             :      * scans in physical order.
    4295             :      */
    4296      226704 :     tuple1 = (IndexTuple) a->tuple;
    4297      226704 :     tuple2 = (IndexTuple) b->tuple;
    4298             : 
    4299             :     {
    4300      226704 :         BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
    4301      226704 :         BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
    4302             : 
    4303      226704 :         if (blk1 != blk2)
    4304      223192 :             return (blk1 < blk2) ? -1 : 1;
    4305             :     }
    4306             :     {
    4307        3512 :         OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
    4308        3512 :         OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
    4309             : 
    4310        3512 :         if (pos1 != pos2)
    4311        3512 :             return (pos1 < pos2) ? -1 : 1;
    4312             :     }
    4313             : 
    4314             :     /* ItemPointer values should never be equal */
    4315             :     Assert(false);
    4316             : 
    4317           0 :     return 0;
    4318             : }
    4319             : 
    4320             : static void
    4321           0 : copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup)
    4322             : {
    4323             :     /* Not currently needed */
    4324           0 :     elog(ERROR, "copytup_index() should not be called");
    4325             : }
    4326             : 
    4327             : static void
    4328     3200008 : writetup_index(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
    4329             : {
    4330     3200008 :     IndexTuple  tuple = (IndexTuple) stup->tuple;
    4331             :     unsigned int tuplen;
    4332             : 
    4333     3200008 :     tuplen = IndexTupleSize(tuple) + sizeof(tuplen);
    4334     3200008 :     LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen));
    4335     3200008 :     LogicalTapeWrite(tape, (void *) tuple, IndexTupleSize(tuple));
    4336     3200008 :     if (state->randomAccess) /* need trailing length word? */
    4337           0 :         LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen));
    4338             : 
    4339     3200008 :     if (!state->slabAllocatorUsed)
    4340             :     {
    4341     3200008 :         FREEMEM(state, GetMemoryChunkSpace(tuple));
    4342     3200008 :         pfree(tuple);
    4343             :     }
    4344     3200008 : }
    4345             : 
    4346             : static void
    4347     3200008 : readtup_index(Tuplesortstate *state, SortTuple *stup,
    4348             :               LogicalTape *tape, unsigned int len)
    4349             : {
    4350     3200008 :     unsigned int tuplen = len - sizeof(unsigned int);
    4351     3200008 :     IndexTuple  tuple = (IndexTuple) readtup_alloc(state, tuplen);
    4352             : 
    4353     3200008 :     LogicalTapeReadExact(tape, tuple, tuplen);
    4354     3200008 :     if (state->randomAccess) /* need trailing length word? */
    4355           0 :         LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
    4356     3200008 :     stup->tuple = (void *) tuple;
    4357             :     /* set up first-column key value */
    4358     3200008 :     stup->datum1 = index_getattr(tuple,
    4359             :                                  1,
    4360             :                                  RelationGetDescr(state->indexRel),
    4361             :                                  &stup->isnull1);
    4362     3200008 : }
    4363             : 
    4364             : /*
    4365             :  * Routines specialized for DatumTuple case
    4366             :  */
    4367             : 
    4368             : static int
    4369    10475136 : comparetup_datum(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
    4370             : {
    4371             :     int         compare;
    4372             : 
    4373    10475136 :     compare = ApplySortComparator(a->datum1, a->isnull1,
    4374    10475136 :                                   b->datum1, b->isnull1,
    4375             :                                   state->sortKeys);
    4376    10475136 :     if (compare != 0)
    4377     9075480 :         return compare;
    4378             : 
    4379             :     /* if we have abbreviations, then "tuple" has the original value */
    4380             : 
    4381     1399656 :     if (state->sortKeys->abbrev_converter)
    4382     1251838 :         compare = ApplySortAbbrevFullComparator(PointerGetDatum(a->tuple), a->isnull1,
    4383     1251838 :                                                 PointerGetDatum(b->tuple), b->isnull1,
    4384             :                                                 state->sortKeys);
    4385             : 
    4386     1399656 :     return compare;
    4387             : }
    4388             : 
    4389             : static void
    4390           0 : copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup)
    4391             : {
    4392             :     /* Not currently needed */
    4393           0 :     elog(ERROR, "copytup_datum() should not be called");
    4394             : }
    4395             : 
    4396             : static void
    4397     1800920 : writetup_datum(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
    4398             : {
    4399             :     void       *waddr;
    4400             :     unsigned int tuplen;
    4401             :     unsigned int writtenlen;
    4402             : 
    4403     1800920 :     if (stup->isnull1)
    4404             :     {
    4405         116 :         waddr = NULL;
    4406         116 :         tuplen = 0;
    4407             :     }
    4408     1800804 :     else if (!state->tuples)
    4409             :     {
    4410      280132 :         waddr = &stup->datum1;
    4411      280132 :         tuplen = sizeof(Datum);
    4412             :     }
    4413             :     else
    4414             :     {
    4415     1520672 :         waddr = stup->tuple;
    4416     1520672 :         tuplen = datumGetSize(PointerGetDatum(stup->tuple), false, state->datumTypeLen);
    4417             :         Assert(tuplen != 0);
    4418             :     }
    4419             : 
    4420     1800920 :     writtenlen = tuplen + sizeof(unsigned int);
    4421             : 
    4422     1800920 :     LogicalTapeWrite(tape, (void *) &writtenlen, sizeof(writtenlen));
    4423     1800920 :     LogicalTapeWrite(tape, waddr, tuplen);
    4424     1800920 :     if (state->randomAccess) /* need trailing length word? */
    4425      480264 :         LogicalTapeWrite(tape, (void *) &writtenlen, sizeof(writtenlen));
    4426             : 
    4427     1800920 :     if (!state->slabAllocatorUsed && stup->tuple)
    4428             :     {
    4429      440184 :         FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
    4430      440184 :         pfree(stup->tuple);
    4431             :     }
    4432     1800920 : }
    4433             : 
    4434             : static void
    4435     1720928 : readtup_datum(Tuplesortstate *state, SortTuple *stup,
    4436             :               LogicalTape *tape, unsigned int len)
    4437             : {
    4438     1720928 :     unsigned int tuplen = len - sizeof(unsigned int);
    4439             : 
    4440     1720928 :     if (tuplen == 0)
    4441             :     {
    4442             :         /* it's NULL */
    4443         132 :         stup->datum1 = (Datum) 0;
    4444         132 :         stup->isnull1 = true;
    4445         132 :         stup->tuple = NULL;
    4446             :     }
    4447     1720796 :     else if (!state->tuples)
    4448             :     {
    4449             :         Assert(tuplen == sizeof(Datum));
    4450      280136 :         LogicalTapeReadExact(tape, &stup->datum1, tuplen);
    4451      280136 :         stup->isnull1 = false;
    4452      280136 :         stup->tuple = NULL;
    4453             :     }
    4454             :     else
    4455             :     {
    4456     1440660 :         void       *raddr = readtup_alloc(state, tuplen);
    4457             : 
    4458     1440660 :         LogicalTapeReadExact(tape, raddr, tuplen);
    4459     1440660 :         stup->datum1 = PointerGetDatum(raddr);
    4460     1440660 :         stup->isnull1 = false;
    4461     1440660 :         stup->tuple = raddr;
    4462             :     }
    4463             : 
    4464     1720928 :     if (state->randomAccess) /* need trailing length word? */
    4465      480296 :         LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
    4466     1720928 : }
    4467             : 
    4468             : /*
    4469             :  * Parallel sort routines
    4470             :  */
    4471             : 
    4472             : /*
    4473             :  * tuplesort_estimate_shared - estimate required shared memory allocation
    4474             :  *
    4475             :  * nWorkers is an estimate of the number of workers (it's the number that
    4476             :  * will be requested).
    4477             :  */
    4478             : Size
    4479          96 : tuplesort_estimate_shared(int nWorkers)
    4480             : {
    4481             :     Size        tapesSize;
    4482             : 
    4483             :     Assert(nWorkers > 0);
    4484             : 
    4485             :     /* Make sure that BufFile shared state is MAXALIGN'd */
    4486          96 :     tapesSize = mul_size(sizeof(TapeShare), nWorkers);
    4487          96 :     tapesSize = MAXALIGN(add_size(tapesSize, offsetof(Sharedsort, tapes)));
    4488             : 
    4489          96 :     return tapesSize;
    4490             : }
    4491             : 
    4492             : /*
    4493             :  * tuplesort_initialize_shared - initialize shared tuplesort state
    4494             :  *
    4495             :  * Must be called from leader process before workers are launched, to
    4496             :  * establish state needed up-front for worker tuplesortstates.  nWorkers
    4497             :  * should match the argument passed to tuplesort_estimate_shared().
    4498             :  */
    4499             : void
    4500         140 : tuplesort_initialize_shared(Sharedsort *shared, int nWorkers, dsm_segment *seg)
    4501             : {
    4502             :     int         i;
    4503             : 
    4504             :     Assert(nWorkers > 0);
    4505             : 
    4506         140 :     SpinLockInit(&shared->mutex);
    4507         140 :     shared->currentWorker = 0;
    4508         140 :     shared->workersFinished = 0;
    4509         140 :     SharedFileSetInit(&shared->fileset, seg);
    4510         140 :     shared->nTapes = nWorkers;
    4511         420 :     for (i = 0; i < nWorkers; i++)
    4512             :     {
    4513         280 :         shared->tapes[i].firstblocknumber = 0L;
    4514             :     }
    4515         140 : }
    4516             : 
    4517             : /*
    4518             :  * tuplesort_attach_shared - attach to shared tuplesort state
    4519             :  *
    4520             :  * Must be called by all worker processes.
    4521             :  */
    4522             : void
    4523         140 : tuplesort_attach_shared(Sharedsort *shared, dsm_segment *seg)
    4524             : {
    4525             :     /* Attach to SharedFileSet */
    4526         140 :     SharedFileSetAttach(&shared->fileset, seg);
    4527         140 : }
    4528             : 
    4529             : /*
    4530             :  * worker_get_identifier - Assign and return ordinal identifier for worker
    4531             :  *
    4532             :  * The order in which these are assigned is not well defined, and should not
    4533             :  * matter; worker numbers across parallel sort participants need only be
    4534             :  * distinct and gapless.  logtape.c requires this.
    4535             :  *
    4536             :  * Note that the identifiers assigned from here have no relation to
    4537             :  * ParallelWorkerNumber number, to avoid making any assumption about
    4538             :  * caller's requirements.  However, we do follow the ParallelWorkerNumber
    4539             :  * convention of representing a non-worker with worker number -1.  This
    4540             :  * includes the leader, as well as serial Tuplesort processes.
    4541             :  */
    4542             : static int
    4543         280 : worker_get_identifier(Tuplesortstate *state)
    4544             : {
    4545         280 :     Sharedsort *shared = state->shared;
    4546             :     int         worker;
    4547             : 
    4548             :     Assert(WORKER(state));
    4549             : 
    4550         280 :     SpinLockAcquire(&shared->mutex);
    4551         280 :     worker = shared->currentWorker++;
    4552         280 :     SpinLockRelease(&shared->mutex);
    4553             : 
    4554         280 :     return worker;
    4555             : }
    4556             : 
    4557             : /*
    4558             :  * worker_freeze_result_tape - freeze worker's result tape for leader
    4559             :  *
    4560             :  * This is called by workers just after the result tape has been determined,
    4561             :  * instead of calling LogicalTapeFreeze() directly.  They do so because
    4562             :  * workers require a few additional steps over similar serial
    4563             :  * TSS_SORTEDONTAPE external sort cases, which also happen here.  The extra
    4564             :  * steps are around freeing now unneeded resources, and representing to
    4565             :  * leader that worker's input run is available for its merge.
    4566             :  *
    4567             :  * There should only be one final output run for each worker, which consists
    4568             :  * of all tuples that were originally input into worker.
    4569             :  */
    4570             : static void
    4571         280 : worker_freeze_result_tape(Tuplesortstate *state)
    4572             : {
    4573         280 :     Sharedsort *shared = state->shared;
    4574             :     TapeShare   output;
    4575             : 
    4576             :     Assert(WORKER(state));
    4577             :     Assert(state->result_tape != NULL);
    4578             :     Assert(state->memtupcount == 0);
    4579             : 
    4580             :     /*
    4581             :      * Free most remaining memory, in case caller is sensitive to our holding
    4582             :      * on to it.  memtuples may not be a tiny merge heap at this point.
    4583             :      */
    4584         280 :     pfree(state->memtuples);
    4585             :     /* Be tidy */
    4586         280 :     state->memtuples = NULL;
    4587         280 :     state->memtupsize = 0;
    4588             : 
    4589             :     /*
    4590             :      * Parallel worker requires result tape metadata, which is to be stored in
    4591             :      * shared memory for leader
    4592             :      */
    4593         280 :     LogicalTapeFreeze(state->result_tape, &output);
    4594             : 
    4595             :     /* Store properties of output tape, and update finished worker count */
    4596         280 :     SpinLockAcquire(&shared->mutex);
    4597         280 :     shared->tapes[state->worker] = output;
    4598         280 :     shared->workersFinished++;
    4599         280 :     SpinLockRelease(&shared->mutex);
    4600         280 : }
    4601             : 
    4602             : /*
    4603             :  * worker_nomergeruns - dump memtuples in worker, without merging
    4604             :  *
    4605             :  * This called as an alternative to mergeruns() with a worker when no
    4606             :  * merging is required.
    4607             :  */
    4608             : static void
    4609         280 : worker_nomergeruns(Tuplesortstate *state)
    4610             : {
    4611             :     Assert(WORKER(state));
    4612             :     Assert(state->result_tape == NULL);
    4613             :     Assert(state->nOutputRuns == 1);
    4614             : 
    4615         280 :     state->result_tape = state->destTape;
    4616         280 :     worker_freeze_result_tape(state);
    4617         280 : }
    4618             : 
    4619             : /*
    4620             :  * leader_takeover_tapes - create tapeset for leader from worker tapes
    4621             :  *
    4622             :  * So far, leader Tuplesortstate has performed no actual sorting.  By now, all
    4623             :  * sorting has occurred in workers, all of which must have already returned
    4624             :  * from tuplesort_performsort().
    4625             :  *
    4626             :  * When this returns, leader process is left in a state that is virtually
    4627             :  * indistinguishable from it having generated runs as a serial external sort
    4628             :  * might have.
    4629             :  */
    4630             : static void
    4631          96 : leader_takeover_tapes(Tuplesortstate *state)
    4632             : {
    4633          96 :     Sharedsort *shared = state->shared;
    4634          96 :     int         nParticipants = state->nParticipants;
    4635             :     int         workersFinished;
    4636             :     int         j;
    4637             : 
    4638             :     Assert(LEADER(state));
    4639             :     Assert(nParticipants >= 1);
    4640             : 
    4641          96 :     SpinLockAcquire(&shared->mutex);
    4642          96 :     workersFinished = shared->workersFinished;
    4643          96 :     SpinLockRelease(&shared->mutex);
    4644             : 
    4645          96 :     if (nParticipants != workersFinished)
    4646           0 :         elog(ERROR, "cannot take over tapes before all workers finish");
    4647             : 
    4648             :     /*
    4649             :      * Create the tapeset from worker tapes, including a leader-owned tape at
    4650             :      * the end.  Parallel workers are far more expensive than logical tapes,
    4651             :      * so the number of tapes allocated here should never be excessive.
    4652             :      */
    4653          96 :     inittapestate(state, nParticipants);
    4654          96 :     state->tapeset = LogicalTapeSetCreate(false, &shared->fileset, -1);
    4655             : 
    4656             :     /*
    4657             :      * Set currentRun to reflect the number of runs we will merge (it's not
    4658             :      * used for anything, this is just pro forma)
    4659             :      */
    4660          96 :     state->currentRun = nParticipants;
    4661             : 
    4662             :     /*
    4663             :      * Initialize the state to look the same as after building the initial
    4664             :      * runs.
    4665             :      *
    4666             :      * There will always be exactly 1 run per worker, and exactly one input
    4667             :      * tape per run, because workers always output exactly 1 run, even when
    4668             :      * there were no input tuples for workers to sort.
    4669             :      */
    4670          96 :     state->inputTapes = NULL;
    4671          96 :     state->nInputTapes = 0;
    4672          96 :     state->nInputRuns = 0;
    4673             : 
    4674          96 :     state->outputTapes = palloc0(nParticipants * sizeof(LogicalTape *));
    4675          96 :     state->nOutputTapes = nParticipants;
    4676          96 :     state->nOutputRuns = nParticipants;
    4677             : 
    4678         288 :     for (j = 0; j < nParticipants; j++)
    4679             :     {
    4680         192 :         state->outputTapes[j] = LogicalTapeImport(state->tapeset, j, &shared->tapes[j]);
    4681             :     }
    4682             : 
    4683          96 :     state->status = TSS_BUILDRUNS;
    4684          96 : }
    4685             : 
    4686             : /*
    4687             :  * Convenience routine to free a tuple previously loaded into sort memory
    4688             :  */
    4689             : static void
    4690     3223002 : free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
    4691             : {
    4692     3223002 :     if (stup->tuple)
    4693             :     {
    4694     3106204 :         FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
    4695     3106204 :         pfree(stup->tuple);
    4696     3106204 :         stup->tuple = NULL;
    4697             :     }
    4698     3223002 : }

Generated by: LCOV version 1.14