LCOV - code coverage report
Current view: top level - src/backend/utils/sort - logtape.c (source / functions) Hit Total Coverage
Test: PostgreSQL 13beta1 Lines: 304 328 92.7 %
Date: 2020-05-25 06:06:29 Functions: 25 25 100.0 %
Legend: Lines: hit not hit

          Line data    Source code
       1             : /*-------------------------------------------------------------------------
       2             :  *
       3             :  * logtape.c
       4             :  *    Management of "logical tapes" within temporary files.
       5             :  *
       6             :  * This module exists to support sorting via multiple merge passes (see
       7             :  * tuplesort.c).  Merging is an ideal algorithm for tape devices, but if
       8             :  * we implement it on disk by creating a separate file for each "tape",
       9             :  * there is an annoying problem: the peak space usage is at least twice
      10             :  * the volume of actual data to be sorted.  (This must be so because each
      11             :  * datum will appear in both the input and output tapes of the final
      12             :  * merge pass.  For seven-tape polyphase merge, which is otherwise a
      13             :  * pretty good algorithm, peak usage is more like 4x actual data volume.)
      14             :  *
      15             :  * We can work around this problem by recognizing that any one tape
      16             :  * dataset (with the possible exception of the final output) is written
      17             :  * and read exactly once in a perfectly sequential manner.  Therefore,
      18             :  * a datum once read will not be required again, and we can recycle its
      19             :  * space for use by the new tape dataset(s) being generated.  In this way,
      20             :  * the total space usage is essentially just the actual data volume, plus
      21             :  * insignificant bookkeeping and start/stop overhead.
      22             :  *
      23             :  * Few OSes allow arbitrary parts of a file to be released back to the OS,
      24             :  * so we have to implement this space-recycling ourselves within a single
      25             :  * logical file.  logtape.c exists to perform this bookkeeping and provide
      26             :  * the illusion of N independent tape devices to tuplesort.c.  Note that
      27             :  * logtape.c itself depends on buffile.c to provide a "logical file" of
      28             :  * larger size than the underlying OS may support.
      29             :  *
      30             :  * For simplicity, we allocate and release space in the underlying file
      31             :  * in BLCKSZ-size blocks.  Space allocation boils down to keeping track
      32             :  * of which blocks in the underlying file belong to which logical tape,
      33             :  * plus any blocks that are free (recycled and not yet reused).
      34             :  * The blocks in each logical tape form a chain, with a prev- and next-
      35             :  * pointer in each block.
      36             :  *
      37             :  * The initial write pass is guaranteed to fill the underlying file
      38             :  * perfectly sequentially, no matter how data is divided into logical tapes.
      39             :  * Once we begin merge passes, the access pattern becomes considerably
      40             :  * less predictable --- but the seeking involved should be comparable to
      41             :  * what would happen if we kept each logical tape in a separate file,
      42             :  * so there's no serious performance penalty paid to obtain the space
      43             :  * savings of recycling.  We try to localize the write accesses by always
      44             :  * writing to the lowest-numbered free block when we have a choice; it's
      45             :  * not clear this helps much, but it can't hurt.  (XXX perhaps a LIFO
      46             :  * policy for free blocks would be better?)
      47             :  *
      48             :  * To further make the I/Os more sequential, we can use a larger buffer
      49             :  * when reading, and read multiple blocks from the same tape in one go,
      50             :  * whenever the buffer becomes empty.
      51             :  *
      52             :  * To support the above policy of writing to the lowest free block, the
      53             :  * freelist is a min heap.
      54             :  *
      55             :  * Since all the bookkeeping and buffer memory is allocated with palloc(),
      56             :  * and the underlying file(s) are made with OpenTemporaryFile, all resources
      57             :  * for a logical tape set are certain to be cleaned up even if processing
      58             :  * is aborted by ereport(ERROR).  To avoid confusion, the caller should take
      59             :  * care that all calls for a single LogicalTapeSet are made in the same
      60             :  * palloc context.
      61             :  *
      62             :  * To support parallel sort operations involving coordinated callers to
      63             :  * tuplesort.c routines across multiple workers, it is necessary to
      64             :  * concatenate each worker BufFile/tapeset into one single logical tapeset
      65             :  * managed by the leader.  Workers should have produced one final
      66             :  * materialized tape (their entire output) when this happens in leader.
      67             :  * There will always be the same number of runs as input tapes, and the same
      68             :  * number of input tapes as participants (worker Tuplesortstates).
      69             :  *
      70             :  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
      71             :  * Portions Copyright (c) 1994, Regents of the University of California
      72             :  *
      73             :  * IDENTIFICATION
      74             :  *    src/backend/utils/sort/logtape.c
      75             :  *
      76             :  *-------------------------------------------------------------------------
      77             :  */
      78             : 
      79             : #include "postgres.h"
      80             : 
      81             : #include "storage/buffile.h"
      82             : #include "utils/builtins.h"
      83             : #include "utils/logtape.h"
      84             : #include "utils/memdebug.h"
      85             : #include "utils/memutils.h"
      86             : 
      87             : /*
      88             :  * A TapeBlockTrailer is stored at the end of each BLCKSZ block.
      89             :  *
      90             :  * The first block of a tape has prev == -1.  The last block of a tape
      91             :  * stores the number of valid bytes on the block, inverted, in 'next'
      92             :  * Therefore next < 0 indicates the last block.
      93             :  */
      94             : typedef struct TapeBlockTrailer
      95             : {
      96             :     long        prev;           /* previous block on this tape, or -1 on first
      97             :                                  * block */
      98             :     long        next;           /* next block on this tape, or # of valid
      99             :                                  * bytes on last block (if < 0) */
     100             : } TapeBlockTrailer;
     101             : 
     102             : #define TapeBlockPayloadSize  (BLCKSZ - sizeof(TapeBlockTrailer))
     103             : #define TapeBlockGetTrailer(buf) \
     104             :     ((TapeBlockTrailer *) ((char *) buf + TapeBlockPayloadSize))
     105             : 
     106             : #define TapeBlockIsLast(buf) (TapeBlockGetTrailer(buf)->next < 0)
     107             : #define TapeBlockGetNBytes(buf) \
     108             :     (TapeBlockIsLast(buf) ? \
     109             :      (- TapeBlockGetTrailer(buf)->next) : TapeBlockPayloadSize)
     110             : #define TapeBlockSetNBytes(buf, nbytes) \
     111             :     (TapeBlockGetTrailer(buf)->next = -(nbytes))
     112             : 
     113             : 
     114             : /*
     115             :  * This data structure represents a single "logical tape" within the set
     116             :  * of logical tapes stored in the same file.
     117             :  *
     118             :  * While writing, we hold the current partially-written data block in the
     119             :  * buffer.  While reading, we can hold multiple blocks in the buffer.  Note
     120             :  * that we don't retain the trailers of a block when it's read into the
     121             :  * buffer.  The buffer therefore contains one large contiguous chunk of data
     122             :  * from the tape.
     123             :  */
     124             : typedef struct LogicalTape
     125             : {
     126             :     bool        writing;        /* T while in write phase */
     127             :     bool        frozen;         /* T if blocks should not be freed when read */
     128             :     bool        dirty;          /* does buffer need to be written? */
     129             : 
     130             :     /*
     131             :      * Block numbers of the first, current, and next block of the tape.
     132             :      *
     133             :      * The "current" block number is only valid when writing, or reading from
     134             :      * a frozen tape.  (When reading from an unfrozen tape, we use a larger
     135             :      * read buffer that holds multiple blocks, so the "current" block is
     136             :      * ambiguous.)
     137             :      *
     138             :      * When concatenation of worker tape BufFiles is performed, an offset to
     139             :      * the first block in the unified BufFile space is applied during reads.
     140             :      */
     141             :     long        firstBlockNumber;
     142             :     long        curBlockNumber;
     143             :     long        nextBlockNumber;
     144             :     long        offsetBlockNumber;
     145             : 
     146             :     /*
     147             :      * Buffer for current data block(s).
     148             :      */
     149             :     char       *buffer;         /* physical buffer (separately palloc'd) */
     150             :     int         buffer_size;    /* allocated size of the buffer */
     151             :     int         max_size;       /* highest useful, safe buffer_size */
     152             :     int         pos;            /* next read/write position in buffer */
     153             :     int         nbytes;         /* total # of valid bytes in buffer */
     154             : } LogicalTape;
     155             : 
     156             : /*
     157             :  * This data structure represents a set of related "logical tapes" sharing
     158             :  * space in a single underlying file.  (But that "file" may be multiple files
     159             :  * if needed to escape OS limits on file size; buffile.c handles that for us.)
     160             :  * The number of tapes is fixed at creation.
     161             :  */
     162             : struct LogicalTapeSet
     163             : {
     164             :     BufFile    *pfile;          /* underlying file for whole tape set */
     165             : 
     166             :     /*
     167             :      * File size tracking.  nBlocksWritten is the size of the underlying file,
     168             :      * in BLCKSZ blocks.  nBlocksAllocated is the number of blocks allocated
     169             :      * by ltsReleaseBlock(), and it is always greater than or equal to
     170             :      * nBlocksWritten.  Blocks between nBlocksAllocated and nBlocksWritten are
     171             :      * blocks that have been allocated for a tape, but have not been written
     172             :      * to the underlying file yet.  nHoleBlocks tracks the total number of
     173             :      * blocks that are in unused holes between worker spaces following BufFile
     174             :      * concatenation.
     175             :      */
     176             :     long        nBlocksAllocated;   /* # of blocks allocated */
     177             :     long        nBlocksWritten; /* # of blocks used in underlying file */
     178             :     long        nHoleBlocks;    /* # of "hole" blocks left */
     179             : 
     180             :     /*
     181             :      * We store the numbers of recycled-and-available blocks in freeBlocks[].
     182             :      * When there are no such blocks, we extend the underlying file.
     183             :      *
     184             :      * If forgetFreeSpace is true then any freed blocks are simply forgotten
     185             :      * rather than being remembered in freeBlocks[].  See notes for
     186             :      * LogicalTapeSetForgetFreeSpace().
     187             :      */
     188             :     bool        forgetFreeSpace;    /* are we remembering free blocks? */
     189             :     long       *freeBlocks;     /* resizable array holding minheap */
     190             :     long        nFreeBlocks;    /* # of currently free blocks */
     191             :     Size        freeBlocksLen;  /* current allocated length of freeBlocks[] */
     192             : 
     193             :     /* The array of logical tapes. */
     194             :     int         nTapes;         /* # of logical tapes in set */
     195             :     LogicalTape *tapes;         /* has nTapes nentries */
     196             : };
     197             : 
     198             : static void ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer);
     199             : static void ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer);
     200             : static long ltsGetFreeBlock(LogicalTapeSet *lts);
     201             : static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
     202             : static void ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
     203             :                                  SharedFileSet *fileset);
     204             : static void ltsInitTape(LogicalTape *lt);
     205             : static void ltsInitReadBuffer(LogicalTapeSet *lts, LogicalTape *lt);
     206             : 
     207             : 
     208             : /*
     209             :  * Write a block-sized buffer to the specified block of the underlying file.
     210             :  *
     211             :  * No need for an error return convention; we ereport() on any error.
     212             :  */
     213             : static void
     214       45502 : ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
     215             : {
     216             :     /*
     217             :      * BufFile does not support "holes", so if we're about to write a block
     218             :      * that's past the current end of file, fill the space between the current
     219             :      * end of file and the target block with zeros.
     220             :      *
     221             :      * This should happen rarely, otherwise you are not writing very
     222             :      * sequentially.  In current use, this only happens when the sort ends
     223             :      * writing a run, and switches to another tape.  The last block of the
     224             :      * previous tape isn't flushed to disk until the end of the sort, so you
     225             :      * get one-block hole, where the last block of the previous tape will
     226             :      * later go.
     227             :      *
     228             :      * Note that BufFile concatenation can leave "holes" in BufFile between
     229             :      * worker-owned block ranges.  These are tracked for reporting purposes
     230             :      * only.  We never read from nor write to these hole blocks, and so they
     231             :      * are not considered here.
     232             :      */
     233       47038 :     while (blocknum > lts->nBlocksWritten)
     234             :     {
     235             :         PGAlignedBlock zerobuf;
     236             : 
     237        1536 :         MemSet(zerobuf.data, 0, sizeof(zerobuf));
     238             : 
     239        1536 :         ltsWriteBlock(lts, lts->nBlocksWritten, zerobuf.data);
     240             :     }
     241             : 
     242             :     /* Write the requested block */
     243       91004 :     if (BufFileSeekBlock(lts->pfile, blocknum) != 0 ||
     244       45502 :         BufFileWrite(lts->pfile, buffer, BLCKSZ) != BLCKSZ)
     245           0 :         ereport(ERROR,
     246             :                 (errcode_for_file_access(),
     247             :                  errmsg("could not write block %ld of temporary file: %m",
     248             :                         blocknum)));
     249             : 
     250             :     /* Update nBlocksWritten, if we extended the file */
     251       45502 :     if (blocknum == lts->nBlocksWritten)
     252       15146 :         lts->nBlocksWritten++;
     253       45502 : }
     254             : 
     255             : /*
     256             :  * Read a block-sized buffer from the specified block of the underlying file.
     257             :  *
     258             :  * No need for an error return convention; we ereport() on any error.   This
     259             :  * module should never attempt to read a block it doesn't know is there.
     260             :  */
     261             : static void
     262       43714 : ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
     263             : {
     264       87428 :     if (BufFileSeekBlock(lts->pfile, blocknum) != 0 ||
     265       43714 :         BufFileRead(lts->pfile, buffer, BLCKSZ) != BLCKSZ)
     266           0 :         ereport(ERROR,
     267             :                 (errcode_for_file_access(),
     268             :                  errmsg("could not read block %ld of temporary file: %m",
     269             :                         blocknum)));
     270       43714 : }
     271             : 
     272             : /*
     273             :  * Read as many blocks as we can into the per-tape buffer.
     274             :  *
     275             :  * Returns true if anything was read, 'false' on EOF.
     276             :  */
     277             : static bool
     278       55864 : ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt)
     279             : {
     280       55864 :     lt->pos = 0;
     281       55864 :     lt->nbytes = 0;
     282             : 
     283             :     do
     284             :     {
     285       66390 :         char       *thisbuf = lt->buffer + lt->nbytes;
     286       66390 :         long        datablocknum = lt->nextBlockNumber;
     287             : 
     288             :         /* Fetch next block number */
     289       66390 :         if (datablocknum == -1L)
     290       22988 :             break;              /* EOF */
     291             :         /* Apply worker offset, needed for leader tapesets */
     292       43402 :         datablocknum += lt->offsetBlockNumber;
     293             : 
     294             :         /* Read the block */
     295       43402 :         ltsReadBlock(lts, datablocknum, (void *) thisbuf);
     296       43402 :         if (!lt->frozen)
     297       42874 :             ltsReleaseBlock(lts, datablocknum);
     298       43402 :         lt->curBlockNumber = lt->nextBlockNumber;
     299             : 
     300       43402 :         lt->nbytes += TapeBlockGetNBytes(thisbuf);
     301       43402 :         if (TapeBlockIsLast(thisbuf))
     302             :         {
     303       23648 :             lt->nextBlockNumber = -1L;
     304             :             /* EOF */
     305       23648 :             break;
     306             :         }
     307             :         else
     308       19754 :             lt->nextBlockNumber = TapeBlockGetTrailer(thisbuf)->next;
     309             : 
     310             :         /* Advance to next block, if we have buffer space left */
     311       19754 :     } while (lt->buffer_size - lt->nbytes > BLCKSZ);
     312             : 
     313       55864 :     return (lt->nbytes > 0);
     314             : }
     315             : 
     316             : static inline void
     317      195712 : swap_nodes(long *heap, unsigned long a, unsigned long b)
     318             : {
     319             :     unsigned long swap;
     320             : 
     321      195712 :     swap = heap[a];
     322      195712 :     heap[a] = heap[b];
     323      195712 :     heap[b] = swap;
     324      195712 : }
     325             : 
     326             : static inline unsigned long
     327      122360 : left_offset(unsigned long i)
     328             : {
     329      122360 :     return 2 * i + 1;
     330             : }
     331             : 
     332             : static inline unsigned long
     333      122360 : right_offset(unsigned i)
     334             : {
     335      122360 :     return 2 * i + 2;
     336             : }
     337             : 
     338             : static inline unsigned long
     339      114948 : parent_offset(unsigned long i)
     340             : {
     341      114948 :     return (i - 1) / 2;
     342             : }
     343             : 
     344             : /*
     345             :  * Select the lowest currently unused block by taking the first element from
     346             :  * the freelist min heap.
     347             :  */
     348             : static long
     349       43966 : ltsGetFreeBlock(LogicalTapeSet *lts)
     350             : {
     351       43966 :     long       *heap = lts->freeBlocks;
     352             :     long        blocknum;
     353             :     int         heapsize;
     354             :     unsigned long pos;
     355             : 
     356             :     /* freelist empty; allocate a new block */
     357       43966 :     if (lts->nFreeBlocks == 0)
     358       15146 :         return lts->nBlocksAllocated++;
     359             : 
     360       28820 :     if (lts->nFreeBlocks == 1)
     361             :     {
     362         244 :         lts->nFreeBlocks--;
     363         244 :         return lts->freeBlocks[0];
     364             :     }
     365             : 
     366             :     /* take top of minheap */
     367       28576 :     blocknum = heap[0];
     368             : 
     369             :     /* replace with end of minheap array */
     370       28576 :     heap[0] = heap[--lts->nFreeBlocks];
     371             : 
     372             :     /* sift down */
     373       28576 :     pos = 0;
     374       28576 :     heapsize = lts->nFreeBlocks;
     375             :     while (true)
     376       93784 :     {
     377      122360 :         unsigned long left = left_offset(pos);
     378      122360 :         unsigned long right = right_offset(pos);
     379             :         unsigned long min_child;
     380             : 
     381      122360 :         if (left < heapsize && right < heapsize)
     382       93808 :             min_child = (heap[left] < heap[right]) ? left : right;
     383       28552 :         else if (left < heapsize)
     384       12576 :             min_child = left;
     385       15976 :         else if (right < heapsize)
     386           0 :             min_child = right;
     387             :         else
     388       15976 :             break;
     389             : 
     390      106384 :         if (heap[min_child] >= heap[pos])
     391       12600 :             break;
     392             : 
     393       93784 :         swap_nodes(heap, min_child, pos);
     394       93784 :         pos = min_child;
     395             :     }
     396             : 
     397       28576 :     return blocknum;
     398             : }
     399             : 
     400             : /*
     401             :  * Return a block# to the freelist.
     402             :  */
     403             : static void
     404       42874 : ltsReleaseBlock(LogicalTapeSet *lts, long blocknum)
     405             : {
     406             :     long       *heap;
     407             :     unsigned long pos;
     408             : 
     409             :     /*
     410             :      * Do nothing if we're no longer interested in remembering free space.
     411             :      */
     412       42874 :     if (lts->forgetFreeSpace)
     413       11710 :         return;
     414             : 
     415             :     /*
     416             :      * Enlarge freeBlocks array if full.
     417             :      */
     418       31164 :     if (lts->nFreeBlocks >= lts->freeBlocksLen)
     419             :     {
     420             :         /*
     421             :          * If the freelist becomes very large, just return and leak this free
     422             :          * block.
     423             :          */
     424          32 :         if (lts->freeBlocksLen * 2 > MaxAllocSize)
     425           0 :             return;
     426             : 
     427          32 :         lts->freeBlocksLen *= 2;
     428          32 :         lts->freeBlocks = (long *) repalloc(lts->freeBlocks,
     429          32 :                                             lts->freeBlocksLen * sizeof(long));
     430             :     }
     431             : 
     432       31164 :     heap = lts->freeBlocks;
     433       31164 :     pos = lts->nFreeBlocks;
     434             : 
     435             :     /* place entry at end of minheap array */
     436       31164 :     heap[pos] = blocknum;
     437       31164 :     lts->nFreeBlocks++;
     438             : 
     439             :     /* sift up */
     440      133092 :     while (pos != 0)
     441             :     {
     442      114948 :         unsigned long parent = parent_offset(pos);
     443             : 
     444      114948 :         if (heap[parent] < heap[pos])
     445       13020 :             break;
     446             : 
     447      101928 :         swap_nodes(heap, parent, pos);
     448      101928 :         pos = parent;
     449             :     }
     450             : }
     451             : 
     452             : /*
     453             :  * Claim ownership of a set of logical tapes from existing shared BufFiles.
     454             :  *
     455             :  * Caller should be leader process.  Though tapes are marked as frozen in
     456             :  * workers, they are not frozen when opened within leader, since unfrozen tapes
     457             :  * use a larger read buffer. (Frozen tapes have smaller read buffer, optimized
     458             :  * for random access.)
     459             :  */
     460             : static void
     461          88 : ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
     462             :                      SharedFileSet *fileset)
     463             : {
     464          88 :     LogicalTape *lt = NULL;
     465          88 :     long        tapeblocks = 0L;
     466          88 :     long        nphysicalblocks = 0L;
     467             :     int         i;
     468             : 
     469             :     /* Should have at least one worker tape, plus leader's tape */
     470             :     Assert(lts->nTapes >= 2);
     471             : 
     472             :     /*
     473             :      * Build concatenated view of all BufFiles, remembering the block number
     474             :      * where each source file begins.  No changes are needed for leader/last
     475             :      * tape.
     476             :      */
     477         264 :     for (i = 0; i < lts->nTapes - 1; i++)
     478             :     {
     479             :         char        filename[MAXPGPATH];
     480             :         BufFile    *file;
     481             :         int64       filesize;
     482             : 
     483         176 :         lt = &lts->tapes[i];
     484             : 
     485         176 :         pg_itoa(i, filename);
     486         176 :         file = BufFileOpenShared(fileset, filename);
     487         176 :         filesize = BufFileSize(file);
     488             : 
     489             :         /*
     490             :          * Stash first BufFile, and concatenate subsequent BufFiles to that.
     491             :          * Store block offset into each tape as we go.
     492             :          */
     493         176 :         lt->firstBlockNumber = shared[i].firstblocknumber;
     494         176 :         if (i == 0)
     495             :         {
     496          88 :             lts->pfile = file;
     497          88 :             lt->offsetBlockNumber = 0L;
     498             :         }
     499             :         else
     500             :         {
     501          88 :             lt->offsetBlockNumber = BufFileAppend(lts->pfile, file);
     502             :         }
     503             :         /* Don't allocate more for read buffer than could possibly help */
     504         176 :         lt->max_size = Min(MaxAllocSize, filesize);
     505         176 :         tapeblocks = filesize / BLCKSZ;
     506         176 :         nphysicalblocks += tapeblocks;
     507             :     }
     508             : 
     509             :     /*
     510             :      * Set # of allocated blocks, as well as # blocks written.  Use extent of
     511             :      * new BufFile space (from 0 to end of last worker's tape space) for this.
     512             :      * Allocated/written blocks should include space used by holes left
     513             :      * between concatenated BufFiles.
     514             :      */
     515          88 :     lts->nBlocksAllocated = lt->offsetBlockNumber + tapeblocks;
     516          88 :     lts->nBlocksWritten = lts->nBlocksAllocated;
     517             : 
     518             :     /*
     519             :      * Compute number of hole blocks so that we can later work backwards, and
     520             :      * instrument number of physical blocks.  We don't simply use physical
     521             :      * blocks directly for instrumentation because this would break if we ever
     522             :      * subsequently wrote to the leader tape.
     523             :      *
     524             :      * Working backwards like this keeps our options open.  If shared BufFiles
     525             :      * ever support being written to post-export, logtape.c can automatically
     526             :      * take advantage of that.  We'd then support writing to the leader tape
     527             :      * while recycling space from worker tapes, because the leader tape has a
     528             :      * zero offset (write routines won't need to have extra logic to apply an
     529             :      * offset).
     530             :      *
     531             :      * The only thing that currently prevents writing to the leader tape from
     532             :      * working is the fact that BufFiles opened using BufFileOpenShared() are
     533             :      * read-only by definition, but that could be changed if it seemed
     534             :      * worthwhile.  For now, writing to the leader tape will raise a "Bad file
     535             :      * descriptor" error, so tuplesort must avoid writing to the leader tape
     536             :      * altogether.
     537             :      */
     538          88 :     lts->nHoleBlocks = lts->nBlocksAllocated - nphysicalblocks;
     539          88 : }
     540             : 
     541             : /*
     542             :  * Initialize per-tape struct.  Note we allocate the I/O buffer lazily.
     543             :  */
     544             : static void
     545       22080 : ltsInitTape(LogicalTape *lt)
     546             : {
     547       22080 :     lt->writing = true;
     548       22080 :     lt->frozen = false;
     549       22080 :     lt->dirty = false;
     550       22080 :     lt->firstBlockNumber = -1L;
     551       22080 :     lt->curBlockNumber = -1L;
     552       22080 :     lt->nextBlockNumber = -1L;
     553       22080 :     lt->offsetBlockNumber = 0L;
     554       22080 :     lt->buffer = NULL;
     555       22080 :     lt->buffer_size = 0;
     556             :     /* palloc() larger than MaxAllocSize would fail */
     557       22080 :     lt->max_size = MaxAllocSize;
     558       22080 :     lt->pos = 0;
     559       22080 :     lt->nbytes = 0;
     560       22080 : }
     561             : 
     562             : /*
     563             :  * Lazily allocate and initialize the read buffer. This avoids waste when many
     564             :  * tapes are open at once, but not all are active between rewinding and
     565             :  * reading.
     566             :  */
     567             : static void
     568       23692 : ltsInitReadBuffer(LogicalTapeSet *lts, LogicalTape *lt)
     569             : {
     570             :     Assert(lt->buffer_size > 0);
     571       23692 :     lt->buffer = palloc(lt->buffer_size);
     572             : 
     573             :     /* Read the first block, or reset if tape is empty */
     574       23692 :     lt->nextBlockNumber = lt->firstBlockNumber;
     575       23692 :     lt->pos = 0;
     576       23692 :     lt->nbytes = 0;
     577       23692 :     ltsReadFillBuffer(lts, lt);
     578       23692 : }
     579             : 
     580             : /*
     581             :  * Create a set of logical tapes in a temporary underlying file.
     582             :  *
     583             :  * Each tape is initialized in write state.  Serial callers pass ntapes,
     584             :  * NULL argument for shared, and -1 for worker.  Parallel worker callers
     585             :  * pass ntapes, a shared file handle, NULL shared argument,  and their own
     586             :  * worker number.  Leader callers, which claim shared worker tapes here,
     587             :  * must supply non-sentinel values for all arguments except worker number,
     588             :  * which should be -1.
     589             :  *
     590             :  * Leader caller is passing back an array of metadata each worker captured
     591             :  * when LogicalTapeFreeze() was called for their final result tapes.  Passed
     592             :  * tapes array is actually sized ntapes - 1, because it includes only
     593             :  * worker tapes, whereas leader requires its own leader tape.  Note that we
     594             :  * rely on the assumption that reclaimed worker tapes will only be read
     595             :  * from once by leader, and never written to again (tapes are initialized
     596             :  * for writing, but that's only to be consistent).  Leader may not write to
     597             :  * its own tape purely due to a restriction in the shared buffile
     598             :  * infrastructure that may be lifted in the future.
     599             :  */
     600             : LogicalTapeSet *
     601         456 : LogicalTapeSetCreate(int ntapes, TapeShare *shared, SharedFileSet *fileset,
     602             :                      int worker)
     603             : {
     604             :     LogicalTapeSet *lts;
     605             :     int         i;
     606             : 
     607             :     /*
     608             :      * Create top-level struct including per-tape LogicalTape structs.
     609             :      */
     610             :     Assert(ntapes > 0);
     611         456 :     lts = (LogicalTapeSet *) palloc(sizeof(LogicalTapeSet));
     612         456 :     lts->nBlocksAllocated = 0L;
     613         456 :     lts->nBlocksWritten = 0L;
     614         456 :     lts->nHoleBlocks = 0L;
     615         456 :     lts->forgetFreeSpace = false;
     616         456 :     lts->freeBlocksLen = 32; /* reasonable initial guess */
     617         456 :     lts->freeBlocks = (long *) palloc(lts->freeBlocksLen * sizeof(long));
     618         456 :     lts->nFreeBlocks = 0;
     619         456 :     lts->nTapes = ntapes;
     620         456 :     lts->tapes = (LogicalTape *) palloc(ntapes * sizeof(LogicalTape));
     621             : 
     622        3692 :     for (i = 0; i < ntapes; i++)
     623        3236 :         ltsInitTape(&lts->tapes[i]);
     624             : 
     625             :     /*
     626             :      * Create temp BufFile storage as required.
     627             :      *
     628             :      * Leader concatenates worker tapes, which requires special adjustment to
     629             :      * final tapeset data.  Things are simpler for the worker case and the
     630             :      * serial case, though.  They are generally very similar -- workers use a
     631             :      * shared fileset, whereas serial sorts use a conventional serial BufFile.
     632             :      */
     633         456 :     if (shared)
     634          88 :         ltsConcatWorkerTapes(lts, shared, fileset);
     635         368 :     else if (fileset)
     636             :     {
     637             :         char        filename[MAXPGPATH];
     638             : 
     639         248 :         pg_itoa(worker, filename);
     640         248 :         lts->pfile = BufFileCreateShared(fileset, filename);
     641             :     }
     642             :     else
     643         120 :         lts->pfile = BufFileCreateTemp(false);
     644             : 
     645         456 :     return lts;
     646             : }
     647             : 
     648             : /*
     649             :  * Close a logical tape set and release all resources.
     650             :  */
     651             : void
     652         456 : LogicalTapeSetClose(LogicalTapeSet *lts)
     653             : {
     654             :     LogicalTape *lt;
     655             :     int         i;
     656             : 
     657         456 :     BufFileClose(lts->pfile);
     658       22536 :     for (i = 0; i < lts->nTapes; i++)
     659             :     {
     660       22080 :         lt = &lts->tapes[i];
     661       22080 :         if (lt->buffer)
     662         478 :             pfree(lt->buffer);
     663             :     }
     664         456 :     pfree(lts->tapes);
     665         456 :     pfree(lts->freeBlocks);
     666         456 :     pfree(lts);
     667         456 : }
     668             : 
     669             : /*
     670             :  * Mark a logical tape set as not needing management of free space anymore.
     671             :  *
     672             :  * This should be called if the caller does not intend to write any more data
     673             :  * into the tape set, but is reading from un-frozen tapes.  Since no more
     674             :  * writes are planned, remembering free blocks is no longer useful.  Setting
     675             :  * this flag lets us avoid wasting time and space in ltsReleaseBlock(), which
     676             :  * is not designed to handle large numbers of free blocks.
     677             :  */
     678             : void
     679         156 : LogicalTapeSetForgetFreeSpace(LogicalTapeSet *lts)
     680             : {
     681         156 :     lts->forgetFreeSpace = true;
     682         156 : }
     683             : 
     684             : /*
     685             :  * Write to a logical tape.
     686             :  *
     687             :  * There are no error returns; we ereport() on failure.
     688             :  */
     689             : void
     690    13027632 : LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
     691             :                  void *ptr, size_t size)
     692             : {
     693             :     LogicalTape *lt;
     694             :     size_t      nthistime;
     695             : 
     696             :     Assert(tapenum >= 0 && tapenum < lts->nTapes);
     697    13027632 :     lt = &lts->tapes[tapenum];
     698             :     Assert(lt->writing);
     699             :     Assert(lt->offsetBlockNumber == 0L);
     700             : 
     701             :     /* Allocate data buffer and first block on first write */
     702    13027632 :     if (lt->buffer == NULL)
     703             :     {
     704       23772 :         lt->buffer = (char *) palloc(BLCKSZ);
     705       23772 :         lt->buffer_size = BLCKSZ;
     706             :     }
     707    13027632 :     if (lt->curBlockNumber == -1)
     708             :     {
     709             :         Assert(lt->firstBlockNumber == -1);
     710             :         Assert(lt->pos == 0);
     711             : 
     712       23772 :         lt->curBlockNumber = ltsGetFreeBlock(lts);
     713       23772 :         lt->firstBlockNumber = lt->curBlockNumber;
     714             : 
     715       23772 :         TapeBlockGetTrailer(lt->buffer)->prev = -1L;
     716             :     }
     717             : 
     718             :     Assert(lt->buffer_size == BLCKSZ);
     719    26069852 :     while (size > 0)
     720             :     {
     721    13042220 :         if (lt->pos >= TapeBlockPayloadSize)
     722             :         {
     723             :             /* Buffer full, dump it out */
     724             :             long        nextBlockNumber;
     725             : 
     726       20194 :             if (!lt->dirty)
     727             :             {
     728             :                 /* Hmm, went directly from reading to writing? */
     729           0 :                 elog(ERROR, "invalid logtape state: should be dirty");
     730             :             }
     731             : 
     732             :             /*
     733             :              * First allocate the next block, so that we can store it in the
     734             :              * 'next' pointer of this block.
     735             :              */
     736       20194 :             nextBlockNumber = ltsGetFreeBlock(lts);
     737             : 
     738             :             /* set the next-pointer and dump the current block. */
     739       20194 :             TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;
     740       20194 :             ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
     741             : 
     742             :             /* initialize the prev-pointer of the next block */
     743       20194 :             TapeBlockGetTrailer(lt->buffer)->prev = lt->curBlockNumber;
     744       20194 :             lt->curBlockNumber = nextBlockNumber;
     745       20194 :             lt->pos = 0;
     746       20194 :             lt->nbytes = 0;
     747             :         }
     748             : 
     749    13042220 :         nthistime = TapeBlockPayloadSize - lt->pos;
     750    13042220 :         if (nthistime > size)
     751    13022022 :             nthistime = size;
     752             :         Assert(nthistime > 0);
     753             : 
     754    13042220 :         memcpy(lt->buffer + lt->pos, ptr, nthistime);
     755             : 
     756    13042220 :         lt->dirty = true;
     757    13042220 :         lt->pos += nthistime;
     758    13042220 :         if (lt->nbytes < lt->pos)
     759    13042220 :             lt->nbytes = lt->pos;
     760    13042220 :         ptr = (void *) ((char *) ptr + nthistime);
     761    13042220 :         size -= nthistime;
     762             :     }
     763    13027632 : }
     764             : 
     765             : /*
     766             :  * Rewind logical tape and switch from writing to reading.
     767             :  *
     768             :  * The tape must currently be in writing state, or "frozen" in read state.
     769             :  *
     770             :  * 'buffer_size' specifies how much memory to use for the read buffer.
     771             :  * Regardless of the argument, the actual amount of memory used is between
     772             :  * BLCKSZ and MaxAllocSize, and is a multiple of BLCKSZ.  The given value is
     773             :  * rounded down and truncated to fit those constraints, if necessary.  If the
     774             :  * tape is frozen, the 'buffer_size' argument is ignored, and a small BLCKSZ
     775             :  * byte buffer is used.
     776             :  */
     777             : void
     778       23812 : LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size)
     779             : {
     780             :     LogicalTape *lt;
     781             : 
     782             :     Assert(tapenum >= 0 && tapenum < lts->nTapes);
     783       23812 :     lt = &lts->tapes[tapenum];
     784             : 
     785             :     /*
     786             :      * Round and cap buffer_size if needed.
     787             :      */
     788       23812 :     if (lt->frozen)
     789           4 :         buffer_size = BLCKSZ;
     790             :     else
     791             :     {
     792             :         /* need at least one block */
     793       23808 :         if (buffer_size < BLCKSZ)
     794         236 :             buffer_size = BLCKSZ;
     795             : 
     796             :         /* palloc() larger than max_size is unlikely to be helpful */
     797       23808 :         if (buffer_size > lt->max_size)
     798         176 :             buffer_size = lt->max_size;
     799             : 
     800             :         /* round down to BLCKSZ boundary */
     801       23808 :         buffer_size -= buffer_size % BLCKSZ;
     802             :     }
     803             : 
     804       23812 :     if (lt->writing)
     805             :     {
     806             :         /*
     807             :          * Completion of a write phase.  Flush last partial data block, and
     808             :          * rewind for normal (destructive) read.
     809             :          */
     810       23808 :         if (lt->dirty)
     811             :         {
     812             :             /*
     813             :              * As long as we've filled the buffer at least once, its contents
     814             :              * are entirely defined from valgrind's point of view, even though
     815             :              * contents beyond the current end point may be stale.  But it's
     816             :              * possible - at least in the case of a parallel sort - to sort
     817             :              * such small amount of data that we do not fill the buffer even
     818             :              * once.  Tell valgrind that its contents are defined, so it
     819             :              * doesn't bleat.
     820             :              */
     821             :             VALGRIND_MAKE_MEM_DEFINED(lt->buffer + lt->nbytes,
     822             :                                       lt->buffer_size - lt->nbytes);
     823             : 
     824       23512 :             TapeBlockSetNBytes(lt->buffer, lt->nbytes);
     825       23512 :             ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
     826             :         }
     827       23808 :         lt->writing = false;
     828             :     }
     829             :     else
     830             :     {
     831             :         /*
     832             :          * This is only OK if tape is frozen; we rewind for (another) read
     833             :          * pass.
     834             :          */
     835             :         Assert(lt->frozen);
     836             :     }
     837             : 
     838             :     /* Allocate a read buffer (unless the tape is empty) */
     839       23812 :     if (lt->buffer)
     840       23516 :         pfree(lt->buffer);
     841             : 
     842             :     /* the buffer is lazily allocated, but set the size here */
     843       23812 :     lt->buffer = NULL;
     844       23812 :     lt->buffer_size = buffer_size;
     845       23812 : }
     846             : 
     847             : /*
     848             :  * Rewind logical tape and switch from reading to writing.
     849             :  *
     850             :  * NOTE: we assume the caller has read the tape to the end; otherwise
     851             :  * untouched data will not have been freed. We could add more code to free
     852             :  * any unread blocks, but in current usage of this module it'd be useless
     853             :  * code.
     854             :  */
     855             : void
     856       23474 : LogicalTapeRewindForWrite(LogicalTapeSet *lts, int tapenum)
     857             : {
     858             :     LogicalTape *lt;
     859             : 
     860             :     Assert(tapenum >= 0 && tapenum < lts->nTapes);
     861       23474 :     lt = &lts->tapes[tapenum];
     862             : 
     863             :     Assert(!lt->writing && !lt->frozen);
     864       23474 :     lt->writing = true;
     865       23474 :     lt->dirty = false;
     866       23474 :     lt->firstBlockNumber = -1L;
     867       23474 :     lt->curBlockNumber = -1L;
     868       23474 :     lt->pos = 0;
     869       23474 :     lt->nbytes = 0;
     870       23474 :     if (lt->buffer)
     871       23470 :         pfree(lt->buffer);
     872       23474 :     lt->buffer = NULL;
     873       23474 :     lt->buffer_size = 0;
     874       23474 : }
     875             : 
     876             : /*
     877             :  * Read from a logical tape.
     878             :  *
     879             :  * Early EOF is indicated by return value less than #bytes requested.
     880             :  */
     881             : size_t
     882    13566632 : LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
     883             :                 void *ptr, size_t size)
     884             : {
     885             :     LogicalTape *lt;
     886    13566632 :     size_t      nread = 0;
     887             :     size_t      nthistime;
     888             : 
     889             :     Assert(tapenum >= 0 && tapenum < lts->nTapes);
     890    13566632 :     lt = &lts->tapes[tapenum];
     891             :     Assert(!lt->writing);
     892             : 
     893    13566632 :     if (lt->buffer == NULL)
     894       23692 :         ltsInitReadBuffer(lts, lt);
     895             : 
     896    27117072 :     while (size > 0)
     897             :     {
     898    13573428 :         if (lt->pos >= lt->nbytes)
     899             :         {
     900             :             /* Try to load more data into buffer. */
     901       32172 :             if (!ltsReadFillBuffer(lts, lt))
     902       22988 :                 break;          /* EOF */
     903             :         }
     904             : 
     905    13550440 :         nthistime = lt->nbytes - lt->pos;
     906    13550440 :         if (nthistime > size)
     907    13517624 :             nthistime = size;
     908             :         Assert(nthistime > 0);
     909             : 
     910    13550440 :         memcpy(ptr, lt->buffer + lt->pos, nthistime);
     911             : 
     912    13550440 :         lt->pos += nthistime;
     913    13550440 :         ptr = (void *) ((char *) ptr + nthistime);
     914    13550440 :         size -= nthistime;
     915    13550440 :         nread += nthistime;
     916             :     }
     917             : 
     918    13566632 :     return nread;
     919             : }
     920             : 
     921             : /*
     922             :  * "Freeze" the contents of a tape so that it can be read multiple times
     923             :  * and/or read backwards.  Once a tape is frozen, its contents will not
     924             :  * be released until the LogicalTapeSet is destroyed.  This is expected
     925             :  * to be used only for the final output pass of a merge.
     926             :  *
     927             :  * This *must* be called just at the end of a write pass, before the
     928             :  * tape is rewound (after rewind is too late!).  It performs a rewind
     929             :  * and switch to read mode "for free".  An immediately following rewind-
     930             :  * for-read call is OK but not necessary.
     931             :  *
     932             :  * share output argument is set with details of storage used for tape after
     933             :  * freezing, which may be passed to LogicalTapeSetCreate within leader
     934             :  * process later.  This metadata is only of interest to worker callers
     935             :  * freezing their final output for leader (single materialized tape).
     936             :  * Serial sorts should set share to NULL.
     937             :  */
     938             : void
     939         260 : LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum, TapeShare *share)
     940             : {
     941             :     LogicalTape *lt;
     942             : 
     943             :     Assert(tapenum >= 0 && tapenum < lts->nTapes);
     944         260 :     lt = &lts->tapes[tapenum];
     945             :     Assert(lt->writing);
     946             :     Assert(lt->offsetBlockNumber == 0L);
     947             : 
     948             :     /*
     949             :      * Completion of a write phase.  Flush last partial data block, and rewind
     950             :      * for nondestructive read.
     951             :      */
     952         260 :     if (lt->dirty)
     953             :     {
     954             :         /*
     955             :          * As long as we've filled the buffer at least once, its contents are
     956             :          * entirely defined from valgrind's point of view, even though
     957             :          * contents beyond the current end point may be stale.  But it's
     958             :          * possible - at least in the case of a parallel sort - to sort such
     959             :          * small amount of data that we do not fill the buffer even once. Tell
     960             :          * valgrind that its contents are defined, so it doesn't bleat.
     961             :          */
     962             :         VALGRIND_MAKE_MEM_DEFINED(lt->buffer + lt->nbytes,
     963             :                                   lt->buffer_size - lt->nbytes);
     964             : 
     965         260 :         TapeBlockSetNBytes(lt->buffer, lt->nbytes);
     966         260 :         ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
     967         260 :         lt->writing = false;
     968             :     }
     969         260 :     lt->writing = false;
     970         260 :     lt->frozen = true;
     971             : 
     972             :     /*
     973             :      * The seek and backspace functions assume a single block read buffer.
     974             :      * That's OK with current usage.  A larger buffer is helpful to make the
     975             :      * read pattern of the backing file look more sequential to the OS, when
     976             :      * we're reading from multiple tapes.  But at the end of a sort, when a
     977             :      * tape is frozen, we only read from a single tape anyway.
     978             :      */
     979         260 :     if (!lt->buffer || lt->buffer_size != BLCKSZ)
     980             :     {
     981           0 :         if (lt->buffer)
     982           0 :             pfree(lt->buffer);
     983           0 :         lt->buffer = palloc(BLCKSZ);
     984           0 :         lt->buffer_size = BLCKSZ;
     985             :     }
     986             : 
     987             :     /* Read the first block, or reset if tape is empty */
     988         260 :     lt->curBlockNumber = lt->firstBlockNumber;
     989         260 :     lt->pos = 0;
     990         260 :     lt->nbytes = 0;
     991             : 
     992         260 :     if (lt->firstBlockNumber == -1L)
     993           0 :         lt->nextBlockNumber = -1L;
     994         260 :     ltsReadBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
     995         260 :     if (TapeBlockIsLast(lt->buffer))
     996         218 :         lt->nextBlockNumber = -1L;
     997             :     else
     998          42 :         lt->nextBlockNumber = TapeBlockGetTrailer(lt->buffer)->next;
     999         260 :     lt->nbytes = TapeBlockGetNBytes(lt->buffer);
    1000             : 
    1001             :     /* Handle extra steps when caller is to share its tapeset */
    1002         260 :     if (share)
    1003             :     {
    1004         248 :         BufFileExportShared(lts->pfile);
    1005         248 :         share->firstblocknumber = lt->firstBlockNumber;
    1006             :     }
    1007         260 : }
    1008             : 
    1009             : /*
    1010             :  * Add additional tapes to this tape set. Not intended to be used when any
    1011             :  * tapes are frozen.
    1012             :  */
    1013             : void
    1014        8948 : LogicalTapeSetExtend(LogicalTapeSet *lts, int nAdditional)
    1015             : {
    1016             :     int         i;
    1017        8948 :     int         nTapesOrig = lts->nTapes;
    1018             : 
    1019        8948 :     lts->nTapes += nAdditional;
    1020             : 
    1021       17896 :     lts->tapes = (LogicalTape *) repalloc(lts->tapes,
    1022        8948 :                                           lts->nTapes * sizeof(LogicalTape));
    1023             : 
    1024       27792 :     for (i = nTapesOrig; i < lts->nTapes; i++)
    1025       18844 :         ltsInitTape(&lts->tapes[i]);
    1026        8948 : }
    1027             : 
    1028             : /*
    1029             :  * Backspace the tape a given number of bytes.  (We also support a more
    1030             :  * general seek interface, see below.)
    1031             :  *
    1032             :  * *Only* a frozen-for-read tape can be backed up; we don't support
    1033             :  * random access during write, and an unfrozen read tape may have
    1034             :  * already discarded the desired data!
    1035             :  *
    1036             :  * Returns the number of bytes backed up.  It can be less than the
    1037             :  * requested amount, if there isn't that much data before the current
    1038             :  * position.  The tape is positioned to the beginning of the tape in
    1039             :  * that case.
    1040             :  */
    1041             : size_t
    1042          48 : LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum, size_t size)
    1043             : {
    1044             :     LogicalTape *lt;
    1045          48 :     size_t      seekpos = 0;
    1046             : 
    1047             :     Assert(tapenum >= 0 && tapenum < lts->nTapes);
    1048          48 :     lt = &lts->tapes[tapenum];
    1049             :     Assert(lt->frozen);
    1050             :     Assert(lt->buffer_size == BLCKSZ);
    1051             : 
    1052          48 :     if (lt->buffer == NULL)
    1053           0 :         ltsInitReadBuffer(lts, lt);
    1054             : 
    1055             :     /*
    1056             :      * Easy case for seek within current block.
    1057             :      */
    1058          48 :     if (size <= (size_t) lt->pos)
    1059             :     {
    1060          44 :         lt->pos -= (int) size;
    1061          44 :         return size;
    1062             :     }
    1063             : 
    1064             :     /*
    1065             :      * Not-so-easy case, have to walk back the chain of blocks.  This
    1066             :      * implementation would be pretty inefficient for long seeks, but we
    1067             :      * really aren't doing that (a seek over one tuple is typical).
    1068             :      */
    1069           4 :     seekpos = (size_t) lt->pos; /* part within this block */
    1070           4 :     while (size > seekpos)
    1071             :     {
    1072           4 :         long        prev = TapeBlockGetTrailer(lt->buffer)->prev;
    1073             : 
    1074           4 :         if (prev == -1L)
    1075             :         {
    1076             :             /* Tried to back up beyond the beginning of tape. */
    1077           4 :             if (lt->curBlockNumber != lt->firstBlockNumber)
    1078           0 :                 elog(ERROR, "unexpected end of tape");
    1079           4 :             lt->pos = 0;
    1080           4 :             return seekpos;
    1081             :         }
    1082             : 
    1083           0 :         ltsReadBlock(lts, prev, (void *) lt->buffer);
    1084             : 
    1085           0 :         if (TapeBlockGetTrailer(lt->buffer)->next != lt->curBlockNumber)
    1086           0 :             elog(ERROR, "broken tape, next of block %ld is %ld, expected %ld",
    1087             :                  prev,
    1088             :                  TapeBlockGetTrailer(lt->buffer)->next,
    1089             :                  lt->curBlockNumber);
    1090             : 
    1091           0 :         lt->nbytes = TapeBlockPayloadSize;
    1092           0 :         lt->curBlockNumber = prev;
    1093           0 :         lt->nextBlockNumber = TapeBlockGetTrailer(lt->buffer)->next;
    1094             : 
    1095           0 :         seekpos += TapeBlockPayloadSize;
    1096             :     }
    1097             : 
    1098             :     /*
    1099             :      * 'seekpos' can now be greater than 'size', because it points to the
    1100             :      * beginning the target block.  The difference is the position within the
    1101             :      * page.
    1102             :      */
    1103           0 :     lt->pos = seekpos - size;
    1104           0 :     return size;
    1105             : }
    1106             : 
    1107             : /*
    1108             :  * Seek to an arbitrary position in a logical tape.
    1109             :  *
    1110             :  * *Only* a frozen-for-read tape can be seeked.
    1111             :  *
    1112             :  * Must be called with a block/offset previously returned by
    1113             :  * LogicalTapeTell().
    1114             :  */
    1115             : void
    1116        4128 : LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
    1117             :                 long blocknum, int offset)
    1118             : {
    1119             :     LogicalTape *lt;
    1120             : 
    1121             :     Assert(tapenum >= 0 && tapenum < lts->nTapes);
    1122        4128 :     lt = &lts->tapes[tapenum];
    1123             :     Assert(lt->frozen);
    1124             :     Assert(offset >= 0 && offset <= TapeBlockPayloadSize);
    1125             :     Assert(lt->buffer_size == BLCKSZ);
    1126             : 
    1127        4128 :     if (lt->buffer == NULL)
    1128           0 :         ltsInitReadBuffer(lts, lt);
    1129             : 
    1130        4128 :     if (blocknum != lt->curBlockNumber)
    1131             :     {
    1132          52 :         ltsReadBlock(lts, blocknum, (void *) lt->buffer);
    1133          52 :         lt->curBlockNumber = blocknum;
    1134          52 :         lt->nbytes = TapeBlockPayloadSize;
    1135          52 :         lt->nextBlockNumber = TapeBlockGetTrailer(lt->buffer)->next;
    1136             :     }
    1137             : 
    1138        4128 :     if (offset > lt->nbytes)
    1139           0 :         elog(ERROR, "invalid tape seek position");
    1140        4128 :     lt->pos = offset;
    1141        4128 : }
    1142             : 
    1143             : /*
    1144             :  * Obtain current position in a form suitable for a later LogicalTapeSeek.
    1145             :  *
    1146             :  * NOTE: it'd be OK to do this during write phase with intention of using
    1147             :  * the position for a seek after freezing.  Not clear if anyone needs that.
    1148             :  */
    1149             : void
    1150        5872 : LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
    1151             :                 long *blocknum, int *offset)
    1152             : {
    1153             :     LogicalTape *lt;
    1154             : 
    1155             :     Assert(tapenum >= 0 && tapenum < lts->nTapes);
    1156        5872 :     lt = &lts->tapes[tapenum];
    1157             : 
    1158        5872 :     if (lt->buffer == NULL)
    1159           0 :         ltsInitReadBuffer(lts, lt);
    1160             : 
    1161             :     Assert(lt->offsetBlockNumber == 0L);
    1162             : 
    1163             :     /* With a larger buffer, 'pos' wouldn't be the same as offset within page */
    1164             :     Assert(lt->buffer_size == BLCKSZ);
    1165             : 
    1166        5872 :     *blocknum = lt->curBlockNumber;
    1167        5872 :     *offset = lt->pos;
    1168        5872 : }
    1169             : 
    1170             : /*
    1171             :  * Obtain total disk space currently used by a LogicalTapeSet, in blocks.
    1172             :  */
    1173             : long
    1174       23448 : LogicalTapeSetBlocks(LogicalTapeSet *lts)
    1175             : {
    1176       23448 :     return lts->nBlocksAllocated - lts->nHoleBlocks;
    1177             : }

Generated by: LCOV version 1.13