LCOV - code coverage report
Current view: top level - src/backend/utils/sort - logtape.c (source / functions) Hit Total Coverage
Test: PostgreSQL 15devel Lines: 307 334 91.9 %
Date: 2021-12-05 02:08:31 Functions: 27 27 100.0 %
Legend: Lines: hit not hit

          Line data    Source code
       1             : /*-------------------------------------------------------------------------
       2             :  *
       3             :  * logtape.c
       4             :  *    Management of "logical tapes" within temporary files.
       5             :  *
       6             :  * This module exists to support sorting via multiple merge passes (see
       7             :  * tuplesort.c).  Merging is an ideal algorithm for tape devices, but if
       8             :  * we implement it on disk by creating a separate file for each "tape",
       9             :  * there is an annoying problem: the peak space usage is at least twice
      10             :  * the volume of actual data to be sorted.  (This must be so because each
      11             :  * datum will appear in both the input and output tapes of the final
      12             :  * merge pass.)
      13             :  *
      14             :  * We can work around this problem by recognizing that any one tape
      15             :  * dataset (with the possible exception of the final output) is written
      16             :  * and read exactly once in a perfectly sequential manner.  Therefore,
      17             :  * a datum once read will not be required again, and we can recycle its
      18             :  * space for use by the new tape dataset(s) being generated.  In this way,
      19             :  * the total space usage is essentially just the actual data volume, plus
      20             :  * insignificant bookkeeping and start/stop overhead.
      21             :  *
      22             :  * Few OSes allow arbitrary parts of a file to be released back to the OS,
      23             :  * so we have to implement this space-recycling ourselves within a single
      24             :  * logical file.  logtape.c exists to perform this bookkeeping and provide
      25             :  * the illusion of N independent tape devices to tuplesort.c.  Note that
      26             :  * logtape.c itself depends on buffile.c to provide a "logical file" of
      27             :  * larger size than the underlying OS may support.
      28             :  *
      29             :  * For simplicity, we allocate and release space in the underlying file
      30             :  * in BLCKSZ-size blocks.  Space allocation boils down to keeping track
      31             :  * of which blocks in the underlying file belong to which logical tape,
      32             :  * plus any blocks that are free (recycled and not yet reused).
      33             :  * The blocks in each logical tape form a chain, with a prev- and next-
      34             :  * pointer in each block.
      35             :  *
      36             :  * The initial write pass is guaranteed to fill the underlying file
      37             :  * perfectly sequentially, no matter how data is divided into logical tapes.
      38             :  * Once we begin merge passes, the access pattern becomes considerably
      39             :  * less predictable --- but the seeking involved should be comparable to
      40             :  * what would happen if we kept each logical tape in a separate file,
      41             :  * so there's no serious performance penalty paid to obtain the space
      42             :  * savings of recycling.  We try to localize the write accesses by always
      43             :  * writing to the lowest-numbered free block when we have a choice; it's
      44             :  * not clear this helps much, but it can't hurt.  (XXX perhaps a LIFO
      45             :  * policy for free blocks would be better?)
      46             :  *
      47             :  * To further make the I/Os more sequential, we can use a larger buffer
      48             :  * when reading, and read multiple blocks from the same tape in one go,
      49             :  * whenever the buffer becomes empty.
      50             :  *
      51             :  * To support the above policy of writing to the lowest free block, the
      52             :  * freelist is a min heap.
      53             :  *
      54             :  * Since all the bookkeeping and buffer memory is allocated with palloc(),
      55             :  * and the underlying file(s) are made with OpenTemporaryFile, all resources
      56             :  * for a logical tape set are certain to be cleaned up even if processing
      57             :  * is aborted by ereport(ERROR).  To avoid confusion, the caller should take
      58             :  * care that all calls for a single LogicalTapeSet are made in the same
      59             :  * palloc context.
      60             :  *
      61             :  * To support parallel sort operations involving coordinated callers to
      62             :  * tuplesort.c routines across multiple workers, it is necessary to
      63             :  * concatenate each worker BufFile/tapeset into one single logical tapeset
      64             :  * managed by the leader.  Workers should have produced one final
      65             :  * materialized tape (their entire output) when this happens in leader.
      66             :  * There will always be the same number of runs as input tapes, and the same
      67             :  * number of input tapes as participants (worker Tuplesortstates).
      68             :  *
      69             :  * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
      70             :  * Portions Copyright (c) 1994, Regents of the University of California
      71             :  *
      72             :  * IDENTIFICATION
      73             :  *    src/backend/utils/sort/logtape.c
      74             :  *
      75             :  *-------------------------------------------------------------------------
      76             :  */
      77             : 
      78             : #include "postgres.h"
      79             : 
      80             : #include <fcntl.h>
      81             : 
      82             : #include "storage/buffile.h"
      83             : #include "utils/builtins.h"
      84             : #include "utils/logtape.h"
      85             : #include "utils/memdebug.h"
      86             : #include "utils/memutils.h"
      87             : 
      88             : /*
      89             :  * A TapeBlockTrailer is stored at the end of each BLCKSZ block.
      90             :  *
      91             :  * The first block of a tape has prev == -1.  The last block of a tape
      92             :  * stores the number of valid bytes on the block, inverted, in 'next'
      93             :  * Therefore next < 0 indicates the last block.
      94             :  */
      95             : typedef struct TapeBlockTrailer
      96             : {
      97             :     long        prev;           /* previous block on this tape, or -1 on first
      98             :                                  * block */
      99             :     long        next;           /* next block on this tape, or # of valid
     100             :                                  * bytes on last block (if < 0) */
     101             : } TapeBlockTrailer;
     102             : 
     103             : #define TapeBlockPayloadSize  (BLCKSZ - sizeof(TapeBlockTrailer))
     104             : #define TapeBlockGetTrailer(buf) \
     105             :     ((TapeBlockTrailer *) ((char *) buf + TapeBlockPayloadSize))
     106             : 
     107             : #define TapeBlockIsLast(buf) (TapeBlockGetTrailer(buf)->next < 0)
     108             : #define TapeBlockGetNBytes(buf) \
     109             :     (TapeBlockIsLast(buf) ? \
     110             :      (- TapeBlockGetTrailer(buf)->next) : TapeBlockPayloadSize)
     111             : #define TapeBlockSetNBytes(buf, nbytes) \
     112             :     (TapeBlockGetTrailer(buf)->next = -(nbytes))
     113             : 
     114             : /*
     115             :  * When multiple tapes are being written to concurrently (as in HashAgg),
     116             :  * avoid excessive fragmentation by preallocating block numbers to individual
     117             :  * tapes. Each preallocation doubles in size starting at
     118             :  * TAPE_WRITE_PREALLOC_MIN blocks up to TAPE_WRITE_PREALLOC_MAX blocks.
     119             :  *
     120             :  * No filesystem operations are performed for preallocation; only the block
     121             :  * numbers are reserved. This may lead to sparse writes, which will cause
     122             :  * ltsWriteBlock() to fill in holes with zeros.
     123             :  */
     124             : #define TAPE_WRITE_PREALLOC_MIN 8
     125             : #define TAPE_WRITE_PREALLOC_MAX 128
     126             : 
     127             : /*
     128             :  * This data structure represents a single "logical tape" within the set
     129             :  * of logical tapes stored in the same file.
     130             :  *
     131             :  * While writing, we hold the current partially-written data block in the
     132             :  * buffer.  While reading, we can hold multiple blocks in the buffer.  Note
     133             :  * that we don't retain the trailers of a block when it's read into the
     134             :  * buffer.  The buffer therefore contains one large contiguous chunk of data
     135             :  * from the tape.
     136             :  */
     137             : struct LogicalTape
     138             : {
     139             :     LogicalTapeSet *tapeSet;    /* tape set this tape is part of */
     140             : 
     141             :     bool        writing;        /* T while in write phase */
     142             :     bool        frozen;         /* T if blocks should not be freed when read */
     143             :     bool        dirty;          /* does buffer need to be written? */
     144             : 
     145             :     /*
     146             :      * Block numbers of the first, current, and next block of the tape.
     147             :      *
     148             :      * The "current" block number is only valid when writing, or reading from
     149             :      * a frozen tape.  (When reading from an unfrozen tape, we use a larger
     150             :      * read buffer that holds multiple blocks, so the "current" block is
     151             :      * ambiguous.)
     152             :      *
     153             :      * When concatenation of worker tape BufFiles is performed, an offset to
     154             :      * the first block in the unified BufFile space is applied during reads.
     155             :      */
     156             :     long        firstBlockNumber;
     157             :     long        curBlockNumber;
     158             :     long        nextBlockNumber;
     159             :     long        offsetBlockNumber;
     160             : 
     161             :     /*
     162             :      * Buffer for current data block(s).
     163             :      */
     164             :     char       *buffer;         /* physical buffer (separately palloc'd) */
     165             :     int         buffer_size;    /* allocated size of the buffer */
     166             :     int         max_size;       /* highest useful, safe buffer_size */
     167             :     int         pos;            /* next read/write position in buffer */
     168             :     int         nbytes;         /* total # of valid bytes in buffer */
     169             : 
     170             :     /*
     171             :      * Preallocated block numbers are held in an array sorted in descending
     172             :      * order; blocks are consumed from the end of the array (lowest block
     173             :      * numbers first).
     174             :      */
     175             :     long       *prealloc;
     176             :     int         nprealloc;      /* number of elements in list */
     177             :     int         prealloc_size;  /* number of elements list can hold */
     178             : };
     179             : 
     180             : /*
     181             :  * This data structure represents a set of related "logical tapes" sharing
     182             :  * space in a single underlying file.  (But that "file" may be multiple files
     183             :  * if needed to escape OS limits on file size; buffile.c handles that for us.)
     184             :  * Tapes belonging to a tape set can be created and destroyed on-the-fly, on
     185             :  * demand.
     186             :  */
     187             : struct LogicalTapeSet
     188             : {
     189             :     BufFile    *pfile;          /* underlying file for whole tape set */
     190             :     SharedFileSet *fileset;
     191             :     int         worker;         /* worker # if shared, -1 for leader/serial */
     192             : 
     193             :     /*
     194             :      * File size tracking.  nBlocksWritten is the size of the underlying file,
     195             :      * in BLCKSZ blocks.  nBlocksAllocated is the number of blocks allocated
     196             :      * by ltsReleaseBlock(), and it is always greater than or equal to
     197             :      * nBlocksWritten.  Blocks between nBlocksAllocated and nBlocksWritten are
     198             :      * blocks that have been allocated for a tape, but have not been written
     199             :      * to the underlying file yet.  nHoleBlocks tracks the total number of
     200             :      * blocks that are in unused holes between worker spaces following BufFile
     201             :      * concatenation.
     202             :      */
     203             :     long        nBlocksAllocated;   /* # of blocks allocated */
     204             :     long        nBlocksWritten; /* # of blocks used in underlying file */
     205             :     long        nHoleBlocks;    /* # of "hole" blocks left */
     206             : 
     207             :     /*
     208             :      * We store the numbers of recycled-and-available blocks in freeBlocks[].
     209             :      * When there are no such blocks, we extend the underlying file.
     210             :      *
     211             :      * If forgetFreeSpace is true then any freed blocks are simply forgotten
     212             :      * rather than being remembered in freeBlocks[].  See notes for
     213             :      * LogicalTapeSetForgetFreeSpace().
     214             :      */
     215             :     bool        forgetFreeSpace;    /* are we remembering free blocks? */
     216             :     long       *freeBlocks;     /* resizable array holding minheap */
     217             :     long        nFreeBlocks;    /* # of currently free blocks */
     218             :     Size        freeBlocksLen;  /* current allocated length of freeBlocks[] */
     219             :     bool        enable_prealloc;    /* preallocate write blocks? */
     220             : };
     221             : 
     222             : static LogicalTape *ltsCreateTape(LogicalTapeSet *lts);
     223             : static void ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer);
     224             : static void ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer);
     225             : static long ltsGetBlock(LogicalTapeSet *lts, LogicalTape *lt);
     226             : static long ltsGetFreeBlock(LogicalTapeSet *lts);
     227             : static long ltsGetPreallocBlock(LogicalTapeSet *lts, LogicalTape *lt);
     228             : static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
     229             : static void ltsInitReadBuffer(LogicalTape *lt);
     230             : 
     231             : 
     232             : /*
     233             :  * Write a block-sized buffer to the specified block of the underlying file.
     234             :  *
     235             :  * No need for an error return convention; we ereport() on any error.
     236             :  */
     237             : static void
     238       39470 : ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
     239             : {
     240             :     /*
     241             :      * BufFile does not support "holes", so if we're about to write a block
     242             :      * that's past the current end of file, fill the space between the current
     243             :      * end of file and the target block with zeros.
     244             :      *
     245             :      * This can happen either when tapes preallocate blocks; or for the last
     246             :      * block of a tape which might not have been flushed.
     247             :      *
     248             :      * Note that BufFile concatenation can leave "holes" in BufFile between
     249             :      * worker-owned block ranges.  These are tracked for reporting purposes
     250             :      * only.  We never read from nor write to these hole blocks, and so they
     251             :      * are not considered here.
     252             :      */
     253       42414 :     while (blocknum > lts->nBlocksWritten)
     254             :     {
     255             :         PGAlignedBlock zerobuf;
     256             : 
     257        2944 :         MemSet(zerobuf.data, 0, sizeof(zerobuf));
     258             : 
     259        2944 :         ltsWriteBlock(lts, lts->nBlocksWritten, zerobuf.data);
     260             :     }
     261             : 
     262             :     /* Write the requested block */
     263       39470 :     if (BufFileSeekBlock(lts->pfile, blocknum) != 0)
     264           0 :         ereport(ERROR,
     265             :                 (errcode_for_file_access(),
     266             :                  errmsg("could not seek to block %ld of temporary file",
     267             :                         blocknum)));
     268       39470 :     BufFileWrite(lts->pfile, buffer, BLCKSZ);
     269             : 
     270             :     /* Update nBlocksWritten, if we extended the file */
     271       39470 :     if (blocknum == lts->nBlocksWritten)
     272       15546 :         lts->nBlocksWritten++;
     273       39470 : }
     274             : 
     275             : /*
     276             :  * Read a block-sized buffer from the specified block of the underlying file.
     277             :  *
     278             :  * No need for an error return convention; we ereport() on any error.   This
     279             :  * module should never attempt to read a block it doesn't know is there.
     280             :  */
     281             : static void
     282       36294 : ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
     283             : {
     284             :     size_t      nread;
     285             : 
     286       36294 :     if (BufFileSeekBlock(lts->pfile, blocknum) != 0)
     287           0 :         ereport(ERROR,
     288             :                 (errcode_for_file_access(),
     289             :                  errmsg("could not seek to block %ld of temporary file",
     290             :                         blocknum)));
     291       36294 :     nread = BufFileRead(lts->pfile, buffer, BLCKSZ);
     292       36294 :     if (nread != BLCKSZ)
     293           0 :         ereport(ERROR,
     294             :                 (errcode_for_file_access(),
     295             :                  errmsg("could not read block %ld of temporary file: read only %zu of %zu bytes",
     296             :                         blocknum, nread, (size_t) BLCKSZ)));
     297       36294 : }
     298             : 
     299             : /*
     300             :  * Read as many blocks as we can into the per-tape buffer.
     301             :  *
     302             :  * Returns true if anything was read, 'false' on EOF.
     303             :  */
     304             : static bool
     305       44056 : ltsReadFillBuffer(LogicalTape *lt)
     306             : {
     307       44056 :     lt->pos = 0;
     308       44056 :     lt->nbytes = 0;
     309             : 
     310             :     do
     311             :     {
     312       54994 :         char       *thisbuf = lt->buffer + lt->nbytes;
     313       54994 :         long        datablocknum = lt->nextBlockNumber;
     314             : 
     315             :         /* Fetch next block number */
     316       54994 :         if (datablocknum == -1L)
     317       19044 :             break;              /* EOF */
     318             :         /* Apply worker offset, needed for leader tapesets */
     319       35950 :         datablocknum += lt->offsetBlockNumber;
     320             : 
     321             :         /* Read the block */
     322       35950 :         ltsReadBlock(lt->tapeSet, datablocknum, (void *) thisbuf);
     323       35950 :         if (!lt->frozen)
     324       35482 :             ltsReleaseBlock(lt->tapeSet, datablocknum);
     325       35950 :         lt->curBlockNumber = lt->nextBlockNumber;
     326             : 
     327       35950 :         lt->nbytes += TapeBlockGetNBytes(thisbuf);
     328       35950 :         if (TapeBlockIsLast(thisbuf))
     329             :         {
     330       19872 :             lt->nextBlockNumber = -1L;
     331             :             /* EOF */
     332       19872 :             break;
     333             :         }
     334             :         else
     335       16078 :             lt->nextBlockNumber = TapeBlockGetTrailer(thisbuf)->next;
     336             : 
     337             :         /* Advance to next block, if we have buffer space left */
     338       16078 :     } while (lt->buffer_size - lt->nbytes > BLCKSZ);
     339             : 
     340       44056 :     return (lt->nbytes > 0);
     341             : }
     342             : 
     343             : static inline void
     344     1436460 : swap_nodes(long *heap, unsigned long a, unsigned long b)
     345             : {
     346             :     unsigned long swap;
     347             : 
     348     1436460 :     swap = heap[a];
     349     1436460 :     heap[a] = heap[b];
     350     1436460 :     heap[b] = swap;
     351     1436460 : }
     352             : 
     353             : static inline unsigned long
     354     1059020 : left_offset(unsigned long i)
     355             : {
     356     1059020 :     return 2 * i + 1;
     357             : }
     358             : 
     359             : static inline unsigned long
     360     1059020 : right_offset(unsigned i)
     361             : {
     362     1059020 :     return 2 * i + 2;
     363             : }
     364             : 
     365             : static inline unsigned long
     366      662868 : parent_offset(unsigned long i)
     367             : {
     368      662868 :     return (i - 1) / 2;
     369             : }
     370             : 
     371             : /*
     372             :  * Get the next block for writing.
     373             :  */
     374             : static long
     375       36526 : ltsGetBlock(LogicalTapeSet *lts, LogicalTape *lt)
     376             : {
     377       36526 :     if (lts->enable_prealloc)
     378       20044 :         return ltsGetPreallocBlock(lts, lt);
     379             :     else
     380       16482 :         return ltsGetFreeBlock(lts);
     381             : }
     382             : 
     383             : /*
     384             :  * Select the lowest currently unused block from the tape set's global free
     385             :  * list min heap.
     386             :  */
     387             : static long
     388      169090 : ltsGetFreeBlock(LogicalTapeSet *lts)
     389             : {
     390      169090 :     long       *heap = lts->freeBlocks;
     391             :     long        blocknum;
     392             :     int         heapsize;
     393             :     unsigned long pos;
     394             : 
     395             :     /* freelist empty; allocate a new block */
     396      169090 :     if (lts->nFreeBlocks == 0)
     397       15806 :         return lts->nBlocksAllocated++;
     398             : 
     399      153284 :     if (lts->nFreeBlocks == 1)
     400             :     {
     401         416 :         lts->nFreeBlocks--;
     402         416 :         return lts->freeBlocks[0];
     403             :     }
     404             : 
     405             :     /* take top of minheap */
     406      152868 :     blocknum = heap[0];
     407             : 
     408             :     /* replace with end of minheap array */
     409      152868 :     heap[0] = heap[--lts->nFreeBlocks];
     410             : 
     411             :     /* sift down */
     412      152868 :     pos = 0;
     413      152868 :     heapsize = lts->nFreeBlocks;
     414             :     while (true)
     415      906152 :     {
     416     1059020 :         unsigned long left = left_offset(pos);
     417     1059020 :         unsigned long right = right_offset(pos);
     418             :         unsigned long min_child;
     419             : 
     420     1059020 :         if (left < heapsize && right < heapsize)
     421      915004 :             min_child = (heap[left] < heap[right]) ? left : right;
     422      144016 :         else if (left < heapsize)
     423       30128 :             min_child = left;
     424      113888 :         else if (right < heapsize)
     425           0 :             min_child = right;
     426             :         else
     427      113888 :             break;
     428             : 
     429      945132 :         if (heap[min_child] >= heap[pos])
     430       38980 :             break;
     431             : 
     432      906152 :         swap_nodes(heap, min_child, pos);
     433      906152 :         pos = min_child;
     434             :     }
     435             : 
     436      152868 :     return blocknum;
     437             : }
     438             : 
     439             : /*
     440             :  * Return the lowest free block number from the tape's preallocation list.
     441             :  * Refill the preallocation list with blocks from the tape set's free list if
     442             :  * necessary.
     443             :  */
     444             : static long
     445       20044 : ltsGetPreallocBlock(LogicalTapeSet *lts, LogicalTape *lt)
     446             : {
     447             :     /* sorted in descending order, so return the last element */
     448       20044 :     if (lt->nprealloc > 0)
     449         984 :         return lt->prealloc[--lt->nprealloc];
     450             : 
     451       19060 :     if (lt->prealloc == NULL)
     452             :     {
     453       19044 :         lt->prealloc_size = TAPE_WRITE_PREALLOC_MIN;
     454       19044 :         lt->prealloc = (long *) palloc(sizeof(long) * lt->prealloc_size);
     455             :     }
     456          16 :     else if (lt->prealloc_size < TAPE_WRITE_PREALLOC_MAX)
     457             :     {
     458             :         /* when the preallocation list runs out, double the size */
     459          16 :         lt->prealloc_size *= 2;
     460          16 :         if (lt->prealloc_size > TAPE_WRITE_PREALLOC_MAX)
     461           0 :             lt->prealloc_size = TAPE_WRITE_PREALLOC_MAX;
     462          16 :         lt->prealloc = (long *) repalloc(lt->prealloc,
     463          16 :                                          sizeof(long) * lt->prealloc_size);
     464             :     }
     465             : 
     466             :     /* refill preallocation list */
     467       19060 :     lt->nprealloc = lt->prealloc_size;
     468      171668 :     for (int i = lt->nprealloc; i > 0; i--)
     469             :     {
     470      152608 :         lt->prealloc[i - 1] = ltsGetFreeBlock(lts);
     471             : 
     472             :         /* verify descending order */
     473             :         Assert(i == lt->nprealloc || lt->prealloc[i - 1] > lt->prealloc[i]);
     474             :     }
     475             : 
     476       19060 :     return lt->prealloc[--lt->nprealloc];
     477             : }
     478             : 
     479             : /*
     480             :  * Return a block# to the freelist.
     481             :  */
     482             : static void
     483      168046 : ltsReleaseBlock(LogicalTapeSet *lts, long blocknum)
     484             : {
     485             :     long       *heap;
     486             :     unsigned long pos;
     487             : 
     488             :     /*
     489             :      * Do nothing if we're no longer interested in remembering free space.
     490             :      */
     491      168046 :     if (lts->forgetFreeSpace)
     492       11778 :         return;
     493             : 
     494             :     /*
     495             :      * Enlarge freeBlocks array if full.
     496             :      */
     497      156268 :     if (lts->nFreeBlocks >= lts->freeBlocksLen)
     498             :     {
     499             :         /*
     500             :          * If the freelist becomes very large, just return and leak this free
     501             :          * block.
     502             :          */
     503          40 :         if (lts->freeBlocksLen * 2 * sizeof(long) > MaxAllocSize)
     504           0 :             return;
     505             : 
     506          40 :         lts->freeBlocksLen *= 2;
     507          40 :         lts->freeBlocks = (long *) repalloc(lts->freeBlocks,
     508          40 :                                             lts->freeBlocksLen * sizeof(long));
     509             :     }
     510             : 
     511      156268 :     heap = lts->freeBlocks;
     512      156268 :     pos = lts->nFreeBlocks;
     513             : 
     514             :     /* place entry at end of minheap array */
     515      156268 :     heap[pos] = blocknum;
     516      156268 :     lts->nFreeBlocks++;
     517             : 
     518             :     /* sift up */
     519      686576 :     while (pos != 0)
     520             :     {
     521      662868 :         unsigned long parent = parent_offset(pos);
     522             : 
     523      662868 :         if (heap[parent] < heap[pos])
     524      132560 :             break;
     525             : 
     526      530308 :         swap_nodes(heap, parent, pos);
     527      530308 :         pos = parent;
     528             :     }
     529             : }
     530             : 
     531             : /*
     532             :  * Lazily allocate and initialize the read buffer. This avoids waste when many
     533             :  * tapes are open at once, but not all are active between rewinding and
     534             :  * reading.
     535             :  */
     536             : static void
     537       19876 : ltsInitReadBuffer(LogicalTape *lt)
     538             : {
     539             :     Assert(lt->buffer_size > 0);
     540       19876 :     lt->buffer = palloc(lt->buffer_size);
     541             : 
     542             :     /* Read the first block, or reset if tape is empty */
     543       19876 :     lt->nextBlockNumber = lt->firstBlockNumber;
     544       19876 :     lt->pos = 0;
     545       19876 :     lt->nbytes = 0;
     546       19876 :     ltsReadFillBuffer(lt);
     547       19876 : }
     548             : 
     549             : /*
     550             :  * Create a tape set, backed by a temporary underlying file.
     551             :  *
     552             :  * The tape set is initially empty. Use LogicalTapeCreate() to create
     553             :  * tapes in it.
     554             :  *
     555             :  * Serial callers pass NULL argument for shared, and -1 for worker.  Parallel
     556             :  * worker callers pass a shared file handle and their own worker number.
     557             :  *
     558             :  * Leader callers pass a shared file handle and -1 for worker. After creating
     559             :  * the tape set, use LogicalTapeImport() to import the worker tapes into it.
     560             :  *
     561             :  * Currently, the leader will only import worker tapes into the set, it does
     562             :  * not create tapes of its own, although in principle that should work.
     563             :  */
     564             : LogicalTapeSet *
     565         500 : LogicalTapeSetCreate(bool preallocate, SharedFileSet *fileset, int worker)
     566             : {
     567             :     LogicalTapeSet *lts;
     568             : 
     569             :     /*
     570             :      * Create top-level struct including per-tape LogicalTape structs.
     571             :      */
     572         500 :     lts = (LogicalTapeSet *) palloc(sizeof(LogicalTapeSet));
     573         500 :     lts->nBlocksAllocated = 0L;
     574         500 :     lts->nBlocksWritten = 0L;
     575         500 :     lts->nHoleBlocks = 0L;
     576         500 :     lts->forgetFreeSpace = false;
     577         500 :     lts->freeBlocksLen = 32; /* reasonable initial guess */
     578         500 :     lts->freeBlocks = (long *) palloc(lts->freeBlocksLen * sizeof(long));
     579         500 :     lts->nFreeBlocks = 0;
     580         500 :     lts->enable_prealloc = preallocate;
     581             : 
     582         500 :     lts->fileset = fileset;
     583         500 :     lts->worker = worker;
     584             : 
     585             :     /*
     586             :      * Create temp BufFile storage as required.
     587             :      *
     588             :      * In leader, we hijack the BufFile of the first tape that's imported, and
     589             :      * concatenate the BufFiles of any subsequent tapes to that. Hence don't
     590             :      * create a BufFile here. Things are simpler for the worker case and the
     591             :      * serial case, though.  They are generally very similar -- workers use a
     592             :      * shared fileset, whereas serial sorts use a conventional serial BufFile.
     593             :      */
     594         500 :     if (fileset && worker == -1)
     595          96 :         lts->pfile = NULL;
     596         404 :     else if (fileset)
     597             :     {
     598             :         char        filename[MAXPGPATH];
     599             : 
     600         280 :         pg_itoa(worker, filename);
     601         280 :         lts->pfile = BufFileCreateFileSet(&fileset->fs, filename);
     602             :     }
     603             :     else
     604         124 :         lts->pfile = BufFileCreateTemp(false);
     605             : 
     606         500 :     return lts;
     607             : }
     608             : 
     609             : /*
     610             :  * Claim ownership of a logical tape from an existing shared BufFile.
     611             :  *
     612             :  * Caller should be leader process.  Though tapes are marked as frozen in
     613             :  * workers, they are not frozen when opened within leader, since unfrozen tapes
     614             :  * use a larger read buffer. (Frozen tapes have smaller read buffer, optimized
     615             :  * for random access.)
     616             :  */
     617             : LogicalTape *
     618         192 : LogicalTapeImport(LogicalTapeSet *lts, int worker, TapeShare *shared)
     619             : {
     620             :     LogicalTape *lt;
     621             :     long        tapeblocks;
     622             :     char        filename[MAXPGPATH];
     623             :     BufFile    *file;
     624             :     int64       filesize;
     625             : 
     626         192 :     lt = ltsCreateTape(lts);
     627             : 
     628             :     /*
     629             :      * build concatenated view of all buffiles, remembering the block number
     630             :      * where each source file begins.
     631             :      */
     632         192 :     pg_itoa(worker, filename);
     633         192 :     file = BufFileOpenFileSet(&lts->fileset->fs, filename, O_RDONLY, false);
     634         192 :     filesize = BufFileSize(file);
     635             : 
     636             :     /*
     637             :      * Stash first BufFile, and concatenate subsequent BufFiles to that. Store
     638             :      * block offset into each tape as we go.
     639             :      */
     640         192 :     lt->firstBlockNumber = shared->firstblocknumber;
     641         192 :     if (lts->pfile == NULL)
     642             :     {
     643          96 :         lts->pfile = file;
     644          96 :         lt->offsetBlockNumber = 0L;
     645             :     }
     646             :     else
     647             :     {
     648          96 :         lt->offsetBlockNumber = BufFileAppend(lts->pfile, file);
     649             :     }
     650             :     /* Don't allocate more for read buffer than could possibly help */
     651         192 :     lt->max_size = Min(MaxAllocSize, filesize);
     652         192 :     tapeblocks = filesize / BLCKSZ;
     653             : 
     654             :     /*
     655             :      * Update # of allocated blocks and # blocks written to reflect the
     656             :      * imported BufFile.  Allocated/written blocks include space used by holes
     657             :      * left between concatenated BufFiles.  Also track the number of hole
     658             :      * blocks so that we can later work backwards to calculate the number of
     659             :      * physical blocks for instrumentation.
     660             :      */
     661         192 :     lts->nHoleBlocks += lt->offsetBlockNumber - lts->nBlocksAllocated;
     662             : 
     663         192 :     lts->nBlocksAllocated = lt->offsetBlockNumber + tapeblocks;
     664         192 :     lts->nBlocksWritten = lts->nBlocksAllocated;
     665             : 
     666         192 :     return lt;
     667             : }
     668             : 
     669             : /*
     670             :  * Close a logical tape set and release all resources.
     671             :  *
     672             :  * NOTE: This doesn't close any of the tapes!  You must close them
     673             :  * first, or you can let them be destroyed along with the memory context.
     674             :  */
     675             : void
     676         500 : LogicalTapeSetClose(LogicalTapeSet *lts)
     677             : {
     678         500 :     BufFileClose(lts->pfile);
     679         500 :     pfree(lts->freeBlocks);
     680         500 :     pfree(lts);
     681         500 : }
     682             : 
     683             : /*
     684             :  * Create a logical tape in the given tapeset.
     685             :  *
     686             :  * The tape is initialized in write state.
     687             :  */
     688             : LogicalTape *
     689       35648 : LogicalTapeCreate(LogicalTapeSet *lts)
     690             : {
     691             :     /*
     692             :      * The only thing that currently prevents creating new tapes in leader is
     693             :      * the fact that BufFiles opened using BufFileOpenShared() are read-only
     694             :      * by definition, but that could be changed if it seemed worthwhile.  For
     695             :      * now, writing to the leader tape will raise a "Bad file descriptor"
     696             :      * error, so tuplesort must avoid writing to the leader tape altogether.
     697             :      */
     698       35648 :     if (lts->fileset && lts->worker == -1)
     699           0 :         elog(ERROR, "cannot create new tapes in leader process");
     700             : 
     701       35648 :     return ltsCreateTape(lts);
     702             : }
     703             : 
     704             : static LogicalTape *
     705       35840 : ltsCreateTape(LogicalTapeSet *lts)
     706             : {
     707             :     LogicalTape *lt;
     708             : 
     709             :     /*
     710             :      * Create per-tape struct.  Note we allocate the I/O buffer lazily.
     711             :      */
     712       35840 :     lt = palloc(sizeof(LogicalTape));
     713       35840 :     lt->tapeSet = lts;
     714       35840 :     lt->writing = true;
     715       35840 :     lt->frozen = false;
     716       35840 :     lt->dirty = false;
     717       35840 :     lt->firstBlockNumber = -1L;
     718       35840 :     lt->curBlockNumber = -1L;
     719       35840 :     lt->nextBlockNumber = -1L;
     720       35840 :     lt->offsetBlockNumber = 0L;
     721       35840 :     lt->buffer = NULL;
     722       35840 :     lt->buffer_size = 0;
     723             :     /* palloc() larger than MaxAllocSize would fail */
     724       35840 :     lt->max_size = MaxAllocSize;
     725       35840 :     lt->pos = 0;
     726       35840 :     lt->nbytes = 0;
     727       35840 :     lt->prealloc = NULL;
     728       35840 :     lt->nprealloc = 0;
     729       35840 :     lt->prealloc_size = 0;
     730             : 
     731       35840 :     return lt;
     732             : }
     733             : 
     734             : /*
     735             :  * Close a logical tape.
     736             :  *
     737             :  * Note: This doesn't return any blocks to the free list!  You must read
     738             :  * the tape to the end first, to reuse the space.  In current use, though,
     739             :  * we only close tapes after fully reading them.
     740             :  */
     741             : void
     742       19702 : LogicalTapeClose(LogicalTape *lt)
     743             : {
     744       19702 :     if (lt->buffer)
     745       19702 :         pfree(lt->buffer);
     746       19702 :     pfree(lt);
     747       19702 : }
     748             : 
     749             : /*
     750             :  * Mark a logical tape set as not needing management of free space anymore.
     751             :  *
     752             :  * This should be called if the caller does not intend to write any more data
     753             :  * into the tape set, but is reading from un-frozen tapes.  Since no more
     754             :  * writes are planned, remembering free blocks is no longer useful.  Setting
     755             :  * this flag lets us avoid wasting time and space in ltsReleaseBlock(), which
     756             :  * is not designed to handle large numbers of free blocks.
     757             :  */
     758             : void
     759         164 : LogicalTapeSetForgetFreeSpace(LogicalTapeSet *lts)
     760             : {
     761         164 :     lts->forgetFreeSpace = true;
     762         164 : }
     763             : 
     764             : /*
     765             :  * Write to a logical tape.
     766             :  *
     767             :  * There are no error returns; we ereport() on failure.
     768             :  */
     769             : void
     770    12808760 : LogicalTapeWrite(LogicalTape *lt, void *ptr, size_t size)
     771             : {
     772    12808760 :     LogicalTapeSet *lts = lt->tapeSet;
     773             :     size_t      nthistime;
     774             : 
     775             :     Assert(lt->writing);
     776             :     Assert(lt->offsetBlockNumber == 0L);
     777             : 
     778             :     /* Allocate data buffer and first block on first write */
     779    12808760 :     if (lt->buffer == NULL)
     780             :     {
     781       19972 :         lt->buffer = (char *) palloc(BLCKSZ);
     782       19972 :         lt->buffer_size = BLCKSZ;
     783             :     }
     784    12808760 :     if (lt->curBlockNumber == -1)
     785             :     {
     786             :         Assert(lt->firstBlockNumber == -1);
     787             :         Assert(lt->pos == 0);
     788             : 
     789       19972 :         lt->curBlockNumber = ltsGetBlock(lts, lt);
     790       19972 :         lt->firstBlockNumber = lt->curBlockNumber;
     791             : 
     792       19972 :         TapeBlockGetTrailer(lt->buffer)->prev = -1L;
     793             :     }
     794             : 
     795             :     Assert(lt->buffer_size == BLCKSZ);
     796    25628412 :     while (size > 0)
     797             :     {
     798    12819652 :         if (lt->pos >= (int) TapeBlockPayloadSize)
     799             :         {
     800             :             /* Buffer full, dump it out */
     801             :             long        nextBlockNumber;
     802             : 
     803       16554 :             if (!lt->dirty)
     804             :             {
     805             :                 /* Hmm, went directly from reading to writing? */
     806           0 :                 elog(ERROR, "invalid logtape state: should be dirty");
     807             :             }
     808             : 
     809             :             /*
     810             :              * First allocate the next block, so that we can store it in the
     811             :              * 'next' pointer of this block.
     812             :              */
     813       16554 :             nextBlockNumber = ltsGetBlock(lt->tapeSet, lt);
     814             : 
     815             :             /* set the next-pointer and dump the current block. */
     816       16554 :             TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;
     817       16554 :             ltsWriteBlock(lt->tapeSet, lt->curBlockNumber, (void *) lt->buffer);
     818             : 
     819             :             /* initialize the prev-pointer of the next block */
     820       16554 :             TapeBlockGetTrailer(lt->buffer)->prev = lt->curBlockNumber;
     821       16554 :             lt->curBlockNumber = nextBlockNumber;
     822       16554 :             lt->pos = 0;
     823       16554 :             lt->nbytes = 0;
     824             :         }
     825             : 
     826    12819652 :         nthistime = TapeBlockPayloadSize - lt->pos;
     827    12819652 :         if (nthistime > size)
     828    12803098 :             nthistime = size;
     829             :         Assert(nthistime > 0);
     830             : 
     831    12819652 :         memcpy(lt->buffer + lt->pos, ptr, nthistime);
     832             : 
     833    12819652 :         lt->dirty = true;
     834    12819652 :         lt->pos += nthistime;
     835    12819652 :         if (lt->nbytes < lt->pos)
     836    12819652 :             lt->nbytes = lt->pos;
     837    12819652 :         ptr = (void *) ((char *) ptr + nthistime);
     838    12819652 :         size -= nthistime;
     839             :     }
     840    12808760 : }
     841             : 
     842             : /*
     843             :  * Rewind logical tape and switch from writing to reading.
     844             :  *
     845             :  * The tape must currently be in writing state, or "frozen" in read state.
     846             :  *
     847             :  * 'buffer_size' specifies how much memory to use for the read buffer.
     848             :  * Regardless of the argument, the actual amount of memory used is between
     849             :  * BLCKSZ and MaxAllocSize, and is a multiple of BLCKSZ.  The given value is
     850             :  * rounded down and truncated to fit those constraints, if necessary.  If the
     851             :  * tape is frozen, the 'buffer_size' argument is ignored, and a small BLCKSZ
     852             :  * byte buffer is used.
     853             :  */
     854             : void
     855       19876 : LogicalTapeRewindForRead(LogicalTape *lt, size_t buffer_size)
     856             : {
     857       19876 :     LogicalTapeSet *lts = lt->tapeSet;
     858             : 
     859             :     /*
     860             :      * Round and cap buffer_size if needed.
     861             :      */
     862       19876 :     if (lt->frozen)
     863           4 :         buffer_size = BLCKSZ;
     864             :     else
     865             :     {
     866             :         /* need at least one block */
     867       19872 :         if (buffer_size < BLCKSZ)
     868         384 :             buffer_size = BLCKSZ;
     869             : 
     870             :         /* palloc() larger than max_size is unlikely to be helpful */
     871       19872 :         if (buffer_size > lt->max_size)
     872         192 :             buffer_size = lt->max_size;
     873             : 
     874             :         /* round down to BLCKSZ boundary */
     875       19872 :         buffer_size -= buffer_size % BLCKSZ;
     876             :     }
     877             : 
     878       19876 :     if (lt->writing)
     879             :     {
     880             :         /*
     881             :          * Completion of a write phase.  Flush last partial data block, and
     882             :          * rewind for normal (destructive) read.
     883             :          */
     884       19872 :         if (lt->dirty)
     885             :         {
     886             :             /*
     887             :              * As long as we've filled the buffer at least once, its contents
     888             :              * are entirely defined from valgrind's point of view, even though
     889             :              * contents beyond the current end point may be stale.  But it's
     890             :              * possible - at least in the case of a parallel sort - to sort
     891             :              * such small amount of data that we do not fill the buffer even
     892             :              * once.  Tell valgrind that its contents are defined, so it
     893             :              * doesn't bleat.
     894             :              */
     895             :             VALGRIND_MAKE_MEM_DEFINED(lt->buffer + lt->nbytes,
     896             :                                       lt->buffer_size - lt->nbytes);
     897             : 
     898       19680 :             TapeBlockSetNBytes(lt->buffer, lt->nbytes);
     899       19680 :             ltsWriteBlock(lt->tapeSet, lt->curBlockNumber, (void *) lt->buffer);
     900             :         }
     901       19872 :         lt->writing = false;
     902             :     }
     903             :     else
     904             :     {
     905             :         /*
     906             :          * This is only OK if tape is frozen; we rewind for (another) read
     907             :          * pass.
     908             :          */
     909             :         Assert(lt->frozen);
     910             :     }
     911             : 
     912       19876 :     if (lt->buffer)
     913       19684 :         pfree(lt->buffer);
     914             : 
     915             :     /* the buffer is lazily allocated, but set the size here */
     916       19876 :     lt->buffer = NULL;
     917       19876 :     lt->buffer_size = buffer_size;
     918             : 
     919             :     /* free the preallocation list, and return unused block numbers */
     920       19876 :     if (lt->prealloc != NULL)
     921             :     {
     922      151608 :         for (int i = lt->nprealloc; i > 0; i--)
     923      132564 :             ltsReleaseBlock(lts, lt->prealloc[i - 1]);
     924       19044 :         pfree(lt->prealloc);
     925       19044 :         lt->prealloc = NULL;
     926       19044 :         lt->nprealloc = 0;
     927       19044 :         lt->prealloc_size = 0;
     928             :     }
     929       19876 : }
     930             : 
     931             : /*
     932             :  * Read from a logical tape.
     933             :  *
     934             :  * Early EOF is indicated by return value less than #bytes requested.
     935             :  */
     936             : size_t
     937    13124196 : LogicalTapeRead(LogicalTape *lt, void *ptr, size_t size)
     938             : {
     939    13124196 :     size_t      nread = 0;
     940             :     size_t      nthistime;
     941             : 
     942             :     Assert(!lt->writing);
     943             : 
     944    13124196 :     if (lt->buffer == NULL)
     945       19876 :         ltsInitReadBuffer(lt);
     946             : 
     947    26232264 :     while (size > 0)
     948             :     {
     949    13127112 :         if (lt->pos >= lt->nbytes)
     950             :         {
     951             :             /* Try to load more data into buffer. */
     952       24180 :             if (!ltsReadFillBuffer(lt))
     953       19044 :                 break;          /* EOF */
     954             :         }
     955             : 
     956    13108068 :         nthistime = lt->nbytes - lt->pos;
     957    13108068 :         if (nthistime > size)
     958    13083064 :             nthistime = size;
     959             :         Assert(nthistime > 0);
     960             : 
     961    13108068 :         memcpy(ptr, lt->buffer + lt->pos, nthistime);
     962             : 
     963    13108068 :         lt->pos += nthistime;
     964    13108068 :         ptr = (void *) ((char *) ptr + nthistime);
     965    13108068 :         size -= nthistime;
     966    13108068 :         nread += nthistime;
     967             :     }
     968             : 
     969    13124196 :     return nread;
     970             : }
     971             : 
     972             : /*
     973             :  * "Freeze" the contents of a tape so that it can be read multiple times
     974             :  * and/or read backwards.  Once a tape is frozen, its contents will not
     975             :  * be released until the LogicalTapeSet is destroyed.  This is expected
     976             :  * to be used only for the final output pass of a merge.
     977             :  *
     978             :  * This *must* be called just at the end of a write pass, before the
     979             :  * tape is rewound (after rewind is too late!).  It performs a rewind
     980             :  * and switch to read mode "for free".  An immediately following rewind-
     981             :  * for-read call is OK but not necessary.
     982             :  *
     983             :  * share output argument is set with details of storage used for tape after
     984             :  * freezing, which may be passed to LogicalTapeSetCreate within leader
     985             :  * process later.  This metadata is only of interest to worker callers
     986             :  * freezing their final output for leader (single materialized tape).
     987             :  * Serial sorts should set share to NULL.
     988             :  */
     989             : void
     990         292 : LogicalTapeFreeze(LogicalTape *lt, TapeShare *share)
     991             : {
     992         292 :     LogicalTapeSet *lts = lt->tapeSet;
     993             : 
     994             :     Assert(lt->writing);
     995             :     Assert(lt->offsetBlockNumber == 0L);
     996             : 
     997             :     /*
     998             :      * Completion of a write phase.  Flush last partial data block, and rewind
     999             :      * for nondestructive read.
    1000             :      */
    1001         292 :     if (lt->dirty)
    1002             :     {
    1003             :         /*
    1004             :          * As long as we've filled the buffer at least once, its contents are
    1005             :          * entirely defined from valgrind's point of view, even though
    1006             :          * contents beyond the current end point may be stale.  But it's
    1007             :          * possible - at least in the case of a parallel sort - to sort such
    1008             :          * small amount of data that we do not fill the buffer even once. Tell
    1009             :          * valgrind that its contents are defined, so it doesn't bleat.
    1010             :          */
    1011             :         VALGRIND_MAKE_MEM_DEFINED(lt->buffer + lt->nbytes,
    1012             :                                   lt->buffer_size - lt->nbytes);
    1013             : 
    1014         292 :         TapeBlockSetNBytes(lt->buffer, lt->nbytes);
    1015         292 :         ltsWriteBlock(lt->tapeSet, lt->curBlockNumber, (void *) lt->buffer);
    1016             :     }
    1017         292 :     lt->writing = false;
    1018         292 :     lt->frozen = true;
    1019             : 
    1020             :     /*
    1021             :      * The seek and backspace functions assume a single block read buffer.
    1022             :      * That's OK with current usage.  A larger buffer is helpful to make the
    1023             :      * read pattern of the backing file look more sequential to the OS, when
    1024             :      * we're reading from multiple tapes.  But at the end of a sort, when a
    1025             :      * tape is frozen, we only read from a single tape anyway.
    1026             :      */
    1027         292 :     if (!lt->buffer || lt->buffer_size != BLCKSZ)
    1028             :     {
    1029           0 :         if (lt->buffer)
    1030           0 :             pfree(lt->buffer);
    1031           0 :         lt->buffer = palloc(BLCKSZ);
    1032           0 :         lt->buffer_size = BLCKSZ;
    1033             :     }
    1034             : 
    1035             :     /* Read the first block, or reset if tape is empty */
    1036         292 :     lt->curBlockNumber = lt->firstBlockNumber;
    1037         292 :     lt->pos = 0;
    1038         292 :     lt->nbytes = 0;
    1039             : 
    1040         292 :     if (lt->firstBlockNumber == -1L)
    1041           0 :         lt->nextBlockNumber = -1L;
    1042         292 :     ltsReadBlock(lt->tapeSet, lt->curBlockNumber, (void *) lt->buffer);
    1043         292 :     if (TapeBlockIsLast(lt->buffer))
    1044         246 :         lt->nextBlockNumber = -1L;
    1045             :     else
    1046          46 :         lt->nextBlockNumber = TapeBlockGetTrailer(lt->buffer)->next;
    1047         292 :     lt->nbytes = TapeBlockGetNBytes(lt->buffer);
    1048             : 
    1049             :     /* Handle extra steps when caller is to share its tapeset */
    1050         292 :     if (share)
    1051             :     {
    1052         280 :         BufFileExportFileSet(lts->pfile);
    1053         280 :         share->firstblocknumber = lt->firstBlockNumber;
    1054             :     }
    1055         292 : }
    1056             : 
    1057             : /*
    1058             :  * Backspace the tape a given number of bytes.  (We also support a more
    1059             :  * general seek interface, see below.)
    1060             :  *
    1061             :  * *Only* a frozen-for-read tape can be backed up; we don't support
    1062             :  * random access during write, and an unfrozen read tape may have
    1063             :  * already discarded the desired data!
    1064             :  *
    1065             :  * Returns the number of bytes backed up.  It can be less than the
    1066             :  * requested amount, if there isn't that much data before the current
    1067             :  * position.  The tape is positioned to the beginning of the tape in
    1068             :  * that case.
    1069             :  */
    1070             : size_t
    1071          48 : LogicalTapeBackspace(LogicalTape *lt, size_t size)
    1072             : {
    1073          48 :     size_t      seekpos = 0;
    1074             : 
    1075             :     Assert(lt->frozen);
    1076             :     Assert(lt->buffer_size == BLCKSZ);
    1077             : 
    1078          48 :     if (lt->buffer == NULL)
    1079           0 :         ltsInitReadBuffer(lt);
    1080             : 
    1081             :     /*
    1082             :      * Easy case for seek within current block.
    1083             :      */
    1084          48 :     if (size <= (size_t) lt->pos)
    1085             :     {
    1086          44 :         lt->pos -= (int) size;
    1087          44 :         return size;
    1088             :     }
    1089             : 
    1090             :     /*
    1091             :      * Not-so-easy case, have to walk back the chain of blocks.  This
    1092             :      * implementation would be pretty inefficient for long seeks, but we
    1093             :      * really aren't doing that (a seek over one tuple is typical).
    1094             :      */
    1095           4 :     seekpos = (size_t) lt->pos; /* part within this block */
    1096           4 :     while (size > seekpos)
    1097             :     {
    1098           4 :         long        prev = TapeBlockGetTrailer(lt->buffer)->prev;
    1099             : 
    1100           4 :         if (prev == -1L)
    1101             :         {
    1102             :             /* Tried to back up beyond the beginning of tape. */
    1103           4 :             if (lt->curBlockNumber != lt->firstBlockNumber)
    1104           0 :                 elog(ERROR, "unexpected end of tape");
    1105           4 :             lt->pos = 0;
    1106           4 :             return seekpos;
    1107             :         }
    1108             : 
    1109           0 :         ltsReadBlock(lt->tapeSet, prev, (void *) lt->buffer);
    1110             : 
    1111           0 :         if (TapeBlockGetTrailer(lt->buffer)->next != lt->curBlockNumber)
    1112           0 :             elog(ERROR, "broken tape, next of block %ld is %ld, expected %ld",
    1113             :                  prev,
    1114             :                  TapeBlockGetTrailer(lt->buffer)->next,
    1115             :                  lt->curBlockNumber);
    1116             : 
    1117           0 :         lt->nbytes = TapeBlockPayloadSize;
    1118           0 :         lt->curBlockNumber = prev;
    1119           0 :         lt->nextBlockNumber = TapeBlockGetTrailer(lt->buffer)->next;
    1120             : 
    1121           0 :         seekpos += TapeBlockPayloadSize;
    1122             :     }
    1123             : 
    1124             :     /*
    1125             :      * 'seekpos' can now be greater than 'size', because it points to the
    1126             :      * beginning the target block.  The difference is the position within the
    1127             :      * page.
    1128             :      */
    1129           0 :     lt->pos = seekpos - size;
    1130           0 :     return size;
    1131             : }
    1132             : 
    1133             : /*
    1134             :  * Seek to an arbitrary position in a logical tape.
    1135             :  *
    1136             :  * *Only* a frozen-for-read tape can be seeked.
    1137             :  *
    1138             :  * Must be called with a block/offset previously returned by
    1139             :  * LogicalTapeTell().
    1140             :  */
    1141             : void
    1142        4128 : LogicalTapeSeek(LogicalTape *lt, long blocknum, int offset)
    1143             : {
    1144             :     Assert(lt->frozen);
    1145             :     Assert(offset >= 0 && offset <= TapeBlockPayloadSize);
    1146             :     Assert(lt->buffer_size == BLCKSZ);
    1147             : 
    1148        4128 :     if (lt->buffer == NULL)
    1149           0 :         ltsInitReadBuffer(lt);
    1150             : 
    1151        4128 :     if (blocknum != lt->curBlockNumber)
    1152             :     {
    1153          52 :         ltsReadBlock(lt->tapeSet, blocknum, (void *) lt->buffer);
    1154          52 :         lt->curBlockNumber = blocknum;
    1155          52 :         lt->nbytes = TapeBlockPayloadSize;
    1156          52 :         lt->nextBlockNumber = TapeBlockGetTrailer(lt->buffer)->next;
    1157             :     }
    1158             : 
    1159        4128 :     if (offset > lt->nbytes)
    1160           0 :         elog(ERROR, "invalid tape seek position");
    1161        4128 :     lt->pos = offset;
    1162        4128 : }
    1163             : 
    1164             : /*
    1165             :  * Obtain current position in a form suitable for a later LogicalTapeSeek.
    1166             :  *
    1167             :  * NOTE: it'd be OK to do this during write phase with intention of using
    1168             :  * the position for a seek after freezing.  Not clear if anyone needs that.
    1169             :  */
    1170             : void
    1171        5872 : LogicalTapeTell(LogicalTape *lt, long *blocknum, int *offset)
    1172             : {
    1173        5872 :     if (lt->buffer == NULL)
    1174           0 :         ltsInitReadBuffer(lt);
    1175             : 
    1176             :     Assert(lt->offsetBlockNumber == 0L);
    1177             : 
    1178             :     /* With a larger buffer, 'pos' wouldn't be the same as offset within page */
    1179             :     Assert(lt->buffer_size == BLCKSZ);
    1180             : 
    1181        5872 :     *blocknum = lt->curBlockNumber;
    1182        5872 :     *offset = lt->pos;
    1183        5872 : }
    1184             : 
    1185             : /*
    1186             :  * Obtain total disk space currently used by a LogicalTapeSet, in blocks.
    1187             :  *
    1188             :  * This should not be called while there are open write buffers; otherwise it
    1189             :  * may not account for buffered data.
    1190             :  */
    1191             : long
    1192       19548 : LogicalTapeSetBlocks(LogicalTapeSet *lts)
    1193             : {
    1194       19548 :     return lts->nBlocksWritten - lts->nHoleBlocks;
    1195             : }

Generated by: LCOV version 1.14