LCOV - code coverage report
Current view: top level - src/backend/replication/logical - reorderbuffer.c (source / functions) Hit Total Coverage
Test: PostgreSQL 13beta1 Lines: 1080 1169 92.4 %
Date: 2020-05-31 23:07:13 Functions: 65 65 100.0 %
Legend: Lines: hit not hit

          Line data    Source code
       1             : /*-------------------------------------------------------------------------
       2             :  *
       3             :  * reorderbuffer.c
       4             :  *    PostgreSQL logical replay/reorder buffer management
       5             :  *
       6             :  *
       7             :  * Copyright (c) 2012-2020, PostgreSQL Global Development Group
       8             :  *
       9             :  *
      10             :  * IDENTIFICATION
      11             :  *    src/backend/replication/reorderbuffer.c
      12             :  *
      13             :  * NOTES
      14             :  *    This module gets handed individual pieces of transactions in the order
      15             :  *    they are written to the WAL and is responsible to reassemble them into
      16             :  *    toplevel transaction sized pieces. When a transaction is completely
      17             :  *    reassembled - signalled by reading the transaction commit record - it
      18             :  *    will then call the output plugin (cf. ReorderBufferCommit()) with the
      19             :  *    individual changes. The output plugins rely on snapshots built by
      20             :  *    snapbuild.c which hands them to us.
      21             :  *
      22             :  *    Transactions and subtransactions/savepoints in postgres are not
      23             :  *    immediately linked to each other from outside the performing
      24             :  *    backend. Only at commit/abort (or special xact_assignment records) they
      25             :  *    are linked together. Which means that we will have to splice together a
      26             :  *    toplevel transaction from its subtransactions. To do that efficiently we
      27             :  *    build a binary heap indexed by the smallest current lsn of the individual
      28             :  *    subtransactions' changestreams. As the individual streams are inherently
      29             :  *    ordered by LSN - since that is where we build them from - the transaction
      30             :  *    can easily be reassembled by always using the subtransaction with the
      31             :  *    smallest current LSN from the heap.
      32             :  *
      33             :  *    In order to cope with large transactions - which can be several times as
      34             :  *    big as the available memory - this module supports spooling the contents
      35             :  *    of a large transactions to disk. When the transaction is replayed the
      36             :  *    contents of individual (sub-)transactions will be read from disk in
      37             :  *    chunks.
      38             :  *
      39             :  *    This module also has to deal with reassembling toast records from the
      40             :  *    individual chunks stored in WAL. When a new (or initial) version of a
      41             :  *    tuple is stored in WAL it will always be preceded by the toast chunks
      42             :  *    emitted for the columns stored out of line. Within a single toplevel
      43             :  *    transaction there will be no other data carrying records between a row's
      44             :  *    toast chunks and the row data itself. See ReorderBufferToast* for
      45             :  *    details.
      46             :  *
      47             :  *    ReorderBuffer uses two special memory context types - SlabContext for
      48             :  *    allocations of fixed-length structures (changes and transactions), and
      49             :  *    GenerationContext for the variable-length transaction data (allocated
      50             :  *    and freed in groups with similar lifespan).
      51             :  *
      52             :  *    To limit the amount of memory used by decoded changes, we track memory
      53             :  *    used at the reorder buffer level (i.e. total amount of memory), and for
      54             :  *    each transaction. When the total amount of used memory exceeds the
      55             :  *    limit, the transaction consuming the most memory is then serialized to
      56             :  *    disk.
      57             :  *
      58             :  *    Only decoded changes are evicted from memory (spilled to disk), not the
      59             :  *    transaction records. The number of toplevel transactions is limited,
      60             :  *    but a transaction with many subtransactions may still consume significant
      61             :  *    amounts of memory. The transaction records are fairly small, though, and
      62             :  *    are not included in the memory limit.
      63             :  *
      64             :  *    The current eviction algorithm is very simple - the transaction is
      65             :  *    picked merely by size, while it might be useful to also consider age
      66             :  *    (LSN) of the changes for example. With the new Generational memory
      67             :  *    allocator, evicting the oldest changes would make it more likely the
      68             :  *    memory gets actually freed.
      69             :  *
      70             :  *    We still rely on max_changes_in_memory when loading serialized changes
      71             :  *    back into memory. At that point we can't use the memory limit directly
      72             :  *    as we load the subxacts independently. One option do deal with this
      73             :  *    would be to count the subxacts, and allow each to allocate 1/N of the
      74             :  *    memory limit. That however does not seem very appealing, because with
      75             :  *    many subtransactions it may easily cause trashing (short cycles of
      76             :  *    deserializing and applying very few changes). We probably should give
      77             :  *    a bit more memory to the oldest subtransactions, because it's likely
      78             :  *    the source for the next sequence of changes.
      79             :  *
      80             :  * -------------------------------------------------------------------------
      81             :  */
      82             : #include "postgres.h"
      83             : 
      84             : #include <unistd.h>
      85             : #include <sys/stat.h>
      86             : 
      87             : #include "access/detoast.h"
      88             : #include "access/heapam.h"
      89             : #include "access/rewriteheap.h"
      90             : #include "access/transam.h"
      91             : #include "access/xact.h"
      92             : #include "access/xlog_internal.h"
      93             : #include "catalog/catalog.h"
      94             : #include "lib/binaryheap.h"
      95             : #include "miscadmin.h"
      96             : #include "pgstat.h"
      97             : #include "replication/logical.h"
      98             : #include "replication/reorderbuffer.h"
      99             : #include "replication/slot.h"
     100             : #include "replication/snapbuild.h"    /* just for SnapBuildSnapDecRefcount */
     101             : #include "storage/bufmgr.h"
     102             : #include "storage/fd.h"
     103             : #include "storage/sinval.h"
     104             : #include "utils/builtins.h"
     105             : #include "utils/combocid.h"
     106             : #include "utils/memdebug.h"
     107             : #include "utils/memutils.h"
     108             : #include "utils/rel.h"
     109             : #include "utils/relfilenodemap.h"
     110             : 
     111             : 
     112             : /* entry for a hash table we use to map from xid to our transaction state */
     113             : typedef struct ReorderBufferTXNByIdEnt
     114             : {
     115             :     TransactionId xid;
     116             :     ReorderBufferTXN *txn;
     117             : } ReorderBufferTXNByIdEnt;
     118             : 
     119             : /* data structures for (relfilenode, ctid) => (cmin, cmax) mapping */
     120             : typedef struct ReorderBufferTupleCidKey
     121             : {
     122             :     RelFileNode relnode;
     123             :     ItemPointerData tid;
     124             : } ReorderBufferTupleCidKey;
     125             : 
     126             : typedef struct ReorderBufferTupleCidEnt
     127             : {
     128             :     ReorderBufferTupleCidKey key;
     129             :     CommandId   cmin;
     130             :     CommandId   cmax;
     131             :     CommandId   combocid;       /* just for debugging */
     132             : } ReorderBufferTupleCidEnt;
     133             : 
     134             : /* Virtual file descriptor with file offset tracking */
     135             : typedef struct TXNEntryFile
     136             : {
     137             :     File        vfd;            /* -1 when the file is closed */
     138             :     off_t       curOffset;      /* offset for next write or read. Reset to 0
     139             :                                  * when vfd is opened. */
     140             : } TXNEntryFile;
     141             : 
     142             : /* k-way in-order change iteration support structures */
     143             : typedef struct ReorderBufferIterTXNEntry
     144             : {
     145             :     XLogRecPtr  lsn;
     146             :     ReorderBufferChange *change;
     147             :     ReorderBufferTXN *txn;
     148             :     TXNEntryFile file;
     149             :     XLogSegNo   segno;
     150             : } ReorderBufferIterTXNEntry;
     151             : 
     152             : typedef struct ReorderBufferIterTXNState
     153             : {
     154             :     binaryheap *heap;
     155             :     Size        nr_txns;
     156             :     dlist_head  old_change;
     157             :     ReorderBufferIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
     158             : } ReorderBufferIterTXNState;
     159             : 
     160             : /* toast datastructures */
     161             : typedef struct ReorderBufferToastEnt
     162             : {
     163             :     Oid         chunk_id;       /* toast_table.chunk_id */
     164             :     int32       last_chunk_seq; /* toast_table.chunk_seq of the last chunk we
     165             :                                  * have seen */
     166             :     Size        num_chunks;     /* number of chunks we've already seen */
     167             :     Size        size;           /* combined size of chunks seen */
     168             :     dlist_head  chunks;         /* linked list of chunks */
     169             :     struct varlena *reconstructed;  /* reconstructed varlena now pointed to in
     170             :                                      * main tup */
     171             : } ReorderBufferToastEnt;
     172             : 
     173             : /* Disk serialization support datastructures */
     174             : typedef struct ReorderBufferDiskChange
     175             : {
     176             :     Size        size;
     177             :     ReorderBufferChange change;
     178             :     /* data follows */
     179             : } ReorderBufferDiskChange;
     180             : 
     181             : /*
     182             :  * Maximum number of changes kept in memory, per transaction. After that,
     183             :  * changes are spooled to disk.
     184             :  *
     185             :  * The current value should be sufficient to decode the entire transaction
     186             :  * without hitting disk in OLTP workloads, while starting to spool to disk in
     187             :  * other workloads reasonably fast.
     188             :  *
     189             :  * At some point in the future it probably makes sense to have a more elaborate
     190             :  * resource management here, but it's not entirely clear what that would look
     191             :  * like.
     192             :  */
     193             : int         logical_decoding_work_mem;
     194             : static const Size max_changes_in_memory = 4096; /* XXX for restore only */
     195             : 
     196             : /* ---------------------------------------
     197             :  * primary reorderbuffer support routines
     198             :  * ---------------------------------------
     199             :  */
     200             : static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
     201             : static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
     202             : static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
     203             :                                                TransactionId xid, bool create, bool *is_new,
     204             :                                                XLogRecPtr lsn, bool create_as_top);
     205             : static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
     206             :                                               ReorderBufferTXN *subtxn);
     207             : 
     208             : static void AssertTXNLsnOrder(ReorderBuffer *rb);
     209             : 
     210             : /* ---------------------------------------
     211             :  * support functions for lsn-order iterating over the ->changes of a
     212             :  * transaction and its subtransactions
     213             :  *
     214             :  * used for iteration over the k-way heap merge of a transaction and its
     215             :  * subtransactions
     216             :  * ---------------------------------------
     217             :  */
     218             : static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
     219             :                                      ReorderBufferIterTXNState *volatile *iter_state);
     220             : static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
     221             : static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
     222             :                                        ReorderBufferIterTXNState *state);
     223             : static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
     224             : 
     225             : /*
     226             :  * ---------------------------------------
     227             :  * Disk serialization support functions
     228             :  * ---------------------------------------
     229             :  */
     230             : static void ReorderBufferCheckMemoryLimit(ReorderBuffer *rb);
     231             : static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
     232             : static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
     233             :                                          int fd, ReorderBufferChange *change);
     234             : static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
     235             :                                         TXNEntryFile *file, XLogSegNo *segno);
     236             : static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
     237             :                                        char *change);
     238             : static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
     239             : static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
     240             : static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
     241             :                                         TransactionId xid, XLogSegNo segno);
     242             : 
     243             : static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
     244             : static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
     245             :                                       ReorderBufferTXN *txn, CommandId cid);
     246             : 
     247             : /* ---------------------------------------
     248             :  * toast reassembly support
     249             :  * ---------------------------------------
     250             :  */
     251             : static void ReorderBufferToastInitHash(ReorderBuffer *rb, ReorderBufferTXN *txn);
     252             : static void ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn);
     253             : static void ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
     254             :                                       Relation relation, ReorderBufferChange *change);
     255             : static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
     256             :                                           Relation relation, ReorderBufferChange *change);
     257             : 
     258             : /*
     259             :  * ---------------------------------------
     260             :  * memory accounting
     261             :  * ---------------------------------------
     262             :  */
     263             : static Size ReorderBufferChangeSize(ReorderBufferChange *change);
     264             : static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
     265             :                                             ReorderBufferChange *change, bool addition);
     266             : 
     267             : /*
     268             :  * Allocate a new ReorderBuffer and clean out any old serialized state from
     269             :  * prior ReorderBuffer instances for the same slot.
     270             :  */
     271             : ReorderBuffer *
     272         594 : ReorderBufferAllocate(void)
     273             : {
     274             :     ReorderBuffer *buffer;
     275             :     HASHCTL     hash_ctl;
     276             :     MemoryContext new_ctx;
     277             : 
     278             :     Assert(MyReplicationSlot != NULL);
     279             : 
     280             :     /* allocate memory in own context, to have better accountability */
     281         594 :     new_ctx = AllocSetContextCreate(CurrentMemoryContext,
     282             :                                     "ReorderBuffer",
     283             :                                     ALLOCSET_DEFAULT_SIZES);
     284             : 
     285             :     buffer =
     286         594 :         (ReorderBuffer *) MemoryContextAlloc(new_ctx, sizeof(ReorderBuffer));
     287             : 
     288         594 :     memset(&hash_ctl, 0, sizeof(hash_ctl));
     289             : 
     290         594 :     buffer->context = new_ctx;
     291             : 
     292         594 :     buffer->change_context = SlabContextCreate(new_ctx,
     293             :                                                "Change",
     294             :                                                SLAB_DEFAULT_BLOCK_SIZE,
     295             :                                                sizeof(ReorderBufferChange));
     296             : 
     297         594 :     buffer->txn_context = SlabContextCreate(new_ctx,
     298             :                                             "TXN",
     299             :                                             SLAB_DEFAULT_BLOCK_SIZE,
     300             :                                             sizeof(ReorderBufferTXN));
     301             : 
     302         594 :     buffer->tup_context = GenerationContextCreate(new_ctx,
     303             :                                                   "Tuples",
     304             :                                                   SLAB_LARGE_BLOCK_SIZE);
     305             : 
     306         594 :     hash_ctl.keysize = sizeof(TransactionId);
     307         594 :     hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
     308         594 :     hash_ctl.hcxt = buffer->context;
     309             : 
     310         594 :     buffer->by_txn = hash_create("ReorderBufferByXid", 1000, &hash_ctl,
     311             :                                  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
     312             : 
     313         594 :     buffer->by_txn_last_xid = InvalidTransactionId;
     314         594 :     buffer->by_txn_last_txn = NULL;
     315             : 
     316         594 :     buffer->outbuf = NULL;
     317         594 :     buffer->outbufsize = 0;
     318         594 :     buffer->size = 0;
     319             : 
     320         594 :     buffer->spillCount = 0;
     321         594 :     buffer->spillTxns = 0;
     322         594 :     buffer->spillBytes = 0;
     323             : 
     324         594 :     buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
     325             : 
     326         594 :     dlist_init(&buffer->toplevel_by_lsn);
     327         594 :     dlist_init(&buffer->txns_by_base_snapshot_lsn);
     328             : 
     329             :     /*
     330             :      * Ensure there's no stale data from prior uses of this slot, in case some
     331             :      * prior exit avoided calling ReorderBufferFree. Failure to do this can
     332             :      * produce duplicated txns, and it's very cheap if there's nothing there.
     333             :      */
     334         594 :     ReorderBufferCleanupSerializedTXNs(NameStr(MyReplicationSlot->data.name));
     335             : 
     336         594 :     return buffer;
     337             : }
     338             : 
     339             : /*
     340             :  * Free a ReorderBuffer
     341             :  */
     342             : void
     343         544 : ReorderBufferFree(ReorderBuffer *rb)
     344             : {
     345         544 :     MemoryContext context = rb->context;
     346             : 
     347             :     /*
     348             :      * We free separately allocated data by entirely scrapping reorderbuffer's
     349             :      * memory context.
     350             :      */
     351         544 :     MemoryContextDelete(context);
     352             : 
     353             :     /* Free disk space used by unconsumed reorder buffers */
     354         544 :     ReorderBufferCleanupSerializedTXNs(NameStr(MyReplicationSlot->data.name));
     355         544 : }
     356             : 
     357             : /*
     358             :  * Get an unused, possibly preallocated, ReorderBufferTXN.
     359             :  */
     360             : static ReorderBufferTXN *
     361        4832 : ReorderBufferGetTXN(ReorderBuffer *rb)
     362             : {
     363             :     ReorderBufferTXN *txn;
     364             : 
     365             :     txn = (ReorderBufferTXN *)
     366        4832 :         MemoryContextAlloc(rb->txn_context, sizeof(ReorderBufferTXN));
     367             : 
     368        4832 :     memset(txn, 0, sizeof(ReorderBufferTXN));
     369             : 
     370        4832 :     dlist_init(&txn->changes);
     371        4832 :     dlist_init(&txn->tuplecids);
     372        4832 :     dlist_init(&txn->subtxns);
     373             : 
     374        4832 :     return txn;
     375             : }
     376             : 
     377             : /*
     378             :  * Free a ReorderBufferTXN.
     379             :  */
     380             : static void
     381        4814 : ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
     382             : {
     383             :     /* clean the lookup cache if we were cached (quite likely) */
     384        4814 :     if (rb->by_txn_last_xid == txn->xid)
     385             :     {
     386        4358 :         rb->by_txn_last_xid = InvalidTransactionId;
     387        4358 :         rb->by_txn_last_txn = NULL;
     388             :     }
     389             : 
     390             :     /* free data that's contained */
     391             : 
     392        4814 :     if (txn->tuplecid_hash != NULL)
     393             :     {
     394         366 :         hash_destroy(txn->tuplecid_hash);
     395         366 :         txn->tuplecid_hash = NULL;
     396             :     }
     397             : 
     398        4814 :     if (txn->invalidations)
     399             :     {
     400         940 :         pfree(txn->invalidations);
     401         940 :         txn->invalidations = NULL;
     402             :     }
     403             : 
     404        4814 :     pfree(txn);
     405        4814 : }
     406             : 
     407             : /*
     408             :  * Get an fresh ReorderBufferChange.
     409             :  */
     410             : ReorderBufferChange *
     411     2398018 : ReorderBufferGetChange(ReorderBuffer *rb)
     412             : {
     413             :     ReorderBufferChange *change;
     414             : 
     415             :     change = (ReorderBufferChange *)
     416     2398018 :         MemoryContextAlloc(rb->change_context, sizeof(ReorderBufferChange));
     417             : 
     418     2398018 :     memset(change, 0, sizeof(ReorderBufferChange));
     419     2398018 :     return change;
     420             : }
     421             : 
     422             : /*
     423             :  * Free an ReorderBufferChange.
     424             :  */
     425             : void
     426     2398004 : ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
     427             : {
     428             :     /* update memory accounting info */
     429     2398004 :     ReorderBufferChangeMemoryUpdate(rb, change, false);
     430             : 
     431             :     /* free contained data */
     432     2398004 :     switch (change->action)
     433             :     {
     434     2301886 :         case REORDER_BUFFER_CHANGE_INSERT:
     435             :         case REORDER_BUFFER_CHANGE_UPDATE:
     436             :         case REORDER_BUFFER_CHANGE_DELETE:
     437             :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
     438     2301886 :             if (change->data.tp.newtuple)
     439             :             {
     440     2032208 :                 ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
     441     2032208 :                 change->data.tp.newtuple = NULL;
     442             :             }
     443             : 
     444     2301886 :             if (change->data.tp.oldtuple)
     445             :             {
     446      132066 :                 ReorderBufferReturnTupleBuf(rb, change->data.tp.oldtuple);
     447      132066 :                 change->data.tp.oldtuple = NULL;
     448             :             }
     449     2301886 :             break;
     450          46 :         case REORDER_BUFFER_CHANGE_MESSAGE:
     451          46 :             if (change->data.msg.prefix != NULL)
     452          46 :                 pfree(change->data.msg.prefix);
     453          46 :             change->data.msg.prefix = NULL;
     454          46 :             if (change->data.msg.message != NULL)
     455          46 :                 pfree(change->data.msg.message);
     456          46 :             change->data.msg.message = NULL;
     457          46 :             break;
     458        1002 :         case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
     459        1002 :             if (change->data.snapshot)
     460             :             {
     461        1002 :                 ReorderBufferFreeSnap(rb, change->data.snapshot);
     462        1002 :                 change->data.snapshot = NULL;
     463             :             }
     464        1002 :             break;
     465             :             /* no data in addition to the struct itself */
     466          20 :         case REORDER_BUFFER_CHANGE_TRUNCATE:
     467          20 :             if (change->data.truncate.relids != NULL)
     468             :             {
     469          20 :                 ReorderBufferReturnRelids(rb, change->data.truncate.relids);
     470          20 :                 change->data.truncate.relids = NULL;
     471             :             }
     472          20 :             break;
     473       95050 :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
     474             :         case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
     475             :         case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
     476       95050 :             break;
     477             :     }
     478             : 
     479     2398004 :     pfree(change);
     480     2398004 : }
     481             : 
     482             : /*
     483             :  * Get a fresh ReorderBufferTupleBuf fitting at least a tuple of size
     484             :  * tuple_len (excluding header overhead).
     485             :  */
     486             : ReorderBufferTupleBuf *
     487     2164284 : ReorderBufferGetTupleBuf(ReorderBuffer *rb, Size tuple_len)
     488             : {
     489             :     ReorderBufferTupleBuf *tuple;
     490             :     Size        alloc_len;
     491             : 
     492     2164284 :     alloc_len = tuple_len + SizeofHeapTupleHeader;
     493             : 
     494             :     tuple = (ReorderBufferTupleBuf *)
     495     2164284 :         MemoryContextAlloc(rb->tup_context,
     496             :                            sizeof(ReorderBufferTupleBuf) +
     497             :                            MAXIMUM_ALIGNOF + alloc_len);
     498     2164284 :     tuple->alloc_tuple_size = alloc_len;
     499     2164284 :     tuple->tuple.t_data = ReorderBufferTupleBufData(tuple);
     500             : 
     501     2164284 :     return tuple;
     502             : }
     503             : 
     504             : /*
     505             :  * Free an ReorderBufferTupleBuf.
     506             :  */
     507             : void
     508     2164274 : ReorderBufferReturnTupleBuf(ReorderBuffer *rb, ReorderBufferTupleBuf *tuple)
     509             : {
     510     2164274 :     pfree(tuple);
     511     2164274 : }
     512             : 
     513             : /*
     514             :  * Get an array for relids of truncated relations.
     515             :  *
     516             :  * We use the global memory context (for the whole reorder buffer), because
     517             :  * none of the existing ones seems like a good match (some are SLAB, so we
     518             :  * can't use those, and tup_context is meant for tuple data, not relids). We
     519             :  * could add yet another context, but it seems like an overkill - TRUNCATE is
     520             :  * not particularly common operation, so it does not seem worth it.
     521             :  */
     522             : Oid *
     523          20 : ReorderBufferGetRelids(ReorderBuffer *rb, int nrelids)
     524             : {
     525             :     Oid        *relids;
     526             :     Size        alloc_len;
     527             : 
     528          20 :     alloc_len = sizeof(Oid) * nrelids;
     529             : 
     530          20 :     relids = (Oid *) MemoryContextAlloc(rb->context, alloc_len);
     531             : 
     532          20 :     return relids;
     533             : }
     534             : 
     535             : /*
     536             :  * Free an array of relids.
     537             :  */
     538             : void
     539          20 : ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
     540             : {
     541          20 :     pfree(relids);
     542          20 : }
     543             : 
     544             : /*
     545             :  * Return the ReorderBufferTXN from the given buffer, specified by Xid.
     546             :  * If create is true, and a transaction doesn't already exist, create it
     547             :  * (with the given LSN, and as top transaction if that's specified);
     548             :  * when this happens, is_new is set to true.
     549             :  */
     550             : static ReorderBufferTXN *
     551     7624150 : ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
     552             :                       bool *is_new, XLogRecPtr lsn, bool create_as_top)
     553             : {
     554             :     ReorderBufferTXN *txn;
     555             :     ReorderBufferTXNByIdEnt *ent;
     556             :     bool        found;
     557             : 
     558             :     Assert(TransactionIdIsValid(xid));
     559             : 
     560             :     /*
     561             :      * Check the one-entry lookup cache first
     562             :      */
     563     7624150 :     if (TransactionIdIsValid(rb->by_txn_last_xid) &&
     564     7619788 :         rb->by_txn_last_xid == xid)
     565             :     {
     566     6545418 :         txn = rb->by_txn_last_txn;
     567             : 
     568     6545418 :         if (txn != NULL)
     569             :         {
     570             :             /* found it, and it's valid */
     571     6545412 :             if (is_new)
     572        3418 :                 *is_new = false;
     573     6545412 :             return txn;
     574             :         }
     575             : 
     576             :         /*
     577             :          * cached as non-existent, and asked not to create? Then nothing else
     578             :          * to do.
     579             :          */
     580           6 :         if (!create)
     581           6 :             return NULL;
     582             :         /* otherwise fall through to create it */
     583             :     }
     584             : 
     585             :     /*
     586             :      * If the cache wasn't hit or it yielded an "does-not-exist" and we want
     587             :      * to create an entry.
     588             :      */
     589             : 
     590             :     /* search the lookup table */
     591             :     ent = (ReorderBufferTXNByIdEnt *)
     592     1078732 :         hash_search(rb->by_txn,
     593             :                     (void *) &xid,
     594             :                     create ? HASH_ENTER : HASH_FIND,
     595             :                     &found);
     596     1078732 :     if (found)
     597     1073434 :         txn = ent->txn;
     598        5298 :     else if (create)
     599             :     {
     600             :         /* initialize the new entry, if creation was requested */
     601             :         Assert(ent != NULL);
     602             :         Assert(lsn != InvalidXLogRecPtr);
     603             : 
     604        4832 :         ent->txn = ReorderBufferGetTXN(rb);
     605        4832 :         ent->txn->xid = xid;
     606        4832 :         txn = ent->txn;
     607        4832 :         txn->first_lsn = lsn;
     608        4832 :         txn->restart_decoding_lsn = rb->current_restart_decoding_lsn;
     609             : 
     610        4832 :         if (create_as_top)
     611             :         {
     612        3624 :             dlist_push_tail(&rb->toplevel_by_lsn, &txn->node);
     613        3624 :             AssertTXNLsnOrder(rb);
     614             :         }
     615             :     }
     616             :     else
     617         466 :         txn = NULL;             /* not found and not asked to create */
     618             : 
     619             :     /* update cache */
     620     1078732 :     rb->by_txn_last_xid = xid;
     621     1078732 :     rb->by_txn_last_txn = txn;
     622             : 
     623     1078732 :     if (is_new)
     624        4994 :         *is_new = !found;
     625             : 
     626             :     Assert(!create || txn != NULL);
     627     1078732 :     return txn;
     628             : }
     629             : 
     630             : /*
     631             :  * Queue a change into a transaction so it can be replayed upon commit.
     632             :  */
     633             : void
     634     2076136 : ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
     635             :                          ReorderBufferChange *change)
     636             : {
     637             :     ReorderBufferTXN *txn;
     638             : 
     639     2076136 :     txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
     640             : 
     641     2076136 :     change->lsn = lsn;
     642     2076136 :     change->txn = txn;
     643             : 
     644             :     Assert(InvalidXLogRecPtr != lsn);
     645     2076136 :     dlist_push_tail(&txn->changes, &change->node);
     646     2076136 :     txn->nentries++;
     647     2076136 :     txn->nentries_mem++;
     648             : 
     649             :     /* update memory accounting information */
     650     2076136 :     ReorderBufferChangeMemoryUpdate(rb, change, true);
     651             : 
     652             :     /* check the memory limits and evict something if needed */
     653     2076136 :     ReorderBufferCheckMemoryLimit(rb);
     654     2076136 : }
     655             : 
     656             : /*
     657             :  * Queue message into a transaction so it can be processed upon commit.
     658             :  */
     659             : void
     660          50 : ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
     661             :                           Snapshot snapshot, XLogRecPtr lsn,
     662             :                           bool transactional, const char *prefix,
     663             :                           Size message_size, const char *message)
     664             : {
     665          50 :     if (transactional)
     666             :     {
     667             :         MemoryContext oldcontext;
     668             :         ReorderBufferChange *change;
     669             : 
     670             :         Assert(xid != InvalidTransactionId);
     671             : 
     672          44 :         oldcontext = MemoryContextSwitchTo(rb->context);
     673             : 
     674          44 :         change = ReorderBufferGetChange(rb);
     675          44 :         change->action = REORDER_BUFFER_CHANGE_MESSAGE;
     676          44 :         change->data.msg.prefix = pstrdup(prefix);
     677          44 :         change->data.msg.message_size = message_size;
     678          44 :         change->data.msg.message = palloc(message_size);
     679          44 :         memcpy(change->data.msg.message, message, message_size);
     680             : 
     681          44 :         ReorderBufferQueueChange(rb, xid, lsn, change);
     682             : 
     683          44 :         MemoryContextSwitchTo(oldcontext);
     684             :     }
     685             :     else
     686             :     {
     687           6 :         ReorderBufferTXN *txn = NULL;
     688           6 :         volatile Snapshot snapshot_now = snapshot;
     689             : 
     690           6 :         if (xid != InvalidTransactionId)
     691           4 :             txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
     692             : 
     693             :         /* setup snapshot to allow catalog access */
     694           6 :         SetupHistoricSnapshot(snapshot_now, NULL);
     695           6 :         PG_TRY();
     696             :         {
     697           6 :             rb->message(rb, txn, lsn, false, prefix, message_size, message);
     698             : 
     699           6 :             TeardownHistoricSnapshot(false);
     700             :         }
     701           0 :         PG_CATCH();
     702             :         {
     703           0 :             TeardownHistoricSnapshot(true);
     704           0 :             PG_RE_THROW();
     705             :         }
     706           6 :         PG_END_TRY();
     707             :     }
     708          50 : }
     709             : 
     710             : /*
     711             :  * AssertTXNLsnOrder
     712             :  *      Verify LSN ordering of transaction lists in the reorderbuffer
     713             :  *
     714             :  * Other LSN-related invariants are checked too.
     715             :  *
     716             :  * No-op if assertions are not in use.
     717             :  */
     718             : static void
     719        9552 : AssertTXNLsnOrder(ReorderBuffer *rb)
     720             : {
     721             : #ifdef USE_ASSERT_CHECKING
     722             :     dlist_iter  iter;
     723             :     XLogRecPtr  prev_first_lsn = InvalidXLogRecPtr;
     724             :     XLogRecPtr  prev_base_snap_lsn = InvalidXLogRecPtr;
     725             : 
     726             :     dlist_foreach(iter, &rb->toplevel_by_lsn)
     727             :     {
     728             :         ReorderBufferTXN *cur_txn = dlist_container(ReorderBufferTXN, node,
     729             :                                                     iter.cur);
     730             : 
     731             :         /* start LSN must be set */
     732             :         Assert(cur_txn->first_lsn != InvalidXLogRecPtr);
     733             : 
     734             :         /* If there is an end LSN, it must be higher than start LSN */
     735             :         if (cur_txn->end_lsn != InvalidXLogRecPtr)
     736             :             Assert(cur_txn->first_lsn <= cur_txn->end_lsn);
     737             : 
     738             :         /* Current initial LSN must be strictly higher than previous */
     739             :         if (prev_first_lsn != InvalidXLogRecPtr)
     740             :             Assert(prev_first_lsn < cur_txn->first_lsn);
     741             : 
     742             :         /* known-as-subtxn txns must not be listed */
     743             :         Assert(!rbtxn_is_known_subxact(cur_txn));
     744             : 
     745             :         prev_first_lsn = cur_txn->first_lsn;
     746             :     }
     747             : 
     748             :     dlist_foreach(iter, &rb->txns_by_base_snapshot_lsn)
     749             :     {
     750             :         ReorderBufferTXN *cur_txn = dlist_container(ReorderBufferTXN,
     751             :                                                     base_snapshot_node,
     752             :                                                     iter.cur);
     753             : 
     754             :         /* base snapshot (and its LSN) must be set */
     755             :         Assert(cur_txn->base_snapshot != NULL);
     756             :         Assert(cur_txn->base_snapshot_lsn != InvalidXLogRecPtr);
     757             : 
     758             :         /* current LSN must be strictly higher than previous */
     759             :         if (prev_base_snap_lsn != InvalidXLogRecPtr)
     760             :             Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
     761             : 
     762             :         /* known-as-subtxn txns must not be listed */
     763             :         Assert(!rbtxn_is_known_subxact(cur_txn));
     764             : 
     765             :         prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
     766             :     }
     767             : #endif
     768        9552 : }
     769             : 
     770             : /*
     771             :  * ReorderBufferGetOldestTXN
     772             :  *      Return oldest transaction in reorderbuffer
     773             :  */
     774             : ReorderBufferTXN *
     775         170 : ReorderBufferGetOldestTXN(ReorderBuffer *rb)
     776             : {
     777             :     ReorderBufferTXN *txn;
     778             : 
     779         170 :     AssertTXNLsnOrder(rb);
     780             : 
     781         170 :     if (dlist_is_empty(&rb->toplevel_by_lsn))
     782         144 :         return NULL;
     783             : 
     784          26 :     txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
     785             : 
     786             :     Assert(!rbtxn_is_known_subxact(txn));
     787             :     Assert(txn->first_lsn != InvalidXLogRecPtr);
     788          26 :     return txn;
     789             : }
     790             : 
     791             : /*
     792             :  * ReorderBufferGetOldestXmin
     793             :  *      Return oldest Xmin in reorderbuffer
     794             :  *
     795             :  * Returns oldest possibly running Xid from the point of view of snapshots
     796             :  * used in the transactions kept by reorderbuffer, or InvalidTransactionId if
     797             :  * there are none.
     798             :  *
     799             :  * Since snapshots are assigned monotonically, this equals the Xmin of the
     800             :  * base snapshot with minimal base_snapshot_lsn.
     801             :  */
     802             : TransactionId
     803         184 : ReorderBufferGetOldestXmin(ReorderBuffer *rb)
     804             : {
     805             :     ReorderBufferTXN *txn;
     806             : 
     807         184 :     AssertTXNLsnOrder(rb);
     808             : 
     809         184 :     if (dlist_is_empty(&rb->txns_by_base_snapshot_lsn))
     810         158 :         return InvalidTransactionId;
     811             : 
     812          26 :     txn = dlist_head_element(ReorderBufferTXN, base_snapshot_node,
     813             :                              &rb->txns_by_base_snapshot_lsn);
     814          26 :     return txn->base_snapshot->xmin;
     815             : }
     816             : 
     817             : void
     818         186 : ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr)
     819             : {
     820         186 :     rb->current_restart_decoding_lsn = ptr;
     821         186 : }
     822             : 
     823             : /*
     824             :  * ReorderBufferAssignChild
     825             :  *
     826             :  * Make note that we know that subxid is a subtransaction of xid, seen as of
     827             :  * the given lsn.
     828             :  */
     829             : void
     830        2420 : ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
     831             :                          TransactionId subxid, XLogRecPtr lsn)
     832             : {
     833             :     ReorderBufferTXN *txn;
     834             :     ReorderBufferTXN *subtxn;
     835             :     bool        new_top;
     836             :     bool        new_sub;
     837             : 
     838        2420 :     txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
     839        2420 :     subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);
     840             : 
     841        2420 :     if (!new_sub)
     842             :     {
     843        1212 :         if (rbtxn_is_known_subxact(subtxn))
     844             :         {
     845             :             /* already associated, nothing to do */
     846         418 :             return;
     847             :         }
     848             :         else
     849             :         {
     850             :             /*
     851             :              * We already saw this transaction, but initially added it to the
     852             :              * list of top-level txns.  Now that we know it's not top-level,
     853             :              * remove it from there.
     854             :              */
     855         794 :             dlist_delete(&subtxn->node);
     856             :         }
     857             :     }
     858             : 
     859        2002 :     subtxn->txn_flags |= RBTXN_IS_SUBXACT;
     860        2002 :     subtxn->toplevel_xid = xid;
     861             :     Assert(subtxn->nsubtxns == 0);
     862             : 
     863             :     /* add to subtransaction list */
     864        2002 :     dlist_push_tail(&txn->subtxns, &subtxn->node);
     865        2002 :     txn->nsubtxns++;
     866             : 
     867             :     /* Possibly transfer the subtxn's snapshot to its top-level txn. */
     868        2002 :     ReorderBufferTransferSnapToParent(txn, subtxn);
     869             : 
     870             :     /* Verify LSN-ordering invariant */
     871        2002 :     AssertTXNLsnOrder(rb);
     872             : }
     873             : 
     874             : /*
     875             :  * ReorderBufferTransferSnapToParent
     876             :  *      Transfer base snapshot from subtxn to top-level txn, if needed
     877             :  *
     878             :  * This is done if the top-level txn doesn't have a base snapshot, or if the
     879             :  * subtxn's base snapshot has an earlier LSN than the top-level txn's base
     880             :  * snapshot's LSN.  This can happen if there are no changes in the toplevel
     881             :  * txn but there are some in the subtxn, or the first change in subtxn has
     882             :  * earlier LSN than first change in the top-level txn and we learned about
     883             :  * their kinship only now.
     884             :  *
     885             :  * The subtransaction's snapshot is cleared regardless of the transfer
     886             :  * happening, since it's not needed anymore in either case.
     887             :  *
     888             :  * We do this as soon as we become aware of their kinship, to avoid queueing
     889             :  * extra snapshots to txns known-as-subtxns -- only top-level txns will
     890             :  * receive further snapshots.
     891             :  */
     892             : static void
     893        2002 : ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
     894             :                                   ReorderBufferTXN *subtxn)
     895             : {
     896             :     Assert(subtxn->toplevel_xid == txn->xid);
     897             : 
     898        2002 :     if (subtxn->base_snapshot != NULL)
     899             :     {
     900         792 :         if (txn->base_snapshot == NULL ||
     901         784 :             subtxn->base_snapshot_lsn < txn->base_snapshot_lsn)
     902             :         {
     903             :             /*
     904             :              * If the toplevel transaction already has a base snapshot but
     905             :              * it's newer than the subxact's, purge it.
     906             :              */
     907          10 :             if (txn->base_snapshot != NULL)
     908             :             {
     909           2 :                 SnapBuildSnapDecRefcount(txn->base_snapshot);
     910           2 :                 dlist_delete(&txn->base_snapshot_node);
     911             :             }
     912             : 
     913             :             /*
     914             :              * The snapshot is now the top transaction's; transfer it, and
     915             :              * adjust the list position of the top transaction in the list by
     916             :              * moving it to where the subtransaction is.
     917             :              */
     918          10 :             txn->base_snapshot = subtxn->base_snapshot;
     919          10 :             txn->base_snapshot_lsn = subtxn->base_snapshot_lsn;
     920          10 :             dlist_insert_before(&subtxn->base_snapshot_node,
     921             :                                 &txn->base_snapshot_node);
     922             : 
     923             :             /*
     924             :              * The subtransaction doesn't have a snapshot anymore (so it
     925             :              * mustn't be in the list.)
     926             :              */
     927          10 :             subtxn->base_snapshot = NULL;
     928          10 :             subtxn->base_snapshot_lsn = InvalidXLogRecPtr;
     929          10 :             dlist_delete(&subtxn->base_snapshot_node);
     930             :         }
     931             :         else
     932             :         {
     933             :             /* Base snap of toplevel is fine, so subxact's is not needed */
     934         782 :             SnapBuildSnapDecRefcount(subtxn->base_snapshot);
     935         782 :             dlist_delete(&subtxn->base_snapshot_node);
     936         782 :             subtxn->base_snapshot = NULL;
     937         782 :             subtxn->base_snapshot_lsn = InvalidXLogRecPtr;
     938             :         }
     939             :     }
     940        2002 : }
     941             : 
     942             : /*
     943             :  * Associate a subtransaction with its toplevel transaction at commit
     944             :  * time. There may be no further changes added after this.
     945             :  */
     946             : void
     947         486 : ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
     948             :                          TransactionId subxid, XLogRecPtr commit_lsn,
     949             :                          XLogRecPtr end_lsn)
     950             : {
     951             :     ReorderBufferTXN *subtxn;
     952             : 
     953         486 :     subtxn = ReorderBufferTXNByXid(rb, subxid, false, NULL,
     954             :                                    InvalidXLogRecPtr, false);
     955             : 
     956             :     /*
     957             :      * No need to do anything if that subtxn didn't contain any changes
     958             :      */
     959         486 :     if (!subtxn)
     960          30 :         return;
     961             : 
     962         456 :     subtxn->final_lsn = commit_lsn;
     963         456 :     subtxn->end_lsn = end_lsn;
     964             : 
     965             :     /*
     966             :      * Assign this subxact as a child of the toplevel xact (no-op if already
     967             :      * done.)
     968             :      */
     969         456 :     ReorderBufferAssignChild(rb, xid, subxid, InvalidXLogRecPtr);
     970             : }
     971             : 
     972             : 
     973             : /*
     974             :  * Support for efficiently iterating over a transaction's and its
     975             :  * subtransactions' changes.
     976             :  *
     977             :  * We do by doing a k-way merge between transactions/subtransactions. For that
     978             :  * we model the current heads of the different transactions as a binary heap
     979             :  * so we easily know which (sub-)transaction has the change with the smallest
     980             :  * lsn next.
     981             :  *
     982             :  * We assume the changes in individual transactions are already sorted by LSN.
     983             :  */
     984             : 
     985             : /*
     986             :  * Binary heap comparison function.
     987             :  */
     988             : static int
     989      103784 : ReorderBufferIterCompare(Datum a, Datum b, void *arg)
     990             : {
     991      103784 :     ReorderBufferIterTXNState *state = (ReorderBufferIterTXNState *) arg;
     992      103784 :     XLogRecPtr  pos_a = state->entries[DatumGetInt32(a)].lsn;
     993      103784 :     XLogRecPtr  pos_b = state->entries[DatumGetInt32(b)].lsn;
     994             : 
     995      103784 :     if (pos_a < pos_b)
     996      101148 :         return 1;
     997        2636 :     else if (pos_a == pos_b)
     998           2 :         return 0;
     999        2634 :     return -1;
    1000             : }
    1001             : 
    1002             : /*
    1003             :  * Allocate & initialize an iterator which iterates in lsn order over a
    1004             :  * transaction and all its subtransactions.
    1005             :  *
    1006             :  * Note: The iterator state is returned through iter_state parameter rather
    1007             :  * than the function's return value.  This is because the state gets cleaned up
    1008             :  * in a PG_CATCH block in the caller, so we want to make sure the caller gets
    1009             :  * back the state even if this function throws an exception.
    1010             :  */
    1011             : static void
    1012         974 : ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
    1013             :                          ReorderBufferIterTXNState *volatile *iter_state)
    1014             : {
    1015         974 :     Size        nr_txns = 0;
    1016             :     ReorderBufferIterTXNState *state;
    1017             :     dlist_iter  cur_txn_i;
    1018             :     int32       off;
    1019             : 
    1020         974 :     *iter_state = NULL;
    1021             : 
    1022             :     /*
    1023             :      * Calculate the size of our heap: one element for every transaction that
    1024             :      * contains changes.  (Besides the transactions already in the reorder
    1025             :      * buffer, we count the one we were directly passed.)
    1026             :      */
    1027         974 :     if (txn->nentries > 0)
    1028         948 :         nr_txns++;
    1029             : 
    1030        1430 :     dlist_foreach(cur_txn_i, &txn->subtxns)
    1031             :     {
    1032             :         ReorderBufferTXN *cur_txn;
    1033             : 
    1034         456 :         cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
    1035             : 
    1036         456 :         if (cur_txn->nentries > 0)
    1037         322 :             nr_txns++;
    1038             :     }
    1039             : 
    1040             :     /* allocate iteration state */
    1041             :     state = (ReorderBufferIterTXNState *)
    1042         974 :         MemoryContextAllocZero(rb->context,
    1043             :                                sizeof(ReorderBufferIterTXNState) +
    1044         974 :                                sizeof(ReorderBufferIterTXNEntry) * nr_txns);
    1045             : 
    1046         974 :     state->nr_txns = nr_txns;
    1047         974 :     dlist_init(&state->old_change);
    1048             : 
    1049        2244 :     for (off = 0; off < state->nr_txns; off++)
    1050             :     {
    1051        1270 :         state->entries[off].file.vfd = -1;
    1052        1270 :         state->entries[off].segno = 0;
    1053             :     }
    1054             : 
    1055             :     /* allocate heap */
    1056         974 :     state->heap = binaryheap_allocate(state->nr_txns,
    1057             :                                       ReorderBufferIterCompare,
    1058             :                                       state);
    1059             : 
    1060             :     /* Now that the state fields are initialized, it is safe to return it. */
    1061         974 :     *iter_state = state;
    1062             : 
    1063             :     /*
    1064             :      * Now insert items into the binary heap, in an unordered fashion.  (We
    1065             :      * will run a heap assembly step at the end; this is more efficient.)
    1066             :      */
    1067             : 
    1068         974 :     off = 0;
    1069             : 
    1070             :     /* add toplevel transaction if it contains changes */
    1071         974 :     if (txn->nentries > 0)
    1072             :     {
    1073             :         ReorderBufferChange *cur_change;
    1074             : 
    1075         948 :         if (rbtxn_is_serialized(txn))
    1076             :         {
    1077             :             /* serialize remaining changes */
    1078          26 :             ReorderBufferSerializeTXN(rb, txn);
    1079          26 :             ReorderBufferRestoreChanges(rb, txn, &state->entries[off].file,
    1080             :                                         &state->entries[off].segno);
    1081             :         }
    1082             : 
    1083         948 :         cur_change = dlist_head_element(ReorderBufferChange, node,
    1084             :                                         &txn->changes);
    1085             : 
    1086         948 :         state->entries[off].lsn = cur_change->lsn;
    1087         948 :         state->entries[off].change = cur_change;
    1088         948 :         state->entries[off].txn = txn;
    1089             : 
    1090         948 :         binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
    1091             :     }
    1092             : 
    1093             :     /* add subtransactions if they contain changes */
    1094        1430 :     dlist_foreach(cur_txn_i, &txn->subtxns)
    1095             :     {
    1096             :         ReorderBufferTXN *cur_txn;
    1097             : 
    1098         456 :         cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
    1099             : 
    1100         456 :         if (cur_txn->nentries > 0)
    1101             :         {
    1102             :             ReorderBufferChange *cur_change;
    1103             : 
    1104         322 :             if (rbtxn_is_serialized(cur_txn))
    1105             :             {
    1106             :                 /* serialize remaining changes */
    1107          32 :                 ReorderBufferSerializeTXN(rb, cur_txn);
    1108          32 :                 ReorderBufferRestoreChanges(rb, cur_txn,
    1109             :                                             &state->entries[off].file,
    1110             :                                             &state->entries[off].segno);
    1111             :             }
    1112         322 :             cur_change = dlist_head_element(ReorderBufferChange, node,
    1113             :                                             &cur_txn->changes);
    1114             : 
    1115         322 :             state->entries[off].lsn = cur_change->lsn;
    1116         322 :             state->entries[off].change = cur_change;
    1117         322 :             state->entries[off].txn = cur_txn;
    1118             : 
    1119         322 :             binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
    1120             :         }
    1121             :     }
    1122             : 
    1123             :     /* assemble a valid binary heap */
    1124         974 :     binaryheap_build(state->heap);
    1125         974 : }
    1126             : 
    1127             : /*
    1128             :  * Return the next change when iterating over a transaction and its
    1129             :  * subtransactions.
    1130             :  *
    1131             :  * Returns NULL when no further changes exist.
    1132             :  */
    1133             : static ReorderBufferChange *
    1134      315802 : ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
    1135             : {
    1136             :     ReorderBufferChange *change;
    1137             :     ReorderBufferIterTXNEntry *entry;
    1138             :     int32       off;
    1139             : 
    1140             :     /* nothing there anymore */
    1141      315802 :     if (state->heap->bh_size == 0)
    1142         974 :         return NULL;
    1143             : 
    1144      314828 :     off = DatumGetInt32(binaryheap_first(state->heap));
    1145      314828 :     entry = &state->entries[off];
    1146             : 
    1147             :     /* free memory we might have "leaked" in the previous *Next call */
    1148      314828 :     if (!dlist_is_empty(&state->old_change))
    1149             :     {
    1150          78 :         change = dlist_container(ReorderBufferChange, node,
    1151             :                                  dlist_pop_head_node(&state->old_change));
    1152          78 :         ReorderBufferReturnChange(rb, change);
    1153             :         Assert(dlist_is_empty(&state->old_change));
    1154             :     }
    1155             : 
    1156      314828 :     change = entry->change;
    1157             : 
    1158             :     /*
    1159             :      * update heap with information about which transaction has the next
    1160             :      * relevant change in LSN order
    1161             :      */
    1162             : 
    1163             :     /* there are in-memory changes */
    1164      314828 :     if (dlist_has_next(&entry->txn->changes, &entry->change->node))
    1165             :     {
    1166      313498 :         dlist_node *next = dlist_next_node(&entry->txn->changes, &change->node);
    1167      313498 :         ReorderBufferChange *next_change =
    1168      313498 :         dlist_container(ReorderBufferChange, node, next);
    1169             : 
    1170             :         /* txn stays the same */
    1171      313498 :         state->entries[off].lsn = next_change->lsn;
    1172      313498 :         state->entries[off].change = next_change;
    1173             : 
    1174      313498 :         binaryheap_replace_first(state->heap, Int32GetDatum(off));
    1175      313498 :         return change;
    1176             :     }
    1177             : 
    1178             :     /* try to load changes from disk */
    1179        1330 :     if (entry->txn->nentries != entry->txn->nentries_mem)
    1180             :     {
    1181             :         /*
    1182             :          * Ugly: restoring changes will reuse *Change records, thus delete the
    1183             :          * current one from the per-tx list and only free in the next call.
    1184             :          */
    1185         108 :         dlist_delete(&change->node);
    1186         108 :         dlist_push_tail(&state->old_change, &change->node);
    1187             : 
    1188         108 :         if (ReorderBufferRestoreChanges(rb, entry->txn, &entry->file,
    1189             :                                         &state->entries[off].segno))
    1190             :         {
    1191             :             /* successfully restored changes from disk */
    1192             :             ReorderBufferChange *next_change =
    1193          60 :             dlist_head_element(ReorderBufferChange, node,
    1194             :                                &entry->txn->changes);
    1195             : 
    1196          60 :             elog(DEBUG2, "restored %u/%u changes from disk",
    1197             :                  (uint32) entry->txn->nentries_mem,
    1198             :                  (uint32) entry->txn->nentries);
    1199             : 
    1200             :             Assert(entry->txn->nentries_mem);
    1201             :             /* txn stays the same */
    1202          60 :             state->entries[off].lsn = next_change->lsn;
    1203          60 :             state->entries[off].change = next_change;
    1204          60 :             binaryheap_replace_first(state->heap, Int32GetDatum(off));
    1205             : 
    1206          60 :             return change;
    1207             :         }
    1208             :     }
    1209             : 
    1210             :     /* ok, no changes there anymore, remove */
    1211        1270 :     binaryheap_remove_first(state->heap);
    1212             : 
    1213        1270 :     return change;
    1214             : }
    1215             : 
    1216             : /*
    1217             :  * Deallocate the iterator
    1218             :  */
    1219             : static void
    1220         974 : ReorderBufferIterTXNFinish(ReorderBuffer *rb,
    1221             :                            ReorderBufferIterTXNState *state)
    1222             : {
    1223             :     int32       off;
    1224             : 
    1225        2244 :     for (off = 0; off < state->nr_txns; off++)
    1226             :     {
    1227        1270 :         if (state->entries[off].file.vfd != -1)
    1228           0 :             FileClose(state->entries[off].file.vfd);
    1229             :     }
    1230             : 
    1231             :     /* free memory we might have "leaked" in the last *Next call */
    1232         974 :     if (!dlist_is_empty(&state->old_change))
    1233             :     {
    1234             :         ReorderBufferChange *change;
    1235             : 
    1236          28 :         change = dlist_container(ReorderBufferChange, node,
    1237             :                                  dlist_pop_head_node(&state->old_change));
    1238          28 :         ReorderBufferReturnChange(rb, change);
    1239             :         Assert(dlist_is_empty(&state->old_change));
    1240             :     }
    1241             : 
    1242         974 :     binaryheap_free(state->heap);
    1243         974 :     pfree(state);
    1244         974 : }
    1245             : 
    1246             : /*
    1247             :  * Cleanup the contents of a transaction, usually after the transaction
    1248             :  * committed or aborted.
    1249             :  */
    1250             : static void
    1251        4814 : ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
    1252             : {
    1253             :     bool        found;
    1254             :     dlist_mutable_iter iter;
    1255             : 
    1256             :     /* cleanup subtransactions & their changes */
    1257        5270 :     dlist_foreach_modify(iter, &txn->subtxns)
    1258             :     {
    1259             :         ReorderBufferTXN *subtxn;
    1260             : 
    1261         456 :         subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
    1262             : 
    1263             :         /*
    1264             :          * Subtransactions are always associated to the toplevel TXN, even if
    1265             :          * they originally were happening inside another subtxn, so we won't
    1266             :          * ever recurse more than one level deep here.
    1267             :          */
    1268             :         Assert(rbtxn_is_known_subxact(subtxn));
    1269             :         Assert(subtxn->nsubtxns == 0);
    1270             : 
    1271         456 :         ReorderBufferCleanupTXN(rb, subtxn);
    1272             :     }
    1273             : 
    1274             :     /* cleanup changes in the toplevel txn */
    1275      102144 :     dlist_foreach_modify(iter, &txn->changes)
    1276             :     {
    1277             :         ReorderBufferChange *change;
    1278             : 
    1279       97330 :         change = dlist_container(ReorderBufferChange, node, iter.cur);
    1280             : 
    1281             :         /* Check we're not mixing changes from different transactions. */
    1282             :         Assert(change->txn == txn);
    1283             : 
    1284       97330 :         ReorderBufferReturnChange(rb, change);
    1285             :     }
    1286             : 
    1287             :     /*
    1288             :      * Cleanup the tuplecids we stored for decoding catalog snapshot access.
    1289             :      * They are always stored in the toplevel transaction.
    1290             :      */
    1291       32564 :     dlist_foreach_modify(iter, &txn->tuplecids)
    1292             :     {
    1293             :         ReorderBufferChange *change;
    1294             : 
    1295       27750 :         change = dlist_container(ReorderBufferChange, node, iter.cur);
    1296             : 
    1297             :         /* Check we're not mixing changes from different transactions. */
    1298             :         Assert(change->txn == txn);
    1299             :         Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
    1300             : 
    1301       27750 :         ReorderBufferReturnChange(rb, change);
    1302             :     }
    1303             : 
    1304             :     /*
    1305             :      * Cleanup the base snapshot, if set.
    1306             :      */
    1307        4814 :     if (txn->base_snapshot != NULL)
    1308             :     {
    1309        2778 :         SnapBuildSnapDecRefcount(txn->base_snapshot);
    1310        2778 :         dlist_delete(&txn->base_snapshot_node);
    1311             :     }
    1312             : 
    1313             :     /*
    1314             :      * Remove TXN from its containing list.
    1315             :      *
    1316             :      * Note: if txn is known as subxact, we are deleting the TXN from its
    1317             :      * parent's list of known subxacts; this leaves the parent's nsubxacts
    1318             :      * count too high, but we don't care.  Otherwise, we are deleting the TXN
    1319             :      * from the LSN-ordered list of toplevel TXNs.
    1320             :      */
    1321        4814 :     dlist_delete(&txn->node);
    1322             : 
    1323             :     /* now remove reference from buffer */
    1324        4814 :     hash_search(rb->by_txn,
    1325        4814 :                 (void *) &txn->xid,
    1326             :                 HASH_REMOVE,
    1327             :                 &found);
    1328             :     Assert(found);
    1329             : 
    1330             :     /* remove entries spilled to disk */
    1331        4814 :     if (rbtxn_is_serialized(txn))
    1332         318 :         ReorderBufferRestoreCleanup(rb, txn);
    1333             : 
    1334             :     /* deallocate */
    1335        4814 :     ReorderBufferReturnTXN(rb, txn);
    1336        4814 : }
    1337             : 
    1338             : /*
    1339             :  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
    1340             :  * HeapTupleSatisfiesHistoricMVCC.
    1341             :  */
    1342             : static void
    1343         974 : ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
    1344             : {
    1345             :     dlist_iter  iter;
    1346             :     HASHCTL     hash_ctl;
    1347             : 
    1348         974 :     if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
    1349         608 :         return;
    1350             : 
    1351         366 :     memset(&hash_ctl, 0, sizeof(hash_ctl));
    1352             : 
    1353         366 :     hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
    1354         366 :     hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
    1355         366 :     hash_ctl.hcxt = rb->context;
    1356             : 
    1357             :     /*
    1358             :      * create the hash with the exact number of to-be-stored tuplecids from
    1359             :      * the start
    1360             :      */
    1361         366 :     txn->tuplecid_hash =
    1362         366 :         hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
    1363             :                     HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
    1364             : 
    1365       11964 :     dlist_foreach(iter, &txn->tuplecids)
    1366             :     {
    1367             :         ReorderBufferTupleCidKey key;
    1368             :         ReorderBufferTupleCidEnt *ent;
    1369             :         bool        found;
    1370             :         ReorderBufferChange *change;
    1371             : 
    1372       11598 :         change = dlist_container(ReorderBufferChange, node, iter.cur);
    1373             : 
    1374             :         Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
    1375             : 
    1376             :         /* be careful about padding */
    1377       11598 :         memset(&key, 0, sizeof(ReorderBufferTupleCidKey));
    1378             : 
    1379       11598 :         key.relnode = change->data.tuplecid.node;
    1380             : 
    1381       11598 :         ItemPointerCopy(&change->data.tuplecid.tid,
    1382             :                         &key.tid);
    1383             : 
    1384             :         ent = (ReorderBufferTupleCidEnt *)
    1385       11598 :             hash_search(txn->tuplecid_hash,
    1386             :                         (void *) &key,
    1387             :                         HASH_ENTER | HASH_FIND,
    1388             :                         &found);
    1389       11598 :         if (!found)
    1390             :         {
    1391        8690 :             ent->cmin = change->data.tuplecid.cmin;
    1392        8690 :             ent->cmax = change->data.tuplecid.cmax;
    1393        8690 :             ent->combocid = change->data.tuplecid.combocid;
    1394             :         }
    1395             :         else
    1396             :         {
    1397             :             /*
    1398             :              * Maybe we already saw this tuple before in this transaction, but
    1399             :              * if so it must have the same cmin.
    1400             :              */
    1401             :             Assert(ent->cmin == change->data.tuplecid.cmin);
    1402             : 
    1403             :             /*
    1404             :              * cmax may be initially invalid, but once set it can only grow,
    1405             :              * and never become invalid again.
    1406             :              */
    1407             :             Assert((ent->cmax == InvalidCommandId) ||
    1408             :                    ((change->data.tuplecid.cmax != InvalidCommandId) &&
    1409             :                     (change->data.tuplecid.cmax > ent->cmax)));
    1410        2908 :             ent->cmax = change->data.tuplecid.cmax;
    1411             :         }
    1412             :     }
    1413             : }
    1414             : 
    1415             : /*
    1416             :  * Copy a provided snapshot so we can modify it privately. This is needed so
    1417             :  * that catalog modifying transactions can look into intermediate catalog
    1418             :  * states.
    1419             :  */
    1420             : static Snapshot
    1421         734 : ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
    1422             :                       ReorderBufferTXN *txn, CommandId cid)
    1423             : {
    1424             :     Snapshot    snap;
    1425             :     dlist_iter  iter;
    1426         734 :     int         i = 0;
    1427             :     Size        size;
    1428             : 
    1429         734 :     size = sizeof(SnapshotData) +
    1430        1468 :         sizeof(TransactionId) * orig_snap->xcnt +
    1431         734 :         sizeof(TransactionId) * (txn->nsubtxns + 1);
    1432             : 
    1433         734 :     snap = MemoryContextAllocZero(rb->context, size);
    1434         734 :     memcpy(snap, orig_snap, sizeof(SnapshotData));
    1435             : 
    1436         734 :     snap->copied = true;
    1437         734 :     snap->active_count = 1;      /* mark as active so nobody frees it */
    1438         734 :     snap->regd_count = 0;
    1439         734 :     snap->xip = (TransactionId *) (snap + 1);
    1440             : 
    1441         734 :     memcpy(snap->xip, orig_snap->xip, sizeof(TransactionId) * snap->xcnt);
    1442             : 
    1443             :     /*
    1444             :      * snap->subxip contains all txids that belong to our transaction which we
    1445             :      * need to check via cmin/cmax. That's why we store the toplevel
    1446             :      * transaction in there as well.
    1447             :      */
    1448         734 :     snap->subxip = snap->xip + snap->xcnt;
    1449         734 :     snap->subxip[i++] = txn->xid;
    1450             : 
    1451             :     /*
    1452             :      * subxcnt isn't decreased when subtransactions abort, so count manually.
    1453             :      * Since it's an upper boundary it is safe to use it for the allocation
    1454             :      * above.
    1455             :      */
    1456         734 :     snap->subxcnt = 1;
    1457             : 
    1458         740 :     dlist_foreach(iter, &txn->subtxns)
    1459             :     {
    1460             :         ReorderBufferTXN *sub_txn;
    1461             : 
    1462           6 :         sub_txn = dlist_container(ReorderBufferTXN, node, iter.cur);
    1463           6 :         snap->subxip[i++] = sub_txn->xid;
    1464           6 :         snap->subxcnt++;
    1465             :     }
    1466             : 
    1467             :     /* sort so we can bsearch() later */
    1468         734 :     qsort(snap->subxip, snap->subxcnt, sizeof(TransactionId), xidComparator);
    1469             : 
    1470             :     /* store the specified current CommandId */
    1471         734 :     snap->curcid = cid;
    1472             : 
    1473         734 :     return snap;
    1474             : }
    1475             : 
    1476             : /*
    1477             :  * Free a previously ReorderBufferCopySnap'ed snapshot
    1478             :  */
    1479             : static void
    1480        1736 : ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
    1481             : {
    1482        1736 :     if (snap->copied)
    1483         738 :         pfree(snap);
    1484             :     else
    1485         998 :         SnapBuildSnapDecRefcount(snap);
    1486        1736 : }
    1487             : 
    1488             : /*
    1489             :  * Perform the replay of a transaction and its non-aborted subtransactions.
    1490             :  *
    1491             :  * Subtransactions previously have to be processed by
    1492             :  * ReorderBufferCommitChild(), even if previously assigned to the toplevel
    1493             :  * transaction with ReorderBufferAssignChild.
    1494             :  *
    1495             :  * We currently can only decode a transaction's contents when its commit
    1496             :  * record is read because that's the only place where we know about cache
    1497             :  * invalidations. Thus, once a toplevel commit is read, we iterate over the top
    1498             :  * and subtransactions (using a k-way merge) and replay the changes in lsn
    1499             :  * order.
    1500             :  */
    1501             : void
    1502         978 : ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
    1503             :                     XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
    1504             :                     TimestampTz commit_time,
    1505             :                     RepOriginId origin_id, XLogRecPtr origin_lsn)
    1506             : {
    1507             :     ReorderBufferTXN *txn;
    1508             :     volatile Snapshot snapshot_now;
    1509         978 :     volatile CommandId command_id = FirstCommandId;
    1510             :     bool        using_subtxn;
    1511         978 :     ReorderBufferIterTXNState *volatile iterstate = NULL;
    1512             : 
    1513         978 :     txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
    1514             :                                 false);
    1515             : 
    1516             :     /* unknown transaction, nothing to replay */
    1517         978 :     if (txn == NULL)
    1518           2 :         return;
    1519             : 
    1520         976 :     txn->final_lsn = commit_lsn;
    1521         976 :     txn->end_lsn = end_lsn;
    1522         976 :     txn->commit_time = commit_time;
    1523         976 :     txn->origin_id = origin_id;
    1524         976 :     txn->origin_lsn = origin_lsn;
    1525             : 
    1526             :     /*
    1527             :      * If this transaction has no snapshot, it didn't make any changes to the
    1528             :      * database, so there's nothing to decode.  Note that
    1529             :      * ReorderBufferCommitChild will have transferred any snapshots from
    1530             :      * subtransactions if there were any.
    1531             :      */
    1532         976 :     if (txn->base_snapshot == NULL)
    1533             :     {
    1534             :         Assert(txn->ninvalidations == 0);
    1535           2 :         ReorderBufferCleanupTXN(rb, txn);
    1536           2 :         return;
    1537             :     }
    1538             : 
    1539         974 :     snapshot_now = txn->base_snapshot;
    1540             : 
    1541             :     /* build data to be able to lookup the CommandIds of catalog tuples */
    1542         974 :     ReorderBufferBuildTupleCidHash(rb, txn);
    1543             : 
    1544             :     /* setup the initial snapshot */
    1545         974 :     SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
    1546             : 
    1547             :     /*
    1548             :      * Decoding needs access to syscaches et al., which in turn use
    1549             :      * heavyweight locks and such. Thus we need to have enough state around to
    1550             :      * keep track of those.  The easiest way is to simply use a transaction
    1551             :      * internally.  That also allows us to easily enforce that nothing writes
    1552             :      * to the database by checking for xid assignments.
    1553             :      *
    1554             :      * When we're called via the SQL SRF there's already a transaction
    1555             :      * started, so start an explicit subtransaction there.
    1556             :      */
    1557         974 :     using_subtxn = IsTransactionOrTransactionBlock();
    1558             : 
    1559         974 :     PG_TRY();
    1560             :     {
    1561             :         ReorderBufferChange *change;
    1562         974 :         ReorderBufferChange *specinsert = NULL;
    1563             : 
    1564         974 :         if (using_subtxn)
    1565         660 :             BeginInternalSubTransaction("replay");
    1566             :         else
    1567         314 :             StartTransactionCommand();
    1568             : 
    1569         974 :         rb->begin(rb, txn);
    1570             : 
    1571         974 :         ReorderBufferIterTXNInit(rb, txn, &iterstate);
    1572      316776 :         while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
    1573             :         {
    1574      314828 :             Relation    relation = NULL;
    1575             :             Oid         reloid;
    1576             : 
    1577      314828 :             switch (change->action)
    1578             :             {
    1579        3564 :                 case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
    1580             : 
    1581             :                     /*
    1582             :                      * Confirmation for speculative insertion arrived. Simply
    1583             :                      * use as a normal record. It'll be cleaned up at the end
    1584             :                      * of INSERT processing.
    1585             :                      */
    1586        3564 :                     if (specinsert == NULL)
    1587           0 :                         elog(ERROR, "invalid ordering of speculative insertion changes");
    1588             :                     Assert(specinsert->data.tp.oldtuple == NULL);
    1589        3564 :                     change = specinsert;
    1590        3564 :                     change->action = REORDER_BUFFER_CHANGE_INSERT;
    1591             : 
    1592             :                     /* intentionally fall through */
    1593      299438 :                 case REORDER_BUFFER_CHANGE_INSERT:
    1594             :                 case REORDER_BUFFER_CHANGE_UPDATE:
    1595             :                 case REORDER_BUFFER_CHANGE_DELETE:
    1596             :                     Assert(snapshot_now);
    1597             : 
    1598      299438 :                     reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
    1599             :                                                 change->data.tp.relnode.relNode);
    1600             : 
    1601             :                     /*
    1602             :                      * Mapped catalog tuple without data, emitted while
    1603             :                      * catalog table was in the process of being rewritten. We
    1604             :                      * can fail to look up the relfilenode, because the
    1605             :                      * relmapper has no "historic" view, in contrast to normal
    1606             :                      * the normal catalog during decoding. Thus repeated
    1607             :                      * rewrites can cause a lookup failure. That's OK because
    1608             :                      * we do not decode catalog changes anyway. Normally such
    1609             :                      * tuples would be skipped over below, but we can't
    1610             :                      * identify whether the table should be logically logged
    1611             :                      * without mapping the relfilenode to the oid.
    1612             :                      */
    1613      299438 :                     if (reloid == InvalidOid &&
    1614         154 :                         change->data.tp.newtuple == NULL &&
    1615         154 :                         change->data.tp.oldtuple == NULL)
    1616         154 :                         goto change_done;
    1617      299284 :                     else if (reloid == InvalidOid)
    1618           0 :                         elog(ERROR, "could not map filenode \"%s\" to relation OID",
    1619             :                              relpathperm(change->data.tp.relnode,
    1620             :                                          MAIN_FORKNUM));
    1621             : 
    1622      299284 :                     relation = RelationIdGetRelation(reloid);
    1623             : 
    1624      299284 :                     if (!RelationIsValid(relation))
    1625           0 :                         elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
    1626             :                              reloid,
    1627             :                              relpathperm(change->data.tp.relnode,
    1628             :                                          MAIN_FORKNUM));
    1629             : 
    1630      299284 :                     if (!RelationIsLogicallyLogged(relation))
    1631        3858 :                         goto change_done;
    1632             : 
    1633             :                     /*
    1634             :                      * Ignore temporary heaps created during DDL unless the
    1635             :                      * plugin has asked for them.
    1636             :                      */
    1637      295426 :                     if (relation->rd_rel->relrewrite && !rb->output_rewrites)
    1638          40 :                         goto change_done;
    1639             : 
    1640             :                     /*
    1641             :                      * For now ignore sequence changes entirely. Most of the
    1642             :                      * time they don't log changes using records we
    1643             :                      * understand, so it doesn't make sense to handle the few
    1644             :                      * cases we do.
    1645             :                      */
    1646      295386 :                     if (relation->rd_rel->relkind == RELKIND_SEQUENCE)
    1647           0 :                         goto change_done;
    1648             : 
    1649             :                     /* user-triggered change */
    1650      295386 :                     if (!IsToastRelation(relation))
    1651             :                     {
    1652      294316 :                         ReorderBufferToastReplace(rb, txn, relation, change);
    1653      294316 :                         rb->apply_change(rb, txn, relation, change);
    1654             : 
    1655             :                         /*
    1656             :                          * Only clear reassembled toast chunks if we're sure
    1657             :                          * they're not required anymore. The creator of the
    1658             :                          * tuple tells us.
    1659             :                          */
    1660      294316 :                         if (change->data.tp.clear_toast_afterwards)
    1661      293912 :                             ReorderBufferToastReset(rb, txn);
    1662             :                     }
    1663             :                     /* we're not interested in toast deletions */
    1664        1070 :                     else if (change->action == REORDER_BUFFER_CHANGE_INSERT)
    1665             :                     {
    1666             :                         /*
    1667             :                          * Need to reassemble the full toasted Datum in
    1668             :                          * memory, to ensure the chunks don't get reused till
    1669             :                          * we're done remove it from the list of this
    1670             :                          * transaction's changes. Otherwise it will get
    1671             :                          * freed/reused while restoring spooled data from
    1672             :                          * disk.
    1673             :                          */
    1674             :                         Assert(change->data.tp.newtuple != NULL);
    1675             : 
    1676         608 :                         dlist_delete(&change->node);
    1677         608 :                         ReorderBufferToastAppendChunk(rb, txn, relation,
    1678             :                                                       change);
    1679             :                     }
    1680             : 
    1681         462 :             change_done:
    1682             : 
    1683             :                     /*
    1684             :                      * Either speculative insertion was confirmed, or it was
    1685             :                      * unsuccessful and the record isn't needed anymore.
    1686             :                      */
    1687      299438 :                     if (specinsert != NULL)
    1688             :                     {
    1689        3564 :                         ReorderBufferReturnChange(rb, specinsert);
    1690        3564 :                         specinsert = NULL;
    1691             :                     }
    1692             : 
    1693      299438 :                     if (relation != NULL)
    1694             :                     {
    1695      299284 :                         RelationClose(relation);
    1696      299284 :                         relation = NULL;
    1697             :                     }
    1698      299438 :                     break;
    1699             : 
    1700        3564 :                 case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
    1701             : 
    1702             :                     /*
    1703             :                      * Speculative insertions are dealt with by delaying the
    1704             :                      * processing of the insert until the confirmation record
    1705             :                      * arrives. For that we simply unlink the record from the
    1706             :                      * chain, so it does not get freed/reused while restoring
    1707             :                      * spooled data from disk.
    1708             :                      *
    1709             :                      * This is safe in the face of concurrent catalog changes
    1710             :                      * because the relevant relation can't be changed between
    1711             :                      * speculative insertion and confirmation due to
    1712             :                      * CheckTableNotInUse() and locking.
    1713             :                      */
    1714             : 
    1715             :                     /* clear out a pending (and thus failed) speculation */
    1716        3564 :                     if (specinsert != NULL)
    1717             :                     {
    1718           0 :                         ReorderBufferReturnChange(rb, specinsert);
    1719           0 :                         specinsert = NULL;
    1720             :                     }
    1721             : 
    1722             :                     /* and memorize the pending insertion */
    1723        3564 :                     dlist_delete(&change->node);
    1724        3564 :                     specinsert = change;
    1725        3564 :                     break;
    1726             : 
    1727          20 :                 case REORDER_BUFFER_CHANGE_TRUNCATE:
    1728             :                     {
    1729             :                         int         i;
    1730          20 :                         int         nrelids = change->data.truncate.nrelids;
    1731          20 :                         int         nrelations = 0;
    1732             :                         Relation   *relations;
    1733             : 
    1734          20 :                         relations = palloc0(nrelids * sizeof(Relation));
    1735          50 :                         for (i = 0; i < nrelids; i++)
    1736             :                         {
    1737          30 :                             Oid         relid = change->data.truncate.relids[i];
    1738             :                             Relation    relation;
    1739             : 
    1740          30 :                             relation = RelationIdGetRelation(relid);
    1741             : 
    1742          30 :                             if (!RelationIsValid(relation))
    1743           0 :                                 elog(ERROR, "could not open relation with OID %u", relid);
    1744             : 
    1745          30 :                             if (!RelationIsLogicallyLogged(relation))
    1746           0 :                                 continue;
    1747             : 
    1748          30 :                             relations[nrelations++] = relation;
    1749             :                         }
    1750             : 
    1751          20 :                         rb->apply_truncate(rb, txn, nrelations, relations, change);
    1752             : 
    1753          50 :                         for (i = 0; i < nrelations; i++)
    1754          30 :                             RelationClose(relations[i]);
    1755             : 
    1756          20 :                         break;
    1757             :                     }
    1758             : 
    1759          10 :                 case REORDER_BUFFER_CHANGE_MESSAGE:
    1760          30 :                     rb->message(rb, txn, change->lsn, true,
    1761          10 :                                 change->data.msg.prefix,
    1762             :                                 change->data.msg.message_size,
    1763          10 :                                 change->data.msg.message);
    1764          10 :                     break;
    1765             : 
    1766         404 :                 case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
    1767             :                     /* get rid of the old */
    1768         404 :                     TeardownHistoricSnapshot(false);
    1769             : 
    1770         404 :                     if (snapshot_now->copied)
    1771             :                     {
    1772         368 :                         ReorderBufferFreeSnap(rb, snapshot_now);
    1773         368 :                         snapshot_now =
    1774         368 :                             ReorderBufferCopySnap(rb, change->data.snapshot,
    1775             :                                                   txn, command_id);
    1776             :                     }
    1777             : 
    1778             :                     /*
    1779             :                      * Restored from disk, need to be careful not to double
    1780             :                      * free. We could introduce refcounting for that, but for
    1781             :                      * now this seems infrequent enough not to care.
    1782             :                      */
    1783          36 :                     else if (change->data.snapshot->copied)
    1784             :                     {
    1785           0 :                         snapshot_now =
    1786           0 :                             ReorderBufferCopySnap(rb, change->data.snapshot,
    1787             :                                                   txn, command_id);
    1788             :                     }
    1789             :                     else
    1790             :                     {
    1791          36 :                         snapshot_now = change->data.snapshot;
    1792             :                     }
    1793             : 
    1794             : 
    1795             :                     /* and continue with the new one */
    1796         404 :                     SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
    1797         404 :                     break;
    1798             : 
    1799       11392 :                 case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
    1800             :                     Assert(change->data.command_id != InvalidCommandId);
    1801             : 
    1802       11392 :                     if (command_id < change->data.command_id)
    1803             :                     {
    1804        1674 :                         command_id = change->data.command_id;
    1805             : 
    1806        1674 :                         if (!snapshot_now->copied)
    1807             :                         {
    1808             :                             /* we don't use the global one anymore */
    1809         366 :                             snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
    1810             :                                                                  txn, command_id);
    1811             :                         }
    1812             : 
    1813        1674 :                         snapshot_now->curcid = command_id;
    1814             : 
    1815        1674 :                         TeardownHistoricSnapshot(false);
    1816        1674 :                         SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
    1817             : 
    1818             :                         /*
    1819             :                          * Every time the CommandId is incremented, we could
    1820             :                          * see new catalog contents, so execute all
    1821             :                          * invalidations.
    1822             :                          */
    1823        1674 :                         ReorderBufferExecuteInvalidations(rb, txn);
    1824             :                     }
    1825             : 
    1826       11392 :                     break;
    1827             : 
    1828           0 :                 case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
    1829           0 :                     elog(ERROR, "tuplecid value in changequeue");
    1830             :                     break;
    1831             :             }
    1832      315802 :         }
    1833             : 
    1834             :         /*
    1835             :          * There's a speculative insertion remaining, just clean in up, it
    1836             :          * can't have been successful, otherwise we'd gotten a confirmation
    1837             :          * record.
    1838             :          */
    1839         974 :         if (specinsert)
    1840             :         {
    1841           0 :             ReorderBufferReturnChange(rb, specinsert);
    1842           0 :             specinsert = NULL;
    1843             :         }
    1844             : 
    1845             :         /* clean up the iterator */
    1846         974 :         ReorderBufferIterTXNFinish(rb, iterstate);
    1847         974 :         iterstate = NULL;
    1848             : 
    1849             :         /* call commit callback */
    1850         974 :         rb->commit(rb, txn, commit_lsn);
    1851             : 
    1852             :         /* this is just a sanity check against bad output plugin behaviour */
    1853         974 :         if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
    1854           0 :             elog(ERROR, "output plugin used XID %u",
    1855             :                  GetCurrentTransactionId());
    1856             : 
    1857             :         /* cleanup */
    1858         974 :         TeardownHistoricSnapshot(false);
    1859             : 
    1860             :         /*
    1861             :          * Aborting the current (sub-)transaction as a whole has the right
    1862             :          * semantics. We want all locks acquired in here to be released, not
    1863             :          * reassigned to the parent and we do not want any database access
    1864             :          * have persistent effects.
    1865             :          */
    1866         974 :         AbortCurrentTransaction();
    1867             : 
    1868             :         /* make sure there's no cache pollution */
    1869         974 :         ReorderBufferExecuteInvalidations(rb, txn);
    1870             : 
    1871         974 :         if (using_subtxn)
    1872         660 :             RollbackAndReleaseCurrentSubTransaction();
    1873             : 
    1874         974 :         if (snapshot_now->copied)
    1875         366 :             ReorderBufferFreeSnap(rb, snapshot_now);
    1876             : 
    1877             :         /* remove potential on-disk data, and deallocate */
    1878         974 :         ReorderBufferCleanupTXN(rb, txn);
    1879             :     }
    1880           0 :     PG_CATCH();
    1881             :     {
    1882             :         /* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
    1883           0 :         if (iterstate)
    1884           0 :             ReorderBufferIterTXNFinish(rb, iterstate);
    1885             : 
    1886           0 :         TeardownHistoricSnapshot(true);
    1887             : 
    1888             :         /*
    1889             :          * Force cache invalidation to happen outside of a valid transaction
    1890             :          * to prevent catalog access as we just caught an error.
    1891             :          */
    1892           0 :         AbortCurrentTransaction();
    1893             : 
    1894             :         /* make sure there's no cache pollution */
    1895           0 :         ReorderBufferExecuteInvalidations(rb, txn);
    1896             : 
    1897           0 :         if (using_subtxn)
    1898           0 :             RollbackAndReleaseCurrentSubTransaction();
    1899             : 
    1900           0 :         if (snapshot_now->copied)
    1901           0 :             ReorderBufferFreeSnap(rb, snapshot_now);
    1902             : 
    1903             :         /* remove potential on-disk data, and deallocate */
    1904           0 :         ReorderBufferCleanupTXN(rb, txn);
    1905             : 
    1906           0 :         PG_RE_THROW();
    1907             :     }
    1908         974 :     PG_END_TRY();
    1909             : }
    1910             : 
    1911             : /*
    1912             :  * Abort a transaction that possibly has previous changes. Needs to be first
    1913             :  * called for subtransactions and then for the toplevel xid.
    1914             :  *
    1915             :  * NB: Transactions handled here have to have actively aborted (i.e. have
    1916             :  * produced an abort record). Implicitly aborted transactions are handled via
    1917             :  * ReorderBufferAbortOld(); transactions we're just not interested in, but
    1918             :  * which have committed are handled in ReorderBufferForget().
    1919             :  *
    1920             :  * This function purges this transaction and its contents from memory and
    1921             :  * disk.
    1922             :  */
    1923             : void
    1924          46 : ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
    1925             : {
    1926             :     ReorderBufferTXN *txn;
    1927             : 
    1928          46 :     txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
    1929             :                                 false);
    1930             : 
    1931             :     /* unknown, nothing to remove */
    1932          46 :     if (txn == NULL)
    1933           0 :         return;
    1934             : 
    1935             :     /* cosmetic... */
    1936          46 :     txn->final_lsn = lsn;
    1937             : 
    1938             :     /* remove potential on-disk data, and deallocate */
    1939          46 :     ReorderBufferCleanupTXN(rb, txn);
    1940             : }
    1941             : 
    1942             : /*
    1943             :  * Abort all transactions that aren't actually running anymore because the
    1944             :  * server restarted.
    1945             :  *
    1946             :  * NB: These really have to be transactions that have aborted due to a server
    1947             :  * crash/immediate restart, as we don't deal with invalidations here.
    1948             :  */
    1949             : void
    1950         754 : ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
    1951             : {
    1952             :     dlist_mutable_iter it;
    1953             : 
    1954             :     /*
    1955             :      * Iterate through all (potential) toplevel TXNs and abort all that are
    1956             :      * older than what possibly can be running. Once we've found the first
    1957             :      * that is alive we stop, there might be some that acquired an xid earlier
    1958             :      * but started writing later, but it's unlikely and they will be cleaned
    1959             :      * up in a later call to this function.
    1960             :      */
    1961         758 :     dlist_foreach_modify(it, &rb->toplevel_by_lsn)
    1962             :     {
    1963             :         ReorderBufferTXN *txn;
    1964             : 
    1965          30 :         txn = dlist_container(ReorderBufferTXN, node, it.cur);
    1966             : 
    1967          30 :         if (TransactionIdPrecedes(txn->xid, oldestRunningXid))
    1968             :         {
    1969           4 :             elog(DEBUG2, "aborting old transaction %u", txn->xid);
    1970             : 
    1971             :             /* remove potential on-disk data, and deallocate this tx */
    1972           4 :             ReorderBufferCleanupTXN(rb, txn);
    1973             :         }
    1974             :         else
    1975          26 :             return;
    1976             :     }
    1977             : }
    1978             : 
    1979             : /*
    1980             :  * Forget the contents of a transaction if we aren't interested in its
    1981             :  * contents. Needs to be first called for subtransactions and then for the
    1982             :  * toplevel xid.
    1983             :  *
    1984             :  * This is significantly different to ReorderBufferAbort() because
    1985             :  * transactions that have committed need to be treated differently from aborted
    1986             :  * ones since they may have modified the catalog.
    1987             :  *
    1988             :  * Note that this is only allowed to be called in the moment a transaction
    1989             :  * commit has just been read, not earlier; otherwise later records referring
    1990             :  * to this xid might re-create the transaction incompletely.
    1991             :  */
    1992             : void
    1993        3536 : ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
    1994             : {
    1995             :     ReorderBufferTXN *txn;
    1996             : 
    1997        3536 :     txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
    1998             :                                 false);
    1999             : 
    2000             :     /* unknown, nothing to forget */
    2001        3536 :     if (txn == NULL)
    2002         204 :         return;
    2003             : 
    2004             :     /* cosmetic... */
    2005        3332 :     txn->final_lsn = lsn;
    2006             : 
    2007             :     /*
    2008             :      * Process cache invalidation messages if there are any. Even if we're not
    2009             :      * interested in the transaction's contents, it could have manipulated the
    2010             :      * catalog and we need to update the caches according to that.
    2011             :      */
    2012        3332 :     if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
    2013         572 :         ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
    2014             :                                            txn->invalidations);
    2015             :     else
    2016             :         Assert(txn->ninvalidations == 0);
    2017             : 
    2018             :     /* remove potential on-disk data, and deallocate */
    2019        3332 :     ReorderBufferCleanupTXN(rb, txn);
    2020             : }
    2021             : 
    2022             : /*
    2023             :  * Execute invalidations happening outside the context of a decoded
    2024             :  * transaction. That currently happens either for xid-less commits
    2025             :  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
    2026             :  * transactions (via ReorderBufferForget()).
    2027             :  */
    2028             : void
    2029         572 : ReorderBufferImmediateInvalidation(ReorderBuffer *rb, uint32 ninvalidations,
    2030             :                                    SharedInvalidationMessage *invalidations)
    2031             : {
    2032         572 :     bool        use_subtxn = IsTransactionOrTransactionBlock();
    2033             :     int         i;
    2034             : 
    2035         572 :     if (use_subtxn)
    2036         534 :         BeginInternalSubTransaction("replay");
    2037             : 
    2038             :     /*
    2039             :      * Force invalidations to happen outside of a valid transaction - that way
    2040             :      * entries will just be marked as invalid without accessing the catalog.
    2041             :      * That's advantageous because we don't need to setup the full state
    2042             :      * necessary for catalog access.
    2043             :      */
    2044         572 :     if (use_subtxn)
    2045         534 :         AbortCurrentTransaction();
    2046             : 
    2047       26990 :     for (i = 0; i < ninvalidations; i++)
    2048       26418 :         LocalExecuteInvalidationMessage(&invalidations[i]);
    2049             : 
    2050         572 :     if (use_subtxn)
    2051         534 :         RollbackAndReleaseCurrentSubTransaction();
    2052         572 : }
    2053             : 
    2054             : /*
    2055             :  * Tell reorderbuffer about an xid seen in the WAL stream. Has to be called at
    2056             :  * least once for every xid in XLogRecord->xl_xid (other places in records
    2057             :  * may, but do not have to be passed through here).
    2058             :  *
    2059             :  * Reorderbuffer keeps some datastructures about transactions in LSN order,
    2060             :  * for efficiency. To do that it has to know about when transactions are seen
    2061             :  * first in the WAL. As many types of records are not actually interesting for
    2062             :  * logical decoding, they do not necessarily pass though here.
    2063             :  */
    2064             : void
    2065     2873628 : ReorderBufferProcessXid(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
    2066             : {
    2067             :     /* many records won't have an xid assigned, centralize check here */
    2068     2873628 :     if (xid != InvalidTransactionId)
    2069     2872282 :         ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
    2070     2873628 : }
    2071             : 
    2072             : /*
    2073             :  * Add a new snapshot to this transaction that may only used after lsn 'lsn'
    2074             :  * because the previous snapshot doesn't describe the catalog correctly for
    2075             :  * following rows.
    2076             :  */
    2077             : void
    2078        1002 : ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
    2079             :                          XLogRecPtr lsn, Snapshot snap)
    2080             : {
    2081        1002 :     ReorderBufferChange *change = ReorderBufferGetChange(rb);
    2082             : 
    2083        1002 :     change->data.snapshot = snap;
    2084        1002 :     change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
    2085             : 
    2086        1002 :     ReorderBufferQueueChange(rb, xid, lsn, change);
    2087        1002 : }
    2088             : 
    2089             : /*
    2090             :  * Set up the transaction's base snapshot.
    2091             :  *
    2092             :  * If we know that xid is a subtransaction, set the base snapshot on the
    2093             :  * top-level transaction instead.
    2094             :  */
    2095             : void
    2096        3572 : ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
    2097             :                              XLogRecPtr lsn, Snapshot snap)
    2098             : {
    2099             :     ReorderBufferTXN *txn;
    2100             :     bool        is_new;
    2101             : 
    2102             :     AssertArg(snap != NULL);
    2103             : 
    2104             :     /*
    2105             :      * Fetch the transaction to operate on.  If we know it's a subtransaction,
    2106             :      * operate on its top-level transaction instead.
    2107             :      */
    2108        3572 :     txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
    2109        3572 :     if (rbtxn_is_known_subxact(txn))
    2110         154 :         txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
    2111             :                                     NULL, InvalidXLogRecPtr, false);
    2112             :     Assert(txn->base_snapshot == NULL);
    2113             : 
    2114        3572 :     txn->base_snapshot = snap;
    2115        3572 :     txn->base_snapshot_lsn = lsn;
    2116        3572 :     dlist_push_tail(&rb->txns_by_base_snapshot_lsn, &txn->base_snapshot_node);
    2117             : 
    2118        3572 :     AssertTXNLsnOrder(rb);
    2119        3572 : }
    2120             : 
    2121             : /*
    2122             :  * Access the catalog with this CommandId at this point in the changestream.
    2123             :  *
    2124             :  * May only be called for command ids > 1
    2125             :  */
    2126             : void
    2127       27750 : ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
    2128             :                              XLogRecPtr lsn, CommandId cid)
    2129             : {
    2130       27750 :     ReorderBufferChange *change = ReorderBufferGetChange(rb);
    2131             : 
    2132       27750 :     change->data.command_id = cid;
    2133       27750 :     change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
    2134             : 
    2135       27750 :     ReorderBufferQueueChange(rb, xid, lsn, change);
    2136       27750 : }
    2137             : 
    2138             : /*
    2139             :  * Update the memory accounting info. We track memory used by the whole
    2140             :  * reorder buffer and the transaction containing the change.
    2141             :  */
    2142             : static void
    2143     4769152 : ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
    2144             :                                 ReorderBufferChange *change,
    2145             :                                 bool addition)
    2146             : {
    2147             :     Size        sz;
    2148             : 
    2149             :     Assert(change->txn);
    2150             : 
    2151             :     /*
    2152             :      * Ignore tuple CID changes, because those are not evicted when reaching
    2153             :      * memory limit. So we just don't count them, because it might easily
    2154             :      * trigger a pointless attempt to spill.
    2155             :      */
    2156     4769152 :     if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
    2157       27750 :         return;
    2158             : 
    2159     4741402 :     sz = ReorderBufferChangeSize(change);
    2160             : 
    2161     4741402 :     if (addition)
    2162             :     {
    2163     2370708 :         change->txn->size += sz;
    2164     2370708 :         rb->size += sz;
    2165             :     }
    2166             :     else
    2167             :     {
    2168             :         Assert((rb->size >= sz) && (change->txn->size >= sz));
    2169     2370694 :         change->txn->size -= sz;
    2170     2370694 :         rb->size -= sz;
    2171             :     }
    2172             : }
    2173             : 
    2174             : /*
    2175             :  * Add new (relfilenode, tid) -> (cmin, cmax) mappings.
    2176             :  *
    2177             :  * We do not include this change type in memory accounting, because we
    2178             :  * keep CIDs in a separate list and do not evict them when reaching
    2179             :  * the memory limit.
    2180             :  */
    2181             : void
    2182       27750 : ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
    2183             :                              XLogRecPtr lsn, RelFileNode node,
    2184             :                              ItemPointerData tid, CommandId cmin,
    2185             :                              CommandId cmax, CommandId combocid)
    2186             : {
    2187       27750 :     ReorderBufferChange *change = ReorderBufferGetChange(rb);
    2188             :     ReorderBufferTXN *txn;
    2189             : 
    2190       27750 :     txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
    2191             : 
    2192       27750 :     change->data.tuplecid.node = node;
    2193       27750 :     change->data.tuplecid.tid = tid;
    2194       27750 :     change->data.tuplecid.cmin = cmin;
    2195       27750 :     change->data.tuplecid.cmax = cmax;
    2196       27750 :     change->data.tuplecid.combocid = combocid;
    2197       27750 :     change->lsn = lsn;
    2198       27750 :     change->txn = txn;
    2199       27750 :     change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
    2200             : 
    2201       27750 :     dlist_push_tail(&txn->tuplecids, &change->node);
    2202       27750 :     txn->ntuplecids++;
    2203       27750 : }
    2204             : 
    2205             : /*
    2206             :  * Setup the invalidation of the toplevel transaction.
    2207             :  *
    2208             :  * This needs to be done before ReorderBufferCommit is called!
    2209             :  */
    2210             : void
    2211         940 : ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
    2212             :                               XLogRecPtr lsn, Size nmsgs,
    2213             :                               SharedInvalidationMessage *msgs)
    2214             : {
    2215             :     ReorderBufferTXN *txn;
    2216             : 
    2217         940 :     txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
    2218             : 
    2219         940 :     if (txn->ninvalidations != 0)
    2220           0 :         elog(ERROR, "only ever add one set of invalidations");
    2221             : 
    2222             :     Assert(nmsgs > 0);
    2223             : 
    2224         940 :     txn->ninvalidations = nmsgs;
    2225         940 :     txn->invalidations = (SharedInvalidationMessage *)
    2226         940 :         MemoryContextAlloc(rb->context,
    2227             :                            sizeof(SharedInvalidationMessage) * nmsgs);
    2228         940 :     memcpy(txn->invalidations, msgs,
    2229             :            sizeof(SharedInvalidationMessage) * nmsgs);
    2230         940 : }
    2231             : 
    2232             : /*
    2233             :  * Apply all invalidations we know. Possibly we only need parts at this point
    2234             :  * in the changestream but we don't know which those are.
    2235             :  */
    2236             : static void
    2237        2648 : ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
    2238             : {
    2239             :     int         i;
    2240             : 
    2241      221308 :     for (i = 0; i < txn->ninvalidations; i++)
    2242      218660 :         LocalExecuteInvalidationMessage(&txn->invalidations[i]);
    2243        2648 : }
    2244             : 
    2245             : /*
    2246             :  * Mark a transaction as containing catalog changes
    2247             :  */
    2248             : void
    2249       29922 : ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
    2250             :                                   XLogRecPtr lsn)
    2251             : {
    2252             :     ReorderBufferTXN *txn;
    2253             : 
    2254       29922 :     txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
    2255             : 
    2256       29922 :     txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
    2257       29922 : }
    2258             : 
    2259             : /*
    2260             :  * Query whether a transaction is already *known* to contain catalog
    2261             :  * changes. This can be wrong until directly before the commit!
    2262             :  */
    2263             : bool
    2264        5000 : ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
    2265             : {
    2266             :     ReorderBufferTXN *txn;
    2267             : 
    2268        5000 :     txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
    2269             :                                 false);
    2270        5000 :     if (txn == NULL)
    2271         236 :         return false;
    2272             : 
    2273        4764 :     return rbtxn_has_catalog_changes(txn);
    2274             : }
    2275             : 
    2276             : /*
    2277             :  * ReorderBufferXidHasBaseSnapshot
    2278             :  *      Have we already set the base snapshot for the given txn/subtxn?
    2279             :  */
    2280             : bool
    2281     2067452 : ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
    2282             : {
    2283             :     ReorderBufferTXN *txn;
    2284             : 
    2285     2067452 :     txn = ReorderBufferTXNByXid(rb, xid, false,
    2286             :                                 NULL, InvalidXLogRecPtr, false);
    2287             : 
    2288             :     /* transaction isn't known yet, ergo no snapshot */
    2289     2067452 :     if (txn == NULL)
    2290           0 :         return false;
    2291             : 
    2292             :     /* a known subtxn? operate on top-level txn instead */
    2293     2067452 :     if (rbtxn_is_known_subxact(txn))
    2294      531052 :         txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
    2295             :                                     NULL, InvalidXLogRecPtr, false);
    2296             : 
    2297     2067452 :     return txn->base_snapshot != NULL;
    2298             : }
    2299             : 
    2300             : 
    2301             : /*
    2302             :  * ---------------------------------------
    2303             :  * Disk serialization support
    2304             :  * ---------------------------------------
    2305             :  */
    2306             : 
    2307             : /*
    2308             :  * Ensure the IO buffer is >= sz.
    2309             :  */
    2310             : static void
    2311     4511394 : ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
    2312             : {
    2313     4511394 :     if (!rb->outbufsize)
    2314             :     {
    2315          58 :         rb->outbuf = MemoryContextAlloc(rb->context, sz);
    2316          58 :         rb->outbufsize = sz;
    2317             :     }
    2318     4511336 :     else if (rb->outbufsize < sz)
    2319             :     {
    2320         456 :         rb->outbuf = repalloc(rb->outbuf, sz);
    2321         456 :         rb->outbufsize = sz;
    2322             :     }
    2323     4511394 : }
    2324             : 
    2325             : /*
    2326             :  * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
    2327             :  *
    2328             :  * XXX With many subtransactions this might be quite slow, because we'll have
    2329             :  * to walk through all of them. There are some options how we could improve
    2330             :  * that: (a) maintain some secondary structure with transactions sorted by
    2331             :  * amount of changes, (b) not looking for the entirely largest transaction,
    2332             :  * but e.g. for transaction using at least some fraction of the memory limit,
    2333             :  * and (c) evicting multiple transactions at once, e.g. to free a given portion
    2334             :  * of the memory limit (e.g. 50%).
    2335             :  */
    2336             : static ReorderBufferTXN *
    2337        4456 : ReorderBufferLargestTXN(ReorderBuffer *rb)
    2338             : {
    2339             :     HASH_SEQ_STATUS hash_seq;
    2340             :     ReorderBufferTXNByIdEnt *ent;
    2341        4456 :     ReorderBufferTXN *largest = NULL;
    2342             : 
    2343        4456 :     hash_seq_init(&hash_seq, rb->by_txn);
    2344       11528 :     while ((ent = hash_seq_search(&hash_seq)) != NULL)
    2345             :     {
    2346        7072 :         ReorderBufferTXN *txn = ent->txn;
    2347             : 
    2348             :         /* if the current transaction is larger, remember it */
    2349        7072 :         if ((!largest) || (txn->size > largest->size))
    2350        6060 :             largest = txn;
    2351             :     }
    2352             : 
    2353             :     Assert(largest);
    2354             :     Assert(largest->size > 0);
    2355             :     Assert(largest->size <= rb->size);
    2356             : 
    2357        4456 :     return largest;
    2358             : }
    2359             : 
    2360             : /*
    2361             :  * Check whether the logical_decoding_work_mem limit was reached, and if yes
    2362             :  * pick the transaction to evict and spill the changes to disk.
    2363             :  *
    2364             :  * XXX At this point we select just a single (largest) transaction, but
    2365             :  * we might also adapt a more elaborate eviction strategy - for example
    2366             :  * evicting enough transactions to free certain fraction (e.g. 50%) of
    2367             :  * the memory limit.
    2368             :  */
    2369             : static void
    2370     2076136 : ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
    2371             : {
    2372             :     ReorderBufferTXN *txn;
    2373             : 
    2374             :     /* bail out if we haven't exceeded the memory limit */
    2375     2076136 :     if (rb->size < logical_decoding_work_mem * 1024L)
    2376     2071680 :         return;
    2377             : 
    2378             :     /*
    2379             :      * Pick the largest transaction (or subtransaction) and evict it from
    2380             :      * memory by serializing it to disk.
    2381             :      */
    2382        4456 :     txn = ReorderBufferLargestTXN(rb);
    2383             : 
    2384        4456 :     ReorderBufferSerializeTXN(rb, txn);
    2385             : 
    2386             :     /*
    2387             :      * After eviction, the transaction should have no entries in memory, and
    2388             :      * should use 0 bytes for changes.
    2389             :      */
    2390             :     Assert(txn->size == 0);
    2391             :     Assert(txn->nentries_mem == 0);
    2392             : 
    2393             :     /*
    2394             :      * And furthermore, evicting the transaction should get us below the
    2395             :      * memory limit again - it is not possible that we're still exceeding the
    2396             :      * memory limit after evicting the transaction.
    2397             :      *
    2398             :      * This follows from the simple fact that the selected transaction is at
    2399             :      * least as large as the most recent change (which caused us to go over
    2400             :      * the memory limit). So by evicting it we're definitely back below the
    2401             :      * memory limit.
    2402             :      */
    2403             :     Assert(rb->size < logical_decoding_work_mem * 1024L);
    2404             : }
    2405             : 
    2406             : /*
    2407             :  * Spill data of a large transaction (and its subtransactions) to disk.
    2408             :  */
    2409             : static void
    2410        4954 : ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
    2411             : {
    2412             :     dlist_iter  subtxn_i;
    2413             :     dlist_mutable_iter change_i;
    2414        4954 :     int         fd = -1;
    2415        4954 :     XLogSegNo   curOpenSegNo = 0;
    2416        4954 :     Size        spilled = 0;
    2417        4954 :     Size        size = txn->size;
    2418             : 
    2419        4954 :     elog(DEBUG2, "spill %u changes in XID %u to disk",
    2420             :          (uint32) txn->nentries_mem, txn->xid);
    2421             : 
    2422             :     /* do the same to all child TXs */
    2423        5394 :     dlist_foreach(subtxn_i, &txn->subtxns)
    2424             :     {
    2425             :         ReorderBufferTXN *subtxn;
    2426             : 
    2427         440 :         subtxn = dlist_container(ReorderBufferTXN, node, subtxn_i.cur);
    2428         440 :         ReorderBufferSerializeTXN(rb, subtxn);
    2429             :     }
    2430             : 
    2431             :     /* serialize changestream */
    2432     1983626 :     dlist_foreach_modify(change_i, &txn->changes)
    2433             :     {
    2434             :         ReorderBufferChange *change;
    2435             : 
    2436     1978672 :         change = dlist_container(ReorderBufferChange, node, change_i.cur);
    2437             : 
    2438             :         /*
    2439             :          * store in segment in which it belongs by start lsn, don't split over
    2440             :          * multiple segments tho
    2441             :          */
    2442     1978672 :         if (fd == -1 ||
    2443     1974132 :             !XLByteInSeg(change->lsn, curOpenSegNo, wal_segment_size))
    2444             :         {
    2445             :             char        path[MAXPGPATH];
    2446             : 
    2447        4544 :             if (fd != -1)
    2448           4 :                 CloseTransientFile(fd);
    2449             : 
    2450        4544 :             XLByteToSeg(change->lsn, curOpenSegNo, wal_segment_size);
    2451             : 
    2452             :             /*
    2453             :              * No need to care about TLIs here, only used during a single run,
    2454             :              * so each LSN only maps to a specific WAL record.
    2455             :              */
    2456        4544 :             ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid,
    2457             :                                         curOpenSegNo);
    2458             : 
    2459             :             /* open segment, create it if necessary */
    2460        4544 :             fd = OpenTransientFile(path,
    2461             :                                    O_CREAT | O_WRONLY | O_APPEND | PG_BINARY);
    2462             : 
    2463        4544 :             if (fd < 0)
    2464           0 :                 ereport(ERROR,
    2465             :                         (errcode_for_file_access(),
    2466             :                          errmsg("could not open file \"%s\": %m", path)));
    2467             :         }
    2468             : 
    2469     1978672 :         ReorderBufferSerializeChange(rb, txn, fd, change);
    2470     1978672 :         dlist_delete(&change->node);
    2471     1978672 :         ReorderBufferReturnChange(rb, change);
    2472             : 
    2473     1978672 :         spilled++;
    2474             :     }
    2475             : 
    2476             :     /* update the statistics */
    2477        4954 :     rb->spillCount += 1;
    2478        4954 :     rb->spillBytes += size;
    2479             : 
    2480             :     /* Don't consider already serialized transactions. */
    2481        4954 :     rb->spillTxns += rbtxn_is_serialized(txn) ? 0 : 1;
    2482             : 
    2483             :     Assert(spilled == txn->nentries_mem);
    2484             :     Assert(dlist_is_empty(&txn->changes));
    2485        4954 :     txn->nentries_mem = 0;
    2486        4954 :     txn->txn_flags |= RBTXN_IS_SERIALIZED;
    2487             : 
    2488        4954 :     if (fd != -1)
    2489        4540 :         CloseTransientFile(fd);
    2490        4954 : }
    2491             : 
    2492             : /*
    2493             :  * Serialize individual change to disk.
    2494             :  */
    2495             : static void
    2496     1978672 : ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
    2497             :                              int fd, ReorderBufferChange *change)
    2498             : {
    2499             :     ReorderBufferDiskChange *ondisk;
    2500     1978672 :     Size        sz = sizeof(ReorderBufferDiskChange);
    2501             : 
    2502     1978672 :     ReorderBufferSerializeReserve(rb, sz);
    2503             : 
    2504     1978672 :     ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    2505     1978672 :     memcpy(&ondisk->change, change, sizeof(ReorderBufferChange));
    2506             : 
    2507     1978672 :     switch (change->action)
    2508             :     {
    2509             :             /* fall through these, they're all similar enough */
    2510     1944370 :         case REORDER_BUFFER_CHANGE_INSERT:
    2511             :         case REORDER_BUFFER_CHANGE_UPDATE:
    2512             :         case REORDER_BUFFER_CHANGE_DELETE:
    2513             :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
    2514             :             {
    2515             :                 char       *data;
    2516             :                 ReorderBufferTupleBuf *oldtup,
    2517             :                            *newtup;
    2518     1944370 :                 Size        oldlen = 0;
    2519     1944370 :                 Size        newlen = 0;
    2520             : 
    2521     1944370 :                 oldtup = change->data.tp.oldtuple;
    2522     1944370 :                 newtup = change->data.tp.newtuple;
    2523             : 
    2524     1944370 :                 if (oldtup)
    2525             :                 {
    2526      120054 :                     sz += sizeof(HeapTupleData);
    2527      120054 :                     oldlen = oldtup->tuple.t_len;
    2528      120054 :                     sz += oldlen;
    2529             :                 }
    2530             : 
    2531     1944370 :                 if (newtup)
    2532             :                 {
    2533     1716908 :                     sz += sizeof(HeapTupleData);
    2534     1716908 :                     newlen = newtup->tuple.t_len;
    2535     1716908 :                     sz += newlen;
    2536             :                 }
    2537             : 
    2538             :                 /* make sure we have enough space */
    2539     1944370 :                 ReorderBufferSerializeReserve(rb, sz);
    2540             : 
    2541     1944370 :                 data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
    2542             :                 /* might have been reallocated above */
    2543     1944370 :                 ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    2544             : 
    2545     1944370 :                 if (oldlen)
    2546             :                 {
    2547      120054 :                     memcpy(data, &oldtup->tuple, sizeof(HeapTupleData));
    2548      120054 :                     data += sizeof(HeapTupleData);
    2549             : 
    2550      120054 :                     memcpy(data, oldtup->tuple.t_data, oldlen);
    2551      120054 :                     data += oldlen;
    2552             :                 }
    2553             : 
    2554     1944370 :                 if (newlen)
    2555             :                 {
    2556     1716908 :                     memcpy(data, &newtup->tuple, sizeof(HeapTupleData));
    2557     1716908 :                     data += sizeof(HeapTupleData);
    2558             : 
    2559     1716908 :                     memcpy(data, newtup->tuple.t_data, newlen);
    2560     1716908 :                     data += newlen;
    2561             :                 }
    2562     1944370 :                 break;
    2563             :             }
    2564          24 :         case REORDER_BUFFER_CHANGE_MESSAGE:
    2565             :             {
    2566             :                 char       *data;
    2567          24 :                 Size        prefix_size = strlen(change->data.msg.prefix) + 1;
    2568             : 
    2569          24 :                 sz += prefix_size + change->data.msg.message_size +
    2570             :                     sizeof(Size) + sizeof(Size);
    2571          24 :                 ReorderBufferSerializeReserve(rb, sz);
    2572             : 
    2573          24 :                 data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
    2574             : 
    2575             :                 /* might have been reallocated above */
    2576          24 :                 ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    2577             : 
    2578             :                 /* write the prefix including the size */
    2579          24 :                 memcpy(data, &prefix_size, sizeof(Size));
    2580          24 :                 data += sizeof(Size);
    2581          24 :                 memcpy(data, change->data.msg.prefix,
    2582             :                        prefix_size);
    2583          24 :                 data += prefix_size;
    2584             : 
    2585             :                 /* write the message including the size */
    2586          24 :                 memcpy(data, &change->data.msg.message_size, sizeof(Size));
    2587          24 :                 data += sizeof(Size);
    2588          24 :                 memcpy(data, change->data.msg.message,
    2589             :                        change->data.msg.message_size);
    2590          24 :                 data += change->data.msg.message_size;
    2591             : 
    2592          24 :                 break;
    2593             :             }
    2594           4 :         case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
    2595             :             {
    2596             :                 Snapshot    snap;
    2597             :                 char       *data;
    2598             : 
    2599           4 :                 snap = change->data.snapshot;
    2600             : 
    2601           4 :                 sz += sizeof(SnapshotData) +
    2602           8 :                     sizeof(TransactionId) * snap->xcnt +
    2603           4 :                     sizeof(TransactionId) * snap->subxcnt;
    2604             : 
    2605             :                 /* make sure we have enough space */
    2606           4 :                 ReorderBufferSerializeReserve(rb, sz);
    2607           4 :                 data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
    2608             :                 /* might have been reallocated above */
    2609           4 :                 ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    2610             : 
    2611           4 :                 memcpy(data, snap, sizeof(SnapshotData));
    2612           4 :                 data += sizeof(SnapshotData);
    2613             : 
    2614           4 :                 if (snap->xcnt)
    2615             :                 {
    2616           4 :                     memcpy(data, snap->xip,
    2617           4 :                            sizeof(TransactionId) * snap->xcnt);
    2618           4 :                     data += sizeof(TransactionId) * snap->xcnt;
    2619             :                 }
    2620             : 
    2621           4 :                 if (snap->subxcnt)
    2622             :                 {
    2623           0 :                     memcpy(data, snap->subxip,
    2624           0 :                            sizeof(TransactionId) * snap->subxcnt);
    2625           0 :                     data += sizeof(TransactionId) * snap->subxcnt;
    2626             :                 }
    2627           4 :                 break;
    2628             :             }
    2629           0 :         case REORDER_BUFFER_CHANGE_TRUNCATE:
    2630             :             {
    2631             :                 Size        size;
    2632             :                 char       *data;
    2633             : 
    2634             :                 /* account for the OIDs of truncated relations */
    2635           0 :                 size = sizeof(Oid) * change->data.truncate.nrelids;
    2636           0 :                 sz += size;
    2637             : 
    2638             :                 /* make sure we have enough space */
    2639           0 :                 ReorderBufferSerializeReserve(rb, sz);
    2640             : 
    2641           0 :                 data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
    2642             :                 /* might have been reallocated above */
    2643           0 :                 ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    2644             : 
    2645           0 :                 memcpy(data, change->data.truncate.relids, size);
    2646           0 :                 data += size;
    2647             : 
    2648           0 :                 break;
    2649             :             }
    2650       34274 :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
    2651             :         case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
    2652             :         case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
    2653             :             /* ReorderBufferChange contains everything important */
    2654       34274 :             break;
    2655             :     }
    2656             : 
    2657     1978672 :     ondisk->size = sz;
    2658             : 
    2659     1978672 :     errno = 0;
    2660     1978672 :     pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_WRITE);
    2661     1978672 :     if (write(fd, rb->outbuf, ondisk->size) != ondisk->size)
    2662             :     {
    2663           0 :         int         save_errno = errno;
    2664             : 
    2665           0 :         CloseTransientFile(fd);
    2666             : 
    2667             :         /* if write didn't set errno, assume problem is no disk space */
    2668           0 :         errno = save_errno ? save_errno : ENOSPC;
    2669           0 :         ereport(ERROR,
    2670             :                 (errcode_for_file_access(),
    2671             :                  errmsg("could not write to data file for XID %u: %m",
    2672             :                         txn->xid)));
    2673             :     }
    2674     1978672 :     pgstat_report_wait_end();
    2675             : 
    2676             :     /*
    2677             :      * Keep the transaction's final_lsn up to date with each change we send to
    2678             :      * disk, so that ReorderBufferRestoreCleanup works correctly.  (We used to
    2679             :      * only do this on commit and abort records, but that doesn't work if a
    2680             :      * system crash leaves a transaction without its abort record).
    2681             :      *
    2682             :      * Make sure not to move it backwards.
    2683             :      */
    2684     1978672 :     if (txn->final_lsn < change->lsn)
    2685     1970412 :         txn->final_lsn = change->lsn;
    2686             : 
    2687             :     Assert(ondisk->change.action == change->action);
    2688     1978672 : }
    2689             : 
    2690             : /*
    2691             :  * Size of a change in memory.
    2692             :  */
    2693             : static Size
    2694     4741402 : ReorderBufferChangeSize(ReorderBufferChange *change)
    2695             : {
    2696     4741402 :     Size        sz = sizeof(ReorderBufferChange);
    2697             : 
    2698     4741402 :     switch (change->action)
    2699             :     {
    2700             :             /* fall through these, they're all similar enough */
    2701     4604662 :         case REORDER_BUFFER_CHANGE_INSERT:
    2702             :         case REORDER_BUFFER_CHANGE_UPDATE:
    2703             :         case REORDER_BUFFER_CHANGE_DELETE:
    2704             :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
    2705             :             {
    2706             :                 ReorderBufferTupleBuf *oldtup,
    2707             :                            *newtup;
    2708     4604662 :                 Size        oldlen = 0;
    2709     4604662 :                 Size        newlen = 0;
    2710             : 
    2711     4604662 :                 oldtup = change->data.tp.oldtuple;
    2712     4604662 :                 newtup = change->data.tp.newtuple;
    2713             : 
    2714     4604662 :                 if (oldtup)
    2715             :                 {
    2716      264148 :                     sz += sizeof(HeapTupleData);
    2717      264148 :                     oldlen = oldtup->tuple.t_len;
    2718      264148 :                     sz += oldlen;
    2719             :                 }
    2720             : 
    2721     4604662 :                 if (newtup)
    2722             :                 {
    2723     4065306 :                     sz += sizeof(HeapTupleData);
    2724     4065306 :                     newlen = newtup->tuple.t_len;
    2725     4065306 :                     sz += newlen;
    2726             :                 }
    2727             : 
    2728     4604662 :                 break;
    2729             :             }
    2730          92 :         case REORDER_BUFFER_CHANGE_MESSAGE:
    2731             :             {
    2732          92 :                 Size        prefix_size = strlen(change->data.msg.prefix) + 1;
    2733             : 
    2734          92 :                 sz += prefix_size + change->data.msg.message_size +
    2735             :                     sizeof(Size) + sizeof(Size);
    2736             : 
    2737          92 :                 break;
    2738             :             }
    2739        2008 :         case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
    2740             :             {
    2741             :                 Snapshot    snap;
    2742             : 
    2743        2008 :                 snap = change->data.snapshot;
    2744             : 
    2745        2008 :                 sz += sizeof(SnapshotData) +
    2746        4016 :                     sizeof(TransactionId) * snap->xcnt +
    2747        2008 :                     sizeof(TransactionId) * snap->subxcnt;
    2748             : 
    2749        2008 :                 break;
    2750             :             }
    2751          40 :         case REORDER_BUFFER_CHANGE_TRUNCATE:
    2752             :             {
    2753          40 :                 sz += sizeof(Oid) * change->data.truncate.nrelids;
    2754             : 
    2755          40 :                 break;
    2756             :             }
    2757      134600 :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
    2758             :         case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
    2759             :         case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
    2760             :             /* ReorderBufferChange contains everything important */
    2761      134600 :             break;
    2762             :     }
    2763             : 
    2764     4741402 :     return sz;
    2765             : }
    2766             : 
    2767             : 
    2768             : /*
    2769             :  * Restore a number of changes spilled to disk back into memory.
    2770             :  */
    2771             : static Size
    2772         166 : ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
    2773             :                             TXNEntryFile *file, XLogSegNo *segno)
    2774             : {
    2775         166 :     Size        restored = 0;
    2776             :     XLogSegNo   last_segno;
    2777             :     dlist_mutable_iter cleanup_iter;
    2778         166 :     File       *fd = &file->vfd;
    2779             : 
    2780             :     Assert(txn->first_lsn != InvalidXLogRecPtr);
    2781             :     Assert(txn->final_lsn != InvalidXLogRecPtr);
    2782             : 
    2783             :     /* free current entries, so we have memory for more */
    2784      290140 :     dlist_foreach_modify(cleanup_iter, &txn->changes)
    2785             :     {
    2786      289974 :         ReorderBufferChange *cleanup =
    2787      289974 :         dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
    2788             : 
    2789      289974 :         dlist_delete(&cleanup->node);
    2790      289974 :         ReorderBufferReturnChange(rb, cleanup);
    2791             :     }
    2792         166 :     txn->nentries_mem = 0;
    2793             :     Assert(dlist_is_empty(&txn->changes));
    2794             : 
    2795         166 :     XLByteToSeg(txn->final_lsn, last_segno, wal_segment_size);
    2796             : 
    2797      294358 :     while (restored < max_changes_in_memory && *segno <= last_segno)
    2798             :     {
    2799             :         int         readBytes;
    2800             :         ReorderBufferDiskChange *ondisk;
    2801             : 
    2802      294192 :         if (*fd == -1)
    2803             :         {
    2804             :             char        path[MAXPGPATH];
    2805             : 
    2806             :             /* first time in */
    2807          60 :             if (*segno == 0)
    2808          58 :                 XLByteToSeg(txn->first_lsn, *segno, wal_segment_size);
    2809             : 
    2810             :             Assert(*segno != 0 || dlist_is_empty(&txn->changes));
    2811             : 
    2812             :             /*
    2813             :              * No need to care about TLIs here, only used during a single run,
    2814             :              * so each LSN only maps to a specific WAL record.
    2815             :              */
    2816          60 :             ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid,
    2817             :                                         *segno);
    2818             : 
    2819          60 :             *fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
    2820             : 
    2821             :             /* No harm in resetting the offset even in case of failure */
    2822          60 :             file->curOffset = 0;
    2823             : 
    2824          60 :             if (*fd < 0 && errno == ENOENT)
    2825             :             {
    2826           0 :                 *fd = -1;
    2827           0 :                 (*segno)++;
    2828           0 :                 continue;
    2829             :             }
    2830          60 :             else if (*fd < 0)
    2831           0 :                 ereport(ERROR,
    2832             :                         (errcode_for_file_access(),
    2833             :                          errmsg("could not open file \"%s\": %m",
    2834             :                                 path)));
    2835             :         }
    2836             : 
    2837             :         /*
    2838             :          * Read the statically sized part of a change which has information
    2839             :          * about the total size. If we couldn't read a record, we're at the
    2840             :          * end of this file.
    2841             :          */
    2842      294192 :         ReorderBufferSerializeReserve(rb, sizeof(ReorderBufferDiskChange));
    2843      294192 :         readBytes = FileRead(file->vfd, rb->outbuf,
    2844             :                              sizeof(ReorderBufferDiskChange),
    2845             :                              file->curOffset, WAIT_EVENT_REORDER_BUFFER_READ);
    2846             : 
    2847             :         /* eof */
    2848      294192 :         if (readBytes == 0)
    2849             :         {
    2850          60 :             FileClose(*fd);
    2851          60 :             *fd = -1;
    2852          60 :             (*segno)++;
    2853          60 :             continue;
    2854             :         }
    2855      294132 :         else if (readBytes < 0)
    2856           0 :             ereport(ERROR,
    2857             :                     (errcode_for_file_access(),
    2858             :                      errmsg("could not read from reorderbuffer spill file: %m")));
    2859      294132 :         else if (readBytes != sizeof(ReorderBufferDiskChange))
    2860           0 :             ereport(ERROR,
    2861             :                     (errcode_for_file_access(),
    2862             :                      errmsg("could not read from reorderbuffer spill file: read %d instead of %u bytes",
    2863             :                             readBytes,
    2864             :                             (uint32) sizeof(ReorderBufferDiskChange))));
    2865             : 
    2866      294132 :         file->curOffset += readBytes;
    2867             : 
    2868      294132 :         ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    2869             : 
    2870      294132 :         ReorderBufferSerializeReserve(rb,
    2871      294132 :                                       sizeof(ReorderBufferDiskChange) + ondisk->size);
    2872      294132 :         ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    2873             : 
    2874      882396 :         readBytes = FileRead(file->vfd,
    2875      294132 :                              rb->outbuf + sizeof(ReorderBufferDiskChange),
    2876      294132 :                              ondisk->size - sizeof(ReorderBufferDiskChange),
    2877             :                              file->curOffset,
    2878             :                              WAIT_EVENT_REORDER_BUFFER_READ);
    2879             : 
    2880      294132 :         if (readBytes < 0)
    2881           0 :             ereport(ERROR,
    2882             :                     (errcode_for_file_access(),
    2883             :                      errmsg("could not read from reorderbuffer spill file: %m")));
    2884      294132 :         else if (readBytes != ondisk->size - sizeof(ReorderBufferDiskChange))
    2885           0 :             ereport(ERROR,
    2886             :                     (errcode_for_file_access(),
    2887             :                      errmsg("could not read from reorderbuffer spill file: read %d instead of %u bytes",
    2888             :                             readBytes,
    2889             :                             (uint32) (ondisk->size - sizeof(ReorderBufferDiskChange)))));
    2890             : 
    2891      294132 :         file->curOffset += readBytes;
    2892             : 
    2893             :         /*
    2894             :          * ok, read a full change from disk, now restore it into proper
    2895             :          * in-memory format
    2896             :          */
    2897      294132 :         ReorderBufferRestoreChange(rb, txn, rb->outbuf);
    2898      294132 :         restored++;
    2899             :     }
    2900             : 
    2901         166 :     return restored;
    2902             : }
    2903             : 
    2904             : /*
    2905             :  * Convert change from its on-disk format to in-memory format and queue it onto
    2906             :  * the TXN's ->changes list.
    2907             :  *
    2908             :  * Note: although "data" is declared char*, at entry it points to a
    2909             :  * maxalign'd buffer, making it safe in most of this function to assume
    2910             :  * that the pointed-to data is suitably aligned for direct access.
    2911             :  */
    2912             : static void
    2913      294132 : ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
    2914             :                            char *data)
    2915             : {
    2916             :     ReorderBufferDiskChange *ondisk;
    2917             :     ReorderBufferChange *change;
    2918             : 
    2919      294132 :     ondisk = (ReorderBufferDiskChange *) data;
    2920             : 
    2921      294132 :     change = ReorderBufferGetChange(rb);
    2922             : 
    2923             :     /* copy static part */
    2924      294132 :     memcpy(change, &ondisk->change, sizeof(ReorderBufferChange));
    2925             : 
    2926      294132 :     data += sizeof(ReorderBufferDiskChange);
    2927             : 
    2928             :     /* restore individual stuff */
    2929      294132 :     switch (change->action)
    2930             :     {
    2931             :             /* fall through these, they're all similar enough */
    2932      290376 :         case REORDER_BUFFER_CHANGE_INSERT:
    2933             :         case REORDER_BUFFER_CHANGE_UPDATE:
    2934             :         case REORDER_BUFFER_CHANGE_DELETE:
    2935             :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
    2936      290376 :             if (change->data.tp.oldtuple)
    2937             :             {
    2938       10012 :                 uint32      tuplelen = ((HeapTuple) data)->t_len;
    2939             : 
    2940       10012 :                 change->data.tp.oldtuple =
    2941       10012 :                     ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader);
    2942             : 
    2943             :                 /* restore ->tuple */
    2944       10012 :                 memcpy(&change->data.tp.oldtuple->tuple, data,
    2945             :                        sizeof(HeapTupleData));
    2946       10012 :                 data += sizeof(HeapTupleData);
    2947             : 
    2948             :                 /* reset t_data pointer into the new tuplebuf */
    2949       20024 :                 change->data.tp.oldtuple->tuple.t_data =
    2950       20024 :                     ReorderBufferTupleBufData(change->data.tp.oldtuple);
    2951             : 
    2952             :                 /* restore tuple data itself */
    2953       10012 :                 memcpy(change->data.tp.oldtuple->tuple.t_data, data, tuplelen);
    2954       10012 :                 data += tuplelen;
    2955             :             }
    2956             : 
    2957      290376 :             if (change->data.tp.newtuple)
    2958             :             {
    2959             :                 /* here, data might not be suitably aligned! */
    2960             :                 uint32      tuplelen;
    2961             : 
    2962      269936 :                 memcpy(&tuplelen, data + offsetof(HeapTupleData, t_len),
    2963             :                        sizeof(uint32));
    2964             : 
    2965      269936 :                 change->data.tp.newtuple =
    2966      269936 :                     ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader);
    2967             : 
    2968             :                 /* restore ->tuple */
    2969      269936 :                 memcpy(&change->data.tp.newtuple->tuple, data,
    2970             :                        sizeof(HeapTupleData));
    2971      269936 :                 data += sizeof(HeapTupleData);
    2972             : 
    2973             :                 /* reset t_data pointer into the new tuplebuf */
    2974      539872 :                 change->data.tp.newtuple->tuple.t_data =
    2975      539872 :                     ReorderBufferTupleBufData(change->data.tp.newtuple);
    2976             : 
    2977             :                 /* restore tuple data itself */
    2978      269936 :                 memcpy(change->data.tp.newtuple->tuple.t_data, data, tuplelen);
    2979      269936 :                 data += tuplelen;
    2980             :             }
    2981             : 
    2982      290376 :             break;
    2983           2 :         case REORDER_BUFFER_CHANGE_MESSAGE:
    2984             :             {
    2985             :                 Size        prefix_size;
    2986             : 
    2987             :                 /* read prefix */
    2988           2 :                 memcpy(&prefix_size, data, sizeof(Size));
    2989           2 :                 data += sizeof(Size);
    2990           2 :                 change->data.msg.prefix = MemoryContextAlloc(rb->context,
    2991             :                                                              prefix_size);
    2992           2 :                 memcpy(change->data.msg.prefix, data, prefix_size);
    2993             :                 Assert(change->data.msg.prefix[prefix_size - 1] == '\0');
    2994           2 :                 data += prefix_size;
    2995             : 
    2996             :                 /* read the message */
    2997           2 :                 memcpy(&change->data.msg.message_size, data, sizeof(Size));
    2998           2 :                 data += sizeof(Size);
    2999           2 :                 change->data.msg.message = MemoryContextAlloc(rb->context,
    3000             :                                                               change->data.msg.message_size);
    3001           2 :                 memcpy(change->data.msg.message, data,
    3002             :                        change->data.msg.message_size);
    3003           2 :                 data += change->data.msg.message_size;
    3004             : 
    3005           2 :                 break;
    3006             :             }
    3007           4 :         case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
    3008             :             {
    3009             :                 Snapshot    oldsnap;
    3010             :                 Snapshot    newsnap;
    3011             :                 Size        size;
    3012             : 
    3013           4 :                 oldsnap = (Snapshot) data;
    3014             : 
    3015           4 :                 size = sizeof(SnapshotData) +
    3016           8 :                     sizeof(TransactionId) * oldsnap->xcnt +
    3017           4 :                     sizeof(TransactionId) * (oldsnap->subxcnt + 0);
    3018             : 
    3019           4 :                 change->data.snapshot = MemoryContextAllocZero(rb->context, size);
    3020             : 
    3021           4 :                 newsnap = change->data.snapshot;
    3022             : 
    3023           4 :                 memcpy(newsnap, data, size);
    3024           4 :                 newsnap->xip = (TransactionId *)
    3025             :                     (((char *) newsnap) + sizeof(SnapshotData));
    3026           4 :                 newsnap->subxip = newsnap->xip + newsnap->xcnt;
    3027           4 :                 newsnap->copied = true;
    3028           4 :                 break;
    3029             :             }
    3030             :             /* the base struct contains all the data, easy peasy */
    3031           0 :         case REORDER_BUFFER_CHANGE_TRUNCATE:
    3032             :             {
    3033             :                 Oid        *relids;
    3034             : 
    3035           0 :                 relids = ReorderBufferGetRelids(rb,
    3036           0 :                                                 change->data.truncate.nrelids);
    3037           0 :                 memcpy(relids, data, change->data.truncate.nrelids * sizeof(Oid));
    3038           0 :                 change->data.truncate.relids = relids;
    3039             : 
    3040           0 :                 break;
    3041             :             }
    3042        3750 :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
    3043             :         case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
    3044             :         case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
    3045        3750 :             break;
    3046             :     }
    3047             : 
    3048      294132 :     dlist_push_tail(&txn->changes, &change->node);
    3049      294132 :     txn->nentries_mem++;
    3050             : 
    3051             :     /*
    3052             :      * Update memory accounting for the restored change.  We need to do this
    3053             :      * although we don't check the memory limit when restoring the changes in
    3054             :      * this branch (we only do that when initially queueing the changes after
    3055             :      * decoding), because we will release the changes later, and that will
    3056             :      * update the accounting too (subtracting the size from the counters). And
    3057             :      * we don't want to underflow there.
    3058             :      */
    3059      294132 :     ReorderBufferChangeMemoryUpdate(rb, change, true);
    3060      294132 : }
    3061             : 
    3062             : /*
    3063             :  * Remove all on-disk stored for the passed in transaction.
    3064             :  */
    3065             : static void
    3066         318 : ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn)
    3067             : {
    3068             :     XLogSegNo   first;
    3069             :     XLogSegNo   cur;
    3070             :     XLogSegNo   last;
    3071             : 
    3072             :     Assert(txn->first_lsn != InvalidXLogRecPtr);
    3073             :     Assert(txn->final_lsn != InvalidXLogRecPtr);
    3074             : 
    3075         318 :     XLByteToSeg(txn->first_lsn, first, wal_segment_size);
    3076         318 :     XLByteToSeg(txn->final_lsn, last, wal_segment_size);
    3077             : 
    3078             :     /* iterate over all possible filenames, and delete them */
    3079         640 :     for (cur = first; cur <= last; cur++)
    3080             :     {
    3081             :         char        path[MAXPGPATH];
    3082             : 
    3083         322 :         ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid, cur);
    3084         322 :         if (unlink(path) != 0 && errno != ENOENT)
    3085           0 :             ereport(ERROR,
    3086             :                     (errcode_for_file_access(),
    3087             :                      errmsg("could not remove file \"%s\": %m", path)));
    3088             :     }
    3089         318 : }
    3090             : 
    3091             : /*
    3092             :  * Remove any leftover serialized reorder buffers from a slot directory after a
    3093             :  * prior crash or decoding session exit.
    3094             :  */
    3095             : static void
    3096        1168 : ReorderBufferCleanupSerializedTXNs(const char *slotname)
    3097             : {
    3098             :     DIR        *spill_dir;
    3099             :     struct dirent *spill_de;
    3100             :     struct stat statbuf;
    3101             :     char        path[MAXPGPATH * 2 + 12];
    3102             : 
    3103        1168 :     sprintf(path, "pg_replslot/%s", slotname);
    3104             : 
    3105             :     /* we're only handling directories here, skip if it's not ours */
    3106        1168 :     if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
    3107           0 :         return;
    3108             : 
    3109        1168 :     spill_dir = AllocateDir(path);
    3110        4672 :     while ((spill_de = ReadDirExtended(spill_dir, path, INFO)) != NULL)
    3111             :     {
    3112             :         /* only look at names that can be ours */
    3113        3504 :         if (strncmp(spill_de->d_name, "xid", 3) == 0)
    3114             :         {
    3115           0 :             snprintf(path, sizeof(path),
    3116             :                      "pg_replslot/%s/%s", slotname,
    3117           0 :                      spill_de->d_name);
    3118             : 
    3119           0 :             if (unlink(path) != 0)
    3120           0 :                 ereport(ERROR,
    3121             :                         (errcode_for_file_access(),
    3122             :                          errmsg("could not remove file \"%s\" during removal of pg_replslot/%s/xid*: %m",
    3123             :                                 path, slotname)));
    3124             :         }
    3125             :     }
    3126        1168 :     FreeDir(spill_dir);
    3127             : }
    3128             : 
    3129             : /*
    3130             :  * Given a replication slot, transaction ID and segment number, fill in the
    3131             :  * corresponding spill file into 'path', which is a caller-owned buffer of size
    3132             :  * at least MAXPGPATH.
    3133             :  */
    3134             : static void
    3135        4926 : ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid,
    3136             :                             XLogSegNo segno)
    3137             : {
    3138             :     XLogRecPtr  recptr;
    3139             : 
    3140        4926 :     XLogSegNoOffsetToRecPtr(segno, 0, wal_segment_size, recptr);
    3141             : 
    3142       14778 :     snprintf(path, MAXPGPATH, "pg_replslot/%s/xid-%u-lsn-%X-%X.spill",
    3143        4926 :              NameStr(MyReplicationSlot->data.name),
    3144             :              xid,
    3145        4926 :              (uint32) (recptr >> 32), (uint32) recptr);
    3146        4926 : }
    3147             : 
    3148             : /*
    3149             :  * Delete all data spilled to disk after we've restarted/crashed. It will be
    3150             :  * recreated when the respective slots are reused.
    3151             :  */
    3152             : void
    3153        1390 : StartupReorderBuffer(void)
    3154             : {
    3155             :     DIR        *logical_dir;
    3156             :     struct dirent *logical_de;
    3157             : 
    3158        1390 :     logical_dir = AllocateDir("pg_replslot");
    3159        4200 :     while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
    3160             :     {
    3161        2810 :         if (strcmp(logical_de->d_name, ".") == 0 ||
    3162        1420 :             strcmp(logical_de->d_name, "..") == 0)
    3163        2780 :             continue;
    3164             : 
    3165             :         /* if it cannot be a slot, skip the directory */
    3166          30 :         if (!ReplicationSlotValidateName(logical_de->d_name, DEBUG2))
    3167           0 :             continue;
    3168             : 
    3169             :         /*
    3170             :          * ok, has to be a surviving logical slot, iterate and delete
    3171             :          * everything starting with xid-*
    3172             :          */
    3173          30 :         ReorderBufferCleanupSerializedTXNs(logical_de->d_name);
    3174             :     }
    3175        1390 :     FreeDir(logical_dir);
    3176        1390 : }
    3177             : 
    3178             : /* ---------------------------------------
    3179             :  * toast reassembly support
    3180             :  * ---------------------------------------
    3181             :  */
    3182             : 
    3183             : /*
    3184             :  * Initialize per tuple toast reconstruction support.
    3185             :  */
    3186             : static void
    3187          36 : ReorderBufferToastInitHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
    3188             : {
    3189             :     HASHCTL     hash_ctl;
    3190             : 
    3191             :     Assert(txn->toast_hash == NULL);
    3192             : 
    3193          36 :     memset(&hash_ctl, 0, sizeof(hash_ctl));
    3194          36 :     hash_ctl.keysize = sizeof(Oid);
    3195          36 :     hash_ctl.entrysize = sizeof(ReorderBufferToastEnt);
    3196          36 :     hash_ctl.hcxt = rb->context;
    3197          36 :     txn->toast_hash = hash_create("ReorderBufferToastHash", 5, &hash_ctl,
    3198             :                                   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
    3199          36 : }
    3200             : 
    3201             : /*
    3202             :  * Per toast-chunk handling for toast reconstruction
    3203             :  *
    3204             :  * Appends a toast chunk so we can reconstruct it when the tuple "owning" the
    3205             :  * toasted Datum comes along.
    3206             :  */
    3207             : static void
    3208         608 : ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
    3209             :                               Relation relation, ReorderBufferChange *change)
    3210             : {
    3211             :     ReorderBufferToastEnt *ent;
    3212             :     ReorderBufferTupleBuf *newtup;
    3213             :     bool        found;
    3214             :     int32       chunksize;
    3215             :     bool        isnull;
    3216             :     Pointer     chunk;
    3217         608 :     TupleDesc   desc = RelationGetDescr(relation);
    3218             :     Oid         chunk_id;
    3219             :     int32       chunk_seq;
    3220             : 
    3221         608 :     if (txn->toast_hash == NULL)
    3222          36 :         ReorderBufferToastInitHash(rb, txn);
    3223             : 
    3224             :     Assert(IsToastRelation(relation));
    3225             : 
    3226         608 :     newtup = change->data.tp.newtuple;
    3227         608 :     chunk_id = DatumGetObjectId(fastgetattr(&newtup->tuple, 1, desc, &isnull));
    3228             :     Assert(!isnull);
    3229         608 :     chunk_seq = DatumGetInt32(fastgetattr(&newtup->tuple, 2, desc, &isnull));
    3230             :     Assert(!isnull);
    3231             : 
    3232             :     ent = (ReorderBufferToastEnt *)
    3233         608 :         hash_search(txn->toast_hash,
    3234             :                     (void *) &chunk_id,
    3235             :                     HASH_ENTER,
    3236             :                     &found);
    3237             : 
    3238         608 :     if (!found)
    3239             :     {
    3240             :         Assert(ent->chunk_id == chunk_id);
    3241          44 :         ent->num_chunks = 0;
    3242          44 :         ent->last_chunk_seq = 0;
    3243          44 :         ent->size = 0;
    3244          44 :         ent->reconstructed = NULL;
    3245          44 :         dlist_init(&ent->chunks);
    3246             : 
    3247          44 :         if (chunk_seq != 0)
    3248           0 :             elog(ERROR, "got sequence entry %d for toast chunk %u instead of seq 0",
    3249             :                  chunk_seq, chunk_id);
    3250             :     }
    3251         564 :     else if (found && chunk_seq != ent->last_chunk_seq + 1)
    3252           0 :         elog(ERROR, "got sequence entry %d for toast chunk %u instead of seq %d",
    3253             :              chunk_seq, chunk_id, ent->last_chunk_seq + 1);
    3254             : 
    3255         608 :     chunk = DatumGetPointer(fastgetattr(&newtup->tuple, 3, desc, &isnull));
    3256             :     Assert(!isnull);
    3257             : 
    3258             :     /* calculate size so we can allocate the right size at once later */
    3259         608 :     if (!VARATT_IS_EXTENDED(chunk))
    3260         608 :         chunksize = VARSIZE(chunk) - VARHDRSZ;
    3261           0 :     else if (VARATT_IS_SHORT(chunk))
    3262             :         /* could happen due to heap_form_tuple doing its thing */
    3263           0 :         chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
    3264             :     else
    3265           0 :         elog(ERROR, "unexpected type of toast chunk");
    3266             : 
    3267         608 :     ent->size += chunksize;
    3268         608 :     ent->last_chunk_seq = chunk_seq;
    3269         608 :     ent->num_chunks++;
    3270         608 :     dlist_push_tail(&ent->chunks, &change->node);
    3271         608 : }
    3272             : 
    3273             : /*
    3274             :  * Rejigger change->newtuple to point to in-memory toast tuples instead to
    3275             :  * on-disk toast tuples that may not longer exist (think DROP TABLE or VACUUM).
    3276             :  *
    3277             :  * We cannot replace unchanged toast tuples though, so those will still point
    3278             :  * to on-disk toast data.
    3279             :  *
    3280             :  * While updating the existing change with detoasted tuple data, we need to
    3281             :  * update the memory accounting info, because the change size will differ.
    3282             :  * Otherwise the accounting may get out of sync, triggering serialization
    3283             :  * at unexpected times.
    3284             :  *
    3285             :  * We simply subtract size of the change before rejiggering the tuple, and
    3286             :  * then adding the new size. This makes it look like the change was removed
    3287             :  * and then added back, except it only tweaks the accounting info.
    3288             :  *
    3289             :  * In particular it can't trigger serialization, which would be pointless
    3290             :  * anyway as it happens during commit processing right before handing
    3291             :  * the change to the output plugin.
    3292             :  */
    3293             : static void
    3294      294316 : ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
    3295             :                           Relation relation, ReorderBufferChange *change)
    3296             : {
    3297             :     TupleDesc   desc;
    3298             :     int         natt;
    3299             :     Datum      *attrs;
    3300             :     bool       *isnull;
    3301             :     bool       *free;
    3302             :     HeapTuple   tmphtup;
    3303             :     Relation    toast_rel;
    3304             :     TupleDesc   toast_desc;
    3305             :     MemoryContext oldcontext;
    3306             :     ReorderBufferTupleBuf *newtup;
    3307             : 
    3308             :     /* no toast tuples changed */
    3309      294316 :     if (txn->toast_hash == NULL)
    3310      293876 :         return;
    3311             : 
    3312             :     /*
    3313             :      * We're going to modify the size of the change, so to make sure the
    3314             :      * accounting is correct we'll make it look like we're removing the change
    3315             :      * now (with the old size), and then re-add it at the end.
    3316             :      */
    3317         440 :     ReorderBufferChangeMemoryUpdate(rb, change, false);
    3318             : 
    3319         440 :     oldcontext = MemoryContextSwitchTo(rb->context);
    3320             : 
    3321             :     /* we should only have toast tuples in an INSERT or UPDATE */
    3322             :     Assert(change->data.tp.newtuple);
    3323             : 
    3324         440 :     desc = RelationGetDescr(relation);
    3325             : 
    3326         440 :     toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
    3327         440 :     if (!RelationIsValid(toast_rel))
    3328           0 :         elog(ERROR, "could not open relation with OID %u",
    3329             :              relation->rd_rel->reltoastrelid);
    3330             : 
    3331         440 :     toast_desc = RelationGetDescr(toast_rel);
    3332             : 
    3333             :     /* should we allocate from stack instead? */
    3334         440 :     attrs = palloc0(sizeof(Datum) * desc->natts);
    3335         440 :     isnull = palloc0(sizeof(bool) * desc->natts);
    3336         440 :     free = palloc0(sizeof(bool) * desc->natts);
    3337             : 
    3338         440 :     newtup = change->data.tp.newtuple;
    3339             : 
    3340         440 :     heap_deform_tuple(&newtup->tuple, desc, attrs, isnull);
    3341             : 
    3342        1404 :     for (natt = 0; natt < desc->natts; natt++)
    3343             :     {
    3344         964 :         Form_pg_attribute attr = TupleDescAttr(desc, natt);
    3345             :         ReorderBufferToastEnt *ent;
    3346             :         struct varlena *varlena;
    3347             : 
    3348             :         /* va_rawsize is the size of the original datum -- including header */
    3349             :         struct varatt_external toast_pointer;
    3350             :         struct varatt_indirect redirect_pointer;
    3351         964 :         struct varlena *new_datum = NULL;
    3352             :         struct varlena *reconstructed;
    3353             :         dlist_iter  it;
    3354         964 :         Size        data_done = 0;
    3355             : 
    3356             :         /* system columns aren't toasted */
    3357         964 :         if (attr->attnum < 0)
    3358         920 :             continue;
    3359             : 
    3360         964 :         if (attr->attisdropped)
    3361           0 :             continue;
    3362             : 
    3363             :         /* not a varlena datatype */
    3364         964 :         if (attr->attlen != -1)
    3365         476 :             continue;
    3366             : 
    3367             :         /* no data */
    3368         488 :         if (isnull[natt])
    3369          24 :             continue;
    3370             : 
    3371             :         /* ok, we know we have a toast datum */
    3372         464 :         varlena = (struct varlena *) DatumGetPointer(attrs[natt]);
    3373             : 
    3374             :         /* no need to do anything if the tuple isn't external */
    3375         464 :         if (!VARATT_IS_EXTERNAL(varlena))
    3376         404 :             continue;
    3377             : 
    3378          60 :         VARATT_EXTERNAL_GET_POINTER(toast_pointer, varlena);
    3379             : 
    3380             :         /*
    3381             :          * Check whether the toast tuple changed, replace if so.
    3382             :          */
    3383             :         ent = (ReorderBufferToastEnt *)
    3384          60 :             hash_search(txn->toast_hash,
    3385             :                         (void *) &toast_pointer.va_valueid,
    3386             :                         HASH_FIND,
    3387             :                         NULL);
    3388          60 :         if (ent == NULL)
    3389          16 :             continue;
    3390             : 
    3391             :         new_datum =
    3392          44 :             (struct varlena *) palloc0(INDIRECT_POINTER_SIZE);
    3393             : 
    3394          44 :         free[natt] = true;
    3395             : 
    3396          44 :         reconstructed = palloc0(toast_pointer.va_rawsize);
    3397             : 
    3398          44 :         ent->reconstructed = reconstructed;
    3399             : 
    3400             :         /* stitch toast tuple back together from its parts */
    3401         652 :         dlist_foreach(it, &ent->chunks)
    3402             :         {
    3403             :             bool        isnull;
    3404             :             ReorderBufferChange *cchange;
    3405             :             ReorderBufferTupleBuf *ctup;
    3406             :             Pointer     chunk;
    3407             : 
    3408         608 :             cchange = dlist_container(ReorderBufferChange, node, it.cur);
    3409         608 :             ctup = cchange->data.tp.newtuple;
    3410         608 :             chunk = DatumGetPointer(fastgetattr(&ctup->tuple, 3, toast_desc, &isnull));
    3411             : 
    3412             :             Assert(!isnull);
    3413             :             Assert(!VARATT_IS_EXTERNAL(chunk));
    3414             :             Assert(!VARATT_IS_SHORT(chunk));
    3415             : 
    3416        1216 :             memcpy(VARDATA(reconstructed) + data_done,
    3417         608 :                    VARDATA(chunk),
    3418         608 :                    VARSIZE(chunk) - VARHDRSZ);
    3419         608 :             data_done += VARSIZE(chunk) - VARHDRSZ;
    3420             :         }
    3421             :         Assert(data_done == toast_pointer.va_extsize);
    3422             : 
    3423             :         /* make sure its marked as compressed or not */
    3424          44 :         if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
    3425           8 :             SET_VARSIZE_COMPRESSED(reconstructed, data_done + VARHDRSZ);
    3426             :         else
    3427          36 :             SET_VARSIZE(reconstructed, data_done + VARHDRSZ);
    3428             : 
    3429          44 :         memset(&redirect_pointer, 0, sizeof(redirect_pointer));
    3430          44 :         redirect_pointer.pointer = reconstructed;
    3431             : 
    3432          44 :         SET_VARTAG_EXTERNAL(new_datum, VARTAG_INDIRECT);
    3433          44 :         memcpy(VARDATA_EXTERNAL(new_datum), &redirect_pointer,
    3434             :                sizeof(redirect_pointer));
    3435             : 
    3436          44 :         attrs[natt] = PointerGetDatum(new_datum);
    3437             :     }
    3438             : 
    3439             :     /*
    3440             :      * Build tuple in separate memory & copy tuple back into the tuplebuf
    3441             :      * passed to the output plugin. We can't directly heap_fill_tuple() into
    3442             :      * the tuplebuf because attrs[] will point back into the current content.
    3443             :      */
    3444         440 :     tmphtup = heap_form_tuple(desc, attrs, isnull);
    3445             :     Assert(newtup->tuple.t_len <= MaxHeapTupleSize);
    3446             :     Assert(ReorderBufferTupleBufData(newtup) == newtup->tuple.t_data);
    3447             : 
    3448         440 :     memcpy(newtup->tuple.t_data, tmphtup->t_data, tmphtup->t_len);
    3449         440 :     newtup->tuple.t_len = tmphtup->t_len;
    3450             : 
    3451             :     /*
    3452             :      * free resources we won't further need, more persistent stuff will be
    3453             :      * free'd in ReorderBufferToastReset().
    3454             :      */
    3455         440 :     RelationClose(toast_rel);
    3456         440 :     pfree(tmphtup);
    3457        1404 :     for (natt = 0; natt < desc->natts; natt++)
    3458             :     {
    3459         964 :         if (free[natt])
    3460          44 :             pfree(DatumGetPointer(attrs[natt]));
    3461             :     }
    3462         440 :     pfree(attrs);
    3463         440 :     pfree(free);
    3464         440 :     pfree(isnull);
    3465             : 
    3466         440 :     MemoryContextSwitchTo(oldcontext);
    3467             : 
    3468             :     /* now add the change back, with the correct size */
    3469         440 :     ReorderBufferChangeMemoryUpdate(rb, change, true);
    3470             : }
    3471             : 
    3472             : /*
    3473             :  * Free all resources allocated for toast reconstruction.
    3474             :  */
    3475             : static void
    3476      293912 : ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
    3477             : {
    3478             :     HASH_SEQ_STATUS hstat;
    3479             :     ReorderBufferToastEnt *ent;
    3480             : 
    3481      293912 :     if (txn->toast_hash == NULL)
    3482      293876 :         return;
    3483             : 
    3484             :     /* sequentially walk over the hash and free everything */
    3485          36 :     hash_seq_init(&hstat, txn->toast_hash);
    3486          80 :     while ((ent = (ReorderBufferToastEnt *) hash_seq_search(&hstat)) != NULL)
    3487             :     {
    3488             :         dlist_mutable_iter it;
    3489             : 
    3490          44 :         if (ent->reconstructed != NULL)
    3491          44 :             pfree(ent->reconstructed);
    3492             : 
    3493         652 :         dlist_foreach_modify(it, &ent->chunks)
    3494             :         {
    3495         608 :             ReorderBufferChange *change =
    3496         608 :             dlist_container(ReorderBufferChange, node, it.cur);
    3497             : 
    3498         608 :             dlist_delete(&change->node);
    3499         608 :             ReorderBufferReturnChange(rb, change);
    3500             :         }
    3501             :     }
    3502             : 
    3503          36 :     hash_destroy(txn->toast_hash);
    3504          36 :     txn->toast_hash = NULL;
    3505             : }
    3506             : 
    3507             : 
    3508             : /* ---------------------------------------
    3509             :  * Visibility support for logical decoding
    3510             :  *
    3511             :  *
    3512             :  * Lookup actual cmin/cmax values when using decoding snapshot. We can't
    3513             :  * always rely on stored cmin/cmax values because of two scenarios:
    3514             :  *
    3515             :  * * A tuple got changed multiple times during a single transaction and thus
    3516             :  *   has got a combocid. Combocid's are only valid for the duration of a
    3517             :  *   single transaction.
    3518             :  * * A tuple with a cmin but no cmax (and thus no combocid) got
    3519             :  *   deleted/updated in another transaction than the one which created it
    3520             :  *   which we are looking at right now. As only one of cmin, cmax or combocid
    3521             :  *   is actually stored in the heap we don't have access to the value we
    3522             :  *   need anymore.
    3523             :  *
    3524             :  * To resolve those problems we have a per-transaction hash of (cmin,
    3525             :  * cmax) tuples keyed by (relfilenode, ctid) which contains the actual
    3526             :  * (cmin, cmax) values. That also takes care of combocids by simply
    3527             :  * not caring about them at all. As we have the real cmin/cmax values
    3528             :  * combocids aren't interesting.
    3529             :  *
    3530             :  * As we only care about catalog tuples here the overhead of this
    3531             :  * hashtable should be acceptable.
    3532             :  *
    3533             :  * Heap rewrites complicate this a bit, check rewriteheap.c for
    3534             :  * details.
    3535             :  * -------------------------------------------------------------------------
    3536             :  */
    3537             : 
    3538             : /* struct for sorting mapping files by LSN efficiently */
    3539             : typedef struct RewriteMappingFile
    3540             : {
    3541             :     XLogRecPtr  lsn;
    3542             :     char        fname[MAXPGPATH];
    3543             : } RewriteMappingFile;
    3544             : 
    3545             : #ifdef NOT_USED
    3546             : static void
    3547             : DisplayMapping(HTAB *tuplecid_data)
    3548             : {
    3549             :     HASH_SEQ_STATUS hstat;
    3550             :     ReorderBufferTupleCidEnt *ent;
    3551             : 
    3552             :     hash_seq_init(&hstat, tuplecid_data);
    3553             :     while ((ent = (ReorderBufferTupleCidEnt *) hash_seq_search(&hstat)) != NULL)
    3554             :     {
    3555             :         elog(DEBUG3, "mapping: node: %u/%u/%u tid: %u/%u cmin: %u, cmax: %u",
    3556             :              ent->key.relnode.dbNode,
    3557             :              ent->key.relnode.spcNode,
    3558             :              ent->key.relnode.relNode,
    3559             :              ItemPointerGetBlockNumber(&ent->key.tid),
    3560             :              ItemPointerGetOffsetNumber(&ent->key.tid),
    3561             :              ent->cmin,
    3562             :              ent->cmax
    3563             :             );
    3564             :     }
    3565             : }
    3566             : #endif
    3567             : 
    3568             : /*
    3569             :  * Apply a single mapping file to tuplecid_data.
    3570             :  *
    3571             :  * The mapping file has to have been verified to be a) committed b) for our
    3572             :  * transaction c) applied in LSN order.
    3573             :  */
    3574             : static void
    3575          44 : ApplyLogicalMappingFile(HTAB *tuplecid_data, Oid relid, const char *fname)
    3576             : {
    3577             :     char        path[MAXPGPATH];
    3578             :     int         fd;
    3579             :     int         readBytes;
    3580             :     LogicalRewriteMappingData map;
    3581             : 
    3582          44 :     sprintf(path, "pg_logical/mappings/%s", fname);
    3583          44 :     fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
    3584          44 :     if (fd < 0)
    3585           0 :         ereport(ERROR,
    3586             :                 (errcode_for_file_access(),
    3587             :                  errmsg("could not open file \"%s\": %m", path)));
    3588             : 
    3589             :     while (true)
    3590         238 :     {
    3591             :         ReorderBufferTupleCidKey key;
    3592             :         ReorderBufferTupleCidEnt *ent;
    3593             :         ReorderBufferTupleCidEnt *new_ent;
    3594             :         bool        found;
    3595             : 
    3596             :         /* be careful about padding */
    3597         282 :         memset(&key, 0, sizeof(ReorderBufferTupleCidKey));
    3598             : 
    3599             :         /* read all mappings till the end of the file */
    3600         282 :         pgstat_report_wait_start(WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ);
    3601         282 :         readBytes = read(fd, &map, sizeof(LogicalRewriteMappingData));
    3602         282 :         pgstat_report_wait_end();
    3603             : 
    3604         282 :         if (readBytes < 0)
    3605           0 :             ereport(ERROR,
    3606             :                     (errcode_for_file_access(),
    3607             :                      errmsg("could not read file \"%s\": %m",
    3608             :                             path)));
    3609         282 :         else if (readBytes == 0)    /* EOF */
    3610          44 :             break;
    3611         238 :         else if (readBytes != sizeof(LogicalRewriteMappingData))
    3612           0 :             ereport(ERROR,
    3613             :                     (errcode_for_file_access(),
    3614             :                      errmsg("could not read from file \"%s\": read %d instead of %d bytes",
    3615             :                             path, readBytes,
    3616             :                             (int32) sizeof(LogicalRewriteMappingData))));
    3617             : 
    3618         238 :         key.relnode = map.old_node;
    3619         238 :         ItemPointerCopy(&map.old_tid,
    3620             :                         &key.tid);
    3621             : 
    3622             : 
    3623             :         ent = (ReorderBufferTupleCidEnt *)
    3624         238 :             hash_search(tuplecid_data,
    3625             :                         (void *) &key,
    3626             :                         HASH_FIND,
    3627             :                         NULL);
    3628             : 
    3629             :         /* no existing mapping, no need to update */
    3630         238 :         if (!ent)
    3631           0 :             continue;
    3632             : 
    3633         238 :         key.relnode = map.new_node;
    3634         238 :         ItemPointerCopy(&map.new_tid,
    3635             :                         &key.tid);
    3636             : 
    3637             :         new_ent = (ReorderBufferTupleCidEnt *)
    3638         238 :             hash_search(tuplecid_data,
    3639             :                         (void *) &key,
    3640             :                         HASH_ENTER,
    3641             :                         &found);
    3642             : 
    3643         238 :         if (found)
    3644             :         {
    3645             :             /*
    3646             :              * Make sure the existing mapping makes sense. We sometime update
    3647             :              * old records that did not yet have a cmax (e.g. pg_class' own
    3648             :              * entry while rewriting it) during rewrites, so allow that.
    3649             :              */
    3650             :             Assert(ent->cmin == InvalidCommandId || ent->cmin == new_ent->cmin);
    3651             :             Assert(ent->cmax == InvalidCommandId || ent->cmax == new_ent->cmax);
    3652             :         }
    3653             :         else
    3654             :         {
    3655             :             /* update mapping */
    3656         226 :             new_ent->cmin = ent->cmin;
    3657         226 :             new_ent->cmax = ent->cmax;
    3658         226 :             new_ent->combocid = ent->combocid;
    3659             :         }
    3660             :     }
    3661             : 
    3662          44 :     if (CloseTransientFile(fd) != 0)
    3663           0 :         ereport(ERROR,
    3664             :                 (errcode_for_file_access(),
    3665             :                  errmsg("could not close file \"%s\": %m", path)));
    3666          44 : }
    3667             : 
    3668             : 
    3669             : /*
    3670             :  * Check whether the TransactionId 'xid' is in the pre-sorted array 'xip'.
    3671             :  */
    3672             : static bool
    3673         580 : TransactionIdInArray(TransactionId xid, TransactionId *xip, Size num)
    3674             : {
    3675         580 :     return bsearch(&xid, xip, num,
    3676         580 :                    sizeof(TransactionId), xidComparator) != NULL;
    3677             : }
    3678             : 
    3679             : /*
    3680             :  * list_sort() comparator for sorting RewriteMappingFiles in LSN order.
    3681             :  */
    3682             : static int
    3683          52 : file_sort_by_lsn(const ListCell *a_p, const ListCell *b_p)
    3684             : {
    3685          52 :     RewriteMappingFile *a = (RewriteMappingFile *) lfirst(a_p);
    3686          52 :     RewriteMappingFile *b = (RewriteMappingFile *) lfirst(b_p);
    3687             : 
    3688          52 :     if (a->lsn < b->lsn)
    3689          28 :         return -1;
    3690          24 :     else if (a->lsn > b->lsn)
    3691          24 :         return 1;
    3692           0 :     return 0;
    3693             : }
    3694             : 
    3695             : /*
    3696             :  * Apply any existing logical remapping files if there are any targeted at our
    3697             :  * transaction for relid.
    3698             :  */
    3699             : static void
    3700          10 : UpdateLogicalMappings(HTAB *tuplecid_data, Oid relid, Snapshot snapshot)
    3701             : {
    3702             :     DIR        *mapping_dir;
    3703             :     struct dirent *mapping_de;
    3704          10 :     List       *files = NIL;
    3705             :     ListCell   *file;
    3706          10 :     Oid         dboid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
    3707             : 
    3708          10 :     mapping_dir = AllocateDir("pg_logical/mappings");
    3709         920 :     while ((mapping_de = ReadDir(mapping_dir, "pg_logical/mappings")) != NULL)
    3710             :     {
    3711             :         Oid         f_dboid;
    3712             :         Oid         f_relid;
    3713             :         TransactionId f_mapped_xid;
    3714             :         TransactionId f_create_xid;
    3715             :         XLogRecPtr  f_lsn;
    3716             :         uint32      f_hi,
    3717             :                     f_lo;
    3718             :         RewriteMappingFile *f;
    3719             : 
    3720         910 :         if (strcmp(mapping_de->d_name, ".") == 0 ||
    3721         900 :             strcmp(mapping_de->d_name, "..") == 0)
    3722         866 :             continue;
    3723             : 
    3724             :         /* Ignore files that aren't ours */
    3725         890 :         if (strncmp(mapping_de->d_name, "map-", 4) != 0)
    3726           0 :             continue;
    3727             : 
    3728         890 :         if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
    3729             :                    &f_dboid, &f_relid, &f_hi, &f_lo,
    3730             :                    &f_mapped_xid, &f_create_xid) != 6)
    3731           0 :             elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
    3732             : 
    3733         890 :         f_lsn = ((uint64) f_hi) << 32 | f_lo;
    3734             : 
    3735             :         /* mapping for another database */
    3736         890 :         if (f_dboid != dboid)
    3737           0 :             continue;
    3738             : 
    3739             :         /* mapping for another relation */
    3740         890 :         if (f_relid != relid)
    3741          90 :             continue;
    3742             : 
    3743             :         /* did the creating transaction abort? */
    3744         800 :         if (!TransactionIdDidCommit(f_create_xid))
    3745         220 :             continue;
    3746             : 
    3747             :         /* not for our transaction */
    3748         580 :         if (!TransactionIdInArray(f_mapped_xid, snapshot->subxip, snapshot->subxcnt))
    3749         536 :             continue;
    3750             : 
    3751             :         /* ok, relevant, queue for apply */
    3752          44 :         f = palloc(sizeof(RewriteMappingFile));
    3753          44 :         f->lsn = f_lsn;
    3754          44 :         strcpy(f->fname, mapping_de->d_name);
    3755          44 :         files = lappend(files, f);
    3756             :     }
    3757          10 :     FreeDir(mapping_dir);
    3758             : 
    3759             :     /* sort files so we apply them in LSN order */
    3760          10 :     list_sort(files, file_sort_by_lsn);
    3761             : 
    3762          54 :     foreach(file, files)
    3763             :     {
    3764          44 :         RewriteMappingFile *f = (RewriteMappingFile *) lfirst(file);
    3765             : 
    3766          44 :         elog(DEBUG1, "applying mapping: \"%s\" in %u", f->fname,
    3767             :              snapshot->subxip[0]);
    3768          44 :         ApplyLogicalMappingFile(tuplecid_data, relid, f->fname);
    3769          44 :         pfree(f);
    3770             :     }
    3771          10 : }
    3772             : 
    3773             : /*
    3774             :  * Lookup cmin/cmax of a tuple, during logical decoding where we can't rely on
    3775             :  * combocids.
    3776             :  */
    3777             : bool
    3778         878 : ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
    3779             :                               Snapshot snapshot,
    3780             :                               HeapTuple htup, Buffer buffer,
    3781             :                               CommandId *cmin, CommandId *cmax)
    3782             : {
    3783             :     ReorderBufferTupleCidKey key;
    3784             :     ReorderBufferTupleCidEnt *ent;
    3785             :     ForkNumber  forkno;
    3786             :     BlockNumber blockno;
    3787         878 :     bool        updated_mapping = false;
    3788             : 
    3789             :     /* be careful about padding */
    3790         878 :     memset(&key, 0, sizeof(key));
    3791             : 
    3792             :     Assert(!BufferIsLocal(buffer));
    3793             : 
    3794             :     /*
    3795             :      * get relfilenode from the buffer, no convenient way to access it other
    3796             :      * than that.
    3797             :      */
    3798         878 :     BufferGetTag(buffer, &key.relnode, &forkno, &blockno);
    3799             : 
    3800             :     /* tuples can only be in the main fork */
    3801             :     Assert(forkno == MAIN_FORKNUM);
    3802             :     Assert(blockno == ItemPointerGetBlockNumber(&htup->t_self));
    3803             : 
    3804         878 :     ItemPointerCopy(&htup->t_self,
    3805             :                     &key.tid);
    3806             : 
    3807         888 : restart:
    3808             :     ent = (ReorderBufferTupleCidEnt *)
    3809         888 :         hash_search(tuplecid_data,
    3810             :                     (void *) &key,
    3811             :                     HASH_FIND,
    3812             :                     NULL);
    3813             : 
    3814             :     /*
    3815             :      * failed to find a mapping, check whether the table was rewritten and
    3816             :      * apply mapping if so, but only do that once - there can be no new
    3817             :      * mappings while we are in here since we have to hold a lock on the
    3818             :      * relation.
    3819             :      */
    3820         888 :     if (ent == NULL && !updated_mapping)
    3821             :     {
    3822          10 :         UpdateLogicalMappings(tuplecid_data, htup->t_tableOid, snapshot);
    3823             :         /* now check but don't update for a mapping again */
    3824          10 :         updated_mapping = true;
    3825          10 :         goto restart;
    3826             :     }
    3827         878 :     else if (ent == NULL)
    3828           0 :         return false;
    3829             : 
    3830         878 :     if (cmin)
    3831         878 :         *cmin = ent->cmin;
    3832         878 :     if (cmax)
    3833         878 :         *cmax = ent->cmax;
    3834         878 :     return true;
    3835             : }

Generated by: LCOV version 1.13