LCOV - code coverage report
Current view: top level - src/backend/replication/logical - reorderbuffer.c (source / functions) Hit Total Coverage
Test: PostgreSQL 15devel Lines: 1377 1493 92.2 %
Date: 2021-12-09 03:08:47 Functions: 86 86 100.0 %
Legend: Lines: hit not hit

          Line data    Source code
       1             : /*-------------------------------------------------------------------------
       2             :  *
       3             :  * reorderbuffer.c
       4             :  *    PostgreSQL logical replay/reorder buffer management
       5             :  *
       6             :  *
       7             :  * Copyright (c) 2012-2021, PostgreSQL Global Development Group
       8             :  *
       9             :  *
      10             :  * IDENTIFICATION
      11             :  *    src/backend/replication/reorderbuffer.c
      12             :  *
      13             :  * NOTES
      14             :  *    This module gets handed individual pieces of transactions in the order
      15             :  *    they are written to the WAL and is responsible to reassemble them into
      16             :  *    toplevel transaction sized pieces. When a transaction is completely
      17             :  *    reassembled - signaled by reading the transaction commit record - it
      18             :  *    will then call the output plugin (cf. ReorderBufferCommit()) with the
      19             :  *    individual changes. The output plugins rely on snapshots built by
      20             :  *    snapbuild.c which hands them to us.
      21             :  *
      22             :  *    Transactions and subtransactions/savepoints in postgres are not
      23             :  *    immediately linked to each other from outside the performing
      24             :  *    backend. Only at commit/abort (or special xact_assignment records) they
      25             :  *    are linked together. Which means that we will have to splice together a
      26             :  *    toplevel transaction from its subtransactions. To do that efficiently we
      27             :  *    build a binary heap indexed by the smallest current lsn of the individual
      28             :  *    subtransactions' changestreams. As the individual streams are inherently
      29             :  *    ordered by LSN - since that is where we build them from - the transaction
      30             :  *    can easily be reassembled by always using the subtransaction with the
      31             :  *    smallest current LSN from the heap.
      32             :  *
      33             :  *    In order to cope with large transactions - which can be several times as
      34             :  *    big as the available memory - this module supports spooling the contents
      35             :  *    of a large transactions to disk. When the transaction is replayed the
      36             :  *    contents of individual (sub-)transactions will be read from disk in
      37             :  *    chunks.
      38             :  *
      39             :  *    This module also has to deal with reassembling toast records from the
      40             :  *    individual chunks stored in WAL. When a new (or initial) version of a
      41             :  *    tuple is stored in WAL it will always be preceded by the toast chunks
      42             :  *    emitted for the columns stored out of line. Within a single toplevel
      43             :  *    transaction there will be no other data carrying records between a row's
      44             :  *    toast chunks and the row data itself. See ReorderBufferToast* for
      45             :  *    details.
      46             :  *
      47             :  *    ReorderBuffer uses two special memory context types - SlabContext for
      48             :  *    allocations of fixed-length structures (changes and transactions), and
      49             :  *    GenerationContext for the variable-length transaction data (allocated
      50             :  *    and freed in groups with similar lifespans).
      51             :  *
      52             :  *    To limit the amount of memory used by decoded changes, we track memory
      53             :  *    used at the reorder buffer level (i.e. total amount of memory), and for
      54             :  *    each transaction. When the total amount of used memory exceeds the
      55             :  *    limit, the transaction consuming the most memory is then serialized to
      56             :  *    disk.
      57             :  *
      58             :  *    Only decoded changes are evicted from memory (spilled to disk), not the
      59             :  *    transaction records. The number of toplevel transactions is limited,
      60             :  *    but a transaction with many subtransactions may still consume significant
      61             :  *    amounts of memory. However, the transaction records are fairly small and
      62             :  *    are not included in the memory limit.
      63             :  *
      64             :  *    The current eviction algorithm is very simple - the transaction is
      65             :  *    picked merely by size, while it might be useful to also consider age
      66             :  *    (LSN) of the changes for example. With the new Generational memory
      67             :  *    allocator, evicting the oldest changes would make it more likely the
      68             :  *    memory gets actually freed.
      69             :  *
      70             :  *    We still rely on max_changes_in_memory when loading serialized changes
      71             :  *    back into memory. At that point we can't use the memory limit directly
      72             :  *    as we load the subxacts independently. One option to deal with this
      73             :  *    would be to count the subxacts, and allow each to allocate 1/N of the
      74             :  *    memory limit. That however does not seem very appealing, because with
      75             :  *    many subtransactions it may easily cause thrashing (short cycles of
      76             :  *    deserializing and applying very few changes). We probably should give
      77             :  *    a bit more memory to the oldest subtransactions, because it's likely
      78             :  *    they are the source for the next sequence of changes.
      79             :  *
      80             :  * -------------------------------------------------------------------------
      81             :  */
      82             : #include "postgres.h"
      83             : 
      84             : #include <unistd.h>
      85             : #include <sys/stat.h>
      86             : 
      87             : #include "access/detoast.h"
      88             : #include "access/heapam.h"
      89             : #include "access/rewriteheap.h"
      90             : #include "access/transam.h"
      91             : #include "access/xact.h"
      92             : #include "access/xlog_internal.h"
      93             : #include "catalog/catalog.h"
      94             : #include "lib/binaryheap.h"
      95             : #include "miscadmin.h"
      96             : #include "pgstat.h"
      97             : #include "replication/logical.h"
      98             : #include "replication/reorderbuffer.h"
      99             : #include "replication/slot.h"
     100             : #include "replication/snapbuild.h"    /* just for SnapBuildSnapDecRefcount */
     101             : #include "storage/bufmgr.h"
     102             : #include "storage/fd.h"
     103             : #include "storage/sinval.h"
     104             : #include "utils/builtins.h"
     105             : #include "utils/combocid.h"
     106             : #include "utils/memdebug.h"
     107             : #include "utils/memutils.h"
     108             : #include "utils/rel.h"
     109             : #include "utils/relfilenodemap.h"
     110             : 
     111             : 
     112             : /* entry for a hash table we use to map from xid to our transaction state */
     113             : typedef struct ReorderBufferTXNByIdEnt
     114             : {
     115             :     TransactionId xid;
     116             :     ReorderBufferTXN *txn;
     117             : } ReorderBufferTXNByIdEnt;
     118             : 
     119             : /* data structures for (relfilenode, ctid) => (cmin, cmax) mapping */
     120             : typedef struct ReorderBufferTupleCidKey
     121             : {
     122             :     RelFileNode relnode;
     123             :     ItemPointerData tid;
     124             : } ReorderBufferTupleCidKey;
     125             : 
     126             : typedef struct ReorderBufferTupleCidEnt
     127             : {
     128             :     ReorderBufferTupleCidKey key;
     129             :     CommandId   cmin;
     130             :     CommandId   cmax;
     131             :     CommandId   combocid;       /* just for debugging */
     132             : } ReorderBufferTupleCidEnt;
     133             : 
     134             : /* Virtual file descriptor with file offset tracking */
     135             : typedef struct TXNEntryFile
     136             : {
     137             :     File        vfd;            /* -1 when the file is closed */
     138             :     off_t       curOffset;      /* offset for next write or read. Reset to 0
     139             :                                  * when vfd is opened. */
     140             : } TXNEntryFile;
     141             : 
     142             : /* k-way in-order change iteration support structures */
     143             : typedef struct ReorderBufferIterTXNEntry
     144             : {
     145             :     XLogRecPtr  lsn;
     146             :     ReorderBufferChange *change;
     147             :     ReorderBufferTXN *txn;
     148             :     TXNEntryFile file;
     149             :     XLogSegNo   segno;
     150             : } ReorderBufferIterTXNEntry;
     151             : 
     152             : typedef struct ReorderBufferIterTXNState
     153             : {
     154             :     binaryheap *heap;
     155             :     Size        nr_txns;
     156             :     dlist_head  old_change;
     157             :     ReorderBufferIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
     158             : } ReorderBufferIterTXNState;
     159             : 
     160             : /* toast datastructures */
     161             : typedef struct ReorderBufferToastEnt
     162             : {
     163             :     Oid         chunk_id;       /* toast_table.chunk_id */
     164             :     int32       last_chunk_seq; /* toast_table.chunk_seq of the last chunk we
     165             :                                  * have seen */
     166             :     Size        num_chunks;     /* number of chunks we've already seen */
     167             :     Size        size;           /* combined size of chunks seen */
     168             :     dlist_head  chunks;         /* linked list of chunks */
     169             :     struct varlena *reconstructed;  /* reconstructed varlena now pointed to in
     170             :                                      * main tup */
     171             : } ReorderBufferToastEnt;
     172             : 
     173             : /* Disk serialization support datastructures */
     174             : typedef struct ReorderBufferDiskChange
     175             : {
     176             :     Size        size;
     177             :     ReorderBufferChange change;
     178             :     /* data follows */
     179             : } ReorderBufferDiskChange;
     180             : 
     181             : #define IsSpecInsert(action) \
     182             : ( \
     183             :     ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT) \
     184             : )
     185             : #define IsSpecConfirmOrAbort(action) \
     186             : ( \
     187             :     (((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM) || \
     188             :     ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT)) \
     189             : )
     190             : #define IsInsertOrUpdate(action) \
     191             : ( \
     192             :     (((action) == REORDER_BUFFER_CHANGE_INSERT) || \
     193             :     ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
     194             :     ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)) \
     195             : )
     196             : 
     197             : /*
     198             :  * Maximum number of changes kept in memory, per transaction. After that,
     199             :  * changes are spooled to disk.
     200             :  *
     201             :  * The current value should be sufficient to decode the entire transaction
     202             :  * without hitting disk in OLTP workloads, while starting to spool to disk in
     203             :  * other workloads reasonably fast.
     204             :  *
     205             :  * At some point in the future it probably makes sense to have a more elaborate
     206             :  * resource management here, but it's not entirely clear what that would look
     207             :  * like.
     208             :  */
     209             : int         logical_decoding_work_mem;
     210             : static const Size max_changes_in_memory = 4096; /* XXX for restore only */
     211             : 
     212             : /* ---------------------------------------
     213             :  * primary reorderbuffer support routines
     214             :  * ---------------------------------------
     215             :  */
     216             : static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
     217             : static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
     218             : static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
     219             :                                                TransactionId xid, bool create, bool *is_new,
     220             :                                                XLogRecPtr lsn, bool create_as_top);
     221             : static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
     222             :                                               ReorderBufferTXN *subtxn);
     223             : 
     224             : static void AssertTXNLsnOrder(ReorderBuffer *rb);
     225             : 
     226             : /* ---------------------------------------
     227             :  * support functions for lsn-order iterating over the ->changes of a
     228             :  * transaction and its subtransactions
     229             :  *
     230             :  * used for iteration over the k-way heap merge of a transaction and its
     231             :  * subtransactions
     232             :  * ---------------------------------------
     233             :  */
     234             : static void ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
     235             :                                      ReorderBufferIterTXNState *volatile *iter_state);
     236             : static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
     237             : static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
     238             :                                        ReorderBufferIterTXNState *state);
     239             : static void ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs);
     240             : 
     241             : /*
     242             :  * ---------------------------------------
     243             :  * Disk serialization support functions
     244             :  * ---------------------------------------
     245             :  */
     246             : static void ReorderBufferCheckMemoryLimit(ReorderBuffer *rb);
     247             : static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
     248             : static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
     249             :                                          int fd, ReorderBufferChange *change);
     250             : static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
     251             :                                         TXNEntryFile *file, XLogSegNo *segno);
     252             : static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
     253             :                                        char *change);
     254             : static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
     255             : static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
     256             :                                      bool txn_prepared);
     257             : static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
     258             : static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
     259             :                                         TransactionId xid, XLogSegNo segno);
     260             : 
     261             : static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
     262             : static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
     263             :                                       ReorderBufferTXN *txn, CommandId cid);
     264             : 
     265             : /*
     266             :  * ---------------------------------------
     267             :  * Streaming support functions
     268             :  * ---------------------------------------
     269             :  */
     270             : static inline bool ReorderBufferCanStream(ReorderBuffer *rb);
     271             : static inline bool ReorderBufferCanStartStreaming(ReorderBuffer *rb);
     272             : static void ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
     273             : static void ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn);
     274             : 
     275             : /* ---------------------------------------
     276             :  * toast reassembly support
     277             :  * ---------------------------------------
     278             :  */
     279             : static void ReorderBufferToastInitHash(ReorderBuffer *rb, ReorderBufferTXN *txn);
     280             : static void ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn);
     281             : static void ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
     282             :                                       Relation relation, ReorderBufferChange *change);
     283             : static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
     284             :                                           Relation relation, ReorderBufferChange *change);
     285             : 
     286             : /*
     287             :  * ---------------------------------------
     288             :  * memory accounting
     289             :  * ---------------------------------------
     290             :  */
     291             : static Size ReorderBufferChangeSize(ReorderBufferChange *change);
     292             : static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
     293             :                                             ReorderBufferChange *change,
     294             :                                             bool addition, Size sz);
     295             : 
     296             : /*
     297             :  * Allocate a new ReorderBuffer and clean out any old serialized state from
     298             :  * prior ReorderBuffer instances for the same slot.
     299             :  */
     300             : ReorderBuffer *
     301        1060 : ReorderBufferAllocate(void)
     302             : {
     303             :     ReorderBuffer *buffer;
     304             :     HASHCTL     hash_ctl;
     305             :     MemoryContext new_ctx;
     306             : 
     307             :     Assert(MyReplicationSlot != NULL);
     308             : 
     309             :     /* allocate memory in own context, to have better accountability */
     310        1060 :     new_ctx = AllocSetContextCreate(CurrentMemoryContext,
     311             :                                     "ReorderBuffer",
     312             :                                     ALLOCSET_DEFAULT_SIZES);
     313             : 
     314             :     buffer =
     315        1060 :         (ReorderBuffer *) MemoryContextAlloc(new_ctx, sizeof(ReorderBuffer));
     316             : 
     317        1060 :     memset(&hash_ctl, 0, sizeof(hash_ctl));
     318             : 
     319        1060 :     buffer->context = new_ctx;
     320             : 
     321        1060 :     buffer->change_context = SlabContextCreate(new_ctx,
     322             :                                                "Change",
     323             :                                                SLAB_DEFAULT_BLOCK_SIZE,
     324             :                                                sizeof(ReorderBufferChange));
     325             : 
     326        1060 :     buffer->txn_context = SlabContextCreate(new_ctx,
     327             :                                             "TXN",
     328             :                                             SLAB_DEFAULT_BLOCK_SIZE,
     329             :                                             sizeof(ReorderBufferTXN));
     330             : 
     331        1060 :     buffer->tup_context = GenerationContextCreate(new_ctx,
     332             :                                                   "Tuples",
     333             :                                                   SLAB_LARGE_BLOCK_SIZE);
     334             : 
     335        1060 :     hash_ctl.keysize = sizeof(TransactionId);
     336        1060 :     hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
     337        1060 :     hash_ctl.hcxt = buffer->context;
     338             : 
     339        1060 :     buffer->by_txn = hash_create("ReorderBufferByXid", 1000, &hash_ctl,
     340             :                                  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
     341             : 
     342        1060 :     buffer->by_txn_last_xid = InvalidTransactionId;
     343        1060 :     buffer->by_txn_last_txn = NULL;
     344             : 
     345        1060 :     buffer->outbuf = NULL;
     346        1060 :     buffer->outbufsize = 0;
     347        1060 :     buffer->size = 0;
     348             : 
     349        1060 :     buffer->spillTxns = 0;
     350        1060 :     buffer->spillCount = 0;
     351        1060 :     buffer->spillBytes = 0;
     352        1060 :     buffer->streamTxns = 0;
     353        1060 :     buffer->streamCount = 0;
     354        1060 :     buffer->streamBytes = 0;
     355        1060 :     buffer->totalTxns = 0;
     356        1060 :     buffer->totalBytes = 0;
     357             : 
     358        1060 :     buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
     359             : 
     360        1060 :     dlist_init(&buffer->toplevel_by_lsn);
     361        1060 :     dlist_init(&buffer->txns_by_base_snapshot_lsn);
     362             : 
     363             :     /*
     364             :      * Ensure there's no stale data from prior uses of this slot, in case some
     365             :      * prior exit avoided calling ReorderBufferFree. Failure to do this can
     366             :      * produce duplicated txns, and it's very cheap if there's nothing there.
     367             :      */
     368        1060 :     ReorderBufferCleanupSerializedTXNs(NameStr(MyReplicationSlot->data.name));
     369             : 
     370        1060 :     return buffer;
     371             : }
     372             : 
     373             : /*
     374             :  * Free a ReorderBuffer
     375             :  */
     376             : void
     377         934 : ReorderBufferFree(ReorderBuffer *rb)
     378             : {
     379         934 :     MemoryContext context = rb->context;
     380             : 
     381             :     /*
     382             :      * We free separately allocated data by entirely scrapping reorderbuffer's
     383             :      * memory context.
     384             :      */
     385         934 :     MemoryContextDelete(context);
     386             : 
     387             :     /* Free disk space used by unconsumed reorder buffers */
     388         934 :     ReorderBufferCleanupSerializedTXNs(NameStr(MyReplicationSlot->data.name));
     389         934 : }
     390             : 
     391             : /*
     392             :  * Get an unused, possibly preallocated, ReorderBufferTXN.
     393             :  */
     394             : static ReorderBufferTXN *
     395        4810 : ReorderBufferGetTXN(ReorderBuffer *rb)
     396             : {
     397             :     ReorderBufferTXN *txn;
     398             : 
     399             :     txn = (ReorderBufferTXN *)
     400        4810 :         MemoryContextAlloc(rb->txn_context, sizeof(ReorderBufferTXN));
     401             : 
     402        4810 :     memset(txn, 0, sizeof(ReorderBufferTXN));
     403             : 
     404        4810 :     dlist_init(&txn->changes);
     405        4810 :     dlist_init(&txn->tuplecids);
     406        4810 :     dlist_init(&txn->subtxns);
     407             : 
     408             :     /* InvalidCommandId is not zero, so set it explicitly */
     409        4810 :     txn->command_id = InvalidCommandId;
     410        4810 :     txn->output_plugin_private = NULL;
     411             : 
     412        4810 :     return txn;
     413             : }
     414             : 
     415             : /*
     416             :  * Free a ReorderBufferTXN.
     417             :  */
     418             : static void
     419        4748 : ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
     420             : {
     421             :     /* clean the lookup cache if we were cached (quite likely) */
     422        4748 :     if (rb->by_txn_last_xid == txn->xid)
     423             :     {
     424        4396 :         rb->by_txn_last_xid = InvalidTransactionId;
     425        4396 :         rb->by_txn_last_txn = NULL;
     426             :     }
     427             : 
     428             :     /* free data that's contained */
     429             : 
     430        4748 :     if (txn->gid != NULL)
     431             :     {
     432          64 :         pfree(txn->gid);
     433          64 :         txn->gid = NULL;
     434             :     }
     435             : 
     436        4748 :     if (txn->tuplecid_hash != NULL)
     437             :     {
     438         492 :         hash_destroy(txn->tuplecid_hash);
     439         492 :         txn->tuplecid_hash = NULL;
     440             :     }
     441             : 
     442        4748 :     if (txn->invalidations)
     443             :     {
     444        1340 :         pfree(txn->invalidations);
     445        1340 :         txn->invalidations = NULL;
     446             :     }
     447             : 
     448             :     /* Reset the toast hash */
     449        4748 :     ReorderBufferToastReset(rb, txn);
     450             : 
     451        4748 :     pfree(txn);
     452        4748 : }
     453             : 
     454             : /*
     455             :  * Get an fresh ReorderBufferChange.
     456             :  */
     457             : ReorderBufferChange *
     458     3165170 : ReorderBufferGetChange(ReorderBuffer *rb)
     459             : {
     460             :     ReorderBufferChange *change;
     461             : 
     462             :     change = (ReorderBufferChange *)
     463     3165170 :         MemoryContextAlloc(rb->change_context, sizeof(ReorderBufferChange));
     464             : 
     465     3165170 :     memset(change, 0, sizeof(ReorderBufferChange));
     466     3165170 :     return change;
     467             : }
     468             : 
     469             : /*
     470             :  * Free a ReorderBufferChange and update memory accounting, if requested.
     471             :  */
     472             : void
     473     3162534 : ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
     474             :                           bool upd_mem)
     475             : {
     476             :     /* update memory accounting info */
     477     3162534 :     if (upd_mem)
     478     3161440 :         ReorderBufferChangeMemoryUpdate(rb, change, false,
     479             :                                         ReorderBufferChangeSize(change));
     480             : 
     481             :     /* free contained data */
     482     3162534 :     switch (change->action)
     483             :     {
     484     3048708 :         case REORDER_BUFFER_CHANGE_INSERT:
     485             :         case REORDER_BUFFER_CHANGE_UPDATE:
     486             :         case REORDER_BUFFER_CHANGE_DELETE:
     487             :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
     488     3048708 :             if (change->data.tp.newtuple)
     489             :             {
     490     2686980 :                 ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
     491     2686980 :                 change->data.tp.newtuple = NULL;
     492             :             }
     493             : 
     494     3048708 :             if (change->data.tp.oldtuple)
     495             :             {
     496      229770 :                 ReorderBufferReturnTupleBuf(rb, change->data.tp.oldtuple);
     497      229770 :                 change->data.tp.oldtuple = NULL;
     498             :             }
     499     3048708 :             break;
     500          78 :         case REORDER_BUFFER_CHANGE_MESSAGE:
     501          78 :             if (change->data.msg.prefix != NULL)
     502          78 :                 pfree(change->data.msg.prefix);
     503          78 :             change->data.msg.prefix = NULL;
     504          78 :             if (change->data.msg.message != NULL)
     505          78 :                 pfree(change->data.msg.message);
     506          78 :             change->data.msg.message = NULL;
     507          78 :             break;
     508        6366 :         case REORDER_BUFFER_CHANGE_INVALIDATION:
     509        6366 :             if (change->data.inval.invalidations)
     510        6366 :                 pfree(change->data.inval.invalidations);
     511        6366 :             change->data.inval.invalidations = NULL;
     512        6366 :             break;
     513        1344 :         case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
     514        1344 :             if (change->data.snapshot)
     515             :             {
     516        1344 :                 ReorderBufferFreeSnap(rb, change->data.snapshot);
     517        1344 :                 change->data.snapshot = NULL;
     518             :             }
     519        1344 :             break;
     520             :             /* no data in addition to the struct itself */
     521          40 :         case REORDER_BUFFER_CHANGE_TRUNCATE:
     522          40 :             if (change->data.truncate.relids != NULL)
     523             :             {
     524          40 :                 ReorderBufferReturnRelids(rb, change->data.truncate.relids);
     525          40 :                 change->data.truncate.relids = NULL;
     526             :             }
     527          40 :             break;
     528      105998 :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
     529             :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
     530             :         case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
     531             :         case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
     532      105998 :             break;
     533             :     }
     534             : 
     535     3162534 :     pfree(change);
     536     3162534 : }
     537             : 
     538             : /*
     539             :  * Get a fresh ReorderBufferTupleBuf fitting at least a tuple of size
     540             :  * tuple_len (excluding header overhead).
     541             :  */
     542             : ReorderBufferTupleBuf *
     543     2919196 : ReorderBufferGetTupleBuf(ReorderBuffer *rb, Size tuple_len)
     544             : {
     545             :     ReorderBufferTupleBuf *tuple;
     546             :     Size        alloc_len;
     547             : 
     548     2919196 :     alloc_len = tuple_len + SizeofHeapTupleHeader;
     549             : 
     550             :     tuple = (ReorderBufferTupleBuf *)
     551     2919196 :         MemoryContextAlloc(rb->tup_context,
     552             :                            sizeof(ReorderBufferTupleBuf) +
     553             :                            MAXIMUM_ALIGNOF + alloc_len);
     554     2919196 :     tuple->alloc_tuple_size = alloc_len;
     555     2919196 :     tuple->tuple.t_data = ReorderBufferTupleBufData(tuple);
     556             : 
     557     2919196 :     return tuple;
     558             : }
     559             : 
     560             : /*
     561             :  * Free an ReorderBufferTupleBuf.
     562             :  */
     563             : void
     564     2916750 : ReorderBufferReturnTupleBuf(ReorderBuffer *rb, ReorderBufferTupleBuf *tuple)
     565             : {
     566     2916750 :     pfree(tuple);
     567     2916750 : }
     568             : 
     569             : /*
     570             :  * Get an array for relids of truncated relations.
     571             :  *
     572             :  * We use the global memory context (for the whole reorder buffer), because
     573             :  * none of the existing ones seems like a good match (some are SLAB, so we
     574             :  * can't use those, and tup_context is meant for tuple data, not relids). We
     575             :  * could add yet another context, but it seems like an overkill - TRUNCATE is
     576             :  * not particularly common operation, so it does not seem worth it.
     577             :  */
     578             : Oid *
     579          40 : ReorderBufferGetRelids(ReorderBuffer *rb, int nrelids)
     580             : {
     581             :     Oid        *relids;
     582             :     Size        alloc_len;
     583             : 
     584          40 :     alloc_len = sizeof(Oid) * nrelids;
     585             : 
     586          40 :     relids = (Oid *) MemoryContextAlloc(rb->context, alloc_len);
     587             : 
     588          40 :     return relids;
     589             : }
     590             : 
     591             : /*
     592             :  * Free an array of relids.
     593             :  */
     594             : void
     595          40 : ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
     596             : {
     597          40 :     pfree(relids);
     598          40 : }
     599             : 
     600             : /*
     601             :  * Return the ReorderBufferTXN from the given buffer, specified by Xid.
     602             :  * If create is true, and a transaction doesn't already exist, create it
     603             :  * (with the given LSN, and as top transaction if that's specified);
     604             :  * when this happens, is_new is set to true.
     605             :  */
     606             : static ReorderBufferTXN *
     607    10774198 : ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
     608             :                       bool *is_new, XLogRecPtr lsn, bool create_as_top)
     609             : {
     610             :     ReorderBufferTXN *txn;
     611             :     ReorderBufferTXNByIdEnt *ent;
     612             :     bool        found;
     613             : 
     614             :     Assert(TransactionIdIsValid(xid));
     615             : 
     616             :     /*
     617             :      * Check the one-entry lookup cache first
     618             :      */
     619    10774198 :     if (TransactionIdIsValid(rb->by_txn_last_xid) &&
     620    10769768 :         rb->by_txn_last_xid == xid)
     621             :     {
     622     8672956 :         txn = rb->by_txn_last_txn;
     623             : 
     624     8672956 :         if (txn != NULL)
     625             :         {
     626             :             /* found it, and it's valid */
     627     8672944 :             if (is_new)
     628        3364 :                 *is_new = false;
     629     8672944 :             return txn;
     630             :         }
     631             : 
     632             :         /*
     633             :          * cached as non-existent, and asked not to create? Then nothing else
     634             :          * to do.
     635             :          */
     636          12 :         if (!create)
     637          12 :             return NULL;
     638             :         /* otherwise fall through to create it */
     639             :     }
     640             : 
     641             :     /*
     642             :      * If the cache wasn't hit or it yielded an "does-not-exist" and we want
     643             :      * to create an entry.
     644             :      */
     645             : 
     646             :     /* search the lookup table */
     647             :     ent = (ReorderBufferTXNByIdEnt *)
     648     2101242 :         hash_search(rb->by_txn,
     649             :                     (void *) &xid,
     650             :                     create ? HASH_ENTER : HASH_FIND,
     651             :                     &found);
     652     2101242 :     if (found)
     653     2093876 :         txn = ent->txn;
     654        7366 :     else if (create)
     655             :     {
     656             :         /* initialize the new entry, if creation was requested */
     657             :         Assert(ent != NULL);
     658             :         Assert(lsn != InvalidXLogRecPtr);
     659             : 
     660        4810 :         ent->txn = ReorderBufferGetTXN(rb);
     661        4810 :         ent->txn->xid = xid;
     662        4810 :         txn = ent->txn;
     663        4810 :         txn->first_lsn = lsn;
     664        4810 :         txn->restart_decoding_lsn = rb->current_restart_decoding_lsn;
     665             : 
     666        4810 :         if (create_as_top)
     667             :         {
     668        3530 :             dlist_push_tail(&rb->toplevel_by_lsn, &txn->node);
     669        3530 :             AssertTXNLsnOrder(rb);
     670             :         }
     671             :     }
     672             :     else
     673        2556 :         txn = NULL;             /* not found and not asked to create */
     674             : 
     675             :     /* update cache */
     676     2101242 :     rb->by_txn_last_xid = xid;
     677     2101242 :     rb->by_txn_last_txn = txn;
     678             : 
     679     2101242 :     if (is_new)
     680        3394 :         *is_new = !found;
     681             : 
     682             :     Assert(!create || txn != NULL);
     683     2101242 :     return txn;
     684             : }
     685             : 
     686             : /*
     687             :  * Record the partial change for the streaming of in-progress transactions.  We
     688             :  * can stream only complete changes so if we have a partial change like toast
     689             :  * table insert or speculative insert then we mark such a 'txn' so that it
     690             :  * can't be streamed.  We also ensure that if the changes in such a 'txn' are
     691             :  * above logical_decoding_work_mem threshold then we stream them as soon as we
     692             :  * have a complete change.
     693             :  */
     694             : static void
     695     2823746 : ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
     696             :                                   ReorderBufferChange *change,
     697             :                                   bool toast_insert)
     698             : {
     699             :     ReorderBufferTXN *toptxn;
     700             : 
     701             :     /*
     702             :      * The partial changes need to be processed only while streaming
     703             :      * in-progress transactions.
     704             :      */
     705     2823746 :     if (!ReorderBufferCanStream(rb))
     706     2430984 :         return;
     707             : 
     708             :     /* Get the top transaction. */
     709      392762 :     if (txn->toptxn != NULL)
     710       48188 :         toptxn = txn->toptxn;
     711             :     else
     712      344574 :         toptxn = txn;
     713             : 
     714             :     /*
     715             :      * Indicate a partial change for toast inserts.  The change will be
     716             :      * considered as complete once we get the insert or update on the main
     717             :      * table and we are sure that the pending toast chunks are not required
     718             :      * anymore.
     719             :      *
     720             :      * If we allow streaming when there are pending toast chunks then such
     721             :      * chunks won't be released till the insert (multi_insert) is complete and
     722             :      * we expect the txn to have streamed all changes after streaming.  This
     723             :      * restriction is mainly to ensure the correctness of streamed
     724             :      * transactions and it doesn't seem worth uplifting such a restriction
     725             :      * just to allow this case because anyway we will stream the transaction
     726             :      * once such an insert is complete.
     727             :      */
     728      392762 :     if (toast_insert)
     729        2902 :         toptxn->txn_flags |= RBTXN_HAS_PARTIAL_CHANGE;
     730      389860 :     else if (rbtxn_has_partial_change(toptxn) &&
     731          66 :              IsInsertOrUpdate(change->action) &&
     732          66 :              change->data.tp.clear_toast_afterwards)
     733          46 :         toptxn->txn_flags &= ~RBTXN_HAS_PARTIAL_CHANGE;
     734             : 
     735             :     /*
     736             :      * Indicate a partial change for speculative inserts.  The change will be
     737             :      * considered as complete once we get the speculative confirm or abort
     738             :      * token.
     739             :      */
     740      392762 :     if (IsSpecInsert(change->action))
     741           0 :         toptxn->txn_flags |= RBTXN_HAS_PARTIAL_CHANGE;
     742      392762 :     else if (rbtxn_has_partial_change(toptxn) &&
     743        2922 :              IsSpecConfirmOrAbort(change->action))
     744           0 :         toptxn->txn_flags &= ~RBTXN_HAS_PARTIAL_CHANGE;
     745             : 
     746             :     /*
     747             :      * Stream the transaction if it is serialized before and the changes are
     748             :      * now complete in the top-level transaction.
     749             :      *
     750             :      * The reason for doing the streaming of such a transaction as soon as we
     751             :      * get the complete change for it is that previously it would have reached
     752             :      * the memory threshold and wouldn't get streamed because of incomplete
     753             :      * changes.  Delaying such transactions would increase apply lag for them.
     754             :      */
     755      392762 :     if (ReorderBufferCanStartStreaming(rb) &&
     756      309586 :         !(rbtxn_has_partial_change(toptxn)) &&
     757      306744 :         rbtxn_is_serialized(txn))
     758           6 :         ReorderBufferStreamTXN(rb, toptxn);
     759             : }
     760             : 
     761             : /*
     762             :  * Queue a change into a transaction so it can be replayed upon commit or will be
     763             :  * streamed when we reach logical_decoding_work_mem threshold.
     764             :  */
     765             : void
     766     2824840 : ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
     767             :                          ReorderBufferChange *change, bool toast_insert)
     768             : {
     769             :     ReorderBufferTXN *txn;
     770             : 
     771     2824840 :     txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
     772             : 
     773             :     /*
     774             :      * While streaming the previous changes we have detected that the
     775             :      * transaction is aborted.  So there is no point in collecting further
     776             :      * changes for it.
     777             :      */
     778     2824840 :     if (txn->concurrent_abort)
     779             :     {
     780             :         /*
     781             :          * We don't need to update memory accounting for this change as we
     782             :          * have not added it to the queue yet.
     783             :          */
     784        1094 :         ReorderBufferReturnChange(rb, change, false);
     785        1094 :         return;
     786             :     }
     787             : 
     788     2823746 :     change->lsn = lsn;
     789     2823746 :     change->txn = txn;
     790             : 
     791             :     Assert(InvalidXLogRecPtr != lsn);
     792     2823746 :     dlist_push_tail(&txn->changes, &change->node);
     793     2823746 :     txn->nentries++;
     794     2823746 :     txn->nentries_mem++;
     795             : 
     796             :     /* update memory accounting information */
     797     2823746 :     ReorderBufferChangeMemoryUpdate(rb, change, true,
     798             :                                     ReorderBufferChangeSize(change));
     799             : 
     800             :     /* process partial change */
     801     2823746 :     ReorderBufferProcessPartialChange(rb, txn, change, toast_insert);
     802             : 
     803             :     /* check the memory limits and evict something if needed */
     804     2823746 :     ReorderBufferCheckMemoryLimit(rb);
     805             : }
     806             : 
     807             : /*
     808             :  * A transactional message is queued to be processed upon commit and a
     809             :  * non-transactional message gets processed immediately.
     810             :  */
     811             : void
     812          88 : ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
     813             :                           Snapshot snapshot, XLogRecPtr lsn,
     814             :                           bool transactional, const char *prefix,
     815             :                           Size message_size, const char *message)
     816             : {
     817          88 :     if (transactional)
     818             :     {
     819             :         MemoryContext oldcontext;
     820             :         ReorderBufferChange *change;
     821             : 
     822             :         Assert(xid != InvalidTransactionId);
     823             : 
     824          76 :         oldcontext = MemoryContextSwitchTo(rb->context);
     825             : 
     826          76 :         change = ReorderBufferGetChange(rb);
     827          76 :         change->action = REORDER_BUFFER_CHANGE_MESSAGE;
     828          76 :         change->data.msg.prefix = pstrdup(prefix);
     829          76 :         change->data.msg.message_size = message_size;
     830          76 :         change->data.msg.message = palloc(message_size);
     831          76 :         memcpy(change->data.msg.message, message, message_size);
     832             : 
     833          76 :         ReorderBufferQueueChange(rb, xid, lsn, change, false);
     834             : 
     835          76 :         MemoryContextSwitchTo(oldcontext);
     836             :     }
     837             :     else
     838             :     {
     839          12 :         ReorderBufferTXN *txn = NULL;
     840          12 :         volatile Snapshot snapshot_now = snapshot;
     841             : 
     842          12 :         if (xid != InvalidTransactionId)
     843           6 :             txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
     844             : 
     845             :         /* setup snapshot to allow catalog access */
     846          12 :         SetupHistoricSnapshot(snapshot_now, NULL);
     847          12 :         PG_TRY();
     848             :         {
     849          12 :             rb->message(rb, txn, lsn, false, prefix, message_size, message);
     850             : 
     851          12 :             TeardownHistoricSnapshot(false);
     852             :         }
     853           0 :         PG_CATCH();
     854             :         {
     855           0 :             TeardownHistoricSnapshot(true);
     856           0 :             PG_RE_THROW();
     857             :         }
     858          12 :         PG_END_TRY();
     859             :     }
     860          88 : }
     861             : 
     862             : /*
     863             :  * AssertTXNLsnOrder
     864             :  *      Verify LSN ordering of transaction lists in the reorderbuffer
     865             :  *
     866             :  * Other LSN-related invariants are checked too.
     867             :  *
     868             :  * No-op if assertions are not in use.
     869             :  */
     870             : static void
     871        8858 : AssertTXNLsnOrder(ReorderBuffer *rb)
     872             : {
     873             : #ifdef USE_ASSERT_CHECKING
     874             :     dlist_iter  iter;
     875             :     XLogRecPtr  prev_first_lsn = InvalidXLogRecPtr;
     876             :     XLogRecPtr  prev_base_snap_lsn = InvalidXLogRecPtr;
     877             : 
     878             :     dlist_foreach(iter, &rb->toplevel_by_lsn)
     879             :     {
     880             :         ReorderBufferTXN *cur_txn = dlist_container(ReorderBufferTXN, node,
     881             :                                                     iter.cur);
     882             : 
     883             :         /* start LSN must be set */
     884             :         Assert(cur_txn->first_lsn != InvalidXLogRecPtr);
     885             : 
     886             :         /* If there is an end LSN, it must be higher than start LSN */
     887             :         if (cur_txn->end_lsn != InvalidXLogRecPtr)
     888             :             Assert(cur_txn->first_lsn <= cur_txn->end_lsn);
     889             : 
     890             :         /* Current initial LSN must be strictly higher than previous */
     891             :         if (prev_first_lsn != InvalidXLogRecPtr)
     892             :             Assert(prev_first_lsn < cur_txn->first_lsn);
     893             : 
     894             :         /* known-as-subtxn txns must not be listed */
     895             :         Assert(!rbtxn_is_known_subxact(cur_txn));
     896             : 
     897             :         prev_first_lsn = cur_txn->first_lsn;
     898             :     }
     899             : 
     900             :     dlist_foreach(iter, &rb->txns_by_base_snapshot_lsn)
     901             :     {
     902             :         ReorderBufferTXN *cur_txn = dlist_container(ReorderBufferTXN,
     903             :                                                     base_snapshot_node,
     904             :                                                     iter.cur);
     905             : 
     906             :         /* base snapshot (and its LSN) must be set */
     907             :         Assert(cur_txn->base_snapshot != NULL);
     908             :         Assert(cur_txn->base_snapshot_lsn != InvalidXLogRecPtr);
     909             : 
     910             :         /* current LSN must be strictly higher than previous */
     911             :         if (prev_base_snap_lsn != InvalidXLogRecPtr)
     912             :             Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
     913             : 
     914             :         /* known-as-subtxn txns must not be listed */
     915             :         Assert(!rbtxn_is_known_subxact(cur_txn));
     916             : 
     917             :         prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
     918             :     }
     919             : #endif
     920        8858 : }
     921             : 
     922             : /*
     923             :  * AssertChangeLsnOrder
     924             :  *
     925             :  * Check ordering of changes in the (sub)transaction.
     926             :  */
     927             : static void
     928        2654 : AssertChangeLsnOrder(ReorderBufferTXN *txn)
     929             : {
     930             : #ifdef USE_ASSERT_CHECKING
     931             :     dlist_iter  iter;
     932             :     XLogRecPtr  prev_lsn = txn->first_lsn;
     933             : 
     934             :     dlist_foreach(iter, &txn->changes)
     935             :     {
     936             :         ReorderBufferChange *cur_change;
     937             : 
     938             :         cur_change = dlist_container(ReorderBufferChange, node, iter.cur);
     939             : 
     940             :         Assert(txn->first_lsn != InvalidXLogRecPtr);
     941             :         Assert(cur_change->lsn != InvalidXLogRecPtr);
     942             :         Assert(txn->first_lsn <= cur_change->lsn);
     943             : 
     944             :         if (txn->end_lsn != InvalidXLogRecPtr)
     945             :             Assert(cur_change->lsn <= txn->end_lsn);
     946             : 
     947             :         Assert(prev_lsn <= cur_change->lsn);
     948             : 
     949             :         prev_lsn = cur_change->lsn;
     950             :     }
     951             : #endif
     952        2654 : }
     953             : 
     954             : /*
     955             :  * ReorderBufferGetOldestTXN
     956             :  *      Return oldest transaction in reorderbuffer
     957             :  */
     958             : ReorderBufferTXN *
     959         266 : ReorderBufferGetOldestTXN(ReorderBuffer *rb)
     960             : {
     961             :     ReorderBufferTXN *txn;
     962             : 
     963         266 :     AssertTXNLsnOrder(rb);
     964             : 
     965         266 :     if (dlist_is_empty(&rb->toplevel_by_lsn))
     966         230 :         return NULL;
     967             : 
     968          36 :     txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
     969             : 
     970             :     Assert(!rbtxn_is_known_subxact(txn));
     971             :     Assert(txn->first_lsn != InvalidXLogRecPtr);
     972          36 :     return txn;
     973             : }
     974             : 
     975             : /*
     976             :  * ReorderBufferGetOldestXmin
     977             :  *      Return oldest Xmin in reorderbuffer
     978             :  *
     979             :  * Returns oldest possibly running Xid from the point of view of snapshots
     980             :  * used in the transactions kept by reorderbuffer, or InvalidTransactionId if
     981             :  * there are none.
     982             :  *
     983             :  * Since snapshots are assigned monotonically, this equals the Xmin of the
     984             :  * base snapshot with minimal base_snapshot_lsn.
     985             :  */
     986             : TransactionId
     987         292 : ReorderBufferGetOldestXmin(ReorderBuffer *rb)
     988             : {
     989             :     ReorderBufferTXN *txn;
     990             : 
     991         292 :     AssertTXNLsnOrder(rb);
     992             : 
     993         292 :     if (dlist_is_empty(&rb->txns_by_base_snapshot_lsn))
     994         256 :         return InvalidTransactionId;
     995             : 
     996          36 :     txn = dlist_head_element(ReorderBufferTXN, base_snapshot_node,
     997             :                              &rb->txns_by_base_snapshot_lsn);
     998          36 :     return txn->base_snapshot->xmin;
     999             : }
    1000             : 
    1001             : void
    1002         290 : ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr)
    1003             : {
    1004         290 :     rb->current_restart_decoding_lsn = ptr;
    1005         290 : }
    1006             : 
    1007             : /*
    1008             :  * ReorderBufferAssignChild
    1009             :  *
    1010             :  * Make note that we know that subxid is a subtransaction of xid, seen as of
    1011             :  * the given lsn.
    1012             :  */
    1013             : void
    1014        1634 : ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
    1015             :                          TransactionId subxid, XLogRecPtr lsn)
    1016             : {
    1017             :     ReorderBufferTXN *txn;
    1018             :     ReorderBufferTXN *subtxn;
    1019             :     bool        new_top;
    1020             :     bool        new_sub;
    1021             : 
    1022        1634 :     txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
    1023        1634 :     subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);
    1024             : 
    1025        1634 :     if (!new_sub)
    1026             :     {
    1027         354 :         if (rbtxn_is_known_subxact(subtxn))
    1028             :         {
    1029             :             /* already associated, nothing to do */
    1030         354 :             return;
    1031             :         }
    1032             :         else
    1033             :         {
    1034             :             /*
    1035             :              * We already saw this transaction, but initially added it to the
    1036             :              * list of top-level txns.  Now that we know it's not top-level,
    1037             :              * remove it from there.
    1038             :              */
    1039           0 :             dlist_delete(&subtxn->node);
    1040             :         }
    1041             :     }
    1042             : 
    1043        1280 :     subtxn->txn_flags |= RBTXN_IS_SUBXACT;
    1044        1280 :     subtxn->toplevel_xid = xid;
    1045             :     Assert(subtxn->nsubtxns == 0);
    1046             : 
    1047             :     /* set the reference to top-level transaction */
    1048        1280 :     subtxn->toptxn = txn;
    1049             : 
    1050             :     /* add to subtransaction list */
    1051        1280 :     dlist_push_tail(&txn->subtxns, &subtxn->node);
    1052        1280 :     txn->nsubtxns++;
    1053             : 
    1054             :     /* Possibly transfer the subtxn's snapshot to its top-level txn. */
    1055        1280 :     ReorderBufferTransferSnapToParent(txn, subtxn);
    1056             : 
    1057             :     /* Verify LSN-ordering invariant */
    1058        1280 :     AssertTXNLsnOrder(rb);
    1059             : }
    1060             : 
    1061             : /*
    1062             :  * ReorderBufferTransferSnapToParent
    1063             :  *      Transfer base snapshot from subtxn to top-level txn, if needed
    1064             :  *
    1065             :  * This is done if the top-level txn doesn't have a base snapshot, or if the
    1066             :  * subtxn's base snapshot has an earlier LSN than the top-level txn's base
    1067             :  * snapshot's LSN.  This can happen if there are no changes in the toplevel
    1068             :  * txn but there are some in the subtxn, or the first change in subtxn has
    1069             :  * earlier LSN than first change in the top-level txn and we learned about
    1070             :  * their kinship only now.
    1071             :  *
    1072             :  * The subtransaction's snapshot is cleared regardless of the transfer
    1073             :  * happening, since it's not needed anymore in either case.
    1074             :  *
    1075             :  * We do this as soon as we become aware of their kinship, to avoid queueing
    1076             :  * extra snapshots to txns known-as-subtxns -- only top-level txns will
    1077             :  * receive further snapshots.
    1078             :  */
    1079             : static void
    1080        1288 : ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
    1081             :                                   ReorderBufferTXN *subtxn)
    1082             : {
    1083             :     Assert(subtxn->toplevel_xid == txn->xid);
    1084             : 
    1085        1288 :     if (subtxn->base_snapshot != NULL)
    1086             :     {
    1087           0 :         if (txn->base_snapshot == NULL ||
    1088           0 :             subtxn->base_snapshot_lsn < txn->base_snapshot_lsn)
    1089             :         {
    1090             :             /*
    1091             :              * If the toplevel transaction already has a base snapshot but
    1092             :              * it's newer than the subxact's, purge it.
    1093             :              */
    1094           0 :             if (txn->base_snapshot != NULL)
    1095             :             {
    1096           0 :                 SnapBuildSnapDecRefcount(txn->base_snapshot);
    1097           0 :                 dlist_delete(&txn->base_snapshot_node);
    1098             :             }
    1099             : 
    1100             :             /*
    1101             :              * The snapshot is now the top transaction's; transfer it, and
    1102             :              * adjust the list position of the top transaction in the list by
    1103             :              * moving it to where the subtransaction is.
    1104             :              */
    1105           0 :             txn->base_snapshot = subtxn->base_snapshot;
    1106           0 :             txn->base_snapshot_lsn = subtxn->base_snapshot_lsn;
    1107           0 :             dlist_insert_before(&subtxn->base_snapshot_node,
    1108             :                                 &txn->base_snapshot_node);
    1109             : 
    1110             :             /*
    1111             :              * The subtransaction doesn't have a snapshot anymore (so it
    1112             :              * mustn't be in the list.)
    1113             :              */
    1114           0 :             subtxn->base_snapshot = NULL;
    1115           0 :             subtxn->base_snapshot_lsn = InvalidXLogRecPtr;
    1116           0 :             dlist_delete(&subtxn->base_snapshot_node);
    1117             :         }
    1118             :         else
    1119             :         {
    1120             :             /* Base snap of toplevel is fine, so subxact's is not needed */
    1121           0 :             SnapBuildSnapDecRefcount(subtxn->base_snapshot);
    1122           0 :             dlist_delete(&subtxn->base_snapshot_node);
    1123           0 :             subtxn->base_snapshot = NULL;
    1124           0 :             subtxn->base_snapshot_lsn = InvalidXLogRecPtr;
    1125             :         }
    1126             :     }
    1127        1288 : }
    1128             : 
    1129             : /*
    1130             :  * Associate a subtransaction with its toplevel transaction at commit
    1131             :  * time. There may be no further changes added after this.
    1132             :  */
    1133             : void
    1134         516 : ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
    1135             :                          TransactionId subxid, XLogRecPtr commit_lsn,
    1136             :                          XLogRecPtr end_lsn)
    1137             : {
    1138             :     ReorderBufferTXN *subtxn;
    1139             : 
    1140         516 :     subtxn = ReorderBufferTXNByXid(rb, subxid, false, NULL,
    1141             :                                    InvalidXLogRecPtr, false);
    1142             : 
    1143             :     /*
    1144             :      * No need to do anything if that subtxn didn't contain any changes
    1145             :      */
    1146         516 :     if (!subtxn)
    1147         162 :         return;
    1148             : 
    1149         354 :     subtxn->final_lsn = commit_lsn;
    1150         354 :     subtxn->end_lsn = end_lsn;
    1151             : 
    1152             :     /*
    1153             :      * Assign this subxact as a child of the toplevel xact (no-op if already
    1154             :      * done.)
    1155             :      */
    1156         354 :     ReorderBufferAssignChild(rb, xid, subxid, InvalidXLogRecPtr);
    1157             : }
    1158             : 
    1159             : 
    1160             : /*
    1161             :  * Support for efficiently iterating over a transaction's and its
    1162             :  * subtransactions' changes.
    1163             :  *
    1164             :  * We do by doing a k-way merge between transactions/subtransactions. For that
    1165             :  * we model the current heads of the different transactions as a binary heap
    1166             :  * so we easily know which (sub-)transaction has the change with the smallest
    1167             :  * lsn next.
    1168             :  *
    1169             :  * We assume the changes in individual transactions are already sorted by LSN.
    1170             :  */
    1171             : 
    1172             : /*
    1173             :  * Binary heap comparison function.
    1174             :  */
    1175             : static int
    1176      110868 : ReorderBufferIterCompare(Datum a, Datum b, void *arg)
    1177             : {
    1178      110868 :     ReorderBufferIterTXNState *state = (ReorderBufferIterTXNState *) arg;
    1179      110868 :     XLogRecPtr  pos_a = state->entries[DatumGetInt32(a)].lsn;
    1180      110868 :     XLogRecPtr  pos_b = state->entries[DatumGetInt32(b)].lsn;
    1181             : 
    1182      110868 :     if (pos_a < pos_b)
    1183      108216 :         return 1;
    1184        2652 :     else if (pos_a == pos_b)
    1185           0 :         return 0;
    1186        2652 :     return -1;
    1187             : }
    1188             : 
    1189             : /*
    1190             :  * Allocate & initialize an iterator which iterates in lsn order over a
    1191             :  * transaction and all its subtransactions.
    1192             :  *
    1193             :  * Note: The iterator state is returned through iter_state parameter rather
    1194             :  * than the function's return value.  This is because the state gets cleaned up
    1195             :  * in a PG_CATCH block in the caller, so we want to make sure the caller gets
    1196             :  * back the state even if this function throws an exception.
    1197             :  */
    1198             : static void
    1199        2096 : ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
    1200             :                          ReorderBufferIterTXNState *volatile *iter_state)
    1201             : {
    1202        2096 :     Size        nr_txns = 0;
    1203             :     ReorderBufferIterTXNState *state;
    1204             :     dlist_iter  cur_txn_i;
    1205             :     int32       off;
    1206             : 
    1207        2096 :     *iter_state = NULL;
    1208             : 
    1209             :     /* Check ordering of changes in the toplevel transaction. */
    1210        2096 :     AssertChangeLsnOrder(txn);
    1211             : 
    1212             :     /*
    1213             :      * Calculate the size of our heap: one element for every transaction that
    1214             :      * contains changes.  (Besides the transactions already in the reorder
    1215             :      * buffer, we count the one we were directly passed.)
    1216             :      */
    1217        2096 :     if (txn->nentries > 0)
    1218        1954 :         nr_txns++;
    1219             : 
    1220        2654 :     dlist_foreach(cur_txn_i, &txn->subtxns)
    1221             :     {
    1222             :         ReorderBufferTXN *cur_txn;
    1223             : 
    1224         558 :         cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
    1225             : 
    1226             :         /* Check ordering of changes in this subtransaction. */
    1227         558 :         AssertChangeLsnOrder(cur_txn);
    1228             : 
    1229         558 :         if (cur_txn->nentries > 0)
    1230         482 :             nr_txns++;
    1231             :     }
    1232             : 
    1233             :     /* allocate iteration state */
    1234             :     state = (ReorderBufferIterTXNState *)
    1235        2096 :         MemoryContextAllocZero(rb->context,
    1236             :                                sizeof(ReorderBufferIterTXNState) +
    1237        2096 :                                sizeof(ReorderBufferIterTXNEntry) * nr_txns);
    1238             : 
    1239        2096 :     state->nr_txns = nr_txns;
    1240        2096 :     dlist_init(&state->old_change);
    1241             : 
    1242        4532 :     for (off = 0; off < state->nr_txns; off++)
    1243             :     {
    1244        2436 :         state->entries[off].file.vfd = -1;
    1245        2436 :         state->entries[off].segno = 0;
    1246             :     }
    1247             : 
    1248             :     /* allocate heap */
    1249        2096 :     state->heap = binaryheap_allocate(state->nr_txns,
    1250             :                                       ReorderBufferIterCompare,
    1251             :                                       state);
    1252             : 
    1253             :     /* Now that the state fields are initialized, it is safe to return it. */
    1254        2096 :     *iter_state = state;
    1255             : 
    1256             :     /*
    1257             :      * Now insert items into the binary heap, in an unordered fashion.  (We
    1258             :      * will run a heap assembly step at the end; this is more efficient.)
    1259             :      */
    1260             : 
    1261        2096 :     off = 0;
    1262             : 
    1263             :     /* add toplevel transaction if it contains changes */
    1264        2096 :     if (txn->nentries > 0)
    1265             :     {
    1266             :         ReorderBufferChange *cur_change;
    1267             : 
    1268        1954 :         if (rbtxn_is_serialized(txn))
    1269             :         {
    1270             :             /* serialize remaining changes */
    1271          34 :             ReorderBufferSerializeTXN(rb, txn);
    1272          34 :             ReorderBufferRestoreChanges(rb, txn, &state->entries[off].file,
    1273             :                                         &state->entries[off].segno);
    1274             :         }
    1275             : 
    1276        1954 :         cur_change = dlist_head_element(ReorderBufferChange, node,
    1277             :                                         &txn->changes);
    1278             : 
    1279        1954 :         state->entries[off].lsn = cur_change->lsn;
    1280        1954 :         state->entries[off].change = cur_change;
    1281        1954 :         state->entries[off].txn = txn;
    1282             : 
    1283        1954 :         binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
    1284             :     }
    1285             : 
    1286             :     /* add subtransactions if they contain changes */
    1287        2654 :     dlist_foreach(cur_txn_i, &txn->subtxns)
    1288             :     {
    1289             :         ReorderBufferTXN *cur_txn;
    1290             : 
    1291         558 :         cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
    1292             : 
    1293         558 :         if (cur_txn->nentries > 0)
    1294             :         {
    1295             :             ReorderBufferChange *cur_change;
    1296             : 
    1297         482 :             if (rbtxn_is_serialized(cur_txn))
    1298             :             {
    1299             :                 /* serialize remaining changes */
    1300          32 :                 ReorderBufferSerializeTXN(rb, cur_txn);
    1301          32 :                 ReorderBufferRestoreChanges(rb, cur_txn,
    1302             :                                             &state->entries[off].file,
    1303             :                                             &state->entries[off].segno);
    1304             :             }
    1305         482 :             cur_change = dlist_head_element(ReorderBufferChange, node,
    1306             :                                             &cur_txn->changes);
    1307             : 
    1308         482 :             state->entries[off].lsn = cur_change->lsn;
    1309         482 :             state->entries[off].change = cur_change;
    1310         482 :             state->entries[off].txn = cur_txn;
    1311             : 
    1312         482 :             binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
    1313             :         }
    1314             :     }
    1315             : 
    1316             :     /* assemble a valid binary heap */
    1317        2096 :     binaryheap_build(state->heap);
    1318        2096 : }
    1319             : 
    1320             : /*
    1321             :  * Return the next change when iterating over a transaction and its
    1322             :  * subtransactions.
    1323             :  *
    1324             :  * Returns NULL when no further changes exist.
    1325             :  */
    1326             : static ReorderBufferChange *
    1327      635932 : ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
    1328             : {
    1329             :     ReorderBufferChange *change;
    1330             :     ReorderBufferIterTXNEntry *entry;
    1331             :     int32       off;
    1332             : 
    1333             :     /* nothing there anymore */
    1334      635932 :     if (state->heap->bh_size == 0)
    1335        2080 :         return NULL;
    1336             : 
    1337      633852 :     off = DatumGetInt32(binaryheap_first(state->heap));
    1338      633852 :     entry = &state->entries[off];
    1339             : 
    1340             :     /* free memory we might have "leaked" in the previous *Next call */
    1341      633852 :     if (!dlist_is_empty(&state->old_change))
    1342             :     {
    1343          80 :         change = dlist_container(ReorderBufferChange, node,
    1344             :                                  dlist_pop_head_node(&state->old_change));
    1345          80 :         ReorderBufferReturnChange(rb, change, true);
    1346             :         Assert(dlist_is_empty(&state->old_change));
    1347             :     }
    1348             : 
    1349      633852 :     change = entry->change;
    1350             : 
    1351             :     /*
    1352             :      * update heap with information about which transaction has the next
    1353             :      * relevant change in LSN order
    1354             :      */
    1355             : 
    1356             :     /* there are in-memory changes */
    1357      633852 :     if (dlist_has_next(&entry->txn->changes, &entry->change->node))
    1358             :     {
    1359      631370 :         dlist_node *next = dlist_next_node(&entry->txn->changes, &change->node);
    1360      631370 :         ReorderBufferChange *next_change =
    1361      631370 :         dlist_container(ReorderBufferChange, node, next);
    1362             : 
    1363             :         /* txn stays the same */
    1364      631370 :         state->entries[off].lsn = next_change->lsn;
    1365      631370 :         state->entries[off].change = next_change;
    1366             : 
    1367      631370 :         binaryheap_replace_first(state->heap, Int32GetDatum(off));
    1368      631370 :         return change;
    1369             :     }
    1370             : 
    1371             :     /* try to load changes from disk */
    1372        2482 :     if (entry->txn->nentries != entry->txn->nentries_mem)
    1373             :     {
    1374             :         /*
    1375             :          * Ugly: restoring changes will reuse *Change records, thus delete the
    1376             :          * current one from the per-tx list and only free in the next call.
    1377             :          */
    1378         112 :         dlist_delete(&change->node);
    1379         112 :         dlist_push_tail(&state->old_change, &change->node);
    1380             : 
    1381             :         /*
    1382             :          * Update the total bytes processed by the txn for which we are
    1383             :          * releasing the current set of changes and restoring the new set of
    1384             :          * changes.
    1385             :          */
    1386         112 :         rb->totalBytes += entry->txn->size;
    1387         112 :         if (ReorderBufferRestoreChanges(rb, entry->txn, &entry->file,
    1388             :                                         &state->entries[off].segno))
    1389             :         {
    1390             :             /* successfully restored changes from disk */
    1391             :             ReorderBufferChange *next_change =
    1392          62 :             dlist_head_element(ReorderBufferChange, node,
    1393             :                                &entry->txn->changes);
    1394             : 
    1395          62 :             elog(DEBUG2, "restored %u/%u changes from disk",
    1396             :                  (uint32) entry->txn->nentries_mem,
    1397             :                  (uint32) entry->txn->nentries);
    1398             : 
    1399             :             Assert(entry->txn->nentries_mem);
    1400             :             /* txn stays the same */
    1401          62 :             state->entries[off].lsn = next_change->lsn;
    1402          62 :             state->entries[off].change = next_change;
    1403          62 :             binaryheap_replace_first(state->heap, Int32GetDatum(off));
    1404             : 
    1405          62 :             return change;
    1406             :         }
    1407             :     }
    1408             : 
    1409             :     /* ok, no changes there anymore, remove */
    1410        2420 :     binaryheap_remove_first(state->heap);
    1411             : 
    1412        2420 :     return change;
    1413             : }
    1414             : 
    1415             : /*
    1416             :  * Deallocate the iterator
    1417             :  */
    1418             : static void
    1419        2092 : ReorderBufferIterTXNFinish(ReorderBuffer *rb,
    1420             :                            ReorderBufferIterTXNState *state)
    1421             : {
    1422             :     int32       off;
    1423             : 
    1424        4524 :     for (off = 0; off < state->nr_txns; off++)
    1425             :     {
    1426        2432 :         if (state->entries[off].file.vfd != -1)
    1427           0 :             FileClose(state->entries[off].file.vfd);
    1428             :     }
    1429             : 
    1430             :     /* free memory we might have "leaked" in the last *Next call */
    1431        2092 :     if (!dlist_is_empty(&state->old_change))
    1432             :     {
    1433             :         ReorderBufferChange *change;
    1434             : 
    1435          30 :         change = dlist_container(ReorderBufferChange, node,
    1436             :                                  dlist_pop_head_node(&state->old_change));
    1437          30 :         ReorderBufferReturnChange(rb, change, true);
    1438             :         Assert(dlist_is_empty(&state->old_change));
    1439             :     }
    1440             : 
    1441        2092 :     binaryheap_free(state->heap);
    1442        2092 :     pfree(state);
    1443        2092 : }
    1444             : 
    1445             : /*
    1446             :  * Cleanup the contents of a transaction, usually after the transaction
    1447             :  * committed or aborted.
    1448             :  */
    1449             : static void
    1450        4748 : ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
    1451             : {
    1452             :     bool        found;
    1453             :     dlist_mutable_iter iter;
    1454             : 
    1455             :     /* cleanup subtransactions & their changes */
    1456        5100 :     dlist_foreach_modify(iter, &txn->subtxns)
    1457             :     {
    1458             :         ReorderBufferTXN *subtxn;
    1459             : 
    1460         352 :         subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
    1461             : 
    1462             :         /*
    1463             :          * Subtransactions are always associated to the toplevel TXN, even if
    1464             :          * they originally were happening inside another subtxn, so we won't
    1465             :          * ever recurse more than one level deep here.
    1466             :          */
    1467             :         Assert(rbtxn_is_known_subxact(subtxn));
    1468             :         Assert(subtxn->nsubtxns == 0);
    1469             : 
    1470         352 :         ReorderBufferCleanupTXN(rb, subtxn);
    1471             :     }
    1472             : 
    1473             :     /* cleanup changes in the txn */
    1474      119092 :     dlist_foreach_modify(iter, &txn->changes)
    1475             :     {
    1476             :         ReorderBufferChange *change;
    1477             : 
    1478      114344 :         change = dlist_container(ReorderBufferChange, node, iter.cur);
    1479             : 
    1480             :         /* Check we're not mixing changes from different transactions. */
    1481             :         Assert(change->txn == txn);
    1482             : 
    1483      114344 :         ReorderBufferReturnChange(rb, change, true);
    1484             :     }
    1485             : 
    1486             :     /*
    1487             :      * Cleanup the tuplecids we stored for decoding catalog snapshot access.
    1488             :      * They are always stored in the toplevel transaction.
    1489             :      */
    1490       37692 :     dlist_foreach_modify(iter, &txn->tuplecids)
    1491             :     {
    1492             :         ReorderBufferChange *change;
    1493             : 
    1494       32944 :         change = dlist_container(ReorderBufferChange, node, iter.cur);
    1495             : 
    1496             :         /* Check we're not mixing changes from different transactions. */
    1497             :         Assert(change->txn == txn);
    1498             :         Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
    1499             : 
    1500       32944 :         ReorderBufferReturnChange(rb, change, true);
    1501             :     }
    1502             : 
    1503             :     /*
    1504             :      * Cleanup the base snapshot, if set.
    1505             :      */
    1506        4748 :     if (txn->base_snapshot != NULL)
    1507             :     {
    1508        3444 :         SnapBuildSnapDecRefcount(txn->base_snapshot);
    1509        3444 :         dlist_delete(&txn->base_snapshot_node);
    1510             :     }
    1511             : 
    1512             :     /*
    1513             :      * Cleanup the snapshot for the last streamed run.
    1514             :      */
    1515        4748 :     if (txn->snapshot_now != NULL)
    1516             :     {
    1517             :         Assert(rbtxn_is_streamed(txn));
    1518          52 :         ReorderBufferFreeSnap(rb, txn->snapshot_now);
    1519             :     }
    1520             : 
    1521             :     /*
    1522             :      * Remove TXN from its containing list.
    1523             :      *
    1524             :      * Note: if txn is known as subxact, we are deleting the TXN from its
    1525             :      * parent's list of known subxacts; this leaves the parent's nsubxacts
    1526             :      * count too high, but we don't care.  Otherwise, we are deleting the TXN
    1527             :      * from the LSN-ordered list of toplevel TXNs.
    1528             :      */
    1529        4748 :     dlist_delete(&txn->node);
    1530             : 
    1531             :     /* now remove reference from buffer */
    1532        4748 :     hash_search(rb->by_txn,
    1533        4748 :                 (void *) &txn->xid,
    1534             :                 HASH_REMOVE,
    1535             :                 &found);
    1536             :     Assert(found);
    1537             : 
    1538             :     /* remove entries spilled to disk */
    1539        4748 :     if (rbtxn_is_serialized(txn))
    1540         418 :         ReorderBufferRestoreCleanup(rb, txn);
    1541             : 
    1542             :     /* deallocate */
    1543        4748 :     ReorderBufferReturnTXN(rb, txn);
    1544        4748 : }
    1545             : 
    1546             : /*
    1547             :  * Discard changes from a transaction (and subtransactions), either after
    1548             :  * streaming or decoding them at PREPARE. Keep the remaining info -
    1549             :  * transactions, tuplecids, invalidations and snapshots.
    1550             :  *
    1551             :  * We additionally remove tuplecids after decoding the transaction at prepare
    1552             :  * time as we only need to perform invalidation at rollback or commit prepared.
    1553             :  *
    1554             :  * 'txn_prepared' indicates that we have decoded the transaction at prepare
    1555             :  * time.
    1556             :  */
    1557             : static void
    1558        1112 : ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
    1559             : {
    1560             :     dlist_mutable_iter iter;
    1561             : 
    1562             :     /* cleanup subtransactions & their changes */
    1563        1342 :     dlist_foreach_modify(iter, &txn->subtxns)
    1564             :     {
    1565             :         ReorderBufferTXN *subtxn;
    1566             : 
    1567         230 :         subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
    1568             : 
    1569             :         /*
    1570             :          * Subtransactions are always associated to the toplevel TXN, even if
    1571             :          * they originally were happening inside another subtxn, so we won't
    1572             :          * ever recurse more than one level deep here.
    1573             :          */
    1574             :         Assert(rbtxn_is_known_subxact(subtxn));
    1575             :         Assert(subtxn->nsubtxns == 0);
    1576             : 
    1577         230 :         ReorderBufferTruncateTXN(rb, subtxn, txn_prepared);
    1578             :     }
    1579             : 
    1580             :     /* cleanup changes in the txn */
    1581      304996 :     dlist_foreach_modify(iter, &txn->changes)
    1582             :     {
    1583             :         ReorderBufferChange *change;
    1584             : 
    1585      303884 :         change = dlist_container(ReorderBufferChange, node, iter.cur);
    1586             : 
    1587             :         /* Check we're not mixing changes from different transactions. */
    1588             :         Assert(change->txn == txn);
    1589             : 
    1590             :         /* remove the change from it's containing list */
    1591      303884 :         dlist_delete(&change->node);
    1592             : 
    1593      303884 :         ReorderBufferReturnChange(rb, change, true);
    1594             :     }
    1595             : 
    1596             :     /*
    1597             :      * Mark the transaction as streamed.
    1598             :      *
    1599             :      * The toplevel transaction, identified by (toptxn==NULL), is marked as
    1600             :      * streamed always, even if it does not contain any changes (that is, when
    1601             :      * all the changes are in subtransactions).
    1602             :      *
    1603             :      * For subtransactions, we only mark them as streamed when there are
    1604             :      * changes in them.
    1605             :      *
    1606             :      * We do it this way because of aborts - we don't want to send aborts for
    1607             :      * XIDs the downstream is not aware of. And of course, it always knows
    1608             :      * about the toplevel xact (we send the XID in all messages), but we never
    1609             :      * stream XIDs of empty subxacts.
    1610             :      */
    1611        1112 :     if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0)))
    1612         952 :         txn->txn_flags |= RBTXN_IS_STREAMED;
    1613             : 
    1614        1112 :     if (txn_prepared)
    1615             :     {
    1616             :         /*
    1617             :          * If this is a prepared txn, cleanup the tuplecids we stored for
    1618             :          * decoding catalog snapshot access. They are always stored in the
    1619             :          * toplevel transaction.
    1620             :          */
    1621         332 :         dlist_foreach_modify(iter, &txn->tuplecids)
    1622             :         {
    1623             :             ReorderBufferChange *change;
    1624             : 
    1625         246 :             change = dlist_container(ReorderBufferChange, node, iter.cur);
    1626             : 
    1627             :             /* Check we're not mixing changes from different transactions. */
    1628             :             Assert(change->txn == txn);
    1629             :             Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
    1630             : 
    1631             :             /* Remove the change from its containing list. */
    1632         246 :             dlist_delete(&change->node);
    1633             : 
    1634         246 :             ReorderBufferReturnChange(rb, change, true);
    1635             :         }
    1636             :     }
    1637             : 
    1638             :     /*
    1639             :      * Destroy the (relfilenode, ctid) hashtable, so that we don't leak any
    1640             :      * memory. We could also keep the hash table and update it with new ctid
    1641             :      * values, but this seems simpler and good enough for now.
    1642             :      */
    1643        1112 :     if (txn->tuplecid_hash != NULL)
    1644             :     {
    1645          32 :         hash_destroy(txn->tuplecid_hash);
    1646          32 :         txn->tuplecid_hash = NULL;
    1647             :     }
    1648             : 
    1649             :     /* If this txn is serialized then clean the disk space. */
    1650        1112 :     if (rbtxn_is_serialized(txn))
    1651             :     {
    1652           6 :         ReorderBufferRestoreCleanup(rb, txn);
    1653           6 :         txn->txn_flags &= ~RBTXN_IS_SERIALIZED;
    1654             : 
    1655             :         /*
    1656             :          * We set this flag to indicate if the transaction is ever serialized.
    1657             :          * We need this to accurately update the stats as otherwise the same
    1658             :          * transaction can be counted as serialized multiple times.
    1659             :          */
    1660           6 :         txn->txn_flags |= RBTXN_IS_SERIALIZED_CLEAR;
    1661             :     }
    1662             : 
    1663             :     /* also reset the number of entries in the transaction */
    1664        1112 :     txn->nentries_mem = 0;
    1665        1112 :     txn->nentries = 0;
    1666        1112 : }
    1667             : 
    1668             : /*
    1669             :  * Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
    1670             :  * HeapTupleSatisfiesHistoricMVCC.
    1671             :  */
    1672             : static void
    1673        2096 : ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
    1674             : {
    1675             :     dlist_iter  iter;
    1676             :     HASHCTL     hash_ctl;
    1677             : 
    1678        2096 :     if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
    1679        1572 :         return;
    1680             : 
    1681         524 :     hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
    1682         524 :     hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
    1683         524 :     hash_ctl.hcxt = rb->context;
    1684             : 
    1685             :     /*
    1686             :      * create the hash with the exact number of to-be-stored tuplecids from
    1687             :      * the start
    1688             :      */
    1689         524 :     txn->tuplecid_hash =
    1690         524 :         hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
    1691             :                     HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
    1692             : 
    1693       13610 :     dlist_foreach(iter, &txn->tuplecids)
    1694             :     {
    1695             :         ReorderBufferTupleCidKey key;
    1696             :         ReorderBufferTupleCidEnt *ent;
    1697             :         bool        found;
    1698             :         ReorderBufferChange *change;
    1699             : 
    1700       13086 :         change = dlist_container(ReorderBufferChange, node, iter.cur);
    1701             : 
    1702             :         Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
    1703             : 
    1704             :         /* be careful about padding */
    1705       13086 :         memset(&key, 0, sizeof(ReorderBufferTupleCidKey));
    1706             : 
    1707       13086 :         key.relnode = change->data.tuplecid.node;
    1708             : 
    1709       13086 :         ItemPointerCopy(&change->data.tuplecid.tid,
    1710             :                         &key.tid);
    1711             : 
    1712             :         ent = (ReorderBufferTupleCidEnt *)
    1713       13086 :             hash_search(txn->tuplecid_hash,
    1714             :                         (void *) &key,
    1715             :                         HASH_ENTER,
    1716             :                         &found);
    1717       13086 :         if (!found)
    1718             :         {
    1719       10094 :             ent->cmin = change->data.tuplecid.cmin;
    1720       10094 :             ent->cmax = change->data.tuplecid.cmax;
    1721       10094 :             ent->combocid = change->data.tuplecid.combocid;
    1722             :         }
    1723             :         else
    1724             :         {
    1725             :             /*
    1726             :              * Maybe we already saw this tuple before in this transaction, but
    1727             :              * if so it must have the same cmin.
    1728             :              */
    1729             :             Assert(ent->cmin == change->data.tuplecid.cmin);
    1730             : 
    1731             :             /*
    1732             :              * cmax may be initially invalid, but once set it can only grow,
    1733             :              * and never become invalid again.
    1734             :              */
    1735             :             Assert((ent->cmax == InvalidCommandId) ||
    1736             :                    ((change->data.tuplecid.cmax != InvalidCommandId) &&
    1737             :                     (change->data.tuplecid.cmax > ent->cmax)));
    1738        2992 :             ent->cmax = change->data.tuplecid.cmax;
    1739             :         }
    1740             :     }
    1741             : }
    1742             : 
    1743             : /*
    1744             :  * Copy a provided snapshot so we can modify it privately. This is needed so
    1745             :  * that catalog modifying transactions can look into intermediate catalog
    1746             :  * states.
    1747             :  */
    1748             : static Snapshot
    1749        1820 : ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
    1750             :                       ReorderBufferTXN *txn, CommandId cid)
    1751             : {
    1752             :     Snapshot    snap;
    1753             :     dlist_iter  iter;
    1754        1820 :     int         i = 0;
    1755             :     Size        size;
    1756             : 
    1757        1820 :     size = sizeof(SnapshotData) +
    1758        1820 :         sizeof(TransactionId) * orig_snap->xcnt +
    1759        1820 :         sizeof(TransactionId) * (txn->nsubtxns + 1);
    1760             : 
    1761        1820 :     snap = MemoryContextAllocZero(rb->context, size);
    1762        1820 :     memcpy(snap, orig_snap, sizeof(SnapshotData));
    1763             : 
    1764        1820 :     snap->copied = true;
    1765        1820 :     snap->active_count = 1;      /* mark as active so nobody frees it */
    1766        1820 :     snap->regd_count = 0;
    1767        1820 :     snap->xip = (TransactionId *) (snap + 1);
    1768             : 
    1769        1820 :     memcpy(snap->xip, orig_snap->xip, sizeof(TransactionId) * snap->xcnt);
    1770             : 
    1771             :     /*
    1772             :      * snap->subxip contains all txids that belong to our transaction which we
    1773             :      * need to check via cmin/cmax. That's why we store the toplevel
    1774             :      * transaction in there as well.
    1775             :      */
    1776        1820 :     snap->subxip = snap->xip + snap->xcnt;
    1777        1820 :     snap->subxip[i++] = txn->xid;
    1778             : 
    1779             :     /*
    1780             :      * subxcnt isn't decreased when subtransactions abort, so count manually.
    1781             :      * Since it's an upper boundary it is safe to use it for the allocation
    1782             :      * above.
    1783             :      */
    1784        1820 :     snap->subxcnt = 1;
    1785             : 
    1786        2066 :     dlist_foreach(iter, &txn->subtxns)
    1787             :     {
    1788             :         ReorderBufferTXN *sub_txn;
    1789             : 
    1790         246 :         sub_txn = dlist_container(ReorderBufferTXN, node, iter.cur);
    1791         246 :         snap->subxip[i++] = sub_txn->xid;
    1792         246 :         snap->subxcnt++;
    1793             :     }
    1794             : 
    1795             :     /* sort so we can bsearch() later */
    1796        1820 :     qsort(snap->subxip, snap->subxcnt, sizeof(TransactionId), xidComparator);
    1797             : 
    1798             :     /* store the specified current CommandId */
    1799        1820 :     snap->curcid = cid;
    1800             : 
    1801        1820 :     return snap;
    1802             : }
    1803             : 
    1804             : /*
    1805             :  * Free a previously ReorderBufferCopySnap'ed snapshot
    1806             :  */
    1807             : static void
    1808        3152 : ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
    1809             : {
    1810        3152 :     if (snap->copied)
    1811        1812 :         pfree(snap);
    1812             :     else
    1813        1340 :         SnapBuildSnapDecRefcount(snap);
    1814        3152 : }
    1815             : 
    1816             : /*
    1817             :  * If the transaction was (partially) streamed, we need to prepare or commit
    1818             :  * it in a 'streamed' way.  That is, we first stream the remaining part of the
    1819             :  * transaction, and then invoke stream_prepare or stream_commit message as per
    1820             :  * the case.
    1821             :  */
    1822             : static void
    1823          54 : ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
    1824             : {
    1825             :     /* we should only call this for previously streamed transactions */
    1826             :     Assert(rbtxn_is_streamed(txn));
    1827             : 
    1828          54 :     ReorderBufferStreamTXN(rb, txn);
    1829             : 
    1830          54 :     if (rbtxn_prepared(txn))
    1831             :     {
    1832             :         /*
    1833             :          * Note, we send stream prepare even if a concurrent abort is
    1834             :          * detected. See DecodePrepare for more information.
    1835             :          */
    1836          18 :         rb->stream_prepare(rb, txn, txn->final_lsn);
    1837             : 
    1838             :         /*
    1839             :          * This is a PREPARED transaction, part of a two-phase commit. The
    1840             :          * full cleanup will happen as part of the COMMIT PREPAREDs, so now
    1841             :          * just truncate txn by removing changes and tuple_cids.
    1842             :          */
    1843          18 :         ReorderBufferTruncateTXN(rb, txn, true);
    1844             :         /* Reset the CheckXidAlive */
    1845          18 :         CheckXidAlive = InvalidTransactionId;
    1846             :     }
    1847             :     else
    1848             :     {
    1849          36 :         rb->stream_commit(rb, txn, txn->final_lsn);
    1850          36 :         ReorderBufferCleanupTXN(rb, txn);
    1851             :     }
    1852          54 : }
    1853             : 
    1854             : /*
    1855             :  * Set xid to detect concurrent aborts.
    1856             :  *
    1857             :  * While streaming an in-progress transaction or decoding a prepared
    1858             :  * transaction there is a possibility that the (sub)transaction might get
    1859             :  * aborted concurrently.  In such case if the (sub)transaction has catalog
    1860             :  * update then we might decode the tuple using wrong catalog version.  For
    1861             :  * example, suppose there is one catalog tuple with (xmin: 500, xmax: 0).  Now,
    1862             :  * the transaction 501 updates the catalog tuple and after that we will have
    1863             :  * two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0).  Now, if 501 is
    1864             :  * aborted and some other transaction say 502 updates the same catalog tuple
    1865             :  * then the first tuple will be changed to (xmin: 500, xmax: 502).  So, the
    1866             :  * problem is that when we try to decode the tuple inserted/updated in 501
    1867             :  * after the catalog update, we will see the catalog tuple with (xmin: 500,
    1868             :  * xmax: 502) as visible because it will consider that the tuple is deleted by
    1869             :  * xid 502 which is not visible to our snapshot.  And when we will try to
    1870             :  * decode with that catalog tuple, it can lead to a wrong result or a crash.
    1871             :  * So, it is necessary to detect concurrent aborts to allow streaming of
    1872             :  * in-progress transactions or decoding of prepared  transactions.
    1873             :  *
    1874             :  * For detecting the concurrent abort we set CheckXidAlive to the current
    1875             :  * (sub)transaction's xid for which this change belongs to.  And, during
    1876             :  * catalog scan we can check the status of the xid and if it is aborted we will
    1877             :  * report a specific error so that we can stop streaming current transaction
    1878             :  * and discard the already streamed changes on such an error.  We might have
    1879             :  * already streamed some of the changes for the aborted (sub)transaction, but
    1880             :  * that is fine because when we decode the abort we will stream abort message
    1881             :  * to truncate the changes in the subscriber. Similarly, for prepared
    1882             :  * transactions, we stop decoding if concurrent abort is detected and then
    1883             :  * rollback the changes when rollback prepared is encountered. See
    1884             :  * DecodePrepare.
    1885             :  */
    1886             : static inline void
    1887      304746 : SetupCheckXidLive(TransactionId xid)
    1888             : {
    1889             :     /*
    1890             :      * If the input transaction id is already set as a CheckXidAlive then
    1891             :      * nothing to do.
    1892             :      */
    1893      304746 :     if (TransactionIdEquals(CheckXidAlive, xid))
    1894      179192 :         return;
    1895             : 
    1896             :     /*
    1897             :      * setup CheckXidAlive if it's not committed yet.  We don't check if the
    1898             :      * xid is aborted.  That will happen during catalog access.
    1899             :      */
    1900      125554 :     if (!TransactionIdDidCommit(xid))
    1901         538 :         CheckXidAlive = xid;
    1902             :     else
    1903      125016 :         CheckXidAlive = InvalidTransactionId;
    1904             : }
    1905             : 
    1906             : /*
    1907             :  * Helper function for ReorderBufferProcessTXN for applying change.
    1908             :  */
    1909             : static inline void
    1910      606114 : ReorderBufferApplyChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
    1911             :                          Relation relation, ReorderBufferChange *change,
    1912             :                          bool streaming)
    1913             : {
    1914      606114 :     if (streaming)
    1915      301372 :         rb->stream_change(rb, txn, relation, change);
    1916             :     else
    1917      304742 :         rb->apply_change(rb, txn, relation, change);
    1918      606110 : }
    1919             : 
    1920             : /*
    1921             :  * Helper function for ReorderBufferProcessTXN for applying the truncate.
    1922             :  */
    1923             : static inline void
    1924          20 : ReorderBufferApplyTruncate(ReorderBuffer *rb, ReorderBufferTXN *txn,
    1925             :                            int nrelations, Relation *relations,
    1926             :                            ReorderBufferChange *change, bool streaming)
    1927             : {
    1928          20 :     if (streaming)
    1929           0 :         rb->stream_truncate(rb, txn, nrelations, relations, change);
    1930             :     else
    1931          20 :         rb->apply_truncate(rb, txn, nrelations, relations, change);
    1932          20 : }
    1933             : 
    1934             : /*
    1935             :  * Helper function for ReorderBufferProcessTXN for applying the message.
    1936             :  */
    1937             : static inline void
    1938          22 : ReorderBufferApplyMessage(ReorderBuffer *rb, ReorderBufferTXN *txn,
    1939             :                           ReorderBufferChange *change, bool streaming)
    1940             : {
    1941          22 :     if (streaming)
    1942           6 :         rb->stream_message(rb, txn, change->lsn, true,
    1943           6 :                            change->data.msg.prefix,
    1944             :                            change->data.msg.message_size,
    1945           6 :                            change->data.msg.message);
    1946             :     else
    1947          16 :         rb->message(rb, txn, change->lsn, true,
    1948          16 :                     change->data.msg.prefix,
    1949             :                     change->data.msg.message_size,
    1950          16 :                     change->data.msg.message);
    1951          22 : }
    1952             : 
    1953             : /*
    1954             :  * Function to store the command id and snapshot at the end of the current
    1955             :  * stream so that we can reuse the same while sending the next stream.
    1956             :  */
    1957             : static inline void
    1958         818 : ReorderBufferSaveTXNSnapshot(ReorderBuffer *rb, ReorderBufferTXN *txn,
    1959             :                              Snapshot snapshot_now, CommandId command_id)
    1960             : {
    1961         818 :     txn->command_id = command_id;
    1962             : 
    1963             :     /* Avoid copying if it's already copied. */
    1964         818 :     if (snapshot_now->copied)
    1965         818 :         txn->snapshot_now = snapshot_now;
    1966             :     else
    1967           0 :         txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
    1968             :                                                   txn, command_id);
    1969         818 : }
    1970             : 
    1971             : /*
    1972             :  * Helper function for ReorderBufferProcessTXN to handle the concurrent
    1973             :  * abort of the streaming transaction.  This resets the TXN such that it
    1974             :  * can be used to stream the remaining data of transaction being processed.
    1975             :  * This can happen when the subtransaction is aborted and we still want to
    1976             :  * continue processing the main or other subtransactions data.
    1977             :  */
    1978             : static void
    1979          12 : ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
    1980             :                       Snapshot snapshot_now,
    1981             :                       CommandId command_id,
    1982             :                       XLogRecPtr last_lsn,
    1983             :                       ReorderBufferChange *specinsert)
    1984             : {
    1985             :     /* Discard the changes that we just streamed */
    1986          12 :     ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
    1987             : 
    1988             :     /* Free all resources allocated for toast reconstruction */
    1989          12 :     ReorderBufferToastReset(rb, txn);
    1990             : 
    1991             :     /* Return the spec insert change if it is not NULL */
    1992          12 :     if (specinsert != NULL)
    1993             :     {
    1994           0 :         ReorderBufferReturnChange(rb, specinsert, true);
    1995           0 :         specinsert = NULL;
    1996             :     }
    1997             : 
    1998             :     /*
    1999             :      * For the streaming case, stop the stream and remember the command ID and
    2000             :      * snapshot for the streaming run.
    2001             :      */
    2002          12 :     if (rbtxn_is_streamed(txn))
    2003             :     {
    2004          12 :         rb->stream_stop(rb, txn, last_lsn);
    2005          12 :         ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
    2006             :     }
    2007          12 : }
    2008             : 
    2009             : /*
    2010             :  * Helper function for ReorderBufferReplay and ReorderBufferStreamTXN.
    2011             :  *
    2012             :  * Send data of a transaction (and its subtransactions) to the
    2013             :  * output plugin. We iterate over the top and subtransactions (using a k-way
    2014             :  * merge) and replay the changes in lsn order.
    2015             :  *
    2016             :  * If streaming is true then data will be sent using stream API.
    2017             :  *
    2018             :  * Note: "volatile" markers on some parameters are to avoid trouble with
    2019             :  * PG_TRY inside the function.
    2020             :  */
    2021             : static void
    2022        2096 : ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
    2023             :                         XLogRecPtr commit_lsn,
    2024             :                         volatile Snapshot snapshot_now,
    2025             :                         volatile CommandId command_id,
    2026             :                         bool streaming)
    2027             : {
    2028             :     bool        using_subtxn;
    2029        2096 :     MemoryContext ccxt = CurrentMemoryContext;
    2030        2096 :     ReorderBufferIterTXNState *volatile iterstate = NULL;
    2031        2096 :     volatile XLogRecPtr prev_lsn = InvalidXLogRecPtr;
    2032        2096 :     ReorderBufferChange *volatile specinsert = NULL;
    2033        2096 :     volatile bool stream_started = false;
    2034        2096 :     ReorderBufferTXN *volatile curtxn = NULL;
    2035             : 
    2036             :     /* build data to be able to lookup the CommandIds of catalog tuples */
    2037        2096 :     ReorderBufferBuildTupleCidHash(rb, txn);
    2038             : 
    2039             :     /* setup the initial snapshot */
    2040        2096 :     SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
    2041             : 
    2042             :     /*
    2043             :      * Decoding needs access to syscaches et al., which in turn use
    2044             :      * heavyweight locks and such. Thus we need to have enough state around to
    2045             :      * keep track of those.  The easiest way is to simply use a transaction
    2046             :      * internally.  That also allows us to easily enforce that nothing writes
    2047             :      * to the database by checking for xid assignments.
    2048             :      *
    2049             :      * When we're called via the SQL SRF there's already a transaction
    2050             :      * started, so start an explicit subtransaction there.
    2051             :      */
    2052        2096 :     using_subtxn = IsTransactionOrTransactionBlock();
    2053             : 
    2054        2096 :     PG_TRY();
    2055             :     {
    2056             :         ReorderBufferChange *change;
    2057             : 
    2058        2096 :         if (using_subtxn)
    2059         806 :             BeginInternalSubTransaction(streaming ? "stream" : "replay");
    2060             :         else
    2061        1290 :             StartTransactionCommand();
    2062             : 
    2063             :         /*
    2064             :          * We only need to send begin/begin-prepare for non-streamed
    2065             :          * transactions.
    2066             :          */
    2067        2096 :         if (!streaming)
    2068             :         {
    2069        1274 :             if (rbtxn_prepared(txn))
    2070          46 :                 rb->begin_prepare(rb, txn);
    2071             :             else
    2072        1228 :                 rb->begin(rb, txn);
    2073             :         }
    2074             : 
    2075        2096 :         ReorderBufferIterTXNInit(rb, txn, &iterstate);
    2076      638028 :         while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
    2077             :         {
    2078      633852 :             Relation    relation = NULL;
    2079             :             Oid         reloid;
    2080             : 
    2081             :             /*
    2082             :              * We can't call start stream callback before processing first
    2083             :              * change.
    2084             :              */
    2085      633852 :             if (prev_lsn == InvalidXLogRecPtr)
    2086             :             {
    2087        2090 :                 if (streaming)
    2088             :                 {
    2089         816 :                     txn->origin_id = change->origin_id;
    2090         816 :                     rb->stream_start(rb, txn, change->lsn);
    2091         816 :                     stream_started = true;
    2092             :                 }
    2093             :             }
    2094             : 
    2095             :             /*
    2096             :              * Enforce correct ordering of changes, merged from multiple
    2097             :              * subtransactions. The changes may have the same LSN due to
    2098             :              * MULTI_INSERT xlog records.
    2099             :              */
    2100             :             Assert(prev_lsn == InvalidXLogRecPtr || prev_lsn <= change->lsn);
    2101             : 
    2102      633852 :             prev_lsn = change->lsn;
    2103             : 
    2104             :             /*
    2105             :              * Set the current xid to detect concurrent aborts. This is
    2106             :              * required for the cases when we decode the changes before the
    2107             :              * COMMIT record is processed.
    2108             :              */
    2109      633852 :             if (streaming || rbtxn_prepared(change->txn))
    2110             :             {
    2111      304746 :                 curtxn = change->txn;
    2112      304746 :                 SetupCheckXidLive(curtxn->xid);
    2113             :             }
    2114             : 
    2115      633852 :             switch (change->action)
    2116             :             {
    2117        3564 :                 case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
    2118             : 
    2119             :                     /*
    2120             :                      * Confirmation for speculative insertion arrived. Simply
    2121             :                      * use as a normal record. It'll be cleaned up at the end
    2122             :                      * of INSERT processing.
    2123             :                      */
    2124        3564 :                     if (specinsert == NULL)
    2125           0 :                         elog(ERROR, "invalid ordering of speculative insertion changes");
    2126             :                     Assert(specinsert->data.tp.oldtuple == NULL);
    2127        3564 :                     change = specinsert;
    2128        3564 :                     change->action = REORDER_BUFFER_CHANGE_INSERT;
    2129             : 
    2130             :                     /* intentionally fall through */
    2131      614562 :                 case REORDER_BUFFER_CHANGE_INSERT:
    2132             :                 case REORDER_BUFFER_CHANGE_UPDATE:
    2133             :                 case REORDER_BUFFER_CHANGE_DELETE:
    2134             :                     Assert(snapshot_now);
    2135             : 
    2136      614562 :                     reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
    2137             :                                                 change->data.tp.relnode.relNode);
    2138             : 
    2139             :                     /*
    2140             :                      * Mapped catalog tuple without data, emitted while
    2141             :                      * catalog table was in the process of being rewritten. We
    2142             :                      * can fail to look up the relfilenode, because the
    2143             :                      * relmapper has no "historic" view, in contrast to the
    2144             :                      * normal catalog during decoding. Thus repeated rewrites
    2145             :                      * can cause a lookup failure. That's OK because we do not
    2146             :                      * decode catalog changes anyway. Normally such tuples
    2147             :                      * would be skipped over below, but we can't identify
    2148             :                      * whether the table should be logically logged without
    2149             :                      * mapping the relfilenode to the oid.
    2150             :                      */
    2151      614550 :                     if (reloid == InvalidOid &&
    2152         152 :                         change->data.tp.newtuple == NULL &&
    2153         152 :                         change->data.tp.oldtuple == NULL)
    2154         152 :                         goto change_done;
    2155      614398 :                     else if (reloid == InvalidOid)
    2156           0 :                         elog(ERROR, "could not map filenode \"%s\" to relation OID",
    2157             :                              relpathperm(change->data.tp.relnode,
    2158             :                                          MAIN_FORKNUM));
    2159             : 
    2160      614398 :                     relation = RelationIdGetRelation(reloid);
    2161             : 
    2162      614398 :                     if (!RelationIsValid(relation))
    2163           0 :                         elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
    2164             :                              reloid,
    2165             :                              relpathperm(change->data.tp.relnode,
    2166             :                                          MAIN_FORKNUM));
    2167             : 
    2168      614398 :                     if (!RelationIsLogicallyLogged(relation))
    2169        4334 :                         goto change_done;
    2170             : 
    2171             :                     /*
    2172             :                      * Ignore temporary heaps created during DDL unless the
    2173             :                      * plugin has asked for them.
    2174             :                      */
    2175      610064 :                     if (relation->rd_rel->relrewrite && !rb->output_rewrites)
    2176          48 :                         goto change_done;
    2177             : 
    2178             :                     /*
    2179             :                      * For now ignore sequence changes entirely. Most of the
    2180             :                      * time they don't log changes using records we
    2181             :                      * understand, so it doesn't make sense to handle the few
    2182             :                      * cases we do.
    2183             :                      */
    2184      610016 :                     if (relation->rd_rel->relkind == RELKIND_SEQUENCE)
    2185           0 :                         goto change_done;
    2186             : 
    2187             :                     /* user-triggered change */
    2188      610016 :                     if (!IsToastRelation(relation))
    2189             :                     {
    2190      606114 :                         ReorderBufferToastReplace(rb, txn, relation, change);
    2191      606114 :                         ReorderBufferApplyChange(rb, txn, relation, change,
    2192             :                                                  streaming);
    2193             : 
    2194             :                         /*
    2195             :                          * Only clear reassembled toast chunks if we're sure
    2196             :                          * they're not required anymore. The creator of the
    2197             :                          * tuple tells us.
    2198             :                          */
    2199      606110 :                         if (change->data.tp.clear_toast_afterwards)
    2200      605666 :                             ReorderBufferToastReset(rb, txn);
    2201             :                     }
    2202             :                     /* we're not interested in toast deletions */
    2203        3902 :                     else if (change->action == REORDER_BUFFER_CHANGE_INSERT)
    2204             :                     {
    2205             :                         /*
    2206             :                          * Need to reassemble the full toasted Datum in
    2207             :                          * memory, to ensure the chunks don't get reused till
    2208             :                          * we're done remove it from the list of this
    2209             :                          * transaction's changes. Otherwise it will get
    2210             :                          * freed/reused while restoring spooled data from
    2211             :                          * disk.
    2212             :                          */
    2213             :                         Assert(change->data.tp.newtuple != NULL);
    2214             : 
    2215        3440 :                         dlist_delete(&change->node);
    2216        3440 :                         ReorderBufferToastAppendChunk(rb, txn, relation,
    2217             :                                                       change);
    2218             :                     }
    2219             : 
    2220         462 :             change_done:
    2221             : 
    2222             :                     /*
    2223             :                      * If speculative insertion was confirmed, the record
    2224             :                      * isn't needed anymore.
    2225             :                      */
    2226      614546 :                     if (specinsert != NULL)
    2227             :                     {
    2228        3564 :                         ReorderBufferReturnChange(rb, specinsert, true);
    2229        3564 :                         specinsert = NULL;
    2230             :                     }
    2231             : 
    2232      614546 :                     if (RelationIsValid(relation))
    2233             :                     {
    2234      614394 :                         RelationClose(relation);
    2235      614394 :                         relation = NULL;
    2236             :                     }
    2237      614546 :                     break;
    2238             : 
    2239        3564 :                 case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
    2240             : 
    2241             :                     /*
    2242             :                      * Speculative insertions are dealt with by delaying the
    2243             :                      * processing of the insert until the confirmation record
    2244             :                      * arrives. For that we simply unlink the record from the
    2245             :                      * chain, so it does not get freed/reused while restoring
    2246             :                      * spooled data from disk.
    2247             :                      *
    2248             :                      * This is safe in the face of concurrent catalog changes
    2249             :                      * because the relevant relation can't be changed between
    2250             :                      * speculative insertion and confirmation due to
    2251             :                      * CheckTableNotInUse() and locking.
    2252             :                      */
    2253             : 
    2254             :                     /* clear out a pending (and thus failed) speculation */
    2255        3564 :                     if (specinsert != NULL)
    2256             :                     {
    2257           0 :                         ReorderBufferReturnChange(rb, specinsert, true);
    2258           0 :                         specinsert = NULL;
    2259             :                     }
    2260             : 
    2261             :                     /* and memorize the pending insertion */
    2262        3564 :                     dlist_delete(&change->node);
    2263        3564 :                     specinsert = change;
    2264        3564 :                     break;
    2265             : 
    2266           0 :                 case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
    2267             : 
    2268             :                     /*
    2269             :                      * Abort for speculative insertion arrived. So cleanup the
    2270             :                      * specinsert tuple and toast hash.
    2271             :                      *
    2272             :                      * Note that we get the spec abort change for each toast
    2273             :                      * entry but we need to perform the cleanup only the first
    2274             :                      * time we get it for the main table.
    2275             :                      */
    2276           0 :                     if (specinsert != NULL)
    2277             :                     {
    2278             :                         /*
    2279             :                          * We must clean the toast hash before processing a
    2280             :                          * completely new tuple to avoid confusion about the
    2281             :                          * previous tuple's toast chunks.
    2282             :                          */
    2283             :                         Assert(change->data.tp.clear_toast_afterwards);
    2284           0 :                         ReorderBufferToastReset(rb, txn);
    2285             : 
    2286             :                         /* We don't need this record anymore. */
    2287           0 :                         ReorderBufferReturnChange(rb, specinsert, true);
    2288           0 :                         specinsert = NULL;
    2289             :                     }
    2290           0 :                     break;
    2291             : 
    2292          20 :                 case REORDER_BUFFER_CHANGE_TRUNCATE:
    2293             :                     {
    2294             :                         int         i;
    2295          20 :                         int         nrelids = change->data.truncate.nrelids;
    2296          20 :                         int         nrelations = 0;
    2297             :                         Relation   *relations;
    2298             : 
    2299          20 :                         relations = palloc0(nrelids * sizeof(Relation));
    2300          50 :                         for (i = 0; i < nrelids; i++)
    2301             :                         {
    2302          30 :                             Oid         relid = change->data.truncate.relids[i];
    2303             :                             Relation    relation;
    2304             : 
    2305          30 :                             relation = RelationIdGetRelation(relid);
    2306             : 
    2307          30 :                             if (!RelationIsValid(relation))
    2308           0 :                                 elog(ERROR, "could not open relation with OID %u", relid);
    2309             : 
    2310          30 :                             if (!RelationIsLogicallyLogged(relation))
    2311           0 :                                 continue;
    2312             : 
    2313          30 :                             relations[nrelations++] = relation;
    2314             :                         }
    2315             : 
    2316             :                         /* Apply the truncate. */
    2317          20 :                         ReorderBufferApplyTruncate(rb, txn, nrelations,
    2318             :                                                    relations, change,
    2319             :                                                    streaming);
    2320             : 
    2321          50 :                         for (i = 0; i < nrelations; i++)
    2322          30 :                             RelationClose(relations[i]);
    2323             : 
    2324          20 :                         break;
    2325             :                     }
    2326             : 
    2327          22 :                 case REORDER_BUFFER_CHANGE_MESSAGE:
    2328          22 :                     ReorderBufferApplyMessage(rb, txn, change, streaming);
    2329          22 :                     break;
    2330             : 
    2331        2380 :                 case REORDER_BUFFER_CHANGE_INVALIDATION:
    2332             :                     /* Execute the invalidation messages locally */
    2333        2380 :                     ReorderBufferExecuteInvalidations(
    2334             :                                                       change->data.inval.ninvalidations,
    2335             :                                                       change->data.inval.invalidations);
    2336        2380 :                     break;
    2337             : 
    2338         540 :                 case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
    2339             :                     /* get rid of the old */
    2340         540 :                     TeardownHistoricSnapshot(false);
    2341             : 
    2342         540 :                     if (snapshot_now->copied)
    2343             :                     {
    2344         502 :                         ReorderBufferFreeSnap(rb, snapshot_now);
    2345         502 :                         snapshot_now =
    2346         502 :                             ReorderBufferCopySnap(rb, change->data.snapshot,
    2347             :                                                   txn, command_id);
    2348             :                     }
    2349             : 
    2350             :                     /*
    2351             :                      * Restored from disk, need to be careful not to double
    2352             :                      * free. We could introduce refcounting for that, but for
    2353             :                      * now this seems infrequent enough not to care.
    2354             :                      */
    2355          38 :                     else if (change->data.snapshot->copied)
    2356             :                     {
    2357           0 :                         snapshot_now =
    2358           0 :                             ReorderBufferCopySnap(rb, change->data.snapshot,
    2359             :                                                   txn, command_id);
    2360             :                     }
    2361             :                     else
    2362             :                     {
    2363          38 :                         snapshot_now = change->data.snapshot;
    2364             :                     }
    2365             : 
    2366             :                     /* and continue with the new one */
    2367         540 :                     SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
    2368         540 :                     break;
    2369             : 
    2370       12764 :                 case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
    2371             :                     Assert(change->data.command_id != InvalidCommandId);
    2372             : 
    2373       12764 :                     if (command_id < change->data.command_id)
    2374             :                     {
    2375        1952 :                         command_id = change->data.command_id;
    2376             : 
    2377        1952 :                         if (!snapshot_now->copied)
    2378             :                         {
    2379             :                             /* we don't use the global one anymore */
    2380         496 :                             snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
    2381             :                                                                  txn, command_id);
    2382             :                         }
    2383             : 
    2384        1952 :                         snapshot_now->curcid = command_id;
    2385             : 
    2386        1952 :                         TeardownHistoricSnapshot(false);
    2387        1952 :                         SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
    2388             :                     }
    2389             : 
    2390       12764 :                     break;
    2391             : 
    2392           0 :                 case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
    2393           0 :                     elog(ERROR, "tuplecid value in changequeue");
    2394             :                     break;
    2395             :             }
    2396      635932 :         }
    2397             : 
    2398             :         /* speculative insertion record must be freed by now */
    2399             :         Assert(!specinsert);
    2400             : 
    2401             :         /* clean up the iterator */
    2402        2080 :         ReorderBufferIterTXNFinish(rb, iterstate);
    2403        2080 :         iterstate = NULL;
    2404             : 
    2405             :         /*
    2406             :          * Update total transaction count and total bytes processed by the
    2407             :          * transaction and its subtransactions. Ensure to not count the
    2408             :          * streamed transaction multiple times.
    2409             :          *
    2410             :          * Note that the statistics computation has to be done after
    2411             :          * ReorderBufferIterTXNFinish as it releases the serialized change
    2412             :          * which we have already accounted in ReorderBufferIterTXNNext.
    2413             :          */
    2414        2080 :         if (!rbtxn_is_streamed(txn))
    2415        1330 :             rb->totalTxns++;
    2416             : 
    2417        2080 :         rb->totalBytes += txn->total_size;
    2418             : 
    2419             :         /*
    2420             :          * Done with current changes, send the last message for this set of
    2421             :          * changes depending upon streaming mode.
    2422             :          */
    2423        2080 :         if (streaming)
    2424             :         {
    2425         806 :             if (stream_started)
    2426             :             {
    2427         800 :                 rb->stream_stop(rb, txn, prev_lsn);
    2428         800 :                 stream_started = false;
    2429             :             }
    2430             :         }
    2431             :         else
    2432             :         {
    2433             :             /*
    2434             :              * Call either PREPARE (for two-phase transactions) or COMMIT (for
    2435             :              * regular ones).
    2436             :              */
    2437        1274 :             if (rbtxn_prepared(txn))
    2438          46 :                 rb->prepare(rb, txn, commit_lsn);
    2439             :             else
    2440        1228 :                 rb->commit(rb, txn, commit_lsn);
    2441             :         }
    2442             : 
    2443             :         /* this is just a sanity check against bad output plugin behaviour */
    2444        2080 :         if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
    2445           0 :             elog(ERROR, "output plugin used XID %u",
    2446             :                  GetCurrentTransactionId());
    2447             : 
    2448             :         /*
    2449             :          * Remember the command ID and snapshot for the next set of changes in
    2450             :          * streaming mode.
    2451             :          */
    2452        2080 :         if (streaming)
    2453         806 :             ReorderBufferSaveTXNSnapshot(rb, txn, snapshot_now, command_id);
    2454        1274 :         else if (snapshot_now->copied)
    2455         496 :             ReorderBufferFreeSnap(rb, snapshot_now);
    2456             : 
    2457             :         /* cleanup */
    2458        2080 :         TeardownHistoricSnapshot(false);
    2459             : 
    2460             :         /*
    2461             :          * Aborting the current (sub-)transaction as a whole has the right
    2462             :          * semantics. We want all locks acquired in here to be released, not
    2463             :          * reassigned to the parent and we do not want any database access
    2464             :          * have persistent effects.
    2465             :          */
    2466        2080 :         AbortCurrentTransaction();
    2467             : 
    2468             :         /* make sure there's no cache pollution */
    2469        2080 :         ReorderBufferExecuteInvalidations(txn->ninvalidations, txn->invalidations);
    2470             : 
    2471        2080 :         if (using_subtxn)
    2472         800 :             RollbackAndReleaseCurrentSubTransaction();
    2473             : 
    2474             :         /*
    2475             :          * We are here due to one of the four reasons: 1. Decoding an
    2476             :          * in-progress txn. 2. Decoding a prepared txn. 3. Decoding of a
    2477             :          * prepared txn that was (partially) streamed. 4. Decoding a committed
    2478             :          * txn.
    2479             :          *
    2480             :          * For 1, we allow truncation of txn data by removing the changes
    2481             :          * already streamed but still keeping other things like invalidations,
    2482             :          * snapshot, and tuplecids. For 2 and 3, we indicate
    2483             :          * ReorderBufferTruncateTXN to do more elaborate truncation of txn
    2484             :          * data as the entire transaction has been decoded except for commit.
    2485             :          * For 4, as the entire txn has been decoded, we can fully clean up
    2486             :          * the TXN reorder buffer.
    2487             :          */
    2488        2080 :         if (streaming || rbtxn_prepared(txn))
    2489             :         {
    2490         852 :             ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
    2491             :             /* Reset the CheckXidAlive */
    2492         852 :             CheckXidAlive = InvalidTransactionId;
    2493             :         }
    2494             :         else
    2495        1228 :             ReorderBufferCleanupTXN(rb, txn);
    2496             :     }
    2497          12 :     PG_CATCH();
    2498             :     {
    2499          12 :         MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
    2500          12 :         ErrorData  *errdata = CopyErrorData();
    2501             : 
    2502             :         /* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
    2503          12 :         if (iterstate)
    2504          12 :             ReorderBufferIterTXNFinish(rb, iterstate);
    2505             : 
    2506          12 :         TeardownHistoricSnapshot(true);
    2507             : 
    2508             :         /*
    2509             :          * Force cache invalidation to happen outside of a valid transaction
    2510             :          * to prevent catalog access as we just caught an error.
    2511             :          */
    2512          12 :         AbortCurrentTransaction();
    2513             : 
    2514             :         /* make sure there's no cache pollution */
    2515          12 :         ReorderBufferExecuteInvalidations(txn->ninvalidations,
    2516             :                                           txn->invalidations);
    2517             : 
    2518          12 :         if (using_subtxn)
    2519           6 :             RollbackAndReleaseCurrentSubTransaction();
    2520             : 
    2521             :         /*
    2522             :          * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent
    2523             :          * abort of the (sub)transaction we are streaming or preparing. We
    2524             :          * need to do the cleanup and return gracefully on this error, see
    2525             :          * SetupCheckXidLive.
    2526             :          *
    2527             :          * This error code can be thrown by one of the callbacks we call
    2528             :          * during decoding so we need to ensure that we return gracefully only
    2529             :          * when we are sending the data in streaming mode and the streaming is
    2530             :          * not finished yet or when we are sending the data out on a PREPARE
    2531             :          * during a two-phase commit.
    2532             :          */
    2533          12 :         if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK &&
    2534          12 :             (stream_started || rbtxn_prepared(txn)))
    2535             :         {
    2536             :             /* curtxn must be set for streaming or prepared transactions */
    2537             :             Assert(curtxn);
    2538             : 
    2539             :             /* Cleanup the temporary error state. */
    2540          12 :             FlushErrorState();
    2541          12 :             FreeErrorData(errdata);
    2542          12 :             errdata = NULL;
    2543          12 :             curtxn->concurrent_abort = true;
    2544             : 
    2545             :             /* Reset the TXN so that it is allowed to stream remaining data. */
    2546          12 :             ReorderBufferResetTXN(rb, txn, snapshot_now,
    2547             :                                   command_id, prev_lsn,
    2548             :                                   specinsert);
    2549             :         }
    2550             :         else
    2551             :         {
    2552           0 :             ReorderBufferCleanupTXN(rb, txn);
    2553           0 :             MemoryContextSwitchTo(ecxt);
    2554           0 :             PG_RE_THROW();
    2555             :         }
    2556             :     }
    2557        2092 :     PG_END_TRY();
    2558        2092 : }
    2559             : 
    2560             : /*
    2561             :  * Perform the replay of a transaction and its non-aborted subtransactions.
    2562             :  *
    2563             :  * Subtransactions previously have to be processed by
    2564             :  * ReorderBufferCommitChild(), even if previously assigned to the toplevel
    2565             :  * transaction with ReorderBufferAssignChild.
    2566             :  *
    2567             :  * This interface is called once a prepare or toplevel commit is read for both
    2568             :  * streamed as well as non-streamed transactions.
    2569             :  */
    2570             : static void
    2571        1330 : ReorderBufferReplay(ReorderBufferTXN *txn,
    2572             :                     ReorderBuffer *rb, TransactionId xid,
    2573             :                     XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
    2574             :                     TimestampTz commit_time,
    2575             :                     RepOriginId origin_id, XLogRecPtr origin_lsn)
    2576             : {
    2577             :     Snapshot    snapshot_now;
    2578        1330 :     CommandId   command_id = FirstCommandId;
    2579             : 
    2580        1330 :     txn->final_lsn = commit_lsn;
    2581        1330 :     txn->end_lsn = end_lsn;
    2582        1330 :     txn->xact_time.commit_time = commit_time;
    2583        1330 :     txn->origin_id = origin_id;
    2584        1330 :     txn->origin_lsn = origin_lsn;
    2585             : 
    2586             :     /*
    2587             :      * If the transaction was (partially) streamed, we need to commit it in a
    2588             :      * 'streamed' way. That is, we first stream the remaining part of the
    2589             :      * transaction, and then invoke stream_commit message.
    2590             :      *
    2591             :      * Called after everything (origin ID, LSN, ...) is stored in the
    2592             :      * transaction to avoid passing that information directly.
    2593             :      */
    2594        1330 :     if (rbtxn_is_streamed(txn))
    2595             :     {
    2596          54 :         ReorderBufferStreamCommit(rb, txn);
    2597          54 :         return;
    2598             :     }
    2599             : 
    2600             :     /*
    2601             :      * If this transaction has no snapshot, it didn't make any changes to the
    2602             :      * database, so there's nothing to decode.  Note that
    2603             :      * ReorderBufferCommitChild will have transferred any snapshots from
    2604             :      * subtransactions if there were any.
    2605             :      */
    2606        1276 :     if (txn->base_snapshot == NULL)
    2607             :     {
    2608             :         Assert(txn->ninvalidations == 0);
    2609             : 
    2610             :         /*
    2611             :          * Removing this txn before a commit might result in the computation
    2612             :          * of an incorrect restart_lsn. See SnapBuildProcessRunningXacts.
    2613             :          */
    2614           2 :         if (!rbtxn_prepared(txn))
    2615           2 :             ReorderBufferCleanupTXN(rb, txn);
    2616           2 :         return;
    2617             :     }
    2618             : 
    2619        1274 :     snapshot_now = txn->base_snapshot;
    2620             : 
    2621             :     /* Process and send the changes to output plugin. */
    2622        1274 :     ReorderBufferProcessTXN(rb, txn, commit_lsn, snapshot_now,
    2623             :                             command_id, false);
    2624             : }
    2625             : 
    2626             : /*
    2627             :  * Commit a transaction.
    2628             :  *
    2629             :  * See comments for ReorderBufferReplay().
    2630             :  */
    2631             : void
    2632        1268 : ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
    2633             :                     XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
    2634             :                     TimestampTz commit_time,
    2635             :                     RepOriginId origin_id, XLogRecPtr origin_lsn)
    2636             : {
    2637             :     ReorderBufferTXN *txn;
    2638             : 
    2639        1268 :     txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
    2640             :                                 false);
    2641             : 
    2642             :     /* unknown transaction, nothing to replay */
    2643        1268 :     if (txn == NULL)
    2644           2 :         return;
    2645             : 
    2646        1266 :     ReorderBufferReplay(txn, rb, xid, commit_lsn, end_lsn, commit_time,
    2647             :                         origin_id, origin_lsn);
    2648             : }
    2649             : 
    2650             : /*
    2651             :  * Record the prepare information for a transaction.
    2652             :  */
    2653             : bool
    2654         208 : ReorderBufferRememberPrepareInfo(ReorderBuffer *rb, TransactionId xid,
    2655             :                                  XLogRecPtr prepare_lsn, XLogRecPtr end_lsn,
    2656             :                                  TimestampTz prepare_time,
    2657             :                                  RepOriginId origin_id, XLogRecPtr origin_lsn)
    2658             : {
    2659             :     ReorderBufferTXN *txn;
    2660             : 
    2661         208 :     txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
    2662             : 
    2663             :     /* unknown transaction, nothing to do */
    2664         208 :     if (txn == NULL)
    2665           0 :         return false;
    2666             : 
    2667             :     /*
    2668             :      * Remember the prepare information to be later used by commit prepared in
    2669             :      * case we skip doing prepare.
    2670             :      */
    2671         208 :     txn->final_lsn = prepare_lsn;
    2672         208 :     txn->end_lsn = end_lsn;
    2673         208 :     txn->xact_time.prepare_time = prepare_time;
    2674         208 :     txn->origin_id = origin_id;
    2675         208 :     txn->origin_lsn = origin_lsn;
    2676             : 
    2677         208 :     return true;
    2678             : }
    2679             : 
    2680             : /* Remember that we have skipped prepare */
    2681             : void
    2682         146 : ReorderBufferSkipPrepare(ReorderBuffer *rb, TransactionId xid)
    2683             : {
    2684             :     ReorderBufferTXN *txn;
    2685             : 
    2686         146 :     txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr, false);
    2687             : 
    2688             :     /* unknown transaction, nothing to do */
    2689         146 :     if (txn == NULL)
    2690           0 :         return;
    2691             : 
    2692         146 :     txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
    2693             : }
    2694             : 
    2695             : /*
    2696             :  * Prepare a two-phase transaction.
    2697             :  *
    2698             :  * See comments for ReorderBufferReplay().
    2699             :  */
    2700             : void
    2701          62 : ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
    2702             :                      char *gid)
    2703             : {
    2704             :     ReorderBufferTXN *txn;
    2705             : 
    2706          62 :     txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
    2707             :                                 false);
    2708             : 
    2709             :     /* unknown transaction, nothing to replay */
    2710          62 :     if (txn == NULL)
    2711           0 :         return;
    2712             : 
    2713          62 :     txn->txn_flags |= RBTXN_PREPARE;
    2714          62 :     txn->gid = pstrdup(gid);
    2715             : 
    2716             :     /* The prepare info must have been updated in txn by now. */
    2717             :     Assert(txn->final_lsn != InvalidXLogRecPtr);
    2718             : 
    2719          62 :     ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
    2720          62 :                         txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
    2721             : 
    2722             :     /*
    2723             :      * We send the prepare for the concurrently aborted xacts so that later
    2724             :      * when rollback prepared is decoded and sent, the downstream should be
    2725             :      * able to rollback such a xact. See comments atop DecodePrepare.
    2726             :      *
    2727             :      * Note, for the concurrent_abort + streaming case a stream_prepare was
    2728             :      * already sent within the ReorderBufferReplay call above.
    2729             :      */
    2730          62 :     if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
    2731           0 :         rb->prepare(rb, txn, txn->final_lsn);
    2732             : }
    2733             : 
    2734             : /*
    2735             :  * This is used to handle COMMIT/ROLLBACK PREPARED.
    2736             :  */
    2737             : void
    2738          64 : ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
    2739             :                             XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
    2740             :                             XLogRecPtr two_phase_at,
    2741             :                             TimestampTz commit_time, RepOriginId origin_id,
    2742             :                             XLogRecPtr origin_lsn, char *gid, bool is_commit)
    2743             : {
    2744             :     ReorderBufferTXN *txn;
    2745             :     XLogRecPtr  prepare_end_lsn;
    2746             :     TimestampTz prepare_time;
    2747             : 
    2748          64 :     txn = ReorderBufferTXNByXid(rb, xid, false, NULL, commit_lsn, false);
    2749             : 
    2750             :     /* unknown transaction, nothing to do */
    2751          64 :     if (txn == NULL)
    2752           0 :         return;
    2753             : 
    2754             :     /*
    2755             :      * By this time the txn has the prepare record information, remember it to
    2756             :      * be later used for rollback.
    2757             :      */
    2758          64 :     prepare_end_lsn = txn->end_lsn;
    2759          64 :     prepare_time = txn->xact_time.prepare_time;
    2760             : 
    2761             :     /* add the gid in the txn */
    2762          64 :     txn->gid = pstrdup(gid);
    2763             : 
    2764             :     /*
    2765             :      * It is possible that this transaction is not decoded at prepare time
    2766             :      * either because by that time we didn't have a consistent snapshot, or
    2767             :      * two_phase was not enabled, or it was decoded earlier but we have
    2768             :      * restarted. We only need to send the prepare if it was not decoded
    2769             :      * earlier. We don't need to decode the xact for aborts if it is not done
    2770             :      * already.
    2771             :      */
    2772          64 :     if ((txn->final_lsn < two_phase_at) && is_commit)
    2773             :     {
    2774           2 :         txn->txn_flags |= RBTXN_PREPARE;
    2775             : 
    2776             :         /*
    2777             :          * The prepare info must have been updated in txn even if we skip
    2778             :          * prepare.
    2779             :          */
    2780             :         Assert(txn->final_lsn != InvalidXLogRecPtr);
    2781             : 
    2782             :         /*
    2783             :          * By this time the txn has the prepare record information and it is
    2784             :          * important to use that so that downstream gets the accurate
    2785             :          * information. If instead, we have passed commit information here
    2786             :          * then downstream can behave as it has already replayed commit
    2787             :          * prepared after the restart.
    2788             :          */
    2789           2 :         ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
    2790           2 :                             txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
    2791             :     }
    2792             : 
    2793          64 :     txn->final_lsn = commit_lsn;
    2794          64 :     txn->end_lsn = end_lsn;
    2795          64 :     txn->xact_time.commit_time = commit_time;
    2796          64 :     txn->origin_id = origin_id;
    2797          64 :     txn->origin_lsn = origin_lsn;
    2798             : 
    2799          64 :     if (is_commit)
    2800          48 :         rb->commit_prepared(rb, txn, commit_lsn);
    2801             :     else
    2802          16 :         rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
    2803             : 
    2804             :     /* cleanup: make sure there's no cache pollution */
    2805          64 :     ReorderBufferExecuteInvalidations(txn->ninvalidations,
    2806             :                                       txn->invalidations);
    2807          64 :     ReorderBufferCleanupTXN(rb, txn);
    2808             : }
    2809             : 
    2810             : /*
    2811             :  * Abort a transaction that possibly has previous changes. Needs to be first
    2812             :  * called for subtransactions and then for the toplevel xid.
    2813             :  *
    2814             :  * NB: Transactions handled here have to have actively aborted (i.e. have
    2815             :  * produced an abort record). Implicitly aborted transactions are handled via
    2816             :  * ReorderBufferAbortOld(); transactions we're just not interested in, but
    2817             :  * which have committed are handled in ReorderBufferForget().
    2818             :  *
    2819             :  * This function purges this transaction and its contents from memory and
    2820             :  * disk.
    2821             :  */
    2822             : void
    2823         144 : ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
    2824             : {
    2825             :     ReorderBufferTXN *txn;
    2826             : 
    2827         144 :     txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
    2828             :                                 false);
    2829             : 
    2830             :     /* unknown, nothing to remove */
    2831         144 :     if (txn == NULL)
    2832           0 :         return;
    2833             : 
    2834             :     /* For streamed transactions notify the remote node about the abort. */
    2835         144 :     if (rbtxn_is_streamed(txn))
    2836             :     {
    2837          28 :         rb->stream_abort(rb, txn, lsn);
    2838             : 
    2839             :         /*
    2840             :          * We might have decoded changes for this transaction that could load
    2841             :          * the cache as per the current transaction's view (consider DDL's
    2842             :          * happened in this transaction). We don't want the decoding of future
    2843             :          * transactions to use those cache entries so execute invalidations.
    2844             :          */
    2845          28 :         if (txn->ninvalidations > 0)
    2846           0 :             ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
    2847             :                                                txn->invalidations);
    2848             :     }
    2849             : 
    2850             :     /* cosmetic... */
    2851         144 :     txn->final_lsn = lsn;
    2852             : 
    2853             :     /* remove potential on-disk data, and deallocate */
    2854         144 :     ReorderBufferCleanupTXN(rb, txn);
    2855             : }
    2856             : 
    2857             : /*
    2858             :  * Abort all transactions that aren't actually running anymore because the
    2859             :  * server restarted.
    2860             :  *
    2861             :  * NB: These really have to be transactions that have aborted due to a server
    2862             :  * crash/immediate restart, as we don't deal with invalidations here.
    2863             :  */
    2864             : void
    2865        1300 : ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
    2866             : {
    2867             :     dlist_mutable_iter it;
    2868             : 
    2869             :     /*
    2870             :      * Iterate through all (potential) toplevel TXNs and abort all that are
    2871             :      * older than what possibly can be running. Once we've found the first
    2872             :      * that is alive we stop, there might be some that acquired an xid earlier
    2873             :      * but started writing later, but it's unlikely and they will be cleaned
    2874             :      * up in a later call to this function.
    2875             :      */
    2876        1304 :     dlist_foreach_modify(it, &rb->toplevel_by_lsn)
    2877             :     {
    2878             :         ReorderBufferTXN *txn;
    2879             : 
    2880          40 :         txn = dlist_container(ReorderBufferTXN, node, it.cur);
    2881             : 
    2882          40 :         if (TransactionIdPrecedes(txn->xid, oldestRunningXid))
    2883             :         {
    2884           4 :             elog(DEBUG2, "aborting old transaction %u", txn->xid);
    2885             : 
    2886             :             /* remove potential on-disk data, and deallocate this tx */
    2887           4 :             ReorderBufferCleanupTXN(rb, txn);
    2888             :         }
    2889             :         else
    2890          36 :             return;
    2891             :     }
    2892             : }
    2893             : 
    2894             : /*
    2895             :  * Forget the contents of a transaction if we aren't interested in its
    2896             :  * contents. Needs to be first called for subtransactions and then for the
    2897             :  * toplevel xid.
    2898             :  *
    2899             :  * This is significantly different to ReorderBufferAbort() because
    2900             :  * transactions that have committed need to be treated differently from aborted
    2901             :  * ones since they may have modified the catalog.
    2902             :  *
    2903             :  * Note that this is only allowed to be called in the moment a transaction
    2904             :  * commit has just been read, not earlier; otherwise later records referring
    2905             :  * to this xid might re-create the transaction incompletely.
    2906             :  */
    2907             : void
    2908        4038 : ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
    2909             : {
    2910             :     ReorderBufferTXN *txn;
    2911             : 
    2912        4038 :     txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
    2913             :                                 false);
    2914             : 
    2915             :     /* unknown, nothing to forget */
    2916        4038 :     if (txn == NULL)
    2917        1120 :         return;
    2918             : 
    2919             :     /* For streamed transactions notify the remote node about the abort. */
    2920        2918 :     if (rbtxn_is_streamed(txn))
    2921           0 :         rb->stream_abort(rb, txn, lsn);
    2922             : 
    2923             :     /* cosmetic... */
    2924        2918 :     txn->final_lsn = lsn;
    2925             : 
    2926             :     /*
    2927             :      * Process cache invalidation messages if there are any. Even if we're not
    2928             :      * interested in the transaction's contents, it could have manipulated the
    2929             :      * catalog and we need to update the caches according to that.
    2930             :      */
    2931        2918 :     if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
    2932         820 :         ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
    2933             :                                            txn->invalidations);
    2934             :     else
    2935             :         Assert(txn->ninvalidations == 0);
    2936             : 
    2937             :     /* remove potential on-disk data, and deallocate */
    2938        2918 :     ReorderBufferCleanupTXN(rb, txn);
    2939             : }
    2940             : 
    2941             : /*
    2942             :  * Invalidate cache for those transactions that need to be skipped just in case
    2943             :  * catalogs were manipulated as part of the transaction.
    2944             :  *
    2945             :  * Note that this is a special-purpose function for prepared transactions where
    2946             :  * we don't want to clean up the TXN even when we decide to skip it. See
    2947             :  * DecodePrepare.
    2948             :  */
    2949             : void
    2950         140 : ReorderBufferInvalidate(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
    2951             : {
    2952             :     ReorderBufferTXN *txn;
    2953             : 
    2954         140 :     txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
    2955             :                                 false);
    2956             : 
    2957             :     /* unknown, nothing to do */
    2958         140 :     if (txn == NULL)
    2959           0 :         return;
    2960             : 
    2961             :     /*
    2962             :      * Process cache invalidation messages if there are any. Even if we're not
    2963             :      * interested in the transaction's contents, it could have manipulated the
    2964             :      * catalog and we need to update the caches according to that.
    2965             :      */
    2966         140 :     if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
    2967          46 :         ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
    2968             :                                            txn->invalidations);
    2969             :     else
    2970             :         Assert(txn->ninvalidations == 0);
    2971             : }
    2972             : 
    2973             : 
    2974             : /*
    2975             :  * Execute invalidations happening outside the context of a decoded
    2976             :  * transaction. That currently happens either for xid-less commits
    2977             :  * (cf. RecordTransactionCommit()) or for invalidations in uninteresting
    2978             :  * transactions (via ReorderBufferForget()).
    2979             :  */
    2980             : void
    2981         866 : ReorderBufferImmediateInvalidation(ReorderBuffer *rb, uint32 ninvalidations,
    2982             :                                    SharedInvalidationMessage *invalidations)
    2983             : {
    2984         866 :     bool        use_subtxn = IsTransactionOrTransactionBlock();
    2985             :     int         i;
    2986             : 
    2987         866 :     if (use_subtxn)
    2988         778 :         BeginInternalSubTransaction("replay");
    2989             : 
    2990             :     /*
    2991             :      * Force invalidations to happen outside of a valid transaction - that way
    2992             :      * entries will just be marked as invalid without accessing the catalog.
    2993             :      * That's advantageous because we don't need to setup the full state
    2994             :      * necessary for catalog access.
    2995             :      */
    2996         866 :     if (use_subtxn)
    2997         778 :         AbortCurrentTransaction();
    2998             : 
    2999       39648 :     for (i = 0; i < ninvalidations; i++)
    3000       38782 :         LocalExecuteInvalidationMessage(&invalidations[i]);
    3001             : 
    3002         866 :     if (use_subtxn)
    3003         778 :         RollbackAndReleaseCurrentSubTransaction();
    3004         866 : }
    3005             : 
    3006             : /*
    3007             :  * Tell reorderbuffer about an xid seen in the WAL stream. Has to be called at
    3008             :  * least once for every xid in XLogRecord->xl_xid (other places in records
    3009             :  * may, but do not have to be passed through here).
    3010             :  *
    3011             :  * Reorderbuffer keeps some datastructures about transactions in LSN order,
    3012             :  * for efficiency. To do that it has to know about when transactions are seen
    3013             :  * first in the WAL. As many types of records are not actually interesting for
    3014             :  * logical decoding, they do not necessarily pass though here.
    3015             :  */
    3016             : void
    3017     4041210 : ReorderBufferProcessXid(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
    3018             : {
    3019             :     /* many records won't have an xid assigned, centralize check here */
    3020     4041210 :     if (xid != InvalidTransactionId)
    3021     4038730 :         ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
    3022     4041210 : }
    3023             : 
    3024             : /*
    3025             :  * Add a new snapshot to this transaction that may only used after lsn 'lsn'
    3026             :  * because the previous snapshot doesn't describe the catalog correctly for
    3027             :  * following rows.
    3028             :  */
    3029             : void
    3030        1346 : ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
    3031             :                          XLogRecPtr lsn, Snapshot snap)
    3032             : {
    3033        1346 :     ReorderBufferChange *change = ReorderBufferGetChange(rb);
    3034             : 
    3035        1346 :     change->data.snapshot = snap;
    3036        1346 :     change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
    3037             : 
    3038        1346 :     ReorderBufferQueueChange(rb, xid, lsn, change, false);
    3039        1346 : }
    3040             : 
    3041             : /*
    3042             :  * Set up the transaction's base snapshot.
    3043             :  *
    3044             :  * If we know that xid is a subtransaction, set the base snapshot on the
    3045             :  * top-level transaction instead.
    3046             :  */
    3047             : void
    3048        3490 : ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
    3049             :                              XLogRecPtr lsn, Snapshot snap)
    3050             : {
    3051             :     ReorderBufferTXN *txn;
    3052             :     bool        is_new;
    3053             : 
    3054             :     AssertArg(snap != NULL);
    3055             : 
    3056             :     /*
    3057             :      * Fetch the transaction to operate on.  If we know it's a subtransaction,
    3058             :      * operate on its top-level transaction instead.
    3059             :      */
    3060        3490 :     txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
    3061        3490 :     if (rbtxn_is_known_subxact(txn))
    3062         232 :         txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
    3063             :                                     NULL, InvalidXLogRecPtr, false);
    3064             :     Assert(txn->base_snapshot == NULL);
    3065             : 
    3066        3490 :     txn->base_snapshot = snap;
    3067        3490 :     txn->base_snapshot_lsn = lsn;
    3068        3490 :     dlist_push_tail(&rb->txns_by_base_snapshot_lsn, &txn->base_snapshot_node);
    3069             : 
    3070        3490 :     AssertTXNLsnOrder(rb);
    3071        3490 : }
    3072             : 
    3073             : /*
    3074             :  * Access the catalog with this CommandId at this point in the changestream.
    3075             :  *
    3076             :  * May only be called for command ids > 1
    3077             :  */
    3078             : void
    3079       33312 : ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
    3080             :                              XLogRecPtr lsn, CommandId cid)
    3081             : {
    3082       33312 :     ReorderBufferChange *change = ReorderBufferGetChange(rb);
    3083             : 
    3084       33312 :     change->data.command_id = cid;
    3085       33312 :     change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
    3086             : 
    3087       33312 :     ReorderBufferQueueChange(rb, xid, lsn, change, false);
    3088       33312 : }
    3089             : 
    3090             : /*
    3091             :  * Update memory counters to account for the new or removed change.
    3092             :  *
    3093             :  * We update two counters - in the reorder buffer, and in the transaction
    3094             :  * containing the change. The reorder buffer counter allows us to quickly
    3095             :  * decide if we reached the memory limit, the transaction counter allows
    3096             :  * us to quickly pick the largest transaction for eviction.
    3097             :  *
    3098             :  * When streaming is enabled, we need to update the toplevel transaction
    3099             :  * counters instead - we don't really care about subtransactions as we
    3100             :  * can't stream them individually anyway, and we only pick toplevel
    3101             :  * transactions for eviction. So only toplevel transactions matter.
    3102             :  */
    3103             : static void
    3104     6293184 : ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
    3105             :                                 ReorderBufferChange *change,
    3106             :                                 bool addition, Size sz)
    3107             : {
    3108             :     ReorderBufferTXN *txn;
    3109             :     ReorderBufferTXN *toptxn;
    3110             : 
    3111             :     Assert(change->txn);
    3112             : 
    3113             :     /*
    3114             :      * Ignore tuple CID changes, because those are not evicted when reaching
    3115             :      * memory limit. So we just don't count them, because it might easily
    3116             :      * trigger a pointless attempt to spill.
    3117             :      */
    3118     6293184 :     if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
    3119       33190 :         return;
    3120             : 
    3121     6259994 :     txn = change->txn;
    3122             : 
    3123             :     /*
    3124             :      * Update the total size in top level as well. This is later used to
    3125             :      * compute the decoding stats.
    3126             :      */
    3127     6259994 :     if (txn->toptxn != NULL)
    3128     2322336 :         toptxn = txn->toptxn;
    3129             :     else
    3130     3937658 :         toptxn = txn;
    3131             : 
    3132     6259994 :     if (addition)
    3133             :     {
    3134     3131254 :         txn->size += sz;
    3135     3131254 :         rb->size += sz;
    3136             : 
    3137             :         /* Update the total size in the top transaction. */
    3138     3131254 :         toptxn->total_size += sz;
    3139             :     }
    3140             :     else
    3141             :     {
    3142             :         Assert((rb->size >= sz) && (txn->size >= sz));
    3143     3128740 :         txn->size -= sz;
    3144     3128740 :         rb->size -= sz;
    3145             : 
    3146             :         /* Update the total size in the top transaction. */
    3147     3128740 :         toptxn->total_size -= sz;
    3148             :     }
    3149             : 
    3150             :     Assert(txn->size <= rb->size);
    3151             : }
    3152             : 
    3153             : /*
    3154             :  * Add new (relfilenode, tid) -> (cmin, cmax) mappings.
    3155             :  *
    3156             :  * We do not include this change type in memory accounting, because we
    3157             :  * keep CIDs in a separate list and do not evict them when reaching
    3158             :  * the memory limit.
    3159             :  */
    3160             : void
    3161       33312 : ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
    3162             :                              XLogRecPtr lsn, RelFileNode node,
    3163             :                              ItemPointerData tid, CommandId cmin,
    3164             :                              CommandId cmax, CommandId combocid)
    3165             : {
    3166       33312 :     ReorderBufferChange *change = ReorderBufferGetChange(rb);
    3167             :     ReorderBufferTXN *txn;
    3168             : 
    3169       33312 :     txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
    3170             : 
    3171       33312 :     change->data.tuplecid.node = node;
    3172       33312 :     change->data.tuplecid.tid = tid;
    3173       33312 :     change->data.tuplecid.cmin = cmin;
    3174       33312 :     change->data.tuplecid.cmax = cmax;
    3175       33312 :     change->data.tuplecid.combocid = combocid;
    3176       33312 :     change->lsn = lsn;
    3177       33312 :     change->txn = txn;
    3178       33312 :     change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
    3179             : 
    3180       33312 :     dlist_push_tail(&txn->tuplecids, &change->node);
    3181       33312 :     txn->ntuplecids++;
    3182       33312 : }
    3183             : 
    3184             : /*
    3185             :  * Setup the invalidation of the toplevel transaction.
    3186             :  *
    3187             :  * This needs to be called for each XLOG_XACT_INVALIDATIONS message and
    3188             :  * accumulates all the invalidation messages in the toplevel transaction as
    3189             :  * well as in the form of change in reorder buffer.  We require to record it in
    3190             :  * form of the change so that we can execute only the required invalidations
    3191             :  * instead of executing all the invalidations on each CommandId increment.  We
    3192             :  * also need to accumulate these in the toplevel transaction because in some
    3193             :  * cases we skip processing the transaction (see ReorderBufferForget), we need
    3194             :  * to execute all the invalidations together.
    3195             :  */
    3196             : void
    3197        6342 : ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
    3198             :                               XLogRecPtr lsn, Size nmsgs,
    3199             :                               SharedInvalidationMessage *msgs)
    3200             : {
    3201             :     ReorderBufferTXN *txn;
    3202             :     MemoryContext oldcontext;
    3203             :     ReorderBufferChange *change;
    3204             : 
    3205        6342 :     txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
    3206             : 
    3207        6342 :     oldcontext = MemoryContextSwitchTo(rb->context);
    3208             : 
    3209             :     /*
    3210             :      * Collect all the invalidations under the top transaction so that we can
    3211             :      * execute them all together.  See comment atop this function
    3212             :      */
    3213        6342 :     if (txn->toptxn)
    3214         326 :         txn = txn->toptxn;
    3215             : 
    3216             :     Assert(nmsgs > 0);
    3217             : 
    3218             :     /* Accumulate invalidations. */
    3219        6342 :     if (txn->ninvalidations == 0)
    3220             :     {
    3221        1356 :         txn->ninvalidations = nmsgs;
    3222        1356 :         txn->invalidations = (SharedInvalidationMessage *)
    3223        1356 :             palloc(sizeof(SharedInvalidationMessage) * nmsgs);
    3224        1356 :         memcpy(txn->invalidations, msgs,
    3225             :                sizeof(SharedInvalidationMessage) * nmsgs);
    3226             :     }
    3227             :     else
    3228             :     {
    3229        4986 :         txn->invalidations = (SharedInvalidationMessage *)
    3230        4986 :             repalloc(txn->invalidations, sizeof(SharedInvalidationMessage) *
    3231        4986 :                      (txn->ninvalidations + nmsgs));
    3232             : 
    3233        4986 :         memcpy(txn->invalidations + txn->ninvalidations, msgs,
    3234             :                nmsgs * sizeof(SharedInvalidationMessage));
    3235        4986 :         txn->ninvalidations += nmsgs;
    3236             :     }
    3237             : 
    3238        6342 :     change = ReorderBufferGetChange(rb);
    3239        6342 :     change->action = REORDER_BUFFER_CHANGE_INVALIDATION;
    3240        6342 :     change->data.inval.ninvalidations = nmsgs;
    3241        6342 :     change->data.inval.invalidations = (SharedInvalidationMessage *)
    3242        6342 :         palloc(sizeof(SharedInvalidationMessage) * nmsgs);
    3243        6342 :     memcpy(change->data.inval.invalidations, msgs,
    3244             :            sizeof(SharedInvalidationMessage) * nmsgs);
    3245             : 
    3246        6342 :     ReorderBufferQueueChange(rb, xid, lsn, change, false);
    3247             : 
    3248        6342 :     MemoryContextSwitchTo(oldcontext);
    3249        6342 : }
    3250             : 
    3251             : /*
    3252             :  * Apply all invalidations we know. Possibly we only need parts at this point
    3253             :  * in the changestream but we don't know which those are.
    3254             :  */
    3255             : static void
    3256        4536 : ReorderBufferExecuteInvalidations(uint32 nmsgs, SharedInvalidationMessage *msgs)
    3257             : {
    3258             :     int         i;
    3259             : 
    3260       53404 :     for (i = 0; i < nmsgs; i++)
    3261       48868 :         LocalExecuteInvalidationMessage(&msgs[i]);
    3262        4536 : }
    3263             : 
    3264             : /*
    3265             :  * Mark a transaction as containing catalog changes
    3266             :  */
    3267             : void
    3268       41210 : ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
    3269             :                                   XLogRecPtr lsn)
    3270             : {
    3271             :     ReorderBufferTXN *txn;
    3272             : 
    3273       41210 :     txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
    3274             : 
    3275       41210 :     txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
    3276             : 
    3277             :     /*
    3278             :      * Mark top-level transaction as having catalog changes too if one of its
    3279             :      * children has so that the ReorderBufferBuildTupleCidHash can
    3280             :      * conveniently check just top-level transaction and decide whether to
    3281             :      * build the hash table or not.
    3282             :      */
    3283       41210 :     if (txn->toptxn != NULL)
    3284        1792 :         txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
    3285       41210 : }
    3286             : 
    3287             : /*
    3288             :  * Query whether a transaction is already *known* to contain catalog
    3289             :  * changes. This can be wrong until directly before the commit!
    3290             :  */
    3291             : bool
    3292        5868 : ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
    3293             : {
    3294             :     ReorderBufferTXN *txn;
    3295             : 
    3296        5868 :     txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
    3297             :                                 false);
    3298        5868 :     if (txn == NULL)
    3299        1284 :         return false;
    3300             : 
    3301        4584 :     return rbtxn_has_catalog_changes(txn);
    3302             : }
    3303             : 
    3304             : /*
    3305             :  * ReorderBufferXidHasBaseSnapshot
    3306             :  *      Have we already set the base snapshot for the given txn/subtxn?
    3307             :  */
    3308             : bool
    3309     2799092 : ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
    3310             : {
    3311             :     ReorderBufferTXN *txn;
    3312             : 
    3313     2799092 :     txn = ReorderBufferTXNByXid(rb, xid, false,
    3314             :                                 NULL, InvalidXLogRecPtr, false);
    3315             : 
    3316             :     /* transaction isn't known yet, ergo no snapshot */
    3317     2799092 :     if (txn == NULL)
    3318           0 :         return false;
    3319             : 
    3320             :     /* a known subtxn? operate on top-level txn instead */
    3321     2799092 :     if (rbtxn_is_known_subxact(txn))
    3322     1011222 :         txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
    3323             :                                     NULL, InvalidXLogRecPtr, false);
    3324             : 
    3325     2799092 :     return txn->base_snapshot != NULL;
    3326             : }
    3327             : 
    3328             : 
    3329             : /*
    3330             :  * ---------------------------------------
    3331             :  * Disk serialization support
    3332             :  * ---------------------------------------
    3333             :  */
    3334             : 
    3335             : /*
    3336             :  * Ensure the IO buffer is >= sz.
    3337             :  */
    3338             : static void
    3339     5385702 : ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
    3340             : {
    3341     5385702 :     if (!rb->outbufsize)
    3342             :     {
    3343          80 :         rb->outbuf = MemoryContextAlloc(rb->context, sz);
    3344          80 :         rb->outbufsize = sz;
    3345             :     }
    3346     5385622 :     else if (rb->outbufsize < sz)
    3347             :     {
    3348         576 :         rb->outbuf = repalloc(rb->outbuf, sz);
    3349         576 :         rb->outbufsize = sz;
    3350             :     }
    3351     5385702 : }
    3352             : 
    3353             : /*
    3354             :  * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
    3355             :  *
    3356             :  * XXX With many subtransactions this might be quite slow, because we'll have
    3357             :  * to walk through all of them. There are some options how we could improve
    3358             :  * that: (a) maintain some secondary structure with transactions sorted by
    3359             :  * amount of changes, (b) not looking for the entirely largest transaction,
    3360             :  * but e.g. for transaction using at least some fraction of the memory limit,
    3361             :  * and (c) evicting multiple transactions at once, e.g. to free a given portion
    3362             :  * of the memory limit (e.g. 50%).
    3363             :  */
    3364             : static ReorderBufferTXN *
    3365        5644 : ReorderBufferLargestTXN(ReorderBuffer *rb)
    3366             : {
    3367             :     HASH_SEQ_STATUS hash_seq;
    3368             :     ReorderBufferTXNByIdEnt *ent;
    3369        5644 :     ReorderBufferTXN *largest = NULL;
    3370             : 
    3371        5644 :     hash_seq_init(&hash_seq, rb->by_txn);
    3372       14900 :     while ((ent = hash_seq_search(&hash_seq)) != NULL)
    3373             :     {
    3374        9256 :         ReorderBufferTXN *txn = ent->txn;
    3375             : 
    3376             :         /* if the current transaction is larger, remember it */
    3377        9256 :         if ((!largest) || (txn->size > largest->size))
    3378        7702 :             largest = txn;
    3379             :     }
    3380             : 
    3381             :     Assert(largest);
    3382             :     Assert(largest->size > 0);
    3383             :     Assert(largest->size <= rb->size);
    3384             : 
    3385        5644 :     return largest;
    3386             : }
    3387             : 
    3388             : /*
    3389             :  * Find the largest toplevel transaction to evict (by streaming).
    3390             :  *
    3391             :  * This can be seen as an optimized version of ReorderBufferLargestTXN, which
    3392             :  * should give us the same transaction (because we don't update memory account
    3393             :  * for subtransaction with streaming, so it's always 0). But we can simply
    3394             :  * iterate over the limited number of toplevel transactions that have a base
    3395             :  * snapshot. There is no use of selecting a transaction that doesn't have base
    3396             :  * snapshot because we don't decode such transactions.
    3397             :  *
    3398             :  * Note that, we skip transactions that contains incomplete changes. There
    3399             :  * is a scope of optimization here such that we can select the largest
    3400             :  * transaction which has incomplete changes.  But that will make the code and
    3401             :  * design quite complex and that might not be worth the benefit.  If we plan to
    3402             :  * stream the transactions that contains incomplete changes then we need to
    3403             :  * find a way to partially stream/truncate the transaction changes in-memory
    3404             :  * and build a mechanism to partially truncate the spilled files.
    3405             :  * Additionally, whenever we partially stream the transaction we need to
    3406             :  * maintain the last streamed lsn and next time we need to restore from that
    3407             :  * segment and the offset in WAL.  As we stream the changes from the top
    3408             :  * transaction and restore them subtransaction wise, we need to even remember
    3409             :  * the subxact from where we streamed the last change.
    3410             :  */
    3411             : static ReorderBufferTXN *
    3412         850 : ReorderBufferLargestTopTXN(ReorderBuffer *rb)
    3413             : {
    3414             :     dlist_iter  iter;
    3415         850 :     Size        largest_size = 0;
    3416         850 :     ReorderBufferTXN *largest = NULL;
    3417             : 
    3418             :     /* Find the largest top-level transaction having a base snapshot. */
    3419        1888 :     dlist_foreach(iter, &rb->txns_by_base_snapshot_lsn)
    3420             :     {
    3421             :         ReorderBufferTXN *txn;
    3422             : 
    3423        1038 :         txn = dlist_container(ReorderBufferTXN, base_snapshot_node, iter.cur);
    3424             : 
    3425             :         /* must not be a subtxn */
    3426             :         Assert(!rbtxn_is_known_subxact(txn));
    3427             :         /* base_snapshot must be set */
    3428             :         Assert(txn->base_snapshot != NULL);
    3429             : 
    3430        1038 :         if ((largest == NULL || txn->total_size > largest_size) &&
    3431         998 :             (txn->total_size > 0) && !(rbtxn_has_partial_change(txn)))
    3432             :         {
    3433         822 :             largest = txn;
    3434         822 :             largest_size = txn->total_size;
    3435             :         }
    3436             :     }
    3437             : 
    3438         850 :     return largest;
    3439             : }
    3440             : 
    3441             : /*
    3442             :  * Check whether the logical_decoding_work_mem limit was reached, and if yes
    3443             :  * pick the largest (sub)transaction at-a-time to evict and spill its changes to
    3444             :  * disk until we reach under the memory limit.
    3445             :  *
    3446             :  * XXX At this point we select the transactions until we reach under the memory
    3447             :  * limit, but we might also adapt a more elaborate eviction strategy - for example
    3448             :  * evicting enough transactions to free certain fraction (e.g. 50%) of the memory
    3449             :  * limit.
    3450             :  */
    3451             : static void
    3452     2823746 : ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
    3453             : {
    3454             :     ReorderBufferTXN *txn;
    3455             : 
    3456             :     /* bail out if we haven't exceeded the memory limit */
    3457     2823746 :     if (rb->size < logical_decoding_work_mem * 1024L)
    3458     2817340 :         return;
    3459             : 
    3460             :     /*
    3461             :      * Loop until we reach under the memory limit.  One might think that just
    3462             :      * by evicting the largest (sub)transaction we will come under the memory
    3463             :      * limit based on assumption that the selected transaction is at least as
    3464             :      * large as the most recent change (which caused us to go over the memory
    3465             :      * limit). However, that is not true because a user can reduce the
    3466             :      * logical_decoding_work_mem to a smaller value before the most recent
    3467             :      * change.
    3468             :      */
    3469       12808 :     while (rb->size >= logical_decoding_work_mem * 1024L)
    3470             :     {
    3471             :         /*
    3472             :          * Pick the largest transaction (or subtransaction) and evict it from
    3473             :          * memory by streaming, if possible.  Otherwise, spill to disk.
    3474             :          */
    3475        7256 :         if (ReorderBufferCanStartStreaming(rb) &&
    3476         850 :             (txn = ReorderBufferLargestTopTXN(rb)) != NULL)
    3477             :         {
    3478             :             /* we know there has to be one, because the size is not zero */
    3479             :             Assert(txn && !txn->toptxn);
    3480             :             Assert(txn->total_size > 0);
    3481             :             Assert(rb->size >= txn->total_size);
    3482             : 
    3483         762 :             ReorderBufferStreamTXN(rb, txn);
    3484             :         }
    3485             :         else
    3486             :         {
    3487             :             /*
    3488             :              * Pick the largest transaction (or subtransaction) and evict it
    3489             :              * from memory by serializing it to disk.
    3490             :              */
    3491        5644 :             txn = ReorderBufferLargestTXN(rb);
    3492             : 
    3493             :             /* we know there has to be one, because the size is not zero */
    3494             :             Assert(txn);
    3495             :             Assert(txn->size > 0);
    3496             :             Assert(rb->size >= txn->size);
    3497             : 
    3498        5644 :             ReorderBufferSerializeTXN(rb, txn);
    3499             :         }
    3500             : 
    3501             :         /*
    3502             :          * After eviction, the transaction should have no entries in memory,
    3503             :          * and should use 0 bytes for changes.
    3504             :          */
    3505             :         Assert(txn->size == 0);
    3506             :         Assert(txn->nentries_mem == 0);
    3507             :     }
    3508             : 
    3509             :     /* We must be under the memory limit now. */
    3510             :     Assert(rb->size < logical_decoding_work_mem * 1024L);
    3511             : }
    3512             : 
    3513             : /*
    3514             :  * Spill data of a large transaction (and its subtransactions) to disk.
    3515             :  */
    3516             : static void
    3517        6246 : ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
    3518             : {
    3519             :     dlist_iter  subtxn_i;
    3520             :     dlist_mutable_iter change_i;
    3521        6246 :     int         fd = -1;
    3522        6246 :     XLogSegNo   curOpenSegNo = 0;
    3523        6246 :     Size        spilled = 0;
    3524        6246 :     Size        size = txn->size;
    3525             : 
    3526        6246 :     elog(DEBUG2, "spill %u changes in XID %u to disk",
    3527             :          (uint32) txn->nentries_mem, txn->xid);
    3528             : 
    3529             :     /* do the same to all child TXs */
    3530        6782 :     dlist_foreach(subtxn_i, &txn->subtxns)
    3531             :     {
    3532             :         ReorderBufferTXN *subtxn;
    3533             : 
    3534         536 :         subtxn = dlist_container(ReorderBufferTXN, node, subtxn_i.cur);
    3535         536 :         ReorderBufferSerializeTXN(rb, subtxn);
    3536             :     }
    3537             : 
    3538             :     /* serialize changestream */
    3539     2409156 :     dlist_foreach_modify(change_i, &txn->changes)
    3540             :     {
    3541             :         ReorderBufferChange *change;
    3542             : 
    3543     2402910 :         change = dlist_container(ReorderBufferChange, node, change_i.cur);
    3544             : 
    3545             :         /*
    3546             :          * store in segment in which it belongs by start lsn, don't split over
    3547             :          * multiple segments tho
    3548             :          */
    3549     2402910 :         if (fd == -1 ||
    3550     2397166 :             !XLByteInSeg(change->lsn, curOpenSegNo, wal_segment_size))
    3551             :         {
    3552             :             char        path[MAXPGPATH];
    3553             : 
    3554        5750 :             if (fd != -1)
    3555           6 :                 CloseTransientFile(fd);
    3556             : 
    3557        5750 :             XLByteToSeg(change->lsn, curOpenSegNo, wal_segment_size);
    3558             : 
    3559             :             /*
    3560             :              * No need to care about TLIs here, only used during a single run,
    3561             :              * so each LSN only maps to a specific WAL record.
    3562             :              */
    3563        5750 :             ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid,
    3564             :                                         curOpenSegNo);
    3565             : 
    3566             :             /* open segment, create it if necessary */
    3567        5750 :             fd = OpenTransientFile(path,
    3568             :                                    O_CREAT | O_WRONLY | O_APPEND | PG_BINARY);
    3569             : 
    3570        5750 :             if (fd < 0)
    3571           0 :                 ereport(ERROR,
    3572             :                         (errcode_for_file_access(),
    3573             :                          errmsg("could not open file \"%s\": %m", path)));
    3574             :         }
    3575             : 
    3576     2402910 :         ReorderBufferSerializeChange(rb, txn, fd, change);
    3577     2402910 :         dlist_delete(&change->node);
    3578     2402910 :         ReorderBufferReturnChange(rb, change, true);
    3579             : 
    3580     2402910 :         spilled++;
    3581             :     }
    3582             : 
    3583             :     /* update the statistics iff we have spilled anything */
    3584        6246 :     if (spilled)
    3585             :     {
    3586        5744 :         rb->spillCount += 1;
    3587        5744 :         rb->spillBytes += size;
    3588             : 
    3589             :         /* don't consider already serialized transactions */
    3590        5744 :         rb->spillTxns += (rbtxn_is_serialized(txn) || rbtxn_is_serialized_clear(txn)) ? 0 : 1;
    3591             : 
    3592             :         /* update the decoding stats */
    3593        5744 :         UpdateDecodingStats((LogicalDecodingContext *) rb->private_data);
    3594             :     }
    3595             : 
    3596             :     Assert(spilled == txn->nentries_mem);
    3597             :     Assert(dlist_is_empty(&txn->changes));
    3598        6246 :     txn->nentries_mem = 0;
    3599        6246 :     txn->txn_flags |= RBTXN_IS_SERIALIZED;
    3600             : 
    3601        6246 :     if (fd != -1)
    3602        5744 :         CloseTransientFile(fd);
    3603        6246 : }
    3604             : 
    3605             : /*
    3606             :  * Serialize individual change to disk.
    3607             :  */
    3608             : static void
    3609     2402910 : ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
    3610             :                              int fd, ReorderBufferChange *change)
    3611             : {
    3612             :     ReorderBufferDiskChange *ondisk;
    3613     2402910 :     Size        sz = sizeof(ReorderBufferDiskChange);
    3614             : 
    3615     2402910 :     ReorderBufferSerializeReserve(rb, sz);
    3616             : 
    3617     2402910 :     ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    3618     2402910 :     memcpy(&ondisk->change, change, sizeof(ReorderBufferChange));
    3619             : 
    3620     2402910 :     switch (change->action)
    3621             :     {
    3622             :             /* fall through these, they're all similar enough */
    3623     2368436 :         case REORDER_BUFFER_CHANGE_INSERT:
    3624             :         case REORDER_BUFFER_CHANGE_UPDATE:
    3625             :         case REORDER_BUFFER_CHANGE_DELETE:
    3626             :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
    3627             :             {
    3628             :                 char       *data;
    3629             :                 ReorderBufferTupleBuf *oldtup,
    3630             :                            *newtup;
    3631     2368436 :                 Size        oldlen = 0;
    3632     2368436 :                 Size        newlen = 0;
    3633             : 
    3634     2368436 :                 oldtup = change->data.tp.oldtuple;
    3635     2368436 :                 newtup = change->data.tp.newtuple;
    3636             : 
    3637     2368436 :                 if (oldtup)
    3638             :                 {
    3639      144674 :                     sz += sizeof(HeapTupleData);
    3640      144674 :                     oldlen = oldtup->tuple.t_len;
    3641      144674 :                     sz += oldlen;
    3642             :                 }
    3643             : 
    3644     2368436 :                 if (newtup)
    3645             :                 {
    3646     2116358 :                     sz += sizeof(HeapTupleData);
    3647     2116358 :                     newlen = newtup->tuple.t_len;
    3648     2116358 :                     sz += newlen;
    3649             :                 }
    3650             : 
    3651             :                 /* make sure we have enough space */
    3652     2368436 :                 ReorderBufferSerializeReserve(rb, sz);
    3653             : 
    3654     2368436 :                 data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
    3655             :                 /* might have been reallocated above */
    3656     2368436 :                 ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    3657             : 
    3658     2368436 :                 if (oldlen)
    3659             :                 {
    3660      144674 :                     memcpy(data, &oldtup->tuple, sizeof(HeapTupleData));
    3661      144674 :                     data += sizeof(HeapTupleData);
    3662             : 
    3663      144674 :                     memcpy(data, oldtup->tuple.t_data, oldlen);
    3664      144674 :                     data += oldlen;
    3665             :                 }
    3666             : 
    3667     2368436 :                 if (newlen)
    3668             :                 {
    3669     2116358 :                     memcpy(data, &newtup->tuple, sizeof(HeapTupleData));
    3670     2116358 :                     data += sizeof(HeapTupleData);
    3671             : 
    3672     2116358 :                     memcpy(data, newtup->tuple.t_data, newlen);
    3673     2116358 :                     data += newlen;
    3674             :                 }
    3675     2368436 :                 break;
    3676             :             }
    3677          38 :         case REORDER_BUFFER_CHANGE_MESSAGE:
    3678             :             {
    3679             :                 char       *data;
    3680          38 :                 Size        prefix_size = strlen(change->data.msg.prefix) + 1;
    3681             : 
    3682          38 :                 sz += prefix_size + change->data.msg.message_size +
    3683             :                     sizeof(Size) + sizeof(Size);
    3684          38 :                 ReorderBufferSerializeReserve(rb, sz);
    3685             : 
    3686          38 :                 data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
    3687             : 
    3688             :                 /* might have been reallocated above */
    3689          38 :                 ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    3690             : 
    3691             :                 /* write the prefix including the size */
    3692          38 :                 memcpy(data, &prefix_size, sizeof(Size));
    3693          38 :                 data += sizeof(Size);
    3694          38 :                 memcpy(data, change->data.msg.prefix,
    3695             :                        prefix_size);
    3696          38 :                 data += prefix_size;
    3697             : 
    3698             :                 /* write the message including the size */
    3699          38 :                 memcpy(data, &change->data.msg.message_size, sizeof(Size));
    3700          38 :                 data += sizeof(Size);
    3701          38 :                 memcpy(data, change->data.msg.message,
    3702             :                        change->data.msg.message_size);
    3703          38 :                 data += change->data.msg.message_size;
    3704             : 
    3705          38 :                 break;
    3706             :             }
    3707         210 :         case REORDER_BUFFER_CHANGE_INVALIDATION:
    3708             :             {
    3709             :                 char       *data;
    3710         210 :                 Size        inval_size = sizeof(SharedInvalidationMessage) *
    3711         210 :                 change->data.inval.ninvalidations;
    3712             : 
    3713         210 :                 sz += inval_size;
    3714             : 
    3715         210 :                 ReorderBufferSerializeReserve(rb, sz);
    3716         210 :                 data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
    3717             : 
    3718             :                 /* might have been reallocated above */
    3719         210 :                 ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    3720         210 :                 memcpy(data, change->data.inval.invalidations, inval_size);
    3721         210 :                 data += inval_size;
    3722             : 
    3723         210 :                 break;
    3724             :             }
    3725           4 :         case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
    3726             :             {
    3727             :                 Snapshot    snap;
    3728             :                 char       *data;
    3729             : 
    3730           4 :                 snap = change->data.snapshot;
    3731             : 
    3732           4 :                 sz += sizeof(SnapshotData) +
    3733           4 :                     sizeof(TransactionId) * snap->xcnt +
    3734           4 :                     sizeof(TransactionId) * snap->subxcnt;
    3735             : 
    3736             :                 /* make sure we have enough space */
    3737           4 :                 ReorderBufferSerializeReserve(rb, sz);
    3738           4 :                 data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
    3739             :                 /* might have been reallocated above */
    3740           4 :                 ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    3741             : 
    3742           4 :                 memcpy(data, snap, sizeof(SnapshotData));
    3743           4 :                 data += sizeof(SnapshotData);
    3744             : 
    3745           4 :                 if (snap->xcnt)
    3746             :                 {
    3747           4 :                     memcpy(data, snap->xip,
    3748           4 :                            sizeof(TransactionId) * snap->xcnt);
    3749           4 :                     data += sizeof(TransactionId) * snap->xcnt;
    3750             :                 }
    3751             : 
    3752           4 :                 if (snap->subxcnt)
    3753             :                 {
    3754           0 :                     memcpy(data, snap->subxip,
    3755           0 :                            sizeof(TransactionId) * snap->subxcnt);
    3756           0 :                     data += sizeof(TransactionId) * snap->subxcnt;
    3757             :                 }
    3758           4 :                 break;
    3759             :             }
    3760           0 :         case REORDER_BUFFER_CHANGE_TRUNCATE:
    3761             :             {
    3762             :                 Size        size;
    3763             :                 char       *data;
    3764             : 
    3765             :                 /* account for the OIDs of truncated relations */
    3766           0 :                 size = sizeof(Oid) * change->data.truncate.nrelids;
    3767           0 :                 sz += size;
    3768             : 
    3769             :                 /* make sure we have enough space */
    3770           0 :                 ReorderBufferSerializeReserve(rb, sz);
    3771             : 
    3772           0 :                 data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
    3773             :                 /* might have been reallocated above */
    3774           0 :                 ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    3775             : 
    3776           0 :                 memcpy(data, change->data.truncate.relids, size);
    3777           0 :                 data += size;
    3778             : 
    3779           0 :                 break;
    3780             :             }
    3781       34222 :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
    3782             :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
    3783             :         case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
    3784             :         case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
    3785             :             /* ReorderBufferChange contains everything important */
    3786       34222 :             break;
    3787             :     }
    3788             : 
    3789     2402910 :     ondisk->size = sz;
    3790             : 
    3791     2402910 :     errno = 0;
    3792     2402910 :     pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_WRITE);
    3793     2402910 :     if (write(fd, rb->outbuf, ondisk->size) != ondisk->size)
    3794             :     {
    3795           0 :         int         save_errno = errno;
    3796             : 
    3797           0 :         CloseTransientFile(fd);
    3798             : 
    3799             :         /* if write didn't set errno, assume problem is no disk space */
    3800           0 :         errno = save_errno ? save_errno : ENOSPC;
    3801           0 :         ereport(ERROR,
    3802             :                 (errcode_for_file_access(),
    3803             :                  errmsg("could not write to data file for XID %u: %m",
    3804             :                         txn->xid)));
    3805             :     }
    3806     2402910 :     pgstat_report_wait_end();
    3807             : 
    3808             :     /*
    3809             :      * Keep the transaction's final_lsn up to date with each change we send to
    3810             :      * disk, so that ReorderBufferRestoreCleanup works correctly.  (We used to
    3811             :      * only do this on commit and abort records, but that doesn't work if a
    3812             :      * system crash leaves a transaction without its abort record).
    3813             :      *
    3814             :      * Make sure not to move it backwards.
    3815             :      */
    3816     2402910 :     if (txn->final_lsn < change->lsn)
    3817     2394024 :         txn->final_lsn = change->lsn;
    3818             : 
    3819             :     Assert(ondisk->change.action == change->action);
    3820     2402910 : }
    3821             : 
    3822             : /* Returns true, if the output plugin supports streaming, false, otherwise. */
    3823             : static inline bool
    3824     3222914 : ReorderBufferCanStream(ReorderBuffer *rb)
    3825             : {
    3826     3222914 :     LogicalDecodingContext *ctx = rb->private_data;
    3827             : 
    3828     3222914 :     return ctx->streaming;
    3829             : }
    3830             : 
    3831             : /* Returns true, if the streaming can be started now, false, otherwise. */
    3832             : static inline bool
    3833      399168 : ReorderBufferCanStartStreaming(ReorderBuffer *rb)
    3834             : {
    3835      399168 :     LogicalDecodingContext *ctx = rb->private_data;
    3836      399168 :     SnapBuild  *builder = ctx->snapshot_builder;
    3837             : 
    3838             :     /* We can't start streaming unless a consistent state is reached. */
    3839      399168 :     if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
    3840           0 :         return false;
    3841             : 
    3842             :     /*
    3843             :      * We can't start streaming immediately even if the streaming is enabled
    3844             :      * because we previously decoded this transaction and now just are
    3845             :      * restarting.
    3846             :      */
    3847      399168 :     if (ReorderBufferCanStream(rb) &&
    3848      393848 :         !SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
    3849      310436 :         return true;
    3850             : 
    3851       88732 :     return false;
    3852             : }
    3853             : 
    3854             : /*
    3855             :  * Send data of a large transaction (and its subtransactions) to the
    3856             :  * output plugin, but using the stream API.
    3857             :  */
    3858             : static void
    3859         822 : ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
    3860             : {
    3861             :     Snapshot    snapshot_now;
    3862             :     CommandId   command_id;
    3863             :     Size        stream_bytes;
    3864             :     bool        txn_is_streamed;
    3865             : 
    3866             :     /* We can never reach here for a subtransaction. */
    3867             :     Assert(txn->toptxn == NULL);
    3868             : 
    3869             :     /*
    3870             :      * We can't make any assumptions about base snapshot here, similar to what
    3871             :      * ReorderBufferCommit() does. That relies on base_snapshot getting
    3872             :      * transferred from subxact in ReorderBufferCommitChild(), but that was
    3873             :      * not yet called as the transaction is in-progress.
    3874             :      *
    3875             :      * So just walk the subxacts and use the same logic here. But we only need
    3876             :      * to do that once, when the transaction is streamed for the first time.
    3877             :      * After that we need to reuse the snapshot from the previous run.
    3878             :      *
    3879             :      * Unlike DecodeCommit which adds xids of all the subtransactions in
    3880             :      * snapshot's xip array via SnapBuildCommittedTxn, we can't do that here
    3881             :      * but we do add them to subxip array instead via ReorderBufferCopySnap.
    3882             :      * This allows the catalog changes made in subtransactions decoded till
    3883             :      * now to be visible.
    3884             :      */
    3885         822 :     if (txn->snapshot_now == NULL)
    3886             :     {
    3887             :         dlist_iter  subxact_i;
    3888             : 
    3889             :         /* make sure this transaction is streamed for the first time */
    3890             :         Assert(!rbtxn_is_streamed(txn));
    3891             : 
    3892             :         /* at the beginning we should have invalid command ID */
    3893             :         Assert(txn->command_id == InvalidCommandId);
    3894             : 
    3895          72 :         dlist_foreach(subxact_i, &txn->subtxns)
    3896             :         {
    3897             :             ReorderBufferTXN *subtxn;
    3898             : 
    3899           8 :             subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
    3900           8 :             ReorderBufferTransferSnapToParent(txn, subtxn);
    3901             :         }
    3902             : 
    3903             :         /*
    3904             :          * If this transaction has no snapshot, it didn't make any changes to
    3905             :          * the database till now, so there's nothing to decode.
    3906             :          */
    3907          64 :         if (txn->base_snapshot == NULL)
    3908             :         {
    3909             :             Assert(txn->ninvalidations == 0);
    3910           0 :             return;
    3911             :         }
    3912             : 
    3913          64 :         command_id = FirstCommandId;
    3914          64 :         snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
    3915             :                                              txn, command_id);
    3916             :     }
    3917             :     else
    3918             :     {
    3919             :         /* the transaction must have been already streamed */
    3920             :         Assert(rbtxn_is_streamed(txn));
    3921             : 
    3922             :         /*
    3923             :          * Nah, we already have snapshot from the previous streaming run. We
    3924             :          * assume new subxacts can't move the LSN backwards, and so can't beat
    3925             :          * the LSN condition in the previous branch (so no need to walk
    3926             :          * through subxacts again). In fact, we must not do that as we may be
    3927             :          * using snapshot half-way through the subxact.
    3928             :          */
    3929         758 :         command_id = txn->command_id;
    3930             : 
    3931             :         /*
    3932             :          * We can't use txn->snapshot_now directly because after the last
    3933             :          * streaming run, we might have got some new sub-transactions. So we
    3934             :          * need to add them to the snapshot.
    3935             :          */
    3936         758 :         snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
    3937             :                                              txn, command_id);
    3938             : 
    3939             :         /* Free the previously copied snapshot. */
    3940             :         Assert(txn->snapshot_now->copied);
    3941         758 :         ReorderBufferFreeSnap(rb, txn->snapshot_now);
    3942         758 :         txn->snapshot_now = NULL;
    3943             :     }
    3944             : 
    3945             :     /*
    3946             :      * Remember this information to be used later to update stats. We can't
    3947             :      * update the stats here as an error while processing the changes would
    3948             :      * lead to the accumulation of stats even though we haven't streamed all
    3949             :      * the changes.
    3950             :      */
    3951         822 :     txn_is_streamed = rbtxn_is_streamed(txn);
    3952         822 :     stream_bytes = txn->total_size;
    3953             : 
    3954             :     /* Process and send the changes to output plugin. */
    3955         822 :     ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
    3956             :                             command_id, true);
    3957             : 
    3958         818 :     rb->streamCount += 1;
    3959         818 :     rb->streamBytes += stream_bytes;
    3960             : 
    3961             :     /* Don't consider already streamed transaction. */
    3962         818 :     rb->streamTxns += (txn_is_streamed) ? 0 : 1;
    3963             : 
    3964             :     /* update the decoding stats */
    3965         818 :     UpdateDecodingStats((LogicalDecodingContext *) rb->private_data);
    3966             : 
    3967             :     Assert(dlist_is_empty(&txn->changes));
    3968             :     Assert(txn->nentries == 0);
    3969             :     Assert(txn->nentries_mem == 0);
    3970             : }
    3971             : 
    3972             : /*
    3973             :  * Size of a change in memory.
    3974             :  */
    3975             : static Size
    3976     6293184 : ReorderBufferChangeSize(ReorderBufferChange *change)
    3977             : {
    3978     6293184 :     Size        sz = sizeof(ReorderBufferChange);
    3979             : 
    3980     6293184 :     switch (change->action)
    3981             :     {
    3982             :             /* fall through these, they're all similar enough */
    3983     6098834 :         case REORDER_BUFFER_CHANGE_INSERT:
    3984             :         case REORDER_BUFFER_CHANGE_UPDATE:
    3985             :         case REORDER_BUFFER_CHANGE_DELETE:
    3986             :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
    3987             :             {
    3988             :                 ReorderBufferTupleBuf *oldtup,
    3989             :                            *newtup;
    3990     6098834 :                 Size        oldlen = 0;
    3991     6098834 :                 Size        newlen = 0;
    3992             : 
    3993     6098834 :                 oldtup = change->data.tp.oldtuple;
    3994     6098834 :                 newtup = change->data.tp.newtuple;
    3995             : 
    3996     6098834 :                 if (oldtup)
    3997             :                 {
    3998      460550 :                     sz += sizeof(HeapTupleData);
    3999      460550 :                     oldlen = oldtup->tuple.t_len;
    4000      460550 :                     sz += oldlen;
    4001             :                 }
    4002             : 
    4003     6098834 :                 if (newtup)
    4004             :                 {
    4005     5374428 :                     sz += sizeof(HeapTupleData);
    4006     5374428 :                     newlen = newtup->tuple.t_len;
    4007     5374428 :                     sz += newlen;
    4008             :                 }
    4009             : 
    4010     6098834 :                 break;
    4011             :             }
    4012         156 :         case REORDER_BUFFER_CHANGE_MESSAGE:
    4013             :             {
    4014         156 :                 Size        prefix_size = strlen(change->data.msg.prefix) + 1;
    4015             : 
    4016         156 :                 sz += prefix_size + change->data.msg.message_size +
    4017             :                     sizeof(Size) + sizeof(Size);
    4018             : 
    4019         156 :                 break;
    4020             :             }
    4021       12684 :         case REORDER_BUFFER_CHANGE_INVALIDATION:
    4022             :             {
    4023       12684 :                 sz += sizeof(SharedInvalidationMessage) *
    4024       12684 :                     change->data.inval.ninvalidations;
    4025       12684 :                 break;
    4026             :             }
    4027        2694 :         case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
    4028             :             {
    4029             :                 Snapshot    snap;
    4030             : 
    4031        2694 :                 snap = change->data.snapshot;
    4032             : 
    4033        2694 :                 sz += sizeof(SnapshotData) +
    4034        2694 :                     sizeof(TransactionId) * snap->xcnt +
    4035        2694 :                     sizeof(TransactionId) * snap->subxcnt;
    4036             : 
    4037        2694 :                 break;
    4038             :             }
    4039          68 :         case REORDER_BUFFER_CHANGE_TRUNCATE:
    4040             :             {
    4041          68 :                 sz += sizeof(Oid) * change->data.truncate.nrelids;
    4042             : 
    4043          68 :                 break;
    4044             :             }
    4045      178748 :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
    4046             :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
    4047             :         case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
    4048             :         case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
    4049             :             /* ReorderBufferChange contains everything important */
    4050      178748 :             break;
    4051             :     }
    4052             : 
    4053     6293184 :     return sz;
    4054             : }
    4055             : 
    4056             : 
    4057             : /*
    4058             :  * Restore a number of changes spilled to disk back into memory.
    4059             :  */
    4060             : static Size
    4061         178 : ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
    4062             :                             TXNEntryFile *file, XLogSegNo *segno)
    4063             : {
    4064         178 :     Size        restored = 0;
    4065             :     XLogSegNo   last_segno;
    4066             :     dlist_mutable_iter cleanup_iter;
    4067         178 :     File       *fd = &file->vfd;
    4068             : 
    4069             :     Assert(txn->first_lsn != InvalidXLogRecPtr);
    4070             :     Assert(txn->final_lsn != InvalidXLogRecPtr);
    4071             : 
    4072             :     /* free current entries, so we have memory for more */
    4073      300176 :     dlist_foreach_modify(cleanup_iter, &txn->changes)
    4074             :     {
    4075      299998 :         ReorderBufferChange *cleanup =
    4076      299998 :         dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
    4077             : 
    4078      299998 :         dlist_delete(&cleanup->node);
    4079      299998 :         ReorderBufferReturnChange(rb, cleanup, true);
    4080             :     }
    4081         178 :     txn->nentries_mem = 0;
    4082             :     Assert(dlist_is_empty(&txn->changes));
    4083             : 
    4084         178 :     XLByteToSeg(txn->final_lsn, last_segno, wal_segment_size);
    4085             : 
    4086      307264 :     while (restored < max_changes_in_memory && *segno <= last_segno)
    4087             :     {
    4088             :         int         readBytes;
    4089             :         ReorderBufferDiskChange *ondisk;
    4090             : 
    4091      307086 :         if (*fd == -1)
    4092             :         {
    4093             :             char        path[MAXPGPATH];
    4094             : 
    4095             :             /* first time in */
    4096          68 :             if (*segno == 0)
    4097          66 :                 XLByteToSeg(txn->first_lsn, *segno, wal_segment_size);
    4098             : 
    4099             :             Assert(*segno != 0 || dlist_is_empty(&txn->changes));
    4100             : 
    4101             :             /*
    4102             :              * No need to care about TLIs here, only used during a single run,
    4103             :              * so each LSN only maps to a specific WAL record.
    4104             :              */
    4105          68 :             ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid,
    4106             :                                         *segno);
    4107             : 
    4108          68 :             *fd = PathNameOpenFile(path, O_RDONLY | PG_BINARY);
    4109             : 
    4110             :             /* No harm in resetting the offset even in case of failure */
    4111          68 :             file->curOffset = 0;
    4112             : 
    4113          68 :             if (*fd < 0 && errno == ENOENT)
    4114             :             {
    4115           0 :                 *fd = -1;
    4116           0 :                 (*segno)++;
    4117           0 :                 continue;
    4118             :             }
    4119          68 :             else if (*fd < 0)
    4120           0 :                 ereport(ERROR,
    4121             :                         (errcode_for_file_access(),
    4122             :                          errmsg("could not open file \"%s\": %m",
    4123             :                                 path)));
    4124             :         }
    4125             : 
    4126             :         /*
    4127             :          * Read the statically sized part of a change which has information
    4128             :          * about the total size. If we couldn't read a record, we're at the
    4129             :          * end of this file.
    4130             :          */
    4131      307086 :         ReorderBufferSerializeReserve(rb, sizeof(ReorderBufferDiskChange));
    4132      307086 :         readBytes = FileRead(file->vfd, rb->outbuf,
    4133             :                              sizeof(ReorderBufferDiskChange),
    4134             :                              file->curOffset, WAIT_EVENT_REORDER_BUFFER_READ);
    4135             : 
    4136             :         /* eof */
    4137      307086 :         if (readBytes == 0)
    4138             :         {
    4139          68 :             FileClose(*fd);
    4140          68 :             *fd = -1;
    4141          68 :             (*segno)++;
    4142          68 :             continue;
    4143             :         }
    4144      307018 :         else if (readBytes < 0)
    4145           0 :             ereport(ERROR,
    4146             :                     (errcode_for_file_access(),
    4147             :                      errmsg("could not read from reorderbuffer spill file: %m")));
    4148      307018 :         else if (readBytes != sizeof(ReorderBufferDiskChange))
    4149           0 :             ereport(ERROR,
    4150             :                     (errcode_for_file_access(),
    4151             :                      errmsg("could not read from reorderbuffer spill file: read %d instead of %u bytes",
    4152             :                             readBytes,
    4153             :                             (uint32) sizeof(ReorderBufferDiskChange))));
    4154             : 
    4155      307018 :         file->curOffset += readBytes;
    4156             : 
    4157      307018 :         ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    4158             : 
    4159      307018 :         ReorderBufferSerializeReserve(rb,
    4160      307018 :                                       sizeof(ReorderBufferDiskChange) + ondisk->size);
    4161      307018 :         ondisk = (ReorderBufferDiskChange *) rb->outbuf;
    4162             : 
    4163      307018 :         readBytes = FileRead(file->vfd,
    4164      307018 :                              rb->outbuf + sizeof(ReorderBufferDiskChange),
    4165      307018 :                              ondisk->size - sizeof(ReorderBufferDiskChange),
    4166             :                              file->curOffset,
    4167             :                              WAIT_EVENT_REORDER_BUFFER_READ);
    4168             : 
    4169      307018 :         if (readBytes < 0)
    4170           0 :             ereport(ERROR,
    4171             :                     (errcode_for_file_access(),
    4172             :                      errmsg("could not read from reorderbuffer spill file: %m")));
    4173      307018 :         else if (readBytes != ondisk->size - sizeof(ReorderBufferDiskChange))
    4174           0 :             ereport(ERROR,
    4175             :                     (errcode_for_file_access(),
    4176             :                      errmsg("could not read from reorderbuffer spill file: read %d instead of %u bytes",
    4177             :                             readBytes,
    4178             :                             (uint32) (ondisk->size - sizeof(ReorderBufferDiskChange)))));
    4179             : 
    4180      307018 :         file->curOffset += readBytes;
    4181             : 
    4182             :         /*
    4183             :          * ok, read a full change from disk, now restore it into proper
    4184             :          * in-memory format
    4185             :          */
    4186      307018 :         ReorderBufferRestoreChange(rb, txn, rb->outbuf);
    4187      307018 :         restored++;
    4188             :     }
    4189             : 
    4190         178 :     return restored;
    4191             : }
    4192             : 
    4193             : /*
    4194             :  * Convert change from its on-disk format to in-memory format and queue it onto
    4195             :  * the TXN's ->changes list.
    4196             :  *
    4197             :  * Note: although "data" is declared char*, at entry it points to a
    4198             :  * maxalign'd buffer, making it safe in most of this function to assume
    4199             :  * that the pointed-to data is suitably aligned for direct access.
    4200             :  */
    4201             : static void
    4202      307018 : ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
    4203             :                            char *data)
    4204             : {
    4205             :     ReorderBufferDiskChange *ondisk;
    4206             :     ReorderBufferChange *change;
    4207             : 
    4208      307018 :     ondisk = (ReorderBufferDiskChange *) data;
    4209             : 
    4210      307018 :     change = ReorderBufferGetChange(rb);
    4211             : 
    4212             :     /* copy static part */
    4213      307018 :     memcpy(change, &ondisk->change, sizeof(ReorderBufferChange));
    4214             : 
    4215      307018 :     data += sizeof(ReorderBufferDiskChange);
    4216             : 
    4217             :     /* restore individual stuff */
    4218      307018 :     switch (change->action)
    4219             :     {
    4220             :             /* fall through these, they're all similar enough */
    4221      303234 :         case REORDER_BUFFER_CHANGE_INSERT:
    4222             :         case REORDER_BUFFER_CHANGE_UPDATE:
    4223             :         case REORDER_BUFFER_CHANGE_DELETE:
    4224             :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
    4225      303234 :             if (change->data.tp.oldtuple)
    4226             :             {
    4227       10012 :                 uint32      tuplelen = ((HeapTuple) data)->t_len;
    4228             : 
    4229       10012 :                 change->data.tp.oldtuple =
    4230       10012 :                     ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader);
    4231             : 
    4232             :                 /* restore ->tuple */
    4233       10012 :                 memcpy(&change->data.tp.oldtuple->tuple, data,
    4234             :                        sizeof(HeapTupleData));
    4235       10012 :                 data += sizeof(HeapTupleData);
    4236             : 
    4237             :                 /* reset t_data pointer into the new tuplebuf */
    4238       10012 :                 change->data.tp.oldtuple->tuple.t_data =
    4239       10012 :                     ReorderBufferTupleBufData(change->data.tp.oldtuple);
    4240             : 
    4241             :                 /* restore tuple data itself */
    4242       10012 :                 memcpy(change->data.tp.oldtuple->tuple.t_data, data, tuplelen);
    4243       10012 :                 data += tuplelen;
    4244             :             }
    4245             : 
    4246      303234 :             if (change->data.tp.newtuple)
    4247             :             {
    4248             :                 /* here, data might not be suitably aligned! */
    4249             :                 uint32      tuplelen;
    4250             : 
    4251      282794 :                 memcpy(&tuplelen, data + offsetof(HeapTupleData, t_len),
    4252             :                        sizeof(uint32));
    4253             : 
    4254      282794 :                 change->data.tp.newtuple =
    4255      282794 :                     ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader);
    4256             : 
    4257             :                 /* restore ->tuple */
    4258      282794 :                 memcpy(&change->data.tp.newtuple->tuple, data,
    4259             :                        sizeof(HeapTupleData));
    4260      282794 :                 data += sizeof(HeapTupleData);
    4261             : 
    4262             :                 /* reset t_data pointer into the new tuplebuf */
    4263      282794 :                 change->data.tp.newtuple->tuple.t_data =
    4264      282794 :                     ReorderBufferTupleBufData(change->data.tp.newtuple);
    4265             : 
    4266             :                 /* restore tuple data itself */
    4267      282794 :                 memcpy(change->data.tp.newtuple->tuple.t_data, data, tuplelen);
    4268      282794 :                 data += tuplelen;
    4269             :             }
    4270             : 
    4271      303234 :             break;
    4272           2 :         case REORDER_BUFFER_CHANGE_MESSAGE:
    4273             :             {
    4274             :                 Size        prefix_size;
    4275             : 
    4276             :                 /* read prefix */
    4277           2 :                 memcpy(&prefix_size, data, sizeof(Size));
    4278           2 :                 data += sizeof(Size);
    4279           2 :                 change->data.msg.prefix = MemoryContextAlloc(rb->context,
    4280             :                                                              prefix_size);
    4281           2 :                 memcpy(change->data.msg.prefix, data, prefix_size);
    4282             :                 Assert(change->data.msg.prefix[prefix_size - 1] == '\0');
    4283           2 :                 data += prefix_size;
    4284             : 
    4285             :                 /* read the message */
    4286           2 :                 memcpy(&change->data.msg.message_size, data, sizeof(Size));
    4287           2 :                 data += sizeof(Size);
    4288           2 :                 change->data.msg.message = MemoryContextAlloc(rb->context,
    4289             :                                                               change->data.msg.message_size);
    4290           2 :                 memcpy(change->data.msg.message, data,
    4291             :                        change->data.msg.message_size);
    4292           2 :                 data += change->data.msg.message_size;
    4293             : 
    4294           2 :                 break;
    4295             :             }
    4296          36 :         case REORDER_BUFFER_CHANGE_INVALIDATION:
    4297             :             {
    4298          36 :                 Size        inval_size = sizeof(SharedInvalidationMessage) *
    4299          36 :                 change->data.inval.ninvalidations;
    4300             : 
    4301          36 :                 change->data.inval.invalidations =
    4302          36 :                     MemoryContextAlloc(rb->context, inval_size);
    4303             : 
    4304             :                 /* read the message */
    4305          36 :                 memcpy(change->data.inval.invalidations, data, inval_size);
    4306             : 
    4307          36 :                 break;
    4308             :             }
    4309           4 :         case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
    4310             :             {
    4311             :                 Snapshot    oldsnap;
    4312             :                 Snapshot    newsnap;
    4313             :                 Size        size;
    4314             : 
    4315           4 :                 oldsnap = (Snapshot) data;
    4316             : 
    4317           4 :                 size = sizeof(SnapshotData) +
    4318           4 :                     sizeof(TransactionId) * oldsnap->xcnt +
    4319           4 :                     sizeof(TransactionId) * (oldsnap->subxcnt + 0);
    4320             : 
    4321           4 :                 change->data.snapshot = MemoryContextAllocZero(rb->context, size);
    4322             : 
    4323           4 :                 newsnap = change->data.snapshot;
    4324             : 
    4325           4 :                 memcpy(newsnap, data, size);
    4326           4 :                 newsnap->xip = (TransactionId *)
    4327             :                     (((char *) newsnap) + sizeof(SnapshotData));
    4328           4 :                 newsnap->subxip = newsnap->xip + newsnap->xcnt;
    4329           4 :                 newsnap->copied = true;
    4330           4 :                 break;
    4331             :             }
    4332             :             /* the base struct contains all the data, easy peasy */
    4333           0 :         case REORDER_BUFFER_CHANGE_TRUNCATE:
    4334             :             {
    4335             :                 Oid        *relids;
    4336             : 
    4337           0 :                 relids = ReorderBufferGetRelids(rb,
    4338           0 :                                                 change->data.truncate.nrelids);
    4339           0 :                 memcpy(relids, data, change->data.truncate.nrelids * sizeof(Oid));
    4340           0 :                 change->data.truncate.relids = relids;
    4341             : 
    4342           0 :                 break;
    4343             :             }
    4344        3742 :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
    4345             :         case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
    4346             :         case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
    4347             :         case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
    4348        3742 :             break;
    4349             :     }
    4350             : 
    4351      307018 :     dlist_push_tail(&txn->changes, &change->node);
    4352      307018 :     txn->nentries_mem++;
    4353             : 
    4354             :     /*
    4355             :      * Update memory accounting for the restored change.  We need to do this
    4356             :      * although we don't check the memory limit when restoring the changes in
    4357             :      * this branch (we only do that when initially queueing the changes after
    4358             :      * decoding), because we will release the changes later, and that will
    4359             :      * update the accounting too (subtracting the size from the counters). And
    4360             :      * we don't want to underflow there.
    4361             :      */
    4362      307018 :     ReorderBufferChangeMemoryUpdate(rb, change, true,
    4363             :                                     ReorderBufferChangeSize(change));
    4364      307018 : }
    4365             : 
    4366             : /*
    4367             :  * Remove all on-disk stored for the passed in transaction.
    4368             :  */
    4369             : static void
    4370         424 : ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn)
    4371             : {
    4372             :     XLogSegNo   first;
    4373             :     XLogSegNo   cur;
    4374             :     XLogSegNo   last;
    4375             : 
    4376             :     Assert(txn->first_lsn != InvalidXLogRecPtr);
    4377             :     Assert(txn->final_lsn != InvalidXLogRecPtr);
    4378             : 
    4379         424 :     XLByteToSeg(txn->first_lsn, first, wal_segment_size);
    4380         424 :     XLByteToSeg(txn->final_lsn, last, wal_segment_size);
    4381             : 
    4382             :     /* iterate over all possible filenames, and delete them */
    4383         854 :     for (cur = first; cur <= last; cur++)
    4384             :     {
    4385             :         char        path[MAXPGPATH];
    4386             : 
    4387         430 :         ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid, cur);
    4388         430 :         if (unlink(path) != 0 && errno != ENOENT)
    4389           0 :             ereport(ERROR,
    4390             :                     (errcode_for_file_access(),
    4391             :                      errmsg("could not remove file \"%s\": %m", path)));
    4392             :     }
    4393         424 : }
    4394             : 
    4395             : /*
    4396             :  * Remove any leftover serialized reorder buffers from a slot directory after a
    4397             :  * prior crash or decoding session exit.
    4398             :  */
    4399             : static void
    4400        2040 : ReorderBufferCleanupSerializedTXNs(const char *slotname)
    4401             : {
    4402             :     DIR        *spill_dir;
    4403             :     struct dirent *spill_de;
    4404             :     struct stat statbuf;
    4405             :     char        path[MAXPGPATH * 2 + 12];
    4406             : 
    4407        2040 :     sprintf(path, "pg_replslot/%s", slotname);
    4408             : 
    4409             :     /* we're only handling directories here, skip if it's not ours */
    4410        2040 :     if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
    4411           0 :         return;
    4412             : 
    4413        2040 :     spill_dir = AllocateDir(path);
    4414        8160 :     while ((spill_de = ReadDirExtended(spill_dir, path, INFO)) != NULL)
    4415             :     {
    4416             :         /* only look at names that can be ours */
    4417        6120 :         if (strncmp(spill_de->d_name, "xid", 3) == 0)
    4418             :         {
    4419           0 :             snprintf(path, sizeof(path),
    4420             :                      "pg_replslot/%s/%s", slotname,
    4421           0 :                      spill_de->d_name);
    4422             : 
    4423           0 :             if (unlink(path) != 0)
    4424           0 :                 ereport(ERROR,
    4425             :                         (errcode_for_file_access(),
    4426             :                          errmsg("could not remove file \"%s\" during removal of pg_replslot/%s/xid*: %m",
    4427             :                                 path, slotname)));
    4428             :         }
    4429             :     }
    4430        2040 :     FreeDir(spill_dir);
    4431             : }
    4432             : 
    4433             : /*
    4434             :  * Given a replication slot, transaction ID and segment number, fill in the
    4435             :  * corresponding spill file into 'path', which is a caller-owned buffer of size
    4436             :  * at least MAXPGPATH.
    4437             :  */
    4438             : static void
    4439        6248 : ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid,
    4440             :                             XLogSegNo segno)
    4441             : {
    4442             :     XLogRecPtr  recptr;
    4443             : 
    4444        6248 :     XLogSegNoOffsetToRecPtr(segno, 0, wal_segment_size, recptr);
    4445             : 
    4446        6248 :     snprintf(path, MAXPGPATH, "pg_replslot/%s/xid-%u-lsn-%X-%X.spill",
    4447        6248 :              NameStr(MyReplicationSlot->data.name),
    4448        6248 :              xid, LSN_FORMAT_ARGS(recptr));
    4449        6248 : }
    4450             : 
    4451             : /*
    4452             :  * Delete all data spilled to disk after we've restarted/crashed. It will be
    4453             :  * recreated when the respective slots are reused.
    4454             :  */
    4455             : void
    4456        1860 : StartupReorderBuffer(void)
    4457             : {
    4458             :     DIR        *logical_dir;
    4459             :     struct dirent *logical_de;
    4460             : 
    4461        1860 :     logical_dir = AllocateDir("pg_replslot");
    4462        5626 :     while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
    4463             :     {
    4464        3766 :         if (strcmp(logical_de->d_name, ".") == 0 ||
    4465        1906 :             strcmp(logical_de->d_name, "..") == 0)
    4466        3720 :             continue;
    4467             : 
    4468             :         /* if it cannot be a slot, skip the directory */
    4469          46 :         if (!ReplicationSlotValidateName(logical_de->d_name, DEBUG2))
    4470           0 :             continue;
    4471             : 
    4472             :         /*
    4473             :          * ok, has to be a surviving logical slot, iterate and delete
    4474             :          * everything starting with xid-*
    4475             :          */
    4476          46 :         ReorderBufferCleanupSerializedTXNs(logical_de->d_name);
    4477             :     }
    4478        1860 :     FreeDir(logical_dir);
    4479        1860 : }
    4480             : 
    4481             : /* ---------------------------------------
    4482             :  * toast reassembly support
    4483             :  * ---------------------------------------
    4484             :  */
    4485             : 
    4486             : /*
    4487             :  * Initialize per tuple toast reconstruction support.
    4488             :  */
    4489             : static void
    4490          66 : ReorderBufferToastInitHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
    4491             : {
    4492             :     HASHCTL     hash_ctl;
    4493             : 
    4494             :     Assert(txn->toast_hash == NULL);
    4495             : 
    4496          66 :     hash_ctl.keysize = sizeof(Oid);
    4497          66 :     hash_ctl.entrysize = sizeof(ReorderBufferToastEnt);
    4498          66 :     hash_ctl.hcxt = rb->context;
    4499          66 :     txn->toast_hash = hash_create("ReorderBufferToastHash", 5, &hash_ctl,
    4500             :                                   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
    4501          66 : }
    4502             : 
    4503             : /*
    4504             :  * Per toast-chunk handling for toast reconstruction
    4505             :  *
    4506             :  * Appends a toast chunk so we can reconstruct it when the tuple "owning" the
    4507             :  * toasted Datum comes along.
    4508             :  */
    4509             : static void
    4510        3440 : ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
    4511             :                               Relation relation, ReorderBufferChange *change)
    4512             : {
    4513             :     ReorderBufferToastEnt *ent;
    4514             :     ReorderBufferTupleBuf *newtup;
    4515             :     bool        found;
    4516             :     int32       chunksize;
    4517             :     bool        isnull;
    4518             :     Pointer     chunk;
    4519        3440 :     TupleDesc   desc = RelationGetDescr(relation);
    4520             :     Oid         chunk_id;
    4521             :     int32       chunk_seq;
    4522             : 
    4523        3440 :     if (txn->toast_hash == NULL)
    4524          66 :         ReorderBufferToastInitHash(rb, txn);
    4525             : 
    4526             :     Assert(IsToastRelation(relation));
    4527             : 
    4528        3440 :     newtup = change->data.tp.newtuple;
    4529        3440 :     chunk_id = DatumGetObjectId(fastgetattr(&newtup->tuple, 1, desc, &isnull));
    4530             :     Assert(!isnull);
    4531        3440 :     chunk_seq = DatumGetInt32(fastgetattr(&newtup->tuple, 2, desc, &isnull));
    4532             :     Assert(!isnull);
    4533             : 
    4534             :     ent = (ReorderBufferToastEnt *)
    4535        3440 :         hash_search(txn->toast_hash,
    4536             :                     (void *) &chunk_id,
    4537             :                     HASH_ENTER,
    4538             :                     &found);
    4539             : 
    4540        3440 :     if (!found)
    4541             :     {
    4542             :         Assert(ent->chunk_id == chunk_id);
    4543          94 :         ent->num_chunks = 0;
    4544          94 :         ent->last_chunk_seq = 0;
    4545          94 :         ent->size = 0;
    4546          94 :         ent->reconstructed = NULL;
    4547          94 :         dlist_init(&ent->chunks);
    4548             : 
    4549          94 :         if (chunk_seq != 0)
    4550           0 :             elog(ERROR, "got sequence entry %d for toast chunk %u instead of seq 0",
    4551             :                  chunk_seq, chunk_id);
    4552             :     }
    4553        3346 :     else if (found && chunk_seq != ent->last_chunk_seq + 1)
    4554           0 :         elog(ERROR, "got sequence entry %d for toast chunk %u instead of seq %d",
    4555             :              chunk_seq, chunk_id, ent->last_chunk_seq + 1);
    4556             : 
    4557        3440 :     chunk = DatumGetPointer(fastgetattr(&newtup->tuple, 3, desc, &isnull));
    4558             :     Assert(!isnull);
    4559             : 
    4560             :     /* calculate size so we can allocate the right size at once later */
    4561        3440 :     if (!VARATT_IS_EXTENDED(chunk))
    4562        3440 :         chunksize = VARSIZE(chunk) - VARHDRSZ;
    4563           0 :     else if (VARATT_IS_SHORT(chunk))
    4564             :         /* could happen due to heap_form_tuple doing its thing */
    4565           0 :         chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
    4566             :     else
    4567           0 :         elog(ERROR, "unexpected type of toast chunk");
    4568             : 
    4569        3440 :     ent->size += chunksize;
    4570        3440 :     ent->last_chunk_seq = chunk_seq;
    4571        3440 :     ent->num_chunks++;
    4572        3440 :     dlist_push_tail(&ent->chunks, &change->node);
    4573        3440 : }
    4574             : 
    4575             : /*
    4576             :  * Rejigger change->newtuple to point to in-memory toast tuples instead to
    4577             :  * on-disk toast tuples that may not longer exist (think DROP TABLE or VACUUM).
    4578             :  *
    4579             :  * We cannot replace unchanged toast tuples though, so those will still point
    4580             :  * to on-disk toast data.
    4581             :  *
    4582             :  * While updating the existing change with detoasted tuple data, we need to
    4583             :  * update the memory accounting info, because the change size will differ.
    4584             :  * Otherwise the accounting may get out of sync, triggering serialization
    4585             :  * at unexpected times.
    4586             :  *
    4587             :  * We simply subtract size of the change before rejiggering the tuple, and
    4588             :  * then adding the new size. This makes it look like the change was removed
    4589             :  * and then added back, except it only tweaks the accounting info.
    4590             :  *
    4591             :  * In particular it can't trigger serialization, which would be pointless
    4592             :  * anyway as it happens during commit processing right before handing
    4593             :  * the change to the output plugin.
    4594             :  */
    4595             : static void
    4596      606114 : ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
    4597             :                           Relation relation, ReorderBufferChange *change)
    4598             : {
    4599             :     TupleDesc   desc;
    4600             :     int         natt;
    4601             :     Datum      *attrs;
    4602             :     bool       *isnull;
    4603             :     bool       *free;
    4604             :     HeapTuple   tmphtup;
    4605             :     Relation    toast_rel;
    4606             :     TupleDesc   toast_desc;
    4607             :     MemoryContext oldcontext;
    4608             :     ReorderBufferTupleBuf *newtup;
    4609             :     Size        old_size;
    4610             : 
    4611             :     /* no toast tuples changed */
    4612      606114 :     if (txn->toast_hash == NULL)
    4613      605624 :         return;
    4614             : 
    4615             :     /*
    4616             :      * We're going to modify the size of the change. So, to make sure the
    4617             :      * accounting is correct we record the current change size and then after
    4618             :      * re-computing the change we'll subtract the recorded size and then
    4619             :      * re-add the new change size at the end. We don't immediately subtract
    4620             :      * the old size because if there is any error before we add the new size,
    4621             :      * we will release the changes and that will update the accounting info
    4622             :      * (subtracting the size from the counters). And we don't want to
    4623             :      * underflow there.
    4624             :      */
    4625         490 :     old_size = ReorderBufferChangeSize(change);
    4626             : 
    4627         490 :     oldcontext = MemoryContextSwitchTo(rb->context);
    4628             : 
    4629             :     /* we should only have toast tuples in an INSERT or UPDATE */
    4630             :     Assert(change->data.tp.newtuple);
    4631             : 
    4632         490 :     desc = RelationGetDescr(relation);
    4633             : 
    4634         490 :     toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
    4635         490 :     if (!RelationIsValid(toast_rel))
    4636           0 :         elog(ERROR, "could not open toast relation with OID %u (base relation \"%s\")",
    4637             :              relation->rd_rel->reltoastrelid, RelationGetRelationName(relation));
    4638             : 
    4639         490 :     toast_desc = RelationGetDescr(toast_rel);
    4640             : 
    4641             :     /* should we allocate from stack instead? */
    4642         490 :     attrs = palloc0(sizeof(Datum) * desc->natts);
    4643         490 :     isnull = palloc0(sizeof(bool) * desc->natts);
    4644         490 :     free = palloc0(sizeof(bool) * desc->natts);
    4645             : 
    4646         490 :     newtup = change->data.tp.newtuple;
    4647             : 
    4648         490 :     heap_deform_tuple(&newtup->tuple, desc, attrs, isnull);
    4649             : 
    4650        1510 :     for (natt = 0; natt < desc->natts; natt++)
    4651             :     {
    4652        1020 :         Form_pg_attribute attr = TupleDescAttr(desc, natt);
    4653             :         ReorderBufferToastEnt *ent;
    4654             :         struct varlena *varlena;
    4655             : 
    4656             :         /* va_rawsize is the size of the original datum -- including header */
    4657             :         struct varatt_external toast_pointer;
    4658             :         struct varatt_indirect redirect_pointer;
    4659        1020 :         struct varlena *new_datum = NULL;
    4660             :         struct varlena *reconstructed;
    4661             :         dlist_iter  it;
    4662        1020 :         Size        data_done = 0;
    4663             : 
    4664             :         /* system columns aren't toasted */
    4665        1020 :         if (attr->attnum < 0)
    4666         926 :             continue;
    4667             : 
    4668        1020 :         if (attr->attisdropped)
    4669           0 :             continue;
    4670             : 
    4671             :         /* not a varlena datatype */
    4672        1020 :         if (attr->attlen != -1)
    4673         482 :             continue;
    4674             : 
    4675             :         /* no data */
    4676         538 :         if (isnull[natt])
    4677          24 :             continue;
    4678             : 
    4679             :         /* ok, we know we have a toast datum */
    4680         514 :         varlena = (struct varlena *) DatumGetPointer(attrs[natt]);
    4681             : 
    4682             :         /* no need to do anything if the tuple isn't external */
    4683         514 :         if (!VARATT_IS_EXTERNAL(varlena))
    4684         404 :             continue;
    4685             : 
    4686         110 :         VARATT_EXTERNAL_GET_POINTER(toast_pointer, varlena);
    4687             : 
    4688             :         /*
    4689             :          * Check whether the toast tuple changed, replace if so.
    4690             :          */
    4691             :         ent = (ReorderBufferToastEnt *)
    4692         110 :             hash_search(txn->toast_hash,
    4693             :                         (void *) &toast_pointer.va_valueid,
    4694             :                         HASH_FIND,
    4695             :                         NULL);
    4696         110 :         if (ent == NULL)
    4697          16 :             continue;
    4698             : 
    4699             :         new_datum =
    4700          94 :             (struct varlena *) palloc0(INDIRECT_POINTER_SIZE);
    4701             : 
    4702          94 :         free[natt] = true;
    4703             : 
    4704          94 :         reconstructed = palloc0(toast_pointer.va_rawsize);
    4705             : 
    4706          94 :         ent->reconstructed = reconstructed;
    4707             : 
    4708             :         /* stitch toast tuple back together from its parts */
    4709        3534 :         dlist_foreach(it, &ent->chunks)
    4710             :         {
    4711             :             bool        isnull;
    4712             :             ReorderBufferChange *cchange;
    4713             :             ReorderBufferTupleBuf *ctup;
    4714             :             Pointer     chunk;
    4715             : 
    4716        3440 :             cchange = dlist_container(ReorderBufferChange, node, it.cur);
    4717        3440 :             ctup = cchange->data.tp.newtuple;
    4718        3440 :             chunk = DatumGetPointer(fastgetattr(&ctup->tuple, 3, toast_desc, &isnull));
    4719             : 
    4720             :             Assert(!isnull);
    4721             :             Assert(!VARATT_IS_EXTERNAL(chunk));
    4722             :             Assert(!VARATT_IS_SHORT(chunk));
    4723             : 
    4724        3440 :             memcpy(VARDATA(reconstructed) + data_done,
    4725        3440 :                    VARDATA(chunk),
    4726        3440 :                    VARSIZE(chunk) - VARHDRSZ);
    4727        3440 :             data_done += VARSIZE(chunk) - VARHDRSZ;
    4728             :         }
    4729             :         Assert(data_done == VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer));
    4730             : 
    4731             :         /* make sure its marked as compressed or not */
    4732          94 :         if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
    4733          10 :             SET_VARSIZE_COMPRESSED(reconstructed, data_done + VARHDRSZ);
    4734             :         else
    4735          84 :             SET_VARSIZE(reconstructed, data_done + VARHDRSZ);
    4736             : 
    4737          94 :         memset(&redirect_pointer, 0, sizeof(redirect_pointer));
    4738          94 :         redirect_pointer.pointer = reconstructed;
    4739             : 
    4740          94 :         SET_VARTAG_EXTERNAL(new_datum, VARTAG_INDIRECT);
    4741          94 :         memcpy(VARDATA_EXTERNAL(new_datum), &redirect_pointer,
    4742             :                sizeof(redirect_pointer));
    4743             : 
    4744          94 :         attrs[natt] = PointerGetDatum(new_datum);
    4745             :     }
    4746             : 
    4747             :     /*
    4748             :      * Build tuple in separate memory & copy tuple back into the tuplebuf
    4749             :      * passed to the output plugin. We can't directly heap_fill_tuple() into
    4750             :      * the tuplebuf because attrs[] will point back into the current content.
    4751             :      */
    4752         490 :     tmphtup = heap_form_tuple(desc, attrs, isnull);
    4753             :     Assert(newtup->tuple.t_len <= MaxHeapTupleSize);
    4754             :     Assert(ReorderBufferTupleBufData(newtup) == newtup->tuple.t_data);
    4755             : 
    4756         490 :     memcpy(newtup->tuple.t_data, tmphtup->t_data, tmphtup->t_len);
    4757         490 :     newtup->tuple.t_len = tmphtup->t_len;
    4758             : 
    4759             :     /*
    4760             :      * free resources we won't further need, more persistent stuff will be
    4761             :      * free'd in ReorderBufferToastReset().
    4762             :      */
    4763         490 :     RelationClose(toast_rel);
    4764         490 :     pfree(tmphtup);
    4765        1510 :     for (natt = 0; natt < desc->natts; natt++)
    4766             :     {
    4767        1020 :         if (free[natt])
    4768          94 :             pfree(DatumGetPointer(attrs[natt]));
    4769             :     }
    4770         490 :     pfree(attrs);
    4771         490 :     pfree(free);
    4772         490 :     pfree(isnull);
    4773             : 
    4774         490 :     MemoryContextSwitchTo(oldcontext);
    4775             : 
    4776             :     /* subtract the old change size */
    4777         490 :     ReorderBufferChangeMemoryUpdate(rb, change, false, old_size);
    4778             :     /* now add the change back, with the correct size */
    4779         490 :     ReorderBufferChangeMemoryUpdate(rb, change, true,
    4780             :                                     ReorderBufferChangeSize(change));
    4781             : }
    4782             : 
    4783             : /*
    4784             :  * Free all resources allocated for toast reconstruction.
    4785             :  */
    4786             : static void
    4787      610426 : ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
    4788             : {
    4789             :     HASH_SEQ_STATUS hstat;
    4790             :     ReorderBufferToastEnt *ent;
    4791             : 
    4792      610426 :     if (txn->toast_hash == NULL)
    4793      610360 :         return;
    4794             : 
    4795             :     /* sequentially walk over the hash and free everything */
    4796          66 :     hash_seq_init(&hstat, txn->toast_hash);
    4797         160 :     while ((ent = (ReorderBufferToastEnt *) hash_seq_search(&hstat)) != NULL)
    4798             :     {
    4799             :         dlist_mutable_iter it;
    4800             : 
    4801          94 :         if (ent->reconstructed != NULL)
    4802          94 :             pfree(ent->reconstructed);
    4803             : 
    4804        3534 :         dlist_foreach_modify(it, &ent->chunks)
    4805             :         {
    4806        3440 :             ReorderBufferChange *change =
    4807        3440 :             dlist_container(ReorderBufferChange, node, it.cur);
    4808             : 
    4809        3440 :             dlist_delete(&change->node);
    4810        3440 :             ReorderBufferReturnChange(rb, change, true);
    4811             :         }
    4812             :     }
    4813             : 
    4814          66 :     hash_destroy(txn->toast_hash);
    4815          66 :     txn->toast_hash = NULL;
    4816             : }
    4817             : 
    4818             : 
    4819             : /* ---------------------------------------
    4820             :  * Visibility support for logical decoding
    4821             :  *
    4822             :  *
    4823             :  * Lookup actual cmin/cmax values when using decoding snapshot. We can't
    4824             :  * always rely on stored cmin/cmax values because of two scenarios:
    4825             :  *
    4826             :  * * A tuple got changed multiple times during a single transaction and thus
    4827             :  *   has got a combo CID. Combo CIDs are only valid for the duration of a
    4828             :  *   single transaction.
    4829             :  * * A tuple with a cmin but no cmax (and thus no combo CID) got
    4830             :  *   deleted/updated in another transaction than the one which created it
    4831             :  *   which we are looking at right now. As only one of cmin, cmax or combo CID
    4832             :  *   is actually stored in the heap we don't have access to the value we
    4833             :  *   need anymore.
    4834             :  *
    4835             :  * To resolve those problems we have a per-transaction hash of (cmin,
    4836             :  * cmax) tuples keyed by (relfilenode, ctid) which contains the actual
    4837             :  * (cmin, cmax) values. That also takes care of combo CIDs by simply
    4838             :  * not caring about them at all. As we have the real cmin/cmax values
    4839             :  * combo CIDs aren't interesting.
    4840             :  *
    4841             :  * As we only care about catalog tuples here the overhead of this
    4842             :  * hashtable should be acceptable.
    4843             :  *
    4844             :  * Heap rewrites complicate this a bit, check rewriteheap.c for
    4845             :  * details.
    4846             :  * -------------------------------------------------------------------------
    4847             :  */
    4848             : 
    4849             : /* struct for sorting mapping files by LSN efficiently */
    4850             : typedef struct RewriteMappingFile
    4851             : {
    4852             :     XLogRecPtr  lsn;
    4853             :     char        fname[MAXPGPATH];
    4854             : } RewriteMappingFile;
    4855             : 
    4856             : #ifdef NOT_USED
    4857             : static void
    4858             : DisplayMapping(HTAB *tuplecid_data)
    4859             : {
    4860             :     HASH_SEQ_STATUS hstat;
    4861             :     ReorderBufferTupleCidEnt *ent;
    4862             : 
    4863             :     hash_seq_init(&hstat, tuplecid_data);
    4864             :     while ((ent = (ReorderBufferTupleCidEnt *) hash_seq_search(&hstat)) != NULL)
    4865             :     {
    4866             :         elog(DEBUG3, "mapping: node: %u/%u/%u tid: %u/%u cmin: %u, cmax: %u",
    4867             :              ent->key.relnode.dbNode,
    4868             :              ent->key.relnode.spcNode,
    4869             :              ent->key.relnode.relNode,
    4870             :              ItemPointerGetBlockNumber(&ent->key.tid),
    4871             :              ItemPointerGetOffsetNumber(&ent->key.tid),
    4872             :              ent->cmin,
    4873             :              ent->cmax
    4874             :             );
    4875             :     }
    4876             : }
    4877             : #endif
    4878             : 
    4879             : /*
    4880             :  * Apply a single mapping file to tuplecid_data.
    4881             :  *
    4882             :  * The mapping file has to have been verified to be a) committed b) for our
    4883             :  * transaction c) applied in LSN order.
    4884             :  */
    4885             : static void
    4886          44 : ApplyLogicalMappingFile(HTAB *tuplecid_data, Oid relid, const char *fname)
    4887             : {
    4888             :     char        path[MAXPGPATH];
    4889             :     int         fd;
    4890             :     int         readBytes;
    4891             :     LogicalRewriteMappingData map;
    4892             : 
    4893          44 :     sprintf(path, "pg_logical/mappings/%s", fname);
    4894          44 :     fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
    4895          44 :     if (fd < 0)
    4896           0 :         ereport(ERROR,
    4897             :                 (errcode_for_file_access(),
    4898             :                  errmsg("could not open file \"%s\": %m", path)));
    4899             : 
    4900             :     while (true)
    4901         238 :     {
    4902             :         ReorderBufferTupleCidKey key;
    4903             :         ReorderBufferTupleCidEnt *ent;
    4904             :         ReorderBufferTupleCidEnt *new_ent;
    4905             :         bool        found;
    4906             : 
    4907             :         /* be careful about padding */
    4908         282 :         memset(&key, 0, sizeof(ReorderBufferTupleCidKey));
    4909             : 
    4910             :         /* read all mappings till the end of the file */
    4911         282 :         pgstat_report_wait_start(WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ);
    4912         282 :         readBytes = read(fd, &map, sizeof(LogicalRewriteMappingData));
    4913         282 :         pgstat_report_wait_end();
    4914             : 
    4915         282 :         if (readBytes < 0)
    4916           0 :             ereport(ERROR,
    4917             :                     (errcode_for_file_access(),
    4918             :                      errmsg("could not read file \"%s\": %m",
    4919             :                             path)));
    4920         282 :         else if (readBytes == 0)    /* EOF */
    4921          44 :             break;
    4922         238 :         else if (readBytes != sizeof(LogicalRewriteMappingData))
    4923           0 :             ereport(ERROR,
    4924             :                     (errcode_for_file_access(),
    4925             :                      errmsg("could not read from file \"%s\": read %d instead of %d bytes",
    4926             :                             path, readBytes,
    4927             :                             (int32) sizeof(LogicalRewriteMappingData))));
    4928             : 
    4929         238 :         key.relnode = map.old_node;
    4930         238 :         ItemPointerCopy(&map.old_tid,
    4931             :                         &key.tid);
    4932             : 
    4933             : 
    4934             :         ent = (ReorderBufferTupleCidEnt *)
    4935         238 :             hash_search(tuplecid_data,
    4936             :                         (void *) &key,
    4937             :                         HASH_FIND,
    4938             :                         NULL);
    4939             : 
    4940             :         /* no existing mapping, no need to update */
    4941         238 :         if (!ent)
    4942           0 :             continue;
    4943             : 
    4944         238 :         key.relnode = map.new_node;
    4945         238 :         ItemPointerCopy(&map.new_tid,
    4946             :                         &key.tid);
    4947             : 
    4948             :         new_ent = (ReorderBufferTupleCidEnt *)
    4949         238 :             hash_search(tuplecid_data,
    4950             :                         (void *) &key,
    4951             :                         HASH_ENTER,
    4952             :                         &found);
    4953             : 
    4954         238 :         if (found)
    4955             :         {
    4956             :             /*
    4957             :              * Make sure the existing mapping makes sense. We sometime update
    4958             :              * old records that did not yet have a cmax (e.g. pg_class' own
    4959             :              * entry while rewriting it) during rewrites, so allow that.
    4960             :              */
    4961             :             Assert(ent->cmin == InvalidCommandId || ent->cmin == new_ent->cmin);
    4962             :             Assert(ent->cmax == InvalidCommandId || ent->cmax == new_ent->cmax);
    4963             :         }
    4964             :         else
    4965             :         {
    4966             :             /* update mapping */
    4967         226 :             new_ent->cmin = ent->cmin;
    4968         226 :             new_ent->cmax = ent->cmax;
    4969         226 :             new_ent->combocid = ent->combocid;
    4970             :         }
    4971             :     }
    4972             : 
    4973          44 :     if (CloseTransientFile(fd) != 0)
    4974           0 :         ereport(ERROR,
    4975             :                 (errcode_for_file_access(),
    4976             :                  errmsg("could not close file \"%s\": %m", path)));
    4977          44 : }
    4978             : 
    4979             : 
    4980             : /*
    4981             :  * Check whether the TransactionId 'xid' is in the pre-sorted array 'xip'.
    4982             :  */
    4983             : static bool
    4984         580 : TransactionIdInArray(TransactionId xid, TransactionId *xip, Size num)
    4985             : {
    4986         580 :     return bsearch(&xid, xip, num,
    4987         580 :                    sizeof(TransactionId), xidComparator) != NULL;
    4988             : }
    4989             : 
    4990             : /*
    4991             :  * list_sort() comparator for sorting RewriteMappingFiles in LSN order.
    4992             :  */
    4993             : static int
    4994          58 : file_sort_by_lsn(const ListCell *a_p, const ListCell *b_p)
    4995             : {
    4996          58 :     RewriteMappingFile *a = (RewriteMappingFile *) lfirst(a_p);
    4997          58 :     RewriteMappingFile *b = (RewriteMappingFile *) lfirst(b_p);
    4998             : 
    4999          58 :     if (a->lsn < b->lsn)
    5000          24 :         return -1;
    5001          34 :     else if (a->lsn > b->lsn)
    5002          34 :         return 1;
    5003           0 :     return 0;
    5004             : }
    5005             : 
    5006             : /*
    5007             :  * Apply any existing logical remapping files if there are any targeted at our
    5008             :  * transaction for relid.
    5009             :  */
    5010             : static void
    5011          10 : UpdateLogicalMappings(HTAB *tuplecid_data, Oid relid, Snapshot snapshot)
    5012             : {
    5013             :     DIR        *mapping_dir;
    5014             :     struct dirent *mapping_de;
    5015          10 :     List       *files = NIL;
    5016             :     ListCell   *file;
    5017          10 :     Oid         dboid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
    5018             : 
    5019          10 :     mapping_dir = AllocateDir("pg_logical/mappings");
    5020         920 :     while ((mapping_de = ReadDir(mapping_dir, "pg_logical/mappings")) != NULL)
    5021             :     {
    5022             :         Oid         f_dboid;
    5023             :         Oid         f_relid;
    5024             :         TransactionId f_mapped_xid;
    5025             :         TransactionId f_create_xid;
    5026             :         XLogRecPtr  f_lsn;
    5027             :         uint32      f_hi,
    5028             :                     f_lo;
    5029             :         RewriteMappingFile *f;
    5030             : 
    5031         910 :         if (strcmp(mapping_de->d_name, ".") == 0 ||
    5032         900 :             strcmp(mapping_de->d_name, "..") == 0)
    5033         866 :             continue;
    5034             : 
    5035             :         /* Ignore files that aren't ours */
    5036         890 :         if (strncmp(mapping_de->d_name, "map-", 4) != 0)
    5037           0 :             continue;
    5038             : 
    5039         890 :         if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
    5040             :                    &f_dboid, &f_relid, &f_hi, &f_lo,
    5041             :                    &f_mapped_xid, &f_create_xid) != 6)
    5042           0 :             elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
    5043             : 
    5044         890 :         f_lsn = ((uint64) f_hi) << 32 | f_lo;
    5045             : 
    5046             :         /* mapping for another database */
    5047         890 :         if (f_dboid != dboid)
    5048           0 :             continue;
    5049             : 
    5050             :         /* mapping for another relation */
    5051         890 :         if (f_relid != relid)
    5052          90 :             continue;
    5053             : 
    5054             :         /* did the creating transaction abort? */
    5055         800 :         if (!TransactionIdDidCommit(f_create_xid))
    5056         220 :             continue;
    5057             : 
    5058             :         /* not for our transaction */
    5059         580 :         if (!TransactionIdInArray(f_mapped_xid, snapshot->subxip, snapshot->subxcnt))
    5060         536 :             continue;
    5061             : 
    5062             :         /* ok, relevant, queue for apply */
    5063          44 :         f = palloc(sizeof(RewriteMappingFile));
    5064          44 :         f->lsn = f_lsn;
    5065          44 :         strcpy(f->fname, mapping_de->d_name);
    5066          44 :         files = lappend(files, f);
    5067             :     }
    5068          10 :     FreeDir(mapping_dir);
    5069             : 
    5070             :     /* sort files so we apply them in LSN order */
    5071          10 :     list_sort(files, file_sort_by_lsn);
    5072             : 
    5073          54 :     foreach(file, files)
    5074             :     {
    5075          44 :         RewriteMappingFile *f = (RewriteMappingFile *) lfirst(file);
    5076             : 
    5077          44 :         elog(DEBUG1, "applying mapping: \"%s\" in %u", f->fname,
    5078             :              snapshot->subxip[0]);
    5079          44 :         ApplyLogicalMappingFile(tuplecid_data, relid, f->fname);
    5080          44 :         pfree(f);
    5081             :     }
    5082          10 : }
    5083             : 
    5084             : /*
    5085             :  * Lookup cmin/cmax of a tuple, during logical decoding where we can't rely on
    5086             :  * combo CIDs.
    5087             :  */
    5088             : bool
    5089        1066 : ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
    5090             :                               Snapshot snapshot,
    5091             :                               HeapTuple htup, Buffer buffer,
    5092             :                               CommandId *cmin, CommandId *cmax)
    5093             : {
    5094             :     ReorderBufferTupleCidKey key;
    5095             :     ReorderBufferTupleCidEnt *ent;
    5096             :     ForkNumber  forkno;
    5097             :     BlockNumber blockno;
    5098        1066 :     bool        updated_mapping = false;
    5099             : 
    5100             :     /*
    5101             :      * Return unresolved if tuplecid_data is not valid.  That's because when
    5102             :      * streaming in-progress transactions we may run into tuples with the CID
    5103             :      * before actually decoding them.  Think e.g. about INSERT followed by
    5104             :      * TRUNCATE, where the TRUNCATE may not be decoded yet when applying the
    5105             :      * INSERT.  So in such cases, we assume the CID is from the future
    5106             :      * command.
    5107             :      */
    5108        1066 :     if (tuplecid_data == NULL)
    5109          18 :         return false;
    5110             : 
    5111             :     /* be careful about padding */
    5112        1048 :     memset(&key, 0, sizeof(key));
    5113             : 
    5114             :     Assert(!BufferIsLocal(buffer));
    5115             : 
    5116             :     /*
    5117             :      * get relfilenode from the buffer, no convenient way to access it other
    5118             :      * than that.
    5119             :      */
    5120        1048 :     BufferGetTag(buffer, &key.relnode, &forkno, &blockno);
    5121             : 
    5122             :     /* tuples can only be in the main fork */
    5123             :     Assert(forkno == MAIN_FORKNUM);
    5124             :     Assert(blockno == ItemPointerGetBlockNumber(&htup->t_self));
    5125             : 
    5126        1048 :     ItemPointerCopy(&htup->t_self,
    5127             :                     &key.tid);
    5128             : 
    5129        1058 : restart:
    5130             :     ent = (ReorderBufferTupleCidEnt *)
    5131        1058 :         hash_search(tuplecid_data,
    5132             :                     (void *) &key,
    5133             :                     HASH_FIND,
    5134             :                     NULL);
    5135             : 
    5136             :     /*
    5137             :      * failed to find a mapping, check whether the table was rewritten and
    5138             :      * apply mapping if so, but only do that once - there can be no new
    5139             :      * mappings while we are in here since we have to hold a lock on the
    5140             :      * relation.
    5141             :      */
    5142        1058 :     if (ent == NULL && !updated_mapping)
    5143             :     {
    5144          10 :         UpdateLogicalMappings(tuplecid_data, htup->t_tableOid, snapshot);
    5145             :         /* now check but don't update for a mapping again */
    5146          10 :         updated_mapping = true;
    5147          10 :         goto restart;
    5148             :     }
    5149        1048 :     else if (ent == NULL)
    5150           0 :         return false;
    5151             : 
    5152        1048 :     if (cmin)
    5153        1048 :         *cmin = ent->cmin;
    5154        1048 :     if (cmax)
    5155        1048 :         *cmax = ent->cmax;
    5156        1048 :     return true;
    5157             : }

Generated by: LCOV version 1.14