偷天换日 | PostgreSQL伪造事务日志

vlambda
2022-04-30

偷天换日 | PostgreSQL伪造事务日志

PostgreSQL高可用实战

京东

前言

今天一位朋友找到我，查询报错了，提示

could not access status of transaction xxx

could not open file "pg_commit_ts/68F9"：No such file of directory

光看报错，和事务提交日志有关，可能度娘会告诉你使用dd去抹掉，在不少书籍中也有类似教程：

但是光会用dd就可以了吗？没那么简单，接下来让我来带各位深入浅出事务提交日志，以及如何偷天换日欺骗PostgreSQL。

分析

首先，报错中提示了pg_commit_ts，这个目录的作用是用于存储事务提交的时间戳

Subdirectory containing transaction commit timestamp data

默认情况下该目录下是没有内容的

[postgres@xiongcc ~]$ ll pgdata/pg_commit_ts/
total 0

需要开启 track_commit_timestamp 这个参数，才会记录事务提交时的时间戳，看个例子：

[postgres@xiongcc ~]$ psql
psql (14.2)
Type "help" for help.

postgres=# show track_commit_timestamp ;
 track_commit_timestamp 
------------------------
 on
(1 row)

postgres=# begin;
BEGIN
postgres=*# insert into t1 values(1);
INSERT 0 1
postgres=*# select txid_current();
 txid_current 
--------------
         3534
(1 row)

postgres=*# commit ;
COMMIT
postgres=# select pg_xact_commit_timestamp('3534');
   pg_xact_commit_timestamp    
-------------------------------
 2022-04-26 20:02:17.363029+08
(1 row)

可以看到，通过这个参数，我们可以得到事务的提交时间。不过类似于buffer pool，脏数据不会立即写出到磁盘上

[postgres@xiongcc pg_commit_ts]$ stat 0000 
  File: ‘0000’
  Size: 40960           Blocks: 16         IO Block: 4096   regular file
Device: fd02h/64770d    Inode: 82932       Links: 1
Access: (0600/-rw-------)  Uid: ( 1000/postgres)   Gid: ( 1000/postgres)
Access: 2022-04-26 20:01:43.810000000 +0800
Modify: 2022-04-26 20:01:28.639000000 +0800
Change: 2022-04-26 20:01:28.639000000 +0800
 Birth: -

做一个checkpoint之后，数据才会落盘

[postgres@xiongcc pg_commit_ts]$ stat 0000 
  File: ‘0000’
  Size: 40960           Blocks: 16         IO Block: 4096   regular file
Device: fd02h/64770d    Inode: 82932       Links: 1
Access: (0600/-rw-------)  Uid: ( 1000/postgres)   Gid: ( 1000/postgres)
Access: 2022-04-26 20:01:43.810000000 +0800
Modify: 2022-04-26 20:03:38.730000000 +0800
Change: 2022-04-26 20:03:38.730000000 +0800
 Birth: -
 
[postgres@xiongcc pg_commit_ts]$ hexdump -C 0000
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00008a10  00 00 00 00 55 4e 9d 3b  8c 80 02 00 00 00 00 00  |....UN.;........|
00008a20  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0000a000

如何存储

通过hexdump看到一串不知所以然的字符，让我们十分抓狂，我们需要了解每一个字节是什么意思。

关于commit timestamp（以下简称commit_ts）的源码位于src\backend\access\transam\commit_ts.c处。

首先，我们需要了解commit_ts是如何定义的，才能知道是如何存储的

/*
 * Defines for CommitTs page sizes.  A page is the same BLCKSZ as is used
 * everywhere else in Postgres.
 *
 * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
 * CommitTs page numbering also wraps around at
 * 0xFFFFFFFF/COMMIT_TS_XACTS_PER_PAGE, and CommitTs segment numbering at
 * 0xFFFFFFFF/COMMIT_TS_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
 * explicit notice of that fact in this module, except when comparing segment
 * and page numbers in TruncateCommitTs (see CommitTsPagePrecedes).
 */

/*
 * We need 8+2 bytes per xact.  Note that enlarging this struct might mean
 * the largest possible file name is more than 5 chars long; see
 * SlruScanDirectory.
 */
typedef struct CommitTimestampEntry
{
    TimestampTz time;
    RepOriginId nodeid;
} CommitTimestampEntry;

源码里写的很清楚，总共需要 8+2，10个字节来存储，其中时间戳字段TimestampTz是typedef int64 TimestampTz，是64 bit存储的，我简单翻了下源码，发现各个时间类型存储的格式有些差异：

• timestamp with time zone 和 timestamp without time zone 均是64bit，8字节的整形
• time with time zone 有一个额外的4字节，用于表示时区的偏移量，总共12字节
• time without time zone 也是8字节的整形

typedef int64 Timestamp;
typedef int64 TimestampTz;

typedef int32 DateADT;

typedef int64 TimeADT;

typedef struct
{
    TimeADT  time;   /* all time units other than months and years */
    int32  zone;   /* numeric time zone, in seconds */
} TimeTzADT;

当然，直接查询pg_type系统表也可以得到字节数

postgres=# select typname,typlen from pg_type where typname in ('timestamp','timestamptz','time','timetz');
   typname   | typlen 
-------------+--------
 time        |      8
 timestamp   |      8
 timestamptz |      8
 timetz      |     12
(4 rows)

OK，到这里我们知道了，前8个字节表示的是时间戳，参照我们这个例子，55 4e 9d 3b 8c 80 02 00，另外由于我的机器是小端序，因此要"倒着"看

[postgres@xiongcc pg_commit_ts]$ echo $((0x0002808c3b9d4e55))
704289737363029

换算成十进制是704289737363029这个数字，在官网上还有这么一段说明，所有的时间类型内部均转化为 UTC 存放

All timezone-aware dates and times are stored internally in UTC. They are converted to local time in the zone specified by the TimeZone configuration parameter before being displayed to the client.

同时这个数字表示的是从2000年1月1号，直至该时间戳的微秒数，因此让我们换算一下

postgres=# select timestamp with time zone '2000-01-01 00:00:00' + 704289737363029/1000/1000 * interval '1 second' as commit_ts;
       commit_ts        
------------------------
 2022-04-26 12:02:17+08
(1 row)

因为是UTC的时区，和北京时间相差8小时，还要加上8小时

postgres=# select timestamp with time zone '2000-01-01 00:00:00' + 704289737363029/1000/1000 * interval '1 second' + interval '8 h' as commit_ts;
       commit_ts        
------------------------
 2022-04-26 20:02:17+08
(1 row)

postgres=# select pg_xact_commit_timestamp('3534');
   pg_xact_commit_timestamp    
-------------------------------
 2022-04-26 20:02:17.363029+08
(1 row)

可以看到，我们自己计算出来的值和通过函数获取出来的时间完全吻合。时间推算出来了，那么第二个RepOriginId nodeid 是什么东西？

postgres=# select * from pg_xact_commit_timestamp_origin('3534');
           timestamp           | roident 
-------------------------------+---------
 2022-04-26 20:02:17.363029+08 |       0
(1 row)

这个其实和之前 logical decoding 类似，用于标识数据的来源

/*  
 * Replay progress of a single remote node.  
 */  
typedef struct ReplicationState  
{  
        /*  
         * Local identifier for the remote node.  
         */  
        RepOriginId roident;  

        /*  
         * Location of the latest commit from the remote side.  
         */  
        XLogRecPtr      remote_lsn;  

        /*  
         * Remember the local lsn of the commit record so we can XLogFlush() to it  
         * during a checkpoint so we know the commit record actually is safe on  
         * disk.  
         */  
        XLogRecPtr      local_lsn;  

        /*  
         * PID of backend that's acquired slot, or 0 if none.  
         */  
        int                     acquired_by;  

        /*  
         * Lock protecting remote_lsn and local_lsn.  
         */  
        LWLock          lock;  
} ReplicationState;

通过SQL查询到 roident是0，因此也就是剩下的0x0000了。底层存储了解了之后，我们再看看是commit_ts，首先是大小

/*
 * Number of shared CommitTS buffers.
 *
 * We use a very similar logic as for the number of CLOG buffers; see comments
 * in CLOGShmemBuffers.
 */
Size
CommitTsShmemBuffers(void)
{
    return Min(16, Max(4, NBuffers / 1024));
}

和CLOG buffer十分类似，NBuffers就是配置的shared buffers，默认128MB，因此默认分配的commit_ts大小是 Min(16, Max(4, 128MB / 1024))

postgres=# select * from pg_shmem_allocations where name ilike '%commit%';
      name       |   off   |  size  | allocated_size 
-----------------+---------+--------+----------------
 CommitTs        | 4791040 | 133568 |         133632
 CommitTs shared | 4924672 |     32 |            128
(2 rows)

然后会根据track_commit_timestamp参数，决定是否启用该模块

/*
 * Activate or deactivate CommitTs' upon reception of a XLOG_PARAMETER_CHANGE
 * XLog record during recovery.
 */
void
CommitTsParameterChange(bool newvalue, bool oldvalue)
{
    /*
     * If the commit_ts module is disabled in this server and we get word from
     * the primary server that it is enabled there, activate it so that we can
     * replay future WAL records involving it; also mark it as active on
     * pg_control.  If the old value was already set, we already did this, so
     * don't do anything.
     *
     * If the module is disabled in the primary, disable it here too, unless
     * the module is enabled locally.
     *
     * Note this only runs in the recovery process, so an unlocked read is
     * fine.
     */
    if (newvalue)
    {
        if (!commitTsShared->commitTsActive)
            ActivateCommitTs();
    }
    else if (commitTsShared->commitTsActive)
        DeactivateCommitTs();
}


/*
 * Activate this module whenever necessary.
 *  This must happen during postmaster or standalone-backend startup,
 *  or during WAL replay anytime the track_commit_timestamp setting is
 *  changed in the primary.
 *
 * The reason why this SLRU needs separate activation/deactivation functions is
 * that it can be enabled/disabled during start and the activation/deactivation
 * on the primary is propagated to the standby via replay. Other SLRUs don't
 * have this property and they can be just initialized during normal startup.
 *
 * This is in charge of creating the currently active segment, if it's not
 * already there.  The reason for this is that the server might have been
 * running with this module disabled for a while and thus might have skipped
 * the normal creation point.
 */
static void
ActivateCommitTs(void)
{
    TransactionId xid;
    int   pageno;
...
...

何时清理

那么不妨再思考一下，开启了track_commit_timestamp这个参数，会对每一个事务记录commit_ts。因此不难想象，倘若一直记录，也会将磁盘撑爆，用脚趾都能想到，PostgreSQL肯定有自己的清理机制。

简单用pgbench压测一下，SQL很简单，目的是为了不断推进事务号

[postgres@xiongcc ~]$ cat bench.sql 
insert into t1 values(1);
[postgres@xiongcc ~]$ pgbench -f bench.sql -T 3000 -c 10 -j 20
pgbench (14.2)
starting vacuum...end.

不一会儿，pg_commit_ts目录下的文件数量就涨上来了

[postgres@xiongcc ~]$ ll pgdata/pg_commit_ts/ | wc -l
75
[postgres@xiongcc ~]$ ll pgdata/pg_commit_ts/ | wc -l
80
[postgres@xiongcc ~]$ ll pgdata/pg_commit_ts/ | wc -l
94

但是又过了一会之后，文件数量突然就降下来了

[postgres@xiongcc pg_commit_ts]$ ls -l | wc -l
5
[postgres@xiongcc pg_commit_ts]$ ll
total 1008
-rw------- 1 postgres postgres 262144 Apr 26 22:33 0071
-rw------- 1 postgres postgres 262144 Apr 26 22:33 0072
-rw------- 1 postgres postgres 262144 Apr 26 22:33 0073
-rw------- 1 postgres postgres 229376 Apr 26 22:35 0074

再去查看事务状态，也看不到了

postgres=# select pg_xact_commit_timestamp('3534');
   pg_xact_commit_timestamp    
-------------------------------
 2022-04-26 20:02:17.363029+08
(1 row)

postgres=# select pg_xact_commit_timestamp('3534');
 pg_xact_commit_timestamp 
--------------------------
 
(1 row)

那么是何时清理的这个文件呢？

track_commit_timestamp

翻了一下源码，首先第一个函数是 DeactivateCommitTs，逻辑很简单，当关闭了track_commit_timestamp之后，便会清理shared buffers中的commit_ts buffer，以及移除pg_commit_ts目录下的所有文件

/*
 * Deactivate this module.
 *
 * This must be called when the track_commit_timestamp parameter is turned off.
 * This happens during postmaster or standalone-backend startup, or during WAL
 * replay.
 *
 * Resets CommitTs into invalid state to make sure we don't hand back
 * possibly-invalid data; also removes segments of old data.
 */
static void
DeactivateCommitTs(void)
{
    /*
     * Cleanup the status in the shared memory.
     *
     * We reset everything in the commitTsShared record to prevent user from
     * getting confusing data about last committed transaction on the standby
     * when the module was activated repeatedly on the primary.
     */
    LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);

    commitTsShared->commitTsActive = false;
    commitTsShared->xidLastCommit = InvalidTransactionId;
    TIMESTAMP_NOBEGIN(commitTsShared->dataLastCommit.time);
    commitTsShared->dataLastCommit.nodeid = InvalidRepOriginId;

    ShmemVariableCache->oldestCommitTsXid = InvalidTransactionId;
    ShmemVariableCache->newestCommitTsXid = InvalidTransactionId;

    LWLockRelease(CommitTsLock);

    /*
     * Remove *all* files.  This is necessary so that there are no leftover
     * files; in the case where this feature is later enabled after running
     * with it disabled for some time there may be a gap in the file sequence.
     * (We can probably tolerate out-of-sequence files, as they are going to
     * be overwritten anyway when we wrap around, but it seems better to be
     * tidy.)
     */
    LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
    (void) SlruScanDirectory(CommitTsCtl, SlruScanDirCbDeleteAll, NULL);
    LWLockRelease(CommitTsSLRULock);
}

关闭track_commit_timestamp，再重启一下

[postgres@xiongcc ~]$ ll pgdata/pg_commit_ts/
total 1024
-rw------- 1 postgres postgres 262144 Apr 26 23:05 00EA
-rw------- 1 postgres postgres 262144 Apr 26 23:05 00EB
-rw------- 1 postgres postgres 262144 Apr 26 23:05 00EC
-rw------- 1 postgres postgres 245760 Apr 26 23:10 00ED

[postgres@xiongcc ~]$ pg_ctl -D pgdata/ restart
waiting for server to shut down.... done
server stopped
waiting for server to start....2022-04-27 09:26:59.741 CST [20461] LOG:  redirecting log output to logging collector process
2022-04-27 09:26:59.741 CST [20461] HINT:  Future log output will appear in directory "log".
 done
server started
[postgres@xiongcc ~]$ ll pgdata/pg_commit_ts/
total 0

TruncateCommitTs

在commit_ts.c里面，还有一个函数是TruncateCommitTs，会根据传入的事务ID，移除相应的commit_ts segment。

/*
 * Remove all CommitTs segments before the one holding the passed
 * transaction ID.
 *
 * Note that we don't need to flush XLOG here.
 */
void
TruncateCommitTs(TransactionId oldestXact)
{
    int   cutoffPage;

    /*
     * The cutoff point is the start of the segment containing oldestXact. We
     * pass the *page* containing oldestXact to SimpleLruTruncate.
     */
    cutoffPage = TransactionIdToCTsPage(oldestXact);

    /* Check to see if there's any files that could be removed */
    if (!SlruScanDirectory(CommitTsCtl, SlruScanDirCbReportPresence,
                           &cutoffPage))
        return;     /* nothing to remove */

    /* Write XLOG record */
    WriteTruncateXlogRec(cutoffPage, oldestXact);

    /* Now we can remove the old CommitTs segment(s) */
    SimpleLruTruncate(CommitTsCtl, cutoffPage);
}

这个函数是由vac_truncate_clog()所调用的，调用链 vac_update_datfrozenxid() -> vac_truncate_clog() -> TruncateCommitTs()，可以看到，和冻结相关，更新pg_database中的datfrozenxid

/*
 * vac_truncate_clog() -- attempt to truncate the commit log
 *
 *  Scan pg_database to determine the system-wide oldest datfrozenxid,
 *  and use it to truncate the transaction commit log (pg_xact).
 *  Also update the XID wrap limit info maintained by varsup.c.
 *  Likewise for datminmxid.
 *
 *  The passed frozenXID and minMulti are the updated values for my own
 *  pg_database entry. They're used to initialize the "min" calculations.
 *  The caller also passes the "last sane" XID and MXID, since it has
 *  those at hand already.
 *
 *  This routine is only invoked when we've managed to change our
 *  DB's datfrozenxid/datminmxid values, or we found that the shared
 *  XID-wrap-limit info is stale.
 */
static void
vac_truncate_clog(TransactionId frozenXID,
                  MultiXactId minMulti,
                  TransactionId lastSaneFrozenXid,
                  MultiXactId lastSaneMinMulti)
    
 ...
      * Truncate CLOG, multixact and CommitTs to the oldest computed value.
     */
    TruncateCLOG(frozenXID, oldestxid_datoid);
    TruncateCommitTs(frozenXID);
    TruncateMultiXact(minMulti, minmulti_datoid);

同时，在 https://www.postgresql.org/docs/current/routine-vacuuming.html官网也提及到这一点，也是和freeze有关

The sole disadvantage of increasing autovacuum_freeze_max_age (and vacuum_freeze_table_age along with it) is that the pg_xact and pg_commit_ts subdirectories of the database cluster will take more space, because it must store the commit status and (if track_commit_timestamp is enabled) timestamp of all transactions back to the autovacuum_freeze_max_age horizon. The commit status uses two bits per transaction, so if autovacuum_freeze_max_age is set to its maximum allowed value of two billion, pg_xact can be expected to grow to about half a gigabyte and pg_commit_ts to about 20GB. If this is trivial compared to your total database size, setting autovacuum_freeze_max_age to its maximum allowed value is recommended. Otherwise, set it depending on what you are willing to allow for pg_xact and pg_commit_ts storage. (The default, 200 million transactions, translates to about 50MB of pg_xact storage and about 2GB of pg_commit_ts storage.)

如果把autovacuum_freeze_max_age设置为20亿，pg_xact大约占500MB，pg_commit_ts大约是20GB，而默认情况下是2亿，pg_xact大约占50MB，pg_commit_ts大约是2GB，对于存储敏感的用户，也要考虑进去。

因此根据代码和文档说明，数据库在做freeze的时候，会择机去清理CLOG、CommitTS。让我们看个例子，为了观察，将autovacuum_freeze_max_age参数设到最小，以更频繁地触发freeze：

postgres=# show autovacuum_freeze_max_age ;
 autovacuum_freeze_max_age 
---------------------------
 100000
(1 row)

再使用pgbench进行压测，同时另外一个窗口观察pg_stat_activity

postgres= select * from pg_stat_activity where query ilike '%vacuum%' and pid <> pg_backend_pid()\watch 1;
 datid | datname  |  pid  | leader_pid | usesysid | usename | application_name | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type | wait_event  | state  | backend_xid | backend_xmin | query_id |                            query                             |   backend_type    
-------+----------+-------+------------+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+-------------+--------+-------------+--------------+----------+--------------------------------------------------------------+-------------------
 13892 | postgres | 21349 |            |          |         |                  |             |                 |             | 2022-04-27 09:51:22.711741+08 | 2022-04-27 09:51:22.912726+08 | 2022-04-27 09:51:22.912726+08 | 2022-04-27 09:51:22.912726+08 | Timeout         | VacuumDelay | active |             |      6637029 |          | autovacuum: VACUUM ANALYZE public.t1 (to prevent wraparound) | autovacuum worker
(1 row)

可以看到，在 09:51:22 的时候，对表t1自动进行了freeze，同时观察pg_commit_ts，在09:51:26，00F3之前的六个数据文件被移除，因此证实了我们的想法，在freeze的时候会去回收commit_ts。

[postgres@xiongcc ~]$ ll pgdata/pg_commit_ts/;date
total 4004
-rw------- 1 postgres postgres 262144 Apr 27 09:50 00ED
-rw------- 1 postgres postgres 262144 Apr 27 09:50 00EE
-rw------- 1 postgres postgres 262144 Apr 27 09:50 00EF
-rw------- 1 postgres postgres 262144 Apr 27 09:50 00F0
-rw------- 1 postgres postgres 262144 Apr 27 09:50 00F1
-rw------- 1 postgres postgres 262144 Apr 27 09:50 00F2
-rw------- 1 postgres postgres 262144 Apr 27 09:50 00F3
-rw------- 1 postgres postgres 262144 Apr 27 09:50 00F4
-rw------- 1 postgres postgres 262144 Apr 27 09:50 00F5
-rw------- 1 postgres postgres 262144 Apr 27 09:50 00F6
-rw------- 1 postgres postgres 262144 Apr 27 09:51 00F7
-rw------- 1 postgres postgres 262144 Apr 27 09:51 00F8
-rw------- 1 postgres postgres 262144 Apr 27 09:51 00F9
-rw------- 1 postgres postgres 262144 Apr 27 09:51 00FA
-rw------- 1 postgres postgres 262144 Apr 27 09:51 00FB
-rw------- 1 postgres postgres 262144 Apr 27 09:51 00FC
-rw------- 1 postgres postgres  73728 Apr 27 09:51 00FD
Wed Apr 27 09:51:26 CST 2022
[postgres@xiongcc ~]$ ll pgdata/pg_commit_ts/;date
total 2772
-rw------- 1 postgres postgres 262144 Apr 27 09:50 00F3
-rw------- 1 postgres postgres 262144 Apr 27 09:50 00F4
-rw------- 1 postgres postgres 262144 Apr 27 09:50 00F5
-rw------- 1 postgres postgres 262144 Apr 27 09:50 00F6
-rw------- 1 postgres postgres 262144 Apr 27 09:51 00F7
-rw------- 1 postgres postgres 262144 Apr 27 09:51 00F8
-rw------- 1 postgres postgres 262144 Apr 27 09:51 00F9
-rw------- 1 postgres postgres 262144 Apr 27 09:51 00FA
-rw------- 1 postgres postgres 262144 Apr 27 09:51 00FB
-rw------- 1 postgres postgres 262144 Apr 27 09:51 00FC
-rw------- 1 postgres postgres 172032 Apr 27 09:51 00FD
Wed Apr 27 09:51:28 CST 2022

深入理解

搞清楚原理之后，我们再回过头来看这个错

could not access status of transaction xxx

could not open file "pg_commit_ts/68F9"：No such file of directory

我们在判断可见性的时候需要去获取事务状态，CLOG也存在BUFFER，同时还有infomask的存在，所以假如CLOG BUFFER没有获取到事务状态，就要去加载具体的文件，由于一个事务的提交状态占2位，因此一个block可以存放 8 * 1024 * 8 /2 = 32768，所以，光是BUFFER便可以存很多事务了，再加上infomask，获取事务状态的效率还是很高的。

postgres=# select * from pg_shmem_allocations where name ilike '%XACT%';
          name          |   off   |  size  | allocated_size 
------------------------+---------+--------+----------------
 Shared MultiXact State |  696448 |   1052 |           1152
 MultiXactMember        |  562816 | 133568 |         133632
 PredXactList           | 4014208 |     88 |            128
 Xact                   |  129024 |  66240 |          66304
 MultiXactOffset        |  496000 |  66816 |          66816
(5 rows)

让我们简单看一下CLOG的代码

/* We need two bits per xact, so four xacts fit in a byte */
#define CLOG_BITS_PER_XACT 2
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
#define CLOG_XACT_BITMASK ((1 << CLOG_BITS_PER_XACT) - 1)

#define TransactionIdToPage(xid) ((xid) / (TransactionId) CLOG_XACTS_PER_PAGE) 
#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)
#define TransactionIdToByte(xid) (TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE) )
#define TransactionIdToBIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_BYTE)

是不是十分眼熟？与通过 LSN 定位 WAL 日志的流程太像了，LSN -1 除以16M 除以256，16M等于2的24次方，除下来就是LSN的高32位对应于logid，也就是斜杠左边的，logseg是LSN -1 除以16M除以256的余数，对应于logseg，即FF，剩下的后6位，也就是偏移量。也就是说商就是logseg，余数就是WAL的偏移量。

偷天换日 | PostgreSQL伪造事务日志 — Fig. 9.6. Transaction log and WAL segment files

而CLOG也是如此，典型地将文件当哈希用。每当需要获取事务状态的时候，就直接根据当前事务ID计算得到对应的CLOG PAGE。

• #define TransactionIdToPage(xid) ((xid) / (TransactionId) CLOG_XACTS_PER_PAGE)：事务ID存在于哪页，xid/block*4
• #define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)：事务ID在页中的偏移量，xid%(block*4)
• #define TransactionIdToByte(xid) (TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE) )：事务ID在页内的第几个字节，xid%(block*4)/4
• #define TransactionIdToBIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_BYTE)：字节中的第几个比特，xid%4

每个事务状态需要两个比特位存放

#define TRANSACTION_STATUS_IN_PROGRESS  0x00
#define TRANSACTION_STATUS_COMMITTED  0x01
#define TRANSACTION_STATUS_ABORTED   0x02
#define TRANSACTION_STATUS_SUB_COMMITTED  0x03

为了演示，我将autovacuum_freeze_max_age设置到最大20亿，同时重新初始化一个实例，让事务ID重新开始，开启四个事务均不提交

postgres=# begin;
BEGIN
postgres=*# select txid_current();
 txid_current 
--------------
          733
(1 row)

postgres=# begin;
BEGIN
postgres=*# select txid_current();
 txid_current 
--------------
          734
(1 row)

postgres=# begin;
BEGIN
postgres=*# select txid_current();
 txid_current 
--------------
          735
(1 row)

postgres=# begin;
BEGIN
postgres=*# select txid_current();
 txid_current 
--------------
          736
(1 row)

然后按照方式计算一下

postgres=# select 733/8192*4,734/8192*4,735/8192*4,736/8192*4; ---四个事务都在第0页
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
        0 |        0 |        0 |        0
(1 row)

postgres=# select 733%(8192*4),734%(8192*4),735%(8192*4),736%(8192*4); ---四个事务的偏移量
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
      733 |      734 |      735 |      736
(1 row)

postgres=# select 733%(8192*4)/4,734%(8192*4)/4,735%(8192*4)/4,736%(8192*4)/4; ---页内第几个字节
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
      183 |      183 |      183 |      184
(1 row)

postgres=# select 733%4,734%4,735%4,736%4; ---字节内偏移量
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
        1 |        2 |        3 |        0
(1 row)

postgres=# select 732%4,733%4,734%4,735%4,736%4;
 ?column? | ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------+----------
        0 |        1 |        2 |        3 |        0
(1 row)

然后做个checkpoint，让CLOG刷盘。

[postgres@xiongcc pg_xact]$ hexdump -C 0000 -s 183 -n 2
000000b7  01 00                                             |..|
000000b9

地址	00	00	00	01
事务ID	735	734	733	732
事务状态	0x00，IN_PROGRESS	0x00，IN_PROGRESS	0x00，IN_PROGRESS	0x01，COMMITTED

postgres=# select txid_status('732');
 txid_status 
-------------
 committed
(1 row)

postgres=# select txid_status('733');
 txid_status 
-------------
 in progress
(1 row)

postgres=# select txid_status('734');
 txid_status 
-------------
 in progress
(1 row)

根据字节，表示733和734事务都是IN_PROGRESS的状态，即还没有提交。

让我们回滚733事务，提交734事务再看一下（记得做个checkpoint）

postgres=# checkpoint ;
CHECKPOINT
postgres=# select txid_status('733');
 txid_status 
-------------
 aborted
(1 row)

postgres=# select txid_status('734');
 txid_status 
-------------
 committed
(1 row)

[postgres@xiongcc pg_xact]$ hexdump -C 0000 -s 183 -n 2
000000b7  19 00                                             |..|
000000b9

可以看到，这次变成了19，换算成二进制是00011001

地址	00	01	10	01
事务ID	735	734	733	732
事务状态	0x00，IN_PROGRESS	0x01，COMMITTED	0x02，ABORTED	0x01，COMMITTED

符合预期和操作结果。同理，前面分析了commit_ts需要10个字节，总共734个事务，所以偏移量是7340

[postgres@xiongcc pg_commit_ts]$ hexdump -C 0000 -s 7340 -n 10
00001cac  aa a0 59 73 9c 80 02 00  00 00                    |..Ys......|
00001cb6

也就是

[postgres@xiongcc pg_commit_ts]$ echo $((0x0002809c7359a0aa))
704359391928490

所以，转化成时间戳就是

postgres=# select timestamp with time zone '2000-01-01 00:00:00' + 704359391928490/1000/1000 * interval '1 second' + interval '8 h' as commit_ts;
       commit_ts        
------------------------
 2022-04-27 15:23:11+08
(1 row)

postgres=# select pg_xact_commit_timestamp('734');
   pg_xact_commit_timestamp   
------------------------------
 2022-04-27 15:23:11.92849+08
(1 row)

至此，到这里我们搞清楚clog和commit_ts每一个字节是什么意思了。是不是小有成就？

HACK

修改CLOG

都到这一步了，现在不妨让我们来做一点骚操作，偷天换日：

postgres=# create table t1(id int);
CREATE TABLE
postgres=# begin;
BEGIN
postgres=*# select txid_current();
 txid_current 
--------------
          734
(1 row)

postgres=*# insert into t1 values(generate_series(1,5));
INSERT 0 5
postgres=*# commit ;
COMMIT
postgres=# insert into t1 values(6);
INSERT 0 1
postgres=# select xmin,id from t1;
 xmin | id 
------+----
  734 |  1
  734 |  2
  734 |  3
  734 |  4
  734 |  5
  735 |  6
(6 rows)

postgres=# select 734%(8192*4)/4,735%(8192*4)/4;;
 ?column? | ?column? 
----------+----------
      183 |      183
(1 row)
postgres=# select 734%4,735%4;
 ?column? | ?column? 
----------+----------
        2 |        3
(1 row)

734事务插入了5条数据，735事务插入了1条数据，老样子看一下

[postgres@xiongcc pg_xact]$ hexdump -C 0000 -s 183 -n 1
000000b7  55                                                |U|
000000b8

55换算成二进制就是01010101，最高位的0101，分别表示734和735事务，因此均提交了

postgres=# select txid_status('736');
ERROR:  transaction ID 736 is in the future
postgres=# select txid_status('735');
 txid_status 
-------------
 committed
(1 row)

postgres=# select txid_status('734');
 txid_status 
-------------
 committed
(1 row)

另外前文还提及了infomask

当查询一条数据的时候，需要去判断行的可见性，需要去查询相应事务的提交状态，我们只能从CLOG中或者PGXACT内存结构中(未结束的或未清除的事务信息内存)得知该tuple对应的事务提交状态，显然如果每条tuple都要查询pg_clog的话，性能一定会很差，当然还要根据隔离级别、事务快照来综合判断行的可见性，在此不再赘述，在PostgreSQL中提供了TransactionIdIsInProgress、TransactionIdDidCommit和TransactionIdDidAbort用于获取事务的状态，这些函数被设计为尽可能减少对CLOG的频繁访问 (假如把freeze相关参数设置为20亿的话，那么clog最多可能达到500多MB，每一个事务占2bit) 。尽管如此，如果在检查每条元组时都执行这些函数，也可能会成为瓶颈。所以，为了解决这个问题，PostgreSQL在t_infomask中使用了相关标志位。

使用pageinspect看一下

postgres=# select lp, t_xmin, t_xmax, t_ctid,
       infomask(t_infomask, 1) as infomask,
       infomask(t_infomask2, 2) as infomask2
from heap_page_items(get_raw_page('t1', 0));
 lp | t_xmin | t_xmax | t_ctid |          infomask           | infomask2 
----+--------+--------+--------+-----------------------------+-----------
  1 |    734 |      0 | (0,1)  | XMAX_INVALID|XMIN_COMMITTED | 
  2 |    734 |      0 | (0,2)  | XMAX_INVALID|XMIN_COMMITTED | 
  3 |    734 |      0 | (0,3)  | XMAX_INVALID|XMIN_COMMITTED | 
  4 |    734 |      0 | (0,4)  | XMAX_INVALID|XMIN_COMMITTED | 
  5 |    734 |      0 | (0,5)  | XMAX_INVALID|XMIN_COMMITTED | 
  6 |    735 |      0 | (0,6)  | XMAX_INVALID|XMIN_COMMITTED | 
(6 rows)

postgres=# SELECT t_ctid, raw_flags, combined_flags
         FROM heap_page_items(get_raw_page('t1', 0)),
           LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2)
         WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL;
 t_ctid |                raw_flags                | combined_flags 
--------+-----------------------------------------+----------------
 (0,1)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,2)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,3)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,4)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,5)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,6)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
(6 rows)

postgres=# select lp, t_data from heap_page_items(get_raw_page('t1', 0));
 lp |   t_data   
----+------------
  1 | \x01000000
  2 | \x02000000
  3 | \x03000000
  4 | \x04000000
  5 | \x05000000
  6 | \x06000000
(6 rows)

可以看到，每一行都是XMAX_INVALID|XMIN_COMMITTED，表示xmin都提交了并且没有被删除，因此总是可见。值得注意的是，每一行还进行了对齐（int4字节，剩下的4字节进行了填充，按照8字节进行padding）

然后找一下具体的磁盘文件

postgres=# select pg_relation_filepath('t1');
 pg_relation_filepath 
----------------------
 base/13892/16384
(1 row)

[postgres@xiongcc 13892]$ hexdump -C 16384
00000000  00 00 00 00 90 f1 6f 01  00 00 00 00 30 00 40 1f  |......o.....0.@.|
00000010  00 20 04 20 00 00 00 00  e0 9f 38 00 c0 9f 38 00  |. . ......8...8.|
00000020  a0 9f 38 00 80 9f 38 00  60 9f 38 00 40 9f 38 00  |..8...8.`[email protected].|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001f40  df 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001f50  06 00 01 00 00 08 18 00  06 00 00 00 00 00 00 00  |................|
00001f60  de 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001f70  05 00 01 00 00 08 18 00  05 00 00 00 00 00 00 00  |................|
00001f80  de 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001f90  04 00 01 00 00 08 18 00  04 00 00 00 00 00 00 00  |................|
00001fa0  de 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001fb0  03 00 01 00 00 08 18 00  03 00 00 00 00 00 00 00  |................|
00001fc0  de 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001fd0  02 00 01 00 00 08 18 00  02 00 00 00 00 00 00 00  |................|
00001fe0  de 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001ff0  01 00 01 00 00 08 18 00  01 00 00 00 00 00 00 00  |................|
00002000

要分析这些乱七八糟的字符，需要一点数据块结构的知识

数据块的源码定义在src/⁨include/⁨storage⁩/bufpage.h中，不熟悉的可以去补习了。

typedef struct PageHeaderData
{
    /* XXX LSN is member of *any* block, not only page-organized ones */
    PageXLogRecPtr pd_lsn;  /* LSN: next byte after last byte of xlog
                                 * record for last change to this page */
    uint16  pd_checksum; /* checksum */
    uint16  pd_flags;  /* flag bits, see below */
    LocationIndex pd_lower;  /* offset to start of free space */
    LocationIndex pd_upper;  /* offset to end of free space */
    LocationIndex pd_special; /* offset to start of special space */
    uint16  pd_pagesize_version;
    TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
    ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
} PageHeaderData;

偷天换日 | PostgreSQL伪造事务日志

而Tuple的结构在src/include/access/htup_detail.h

struct HeapTupleHeaderData
{
    union
    {
        HeapTupleFields t_heap;
        DatumTupleFields t_datum;
    }   t_choice;

    ItemPointerData t_ctid;  /* current TID of this or newer tuple (or a
                                 * speculative insertion token) */

    /* Fields below here must match MinimalTupleData! */

#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK2 2
    uint16  t_infomask2; /* number of attributes + various flags */

#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK 3
    uint16  t_infomask;  /* various flag bits, see below */

#define FIELDNO_HEAPTUPLEHEADERDATA_HOFF 4
    uint8  t_hoff;   /* sizeof header incl. bitmap, padding */

    /* ^ - 23 bytes - ^ */

#define FIELDNO_HEAPTUPLEHEADERDATA_BITS 5
    bits8  t_bits[FLEXIBLE_ARRAY_MEMBER]; /* bitmap of NULLs */

    /* MORE DATA FOLLOWS AT END OF STRUCT */
};

偷天换日 | PostgreSQL伪造事务日志

Tuple结构总共包括如下几个域：

• 固定字段 (HeapTupleHeaderData)，23byte固定大小
• NULL 位图 (若 t_infomask 设置了 HEAP_HASNULL )
• padding对齐 (必须使得用户数据对齐)，按照alignment，一般是8字节
• Object ID (若 t_infomask 设置了 HEAP_HASOID )，才12以后，被取消了
• 用户数据字段

我们的例子没有NULL，因此按照24字节对齐，所以数据的偏移量是24+8（存在4字节对齐）= 32，并且数据是倒着放的，因此第一行在8192 - 32 = 8160，所以从8160开始就是第一条数据。当然实际情况可能复杂一些，比如还含有NULL，每一列有一个比特。而且12以前的版本可能还有OID，可以去看看我之前的infomask文章

从v12开始default_with_oids参数就没了，The parameter default_with_oids is gone, it had been disabled by default since after PostgreSQL 8.0，并且the default_with_oids parameter cannot be changed to 'on'，可能也是为了性能吧，毕竟每次都去检验重试一遍还是十分耗时的。

[postgres@xiongcc 13892]$ hexdump -C 16384 -s 8160
00001fe0  de 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001ff0  01 00 01 00 00 08 18 00  01 00 00 00 00 00 00 00  |................|
00002000

这就是第一条数据，从前往后看一下每一个字节是什么意思

• 02 de就是xmin，换算成10进制就是734
• 00 00是xmax，表示xmax = 0
• 00 00是t_cid或者t_xvac，感兴趣的可以参考一下我之前的文章，cmin、cmax和xvac是一个共用体，不同的生命周期，会有不同的表示
• ctid占了6个字节，00 00 00 00 01 00，表示ctid = (0,1)
• 00 01是infomask2
• 09 00是t_infomask，0900正是XMAX_INVALID|XMIN_COMMITTED，0x0100 | 0x0800
• 0018是t_hoff，10进制就是24，t_hoff 是header的大小，用户数据从t_hoff开始

至此，分析清楚了。

现在让我们偷天换日一下，记住之前的状态，数据是6条

postgres=# select xmin,id from t1;
 xmin | id 
------+----
  734 |  1
  734 |  2
  734 |  3
  734 |  4
  734 |  5
  735 |  6
(6 rows)

改一下infomask，使起事务变成回滚状态，XMAX_INVALID|XMIN_INVALID，也就是0x0800 | 0x0200 = 0x0a00，十进制的2560

[postgres@xiongcc 13892]$ hexdump -C 16387 -s 8160
00001fe0  de 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001ff0  01 00 01 00 00 09 18 00  01 00 00 00 00 00 00 00  |................|
00002000

改成
[postgres@xiongcc 13892]$ hexdump -C 16387 -s 8160
00001fe0  de 02 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001ff0  01 00 01 00 00 0a 18 00  01 00 00 00 00 00 00 00  |................|
00002000  0a                                                |.|
00002001

然后改一下CLOG，改成回滚ABORT，0x02，01010101 -> 01100101，0x65

[postgres@xiongcc pg_xact]$ hexdump -C 0000 -s 183 -n 1
000000b7  55                                                |U|
000000b8

改成

[postgres@xiongcc pg_xact]$ hexdump -C 0000 -s 183 -n 1
000000b7  65                                                |e|
000000b8

重启数据库，再次查看数据

postgres=# select xmin,id from t1;
 xmin | id 
------+----
  734 |  2
  734 |  3
  734 |  4
  734 |  5
  735 |  6
(5 rows)

postgres=# select txid_status('734');
 txid_status 
-------------
 aborted
(1 row)

postgres=# SELECT t_ctid, raw_flags, combined_flags
         FROM heap_page_items(get_raw_page('t1', 0)),
           LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2)
         WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL;
 t_ctid |                raw_flags                | combined_flags 
--------+-----------------------------------------+----------------
 (0,1)  | {HEAP_XMIN_INVALID,HEAP_XMAX_INVALID}   | {}
 (0,2)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,3)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,4)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,5)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,6)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
(6 rows)

可以看到，1这条数据就被我成功"抹掉了"！并且vacuum会正常移除这一条数据

postgres=# vacuum verbose t1;
INFO:  vacuuming "public.t1"
INFO:  table "t1": removed 1 dead item identifiers in 1 pages
INFO:  table "t1": found 1 removable, 5 nonremovable row versions in 1 out of 1 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 740
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM
postgres=# SELECT t_ctid, raw_flags, combined_flags
         FROM heap_page_items(get_raw_page('t1', 0)),
           LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2)
         WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL;
 t_ctid |                raw_flags                | combined_flags 
--------+-----------------------------------------+----------------
 (0,2)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,3)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,4)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,5)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,6)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
(5 rows)

修改commit_ts

怎么样？是不是十分惊喜，还可以这么玩！移除了一条数据之后，我们再来试着改一下commit_ts，为了方便，我们把数据插入回去，同时再做一次查询，设置标志位infomask

postgres=# insert into t1 values(1);
INSERT 0 1
postgres=# SELECT t_ctid, raw_flags, combined_flags
         FROM heap_page_items(get_raw_page('t1', 0)),
           LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2)
         WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL;
 t_ctid |                raw_flags                | combined_flags 
--------+-----------------------------------------+----------------
 (0,1)  | {HEAP_XMAX_INVALID}                     | {}
 (0,2)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,3)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,4)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,5)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,6)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
(6 rows)

postgres=# select xmin,id from t1;
 xmin | id 
------+----
  740 |  1
  734 |  2
  734 |  3
  734 |  4
  734 |  5
  735 |  6
(6 rows)

postgres=# SELECT t_ctid, raw_flags, combined_flags
         FROM heap_page_items(get_raw_page('t1', 0)),
           LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2)
         WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL;
 t_ctid |                raw_flags                | combined_flags 
--------+-----------------------------------------+----------------
 (0,1)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,2)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,3)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,4)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,5)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,6)  | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
(6 rows)

可以看到，查询了一次之后，添加了HEAP_XMIN_COMMITTED的标志位。看一下commit_ts

postgres=# select pg_xact_commit_timestamp('740');
   pg_xact_commit_timestamp    
-------------------------------
 2022-04-27 16:22:01.888337+08
(1 row)

前面说了，10个字节，所以总偏移量是7400

[postgres@xiongcc pg_commit_ts]$ hexdump -C 0000 -s 7400 -n 10
00001ce8  51 8a c0 45 9d 80 02 00  00 00                    |Q..E......|
00001cf2

[postgres@xiongcc pg_commit_ts]$ echo $((0x0002809d45c08a51))
704362921888337

postgres=# select timestamp with time zone '2000-01-01 00:00:00' + 704362921888337/1000/1000 * interval '1 second' + interval '8 h' as commit_ts;
       commit_ts        
------------------------
 2022-04-27 16:22:01+08
(1 row)

让我们改一个未来的值看看PostgreSQL会如何处理？给时间加两位数字，77704362921888337，变成十六进制就是0x1140fb4d5750a50，也就是 50 0a 50 57 4d fb 40 11，修改一下文件

[postgres@xiongcc pg_commit_ts]$ hexdump -C 0000 -s 7400 -n 10
00001ce8  50 0a 50 47 4d fb 40 11  00 00                    |P.PGM.@...|
00001cf2

😂😂😂，不知道改成一个什么时间了。

postgres=# select pg_xact_commit_timestamp('740');
   pg_xact_commit_timestamp    
-------------------------------
 41397-08-26 04:01:21.73832+08
(1 row)

复现

现在，让我们复现一下这个错误，我们把autovacuum_freeze_max_age设到最大20亿，这样的话CLOG文件会足够多，CLOG在shared buffer中占128个page，因此可以缓存 128 * 32K = 4,194,304个事务。另外每隔32K个事务，要扩展一个CLOG PAGE，每次扩展需要填充0，同时需要调用PG_FSYNC。

让我们压测一会，多产生点CLOG

[postgres@xiongcc ~]$ pgbench -f bench.sql -T 3000 -c 10 -j 100
pgbench (14.2)
starting vacuum...end.

[postgres@xiongcc ~]$ ls -lth pgdata/pg_xact/
total 424K
-rw------- 1 postgres postgres 160K Apr 27 16:51 0001
-rw------- 1 postgres postgres 256K Apr 27 16:49 0000

[postgres@xiongcc ~]$ ls -lth pgdata/pg_xact/
total 528K
-rw------- 1 postgres postgres 8.0K Apr 27 16:53 0002
-rw------- 1 postgres postgres 256K Apr 27 16:52 0001
-rw------- 1 postgres postgres 256K Apr 27 16:49 0000

然后把文件改成bak

[postgres@xiongcc pg_xact]$ ll
total 1340
-rw------- 1 postgres postgres 262144 Apr 27 16:49 0000_bak
-rw------- 1 postgres postgres 262144 Apr 27 16:52 0001_bak
-rw------- 1 postgres postgres 262144 Apr 27 16:56 0002_bak
-rw------- 1 postgres postgres 262144 Apr 27 17:15 0003
-rw------- 1 postgres postgres 262144 Apr 27 17:19 0004
-rw------- 1 postgres postgres  40960 Apr 27 17:22 0005

然后将t1表里的infomask部分行置为空，这样我就要去实际判断数据的可见性了

 (0,219) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,220) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,221) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,222) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,223) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,224) | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (0,225) | {}                                      | {}
 (0,226) | {}                                      | {}
(226 rows)

再进行查询就报错了

postgres=# select txid_status('959');
ERROR:  could not access status of transaction 959
DETAIL:  Could not open file "pg_xact/0000": No such file or directory.

postgres=# select count(*) from t1;
ERROR:  could not access status of transaction 959
DETAIL:  Could not open file "pg_xact/0000": No such file or directory.

postgres=# select * from t1 where ctid='(0,225)';
ERROR:  could not access status of transaction 958
DETAIL:  Could not open file "pg_xact/0000": No such file or directory.

小结

所以，再次回到这个错误中来，假如碰到这种错误，我们是可以通过hack的方式，比如某个事务不存在，来伪造事务文件。

比如现在有个查询报错了

postgres=# select * from t1;
ERROR:  could not access status of transaction xxx
DETAIL:  Could not open file "pg_clog/xxxx": No such file or directory.

你该怎么处理？最简单全部构造一个all commited的CLOG文件，但是数据是否正确与否就无法拿捏了，因为你也不知道事务是提交了还是回滚了，假如WAL还在的话，或许你可以解析一下看看事务状态，但是工程量太大了，最简单粗暴的方式还是恢复吧。

通过前面的分析，再思考一下，为什么会碰到这种错误？我们可以合理怀疑

• freeze做冻结的时候，回收了不该回收的CLOG，尤其是改了MVCC模型的数据库（BUG，😎😎）
• 人为不小心删除

另外，唐成老师写的 pg_fix 工具也是这样的原理，将CLOG中的事务状态从“TRANSACTION_STATUS_COMMITTED”改成“TRANSACTION_STATUS_ABORTED”，原先的事务就会做废，以此找回数据，但是前提是没有被vacuum掉，所以碰到这种故障，先做一个alter table xxx set (autovacuum_enabled = off)。

写完这篇，我突然萌生了开发一个工具的想法，类似的，修改CLOG和infomask，来做天马行空的事情，看看精力是否允许吧，拭目以待拭目以待。

vlambda博客
学习文章列表