这篇文章主要讲解了“PostgreSQL中RelationGetBufferForTuple函数有什么作用”,文中的讲解内容简单清晰,易于学习与理解,下面请大家跟着小编的思路慢慢深入,一起来研究和学习“PostgreSQL中RelationGetBufferForTuple函数有什么作用”吧!
本节简单介绍了PostgreSQL在执行插入过程中与缓存相关的函数RelationGetBufferForTuple,该函数返回满足空闲空间 >= 给定大小的page,并且该page对应的buffer状态为pinned和并持有独占锁。
BufferDesc
共享缓冲区的共享描述符(状态)数据
/*
* Flags for buffer descriptors
* buffer描述器标记
*
* Note: TAG_VALID essentially means that there is a buffer hashtable
* entry associated with the buffer's tag.
* 注意:TAG_VALID本质上意味着有一个与缓冲区的标记相关联的缓冲区散列表条目。
*/
//buffer header锁定
#define BM_LOCKED (1U << 22) /* buffer header is locked */
//数据需要写入(标记为DIRTY)
#define BM_DIRTY (1U << 23) /* data needs writing */
//数据是有效的
#define BM_VALID (1U << 24) /* data is valid */
//已分配buffer tag
#define BM_TAG_VALID (1U << 25) /* tag is assigned */
//正在R/W
#define BM_IO_IN_PROGRESS (1U << 26) /* read or write in progress */
//上一个I/O出现错误
#define BM_IO_ERROR (1U << 27) /* previous I/O failed */
//开始写则变DIRTY
#define BM_JUST_DIRTIED (1U << 28) /* dirtied since write started */
//存在等待sole pin的其他进程
#define BM_PIN_COUNT_WAITER (1U << 29) /* have waiter for sole pin */
//checkpoint发生,必须刷到磁盘上
#define BM_CHECKPOINT_NEEDED (1U << 30) /* must write for checkpoint */
//持久化buffer(不是unlogged或者初始化fork)
#define BM_PERMANENT (1U << 31) /* permanent buffer (not unlogged,
* or init fork) */
/*
* BufferDesc -- shared descriptor/state data for a single shared buffer.
* BufferDesc -- 共享缓冲区的共享描述符(状态)数据
*
* Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change
* the tag, state or wait_backend_pid fields. In general, buffer header lock
* is a spinlock which is combined with flags, refcount and usagecount into
* single atomic variable. This layout allow us to do some operations in a
* single atomic operation, without actually acquiring and releasing spinlock;
* for instance, increase or decrease refcount. buf_id field never changes
* after initialization, so does not need locking. freeNext is protected by
* the buffer_strategy_lock not buffer header lock. The LWLock can take care
* of itself. The buffer header lock is *not* used to control access to the
* data in the buffer!
* 注意:必须持有Buffer header锁(BM_LOCKED标记)才能检查或修改tag/state/wait_backend_pid字段.
* 通常来说,buffer header lock是spinlock,它与标记位/参考计数/使用计数组合到单个原子变量中.
* 这个布局设计允许我们执行原子操作,而不需要实际获得或者释放spinlock(比如,增加或者减少参考计数).
* buf_id字段在初始化后不会出现变化,因此不需要锁定.
* freeNext通过buffer_strategy_lock锁而不是buffer header lock保护.
* LWLock可以很好的处理自己的状态.
* 务请注意的是:buffer header lock不用于控制buffer中的数据访问!
*
* It's assumed that nobody changes the state field while buffer header lock
* is held. Thus buffer header lock holder can do complex updates of the
* state variable in single write, simultaneously with lock release (cleaning
* BM_LOCKED flag). On the other hand, updating of state without holding
* buffer header lock is restricted to CAS, which insure that BM_LOCKED flag
* is not set. Atomic increment/decrement, OR/AND etc. are not allowed.
* 假定在持有buffer header lock的情况下,没有人改变状态字段.
* 持有buffer header lock的进程可以执行在单个写操作中执行复杂的状态变量更新,
* 同步的释放锁(清除BM_LOCKED标记).
* 换句话说,如果没有持有buffer header lock的状态更新,会受限于CAS,
* 这种情况下确保BM_LOCKED没有被设置.
* 比如原子的增加/减少(AND/OR)等操作是不允许的.
*
* An exception is that if we have the buffer pinned, its tag can't change
* underneath us, so we can examine the tag without locking the buffer header.
* Also, in places we do one-time reads of the flags without bothering to
* lock the buffer header; this is generally for situations where we don't
* expect the flag bit being tested to be changing.
* 一种例外情况是如果我们已有buffer pinned,该buffer的tag不能改变(在本进程之下),
* 因此不需要锁定buffer header就可以检查tag了.
* 同时,在执行一次性的flags读取时不需要锁定buffer header.
* 这种情况通常用于我们不希望正在测试的flag bit将被改变.
*
* We can't physically remove items from a disk page if another backend has
* the buffer pinned. Hence, a backend may need to wait for all other pins
* to go away. This is signaled by storing its own PID into
* wait_backend_pid and setting flag bit BM_PIN_COUNT_WAITER. At present,
* there can be only one such waiter per buffer.
* 如果其他进程有buffer pinned,那么进程不能物理的从磁盘页面中删除items.
* 因此,后台进程需要等待其他pins清除.这可以通过存储它自己的PID到wait_backend_pid中,
* 并设置标记位BM_PIN_COUNT_WAITER.
* 目前,每个缓冲区只能由一个等待进程.
*
* We use this same struct for local buffer headers, but the locks are not
* used and not all of the flag bits are useful either. To avoid unnecessary
* overhead, manipulations of the state field should be done without actual
* atomic operations (i.e. only pg_atomic_read_u32() and
* pg_atomic_unlocked_write_u32()).
* 本地缓冲头部使用同样的结构,但并不需要使用locks,而且并不是所有的标记位都使用.
* 为了避免不必要的负载,状态域的维护不需要实际的原子操作
* (比如只有pg_atomic_read_u32() and pg_atomic_unlocked_write_u32())
*
* Be careful to avoid increasing the size of the struct when adding or
* reordering members. Keeping it below 64 bytes (the most common CPU
* cache line size) is fairly important for performance.
* 在增加或者记录成员变量时,小心避免增加结构体的大小.
* 保持结构体大小在64字节内(通常的CPU缓存线大小)对于性能是非常重要的.
*/
typedef struct BufferDesc
{
//buffer tag
BufferTag tag; /* ID of page contained in buffer */
//buffer索引编号(0开始),指向相应的buffer pool slot
int buf_id; /* buffer's index number (from 0) */
/* state of the tag, containing flags, refcount and usagecount */
//tag状态,包括flags/refcount和usagecount
pg_atomic_uint32 state;
//pin-count等待进程ID
int wait_backend_pid; /* backend PID of pin-count waiter */
//空闲链表链中下一个空闲的buffer
int freeNext; /* link in freelist chain */
//缓冲区内容锁
LWLock content_lock; /* to lock access to buffer contents */
} BufferDesc;
BufferTag
Buffer tag标记了buffer存储的是磁盘中哪个block
/*
* Buffer tag identifies which disk block the buffer contains.
* Buffer tag标记了buffer存储的是磁盘中哪个block
*
* Note: the BufferTag data must be sufficient to determine where to write the
* block, without reference to pg_class or pg_tablespace entries. It's
* possible that the backend flushing the buffer doesn't even believe the
* relation is visible yet (its xact may have started before the xact that
* created the rel). The storage manager must be able to cope anyway.
* 注意:BufferTag必须足以确定如何写block而不需要参照pg_class或者pg_tablespace数据字典信息.
* 有可能后台进程在刷新缓冲区的时候深圳不相信关系是可见的(事务可能在创建rel的事务之前).
* 存储管理器必须可以处理这些事情.
*
* Note: if there's any pad bytes in the struct, INIT_BUFFERTAG will have
* to be fixed to zero them, since this struct is used as a hash key.
* 注意:如果在结构体中有填充的字节,INIT_BUFFERTAG必须将它们固定为零,因为这个结构体用作散列键.
*/
typedef struct buftag
{
//物理relation标识符
RelFileNode rnode; /* physical relation identifier */
ForkNumber forkNum;
//相对于relation起始的块号
BlockNumber blockNum; /* blknum relative to begin of reln */
} BufferTag;
RelationGetBufferForTuple函数返回满足空闲空间>=给定大小的page,并且该page对应的buffer状态为pinned和并持有独占锁
输入:
relation-数据表
len-需要的空间大小
otherBuffer-用于update场景,上一次pinned的buffer
options-处理选项
bistate-BulkInsert标记
vmbuffer-第1个vm(visibilitymap)
vmbuffer_other-用于update场景,上一次pinned的buffer对应的vm(visibilitymap)
注意:
otherBuffer这个参数让人觉得困惑,原因是PG的机制使然
Update时,不是原地更新,而是原数据保留(更新xmax),新数据插入
原数据&新数据如果在不同Block中,锁定Block的时候可能会出现Deadlock
举个例子:Session A更新表T的第一行,第一行在Block 0中,新数据存储在Block 2中
Session B更新表T的第二行,第二行在Block 0中,新数据存储在Block 2中
Block 0/2均要锁定才能完整实现Update操作:
如果Session A先锁定了Block 2,Session B先锁定了Block 0,
然后Session A尝试锁定Block 0,Session B尝试锁定Block 2,这时候就会出现死锁
为了避免这种情况,PG规定锁定时,同一个Relation,按Block的编号顺序锁定,
如需要锁定0和2,那必须先锁定Block 0,再锁定2
输出:
为Tuple分配的Buffer
其主要实现逻辑如下:
1.初始化相关变量
2.获取预留空间
3.如为Update操作,则获取上次pinned buffer对应的Block
4.获取目标page:targetBlock
5.如targetBlock非法,并且使用FSM,则使用FSM寻找
6.如targetBlock仍非法,则循环遍历page检索合适的Block
6.1.读取并独占锁定目标block,以及给定的otherBuffer(如给出)
6.2.获取vm
6.3.读取buffer,判断是否有足够的空闲空间,如足够,则返回
6.4.如仍不足够,则调用RecordAndGetPageWithFreeSpace获取targetBlock,再次循环
7.遍历完毕,仍找不到block,则扩展表
8.扩展表后,以P_NEW模式读取buffer并锁定
9.获取该buffer对应的page,执行相关校验
10.校验不通过报错,校验通过则返回buffer
/*
* RelationGetBufferForTuple
*
* Returns pinned and exclusive-locked buffer of a page in given relation
* with free space >= given len.
* 返回满足空闲空间>=给定大小的page,并且该page对应的buffer状态为pinned和并持有独占锁
*
* If otherBuffer is not InvalidBuffer, then it references a previously
* pinned buffer of another page in the same relation; on return, this
* buffer will also be exclusive-locked. (This case is used by heap_update;
* the otherBuffer contains the tuple being updated.)
* 如果otherBuffer不是InvalidBuffer,
* 那么otherBuffer依赖的是先前同一个relation但是其他page的pinned buffer.
* 返回时,该buffer同时被独占锁定.
* (heap_update会出现这种情况,otherBuffer存储正update的tuple)
*
* The reason for passing otherBuffer is that if two backends are doing
* concurrent heap_update operations, a deadlock could occur if they try
* to lock the same two buffers in opposite orders. To ensure that this
* can't happen, we impose the rule that buffers of a relation must be
* locked in increasing page number order. This is most conveniently done
* by having RelationGetBufferForTuple lock them both, with suitable care
* for ordering.
* 传递otherBuffer的原因是如果两个进程在并发heap_update操作,
* 如果它们尝试以相反的顺序锁定相同的两个buffer,那会出现死锁.
* 为了确保这种情况不会出现,我们规定,关系缓冲区必须按page的编号顺序锁定.
* 要做到这一点,最方便的方法是让RelationGetBufferForTuple注意顺序锁定它们.
*
* NOTE: it is unlikely, but not quite impossible, for otherBuffer to be the
* same buffer we select for insertion of the new tuple (this could only
* happen if space is freed in that page after heap_update finds there's not
* enough there). In that case, the page will be pinned and locked only once.
* 注意:这不太可能,但又不是不可能,为了让otherBuffer与我们选择插入新元组的buffer一致.
* (这只会发生在在执行heap_update检索page发现没有足够的空闲空间,但随后空间被释放的情况)
* 在这种情况下,page会被pinned并且只会lock一次.
*
* For the vmbuffer and vmbuffer_other arguments, we avoid deadlock by
* locking them only after locking the corresponding heap page, and taking
* no further lwlocks while they are locked.
* 对于vmbuffer和vmbuffer_other参数,通过在锁定相应的heap page后再锁定它们来避免死锁,
* 同时,在被锁定后,不再持有lwlocks.
*
* We normally use FSM to help us find free space. However,
* if HEAP_INSERT_SKIP_FSM is specified, we just append a new empty page to
* the end of the relation if the tuple won't fit on the current target page.
* This can save some cycles when we know the relation is new and doesn't
* contain useful amounts of free space.
* 通常来说,使用FSM检索空闲空间.但是,如果指定了HEAP_INSERT_SKIP_FSM,
* 那么如果当前的目标page不适合,则直接在relation的最后追加空page.
* 这样可以在知道relation是新的情况下,节省一些处理时间,而且不需要持有有用的空闲空间计数信息.
*
* HEAP_INSERT_SKIP_FSM is also useful for non-WAL-logged additions to a
* relation, if the caller holds exclusive lock and is careful to invalidate
* relation's smgr_targblock before the first insertion --- that ensures that
* all insertions will occur into newly added pages and not be intermixed
* with tuples from other transactions. That way, a crash can't risk losing
* any committed data of other transactions. (See heap_insert's comments
* for additional constraints needed for safe usage of this behavior.)
* HEAP_INSERT_SKIP_FSM同时对于非WAL logged关系也是有用的,
* 如果调用者持有独占锁并且在首次插入前使得关系的smgr_targblock无效 ---
* 这可以确保所有的插入会出现在新增加的pages中,而不会与其他事务的tuple混起来.
* 按这种方式,如果出现宕机,那么就不会有丢失其他事务提交的数据的风险.
* (详细参考heap_insert的注释,里面提到了使用该动作的其他约束)
*
* The caller can also provide a BulkInsertState object to optimize many
* insertions into the same relation. This keeps a pin on the current
* insertion target page (to save pin/unpin cycles) and also passes a
* BULKWRITE buffer selection strategy object to the buffer manager.
* Passing NULL for bistate selects the default behavior.
* 调用者同时提供了BulkInsertState对象用于优化大量插入到同一个relation的情况.
* 这会在当前插入的目标page保持pin(节省pin/unpin处理过程)
* 同时会传递BULKWRITE缓冲区选择器策略对象到buffer manager中.
* 如使用默认模式,则设置bitstate为NULL.
*
* We always try to avoid filling existing pages further than the fillfactor.
* This is OK since this routine is not consulted when updating a tuple and
* keeping it on the same page, which is the scenario fillfactor is meant
* to reserve space for.
* 我们通常尝试避免填充现有页面超过填充因子设定的范围.
* 这是没有问题的,因为在更新元组并将其保存在同一个page中时,不会参考此例程,
* 该场景下填充因子会用到.
*
* ereport(ERROR) is allowed here, so this routine *must* be called
* before any (unlogged) changes are made in buffer pool.
* ereport(ERROR)可在这允许使用,因此该例程必须在buffer pool出现任何变化前调用.
*/
/*
输入:
relation-数据表
len-需要的空间大小
otherBuffer-用于update场景,上一次pinned的buffer
options-处理选项
bistate-BulkInsert标记
vmbuffer-第1个vm(visibilitymap)
vmbuffer_other-用于update场景,上一次pinned的buffer对应的vm(visibilitymap)
注意:
otherBuffer这个参数让人觉得困惑,原因是PG的机制使然
Update时,不是原地更新,而是原数据保留(更新xmax),新数据插入
原数据&新数据如果在不同Block中,锁定Block的时候可能会出现Deadlock
举个例子:Session A更新表T的第一行,第一行在Block 0中,新数据存储在Block 2中
Session B更新表T的第二行,第二行在Block 0中,新数据存储在Block 2中
Block 0/2均要锁定才能完整实现Update操作:
如果Session A先锁定了Block 2,Session B先锁定了Block 0,
然后Session A尝试锁定Block 0,Session B尝试锁定Block 2,这时候就会出现死锁
为了避免这种情况,PG规定锁定时,同一个Relation,按Block的编号顺序锁定,
如需要锁定0和2,那必须先锁定Block 0,再锁定2
输出:
为Tuple分配的Buffer
附:
Pinned buffers:means buffers are currently being used,it should not be flushed out.
*/
Buffer
RelationGetBufferForTuple(Relation relation, Size len,
Buffer otherBuffer, int options,
BulkInsertState bistate,
Buffer *vmbuffer, Buffer *vmbuffer_other)
{
bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM);//是否使用FSM寻找空闲空间
Buffer buffer = InvalidBuffer;//
Page page;//
Size pageFreeSpace = 0,//page空闲空间
saveFreeSpace = 0;//page需要预留的空间
BlockNumber targetBlock,//目标Block
otherBlock;//上一次pinned的buffer对应的Block
bool needLock;//是否需要上锁
//大小对齐
len = MAXALIGN(len); /* be conservative */
/* Bulk insert is not supported for updates, only inserts. */
//otherBuffer有效,说明是update操作,不支持bi(BulkInsert)
//bulk操作仅支持插入
Assert(otherBuffer == InvalidBuffer || !bistate);
/*
* If we're gonna fail for oversize tuple, do it right away
* 对于超限的元组,直接报错
*/
//#define MaxHeapTupleSize (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + sizeof(ItemIdData)))
//#define MinHeapTupleSize MAXALIGN(SizeofHeapTupleHeader)
if (len > MaxHeapTupleSize)
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("row is too big: size %zu, maximum size %zu",
len, MaxHeapTupleSize)));
/* Compute desired extra freespace due to fillfactor option */
//获取预留空间
// #define RelationGetTargetPageFreeSpace(relation, defaultff) \
(BLCKSZ * (100 - RelationGetFillFactor(relation, defaultff)) / 100)
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
//update操作,获取上次pinned buffer对应的Block
if (otherBuffer != InvalidBuffer)
otherBlock = BufferGetBlockNumber(otherBuffer);
else
otherBlock = InvalidBlockNumber; /* just to keep compiler quiet */
/*
* We first try to put the tuple on the same page we last inserted a tuple
* on, as cached in the BulkInsertState or relcache entry. If that
* doesn't work, we ask the Free Space Map to locate a suitable page.
* Since the FSM's info might be out of date, we have to be prepared to
* loop around and retry multiple times. (To insure this isn't an infinite
* loop, we must update the FSM with the correct amount of free space on
* each page that proves not to be suitable.) If the FSM has no record of
* a page with enough free space, we give up and extend the relation.
* 首先会尝试把元组放在最后插入元组的page上,比如BulkInsertState或者relcache条目.
* 如果找不到,那么我们通过FSM来定位合适的page.
* 由于FSM的信息可能过期,这时候不得不循环并尝试多次.
* (为了确保这不是一个无限循环,必须使用正确的页面空闲空间信息更新不靠谱的FSM)
* 如果FSM中信息提示没有page有空闲空间,放弃并扩展relation.
*
* When use_fsm is false, we either put the tuple onto the existing target
* page or extend the relation.
* 如use_fsm为F,我们要不把元组放在现存的目标page上,要不扩展relation.
*/
if (len + saveFreeSpace > MaxHeapTupleSize)
{
//如果需要的大小+预留空间大于可容纳的最大Tuple大小,不使用FSM,扩展后再尝试
/* can't fit, don't bother asking FSM */
targetBlock = InvalidBlockNumber;
use_fsm = false;
}
else if (bistate && bistate->current_buf != InvalidBuffer)//BulkInsert模式
targetBlock = BufferGetBlockNumber(bistate->current_buf);
else
targetBlock = RelationGetTargetBlock(relation);//普通Insert模式
if (targetBlock == InvalidBlockNumber && use_fsm)
{
//还没有找到合适的BlockNumber,并且需要使用FSM
/*
* We have no cached target page, so ask the FSM for an initial
* target.
* 没有缓存目标page,使用FSM获取初始目标page
*/
//使用FSM申请空闲空间=len + saveFreeSpace的块
targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
/*
* If the FSM knows nothing of the rel, try the last page before we
* give up and extend. This avoids one-tuple-per-page syndrome during
* bootstrapping or in a recently-started system.
* 如果FSM对rel一无所知,在放弃并扩展前尝试下最后那个page.
* 这可以避免在bootstrapping或者最近已启动系统时一个元组一个page的情况.
*/
//申请不到,使用最后一个块,否则扩展或者放弃
if (targetBlock == InvalidBlockNumber)
{
BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
if (nblocks > 0)
targetBlock = nblocks - 1;
}
}
loop:
while (targetBlock != InvalidBlockNumber)
{
//---------- 循环直至成功获取插入数据的块号
/*
* Read and exclusive-lock the target block, as well as the other
* block if one was given, taking suitable care with lock ordering and
* the possibility they are the same block.
* 读取并独占锁定目标block,以及给定的另外一个快(如给出),需要适当的关注锁的顺序
* 并关注它们是否同一个块.
*
* If the page-level all-visible flag is set, caller will need to
* clear both that and the corresponding visibility map bit. However,
* by the time we return, we'll have x-locked the buffer, and we don't
* want to do any I/O while in that state. So we check the bit here
* before taking the lock, and pin the page if it appears necessary.
* Checking without the lock creates a risk of getting the wrong
* answer, so we'll have to recheck after acquiring the lock.
* 如果设置了块级别的all-visible flag,调用者需要清空该块的标记和相应的vm标记.
* 但是,在返回时,我们将持有buffer的独占锁,并且我们不希望在这种情况下执行I/O操作.
* 因此,我们在获取锁前检查标记位,如看起来需要的话,pin page.
* 没有持有锁执行检查会出现错误,因此我们将不得不在获取锁后重新执行检查.
*/
if (otherBuffer == InvalidBuffer)
{
//----------- 非Update操作
/* easy case */
//这种情况比较简单
//获取Buffer
buffer = ReadBufferBI(relation, targetBlock, bistate);
if (PageIsAllVisible(BufferGetPage(buffer)))
//如果Page可见,那么把Page Pin在内存中(Pin的意思是固定/保留)
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);//锁定buffer
}
else if (otherBlock == targetBlock)
{
//----------- Update操作,新记录跟原记录在同一个Block中
//这种情况也比较简单
/* also easy case */
buffer = otherBuffer;
if (PageIsAllVisible(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
else if (otherBlock < targetBlock)
{
//----------- Update操作,原记录所在的Block < 新记录的Block
/* lock other buffer first */
//首先锁定otherBlock
buffer = ReadBuffer(relation, targetBlock);
if (PageIsAllVisible(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
//优先锁定BlockNumber小的那个
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
}
else
{
//------------ Update操作,原记录所在的Block > 新记录的Block
/* lock target buffer first */
buffer = ReadBuffer(relation, targetBlock);
if (PageIsAllVisible(BufferGetPage(buffer)))
visibilitymap_pin(relation, targetBlock, vmbuffer);
//优先锁定BlockNumber小的那个
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
}
/*
* We now have the target page (and the other buffer, if any) pinned
* and locked. However, since our initial PageIsAllVisible checks
* were performed before acquiring the lock, the results might now be
* out of date, either for the selected victim buffer, or for the
* other buffer passed by the caller. In that case, we'll need to
* give up our locks, go get the pin(s) we failed to get earlier, and
* re-lock. That's pretty painful, but hopefully shouldn't happen
* often.
* 现在已有了target page,并且该page(包括other buffer,如存在)已缓存到内存中(pinned)且已锁定.
* 但是,由于初始的PageIsAllVisible在获取锁前执行,结果可能已经过期,
* 这时候可能选择了需要被淘汰的buffer或者otherBuffer出现了变化.
* 在这种情况下,需要放弃锁,回到先前曾经失败的pin的地方,重新锁定.
* 这蛮吐血的,希望不要经常出现.
*
* Note that there's a small possibility that we didn't pin the page
* above but still have the correct page pinned anyway, either because
* we've already made a previous pass through this loop, or because
* caller passed us the right page anyway.
* 注意存在较小的可能是我们在上面不需要pin page,但仍然需要持有正确的pinned page,
* 这一方面是因为我们已经通过该循环执行了一遍,另外一方面是调用者通过其他方式传入了正确的page.
*
* Note also that it's possible that by the time we get the pin and
* retake the buffer locks, the visibility map bit will have been
* cleared by some other backend anyway. In that case, we'll have
* done a bit of extra work for no gain, but there's no real harm
* done.
* 同时要注意在我们获取pin并且重新获取buffer lock时,vm位已被其他后台进程清除了.
* 在这种情况下,我们需要执行一些额外的工作以避免重复工作,但这实质上并没有什么危害.
*/
if (otherBuffer == InvalidBuffer || buffer <= otherBuffer)
GetVisibilityMapPins(relation, buffer, otherBuffer,
targetBlock, otherBlock, vmbuffer,
vmbuffer_other);//Pin VM在内存中
else
GetVisibilityMapPins(relation, otherBuffer, buffer,
otherBlock, targetBlock, vmbuffer_other,
vmbuffer);//Pin VM在内存中
/*
* Now we can check to see if there's enough free space here. If so,
* we're done.
* 现在我们可以检查是否有足够的空闲空间.
* 如有,则我们已完成所有工作了.
*/
page = BufferGetPage(buffer);
pageFreeSpace = PageGetHeapFreeSpace(page);
if (len + saveFreeSpace <= pageFreeSpace)
{
//有足够的空间存储数据,返回此Buffer
/* use this page as future insert target, too */
//用这个page作为未来插入的目标page
/*
#define RelationSetTargetBlock(relation, targblock) \
do { \
RelationOpenSmgr(relation); \
(relation)->rd_smgr->smgr_targblock = (targblock); \
} while (0)
*/
RelationSetTargetBlock(relation, targetBlock);
return buffer;
}
/*
* Not enough space, so we must give up our page locks and pin (if
* any) and prepare to look elsewhere. We don't care which order we
* unlock the two buffers in, so this can be slightly simpler than the
* code above.
* 空间不够,必须放弃持有的page locks和pin,准备检索其他地方.
* 在解锁时不需要关注两个buffer的顺序,这个逻辑比先前的逻辑要简单.
*/
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
if (otherBuffer == InvalidBuffer)
ReleaseBuffer(buffer);
else if (otherBlock != targetBlock)
{
LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
ReleaseBuffer(buffer);
}
/* Without FSM, always fall out of the loop and extend */
//不使用FSM定位空闲空间,跳出循环,执行扩展
if (!use_fsm)
break;
/*
* Update FSM as to condition of this page, and ask for another page
* to try.
*/
//使用FSM获取下一个备选的Block
//注意:如果全部扫描后发现没有满足条件的Block,targetBlock = InvalidBlockNumber,跳出循环
targetBlock = RecordAndGetPageWithFreeSpace(relation,
targetBlock,
pageFreeSpace,
len + saveFreeSpace);
}
//--------- 没有获取满足条件的Block,扩展表
/*
* Have to extend the relation.
*
* We have to use a lock to ensure no one else is extending the rel at the
* same time, else we will both try to initialize the same new page. We
* can skip locking for new or temp relations, however, since no one else
* could be accessing them.
* 必须锁定以确保其他进程不能扩展rel,否则我们会同时尝试初始化新的page.
* 但是,我们可以为新的或者临时关系跳过锁定,这时候没有其他进程可以访问它们.
*/
//新创建的数据表或者临时表,无需Lock
needLock = !RELATION_IS_LOCAL(relation);
/*
* If we need the lock but are not able to acquire it immediately, we'll
* consider extending the relation by multiple blocks at a time to manage
* contention on the relation extension lock. However, this only makes
* sense if we're using the FSM; otherwise, there's no point.
* 如果需要锁定但不能够马上获取,考虑通过一次性多个blocks的方式扩展关系,
* 这样可以在关系扩展锁上管理竞争.
* 但是,这在使用FSM的时候才会奇效,否则没有其他太好的办法.
*/
if (needLock)//需要锁定
{
if (!use_fsm)
//不使用FSM
LockRelationForExtension(relation, ExclusiveLock);
else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
{
/* Couldn't get the lock immediately; wait for it. */
//不能马上获取锁,等待
LockRelationForExtension(relation, ExclusiveLock);
/*
* Check if some other backend has extended a block for us while
* we were waiting on the lock.
*/
//如有其它进程扩展了数据表,那么可以成功获取满足条件的targetBlock
targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
/*
* If some other waiter has already extended the relation, we
* don't need to do so; just use the existing freespace.
* 如果其他等待进程已经扩展了关系,那么我们不需要再扩展了,使用现成的空闲空间即可.
*/
if (targetBlock != InvalidBlockNumber)
{
UnlockRelationForExtension(relation, ExclusiveLock);
goto loop;
}
/* Time to bulk-extend. */
//其它进程没有扩展
//Just extend it!
RelationAddExtraBlocks(relation, bistate);
}
}
/*
* In addition to whatever extension we performed above, we always add at
* least one block to satisfy our own request.
* 处理上面执行的扩展,我们总是添加了至少一个block用以满足自身需要.
*
* XXX This does an lseek - rather expensive - but at the moment it is the
* only way to accurately determine how many blocks are in a relation. Is
* it worth keeping an accurate file length in shared memory someplace,
* rather than relying on the kernel to do it for us?
* XXX 这相当于做了一次lseek - 相当昂贵的操作! - 在这时候这也是唯一可以准确确定关系有多少blocks的方法.
* 相对于不是使用内核来完成这个事情,在内存的某个地方保存准确的文件尺寸是否更好?
*/
//扩展表后,New Page!
buffer = ReadBufferBI(relation, P_NEW, bistate);
/*
* We can be certain that locking the otherBuffer first is OK, since it
* must have a lower page number.
* 这时候可以确定首先锁定的otherBuffer没有问题,因为它有一个较小的page编号
*/
if (otherBuffer != InvalidBuffer)
////otherBuffer的顺序一定在扩展的Block之前,Lock it!
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
/*
* Now acquire lock on the new page.
* 现在可以尝试为新page上锁
*/
//锁定New Page
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
/*
* Release the file-extension lock; it's now OK for someone else to extend
* the relation some more. Note that we cannot release this lock before
* we have buffer lock on the new page, or we risk a race condition
* against vacuumlazy.c --- see comments therein.
* 是否文件扩展锁.现在对于其他进程来说可以扩展relation了.
* 注意不能在持有新page的buffer lock前释放该锁,否则将会在vacuumlazy.c中存在条件竞争.
* 详细可参见注释.
*/
if (needLock)
//释放扩展锁
UnlockRelationForExtension(relation, ExclusiveLock);
/*
* We need to initialize the empty new page. Double-check that it really
* is empty (this should never happen, but if it does we don't want to
* risk wiping out valid data).
* 我们需要初始化空的新page.
* 需再次检查该page是空的(这应该不会出现,但执行这个操作是因为我们不希望冒删除有效数据的风险)
*/
//获取相应的Page
page = BufferGetPage(buffer);
if (!PageIsNew(page))
//不是New Page,那一定某个地方搞错了!
elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
BufferGetBlockNumber(buffer),
RelationGetRelationName(relation));
//初始化New Page
PageInit(page, BufferGetPageSize(buffer), 0);
//New Page也满足不了要求的大小,报错
if (len > PageGetHeapFreeSpace(page))
{
/* We should not get here given the test at the top */
elog(PANIC, "tuple is too big: size %zu", len);
}
/*
* Remember the new page as our target for future insertions.
* 记录新page为未来插入的目标page.
*
* XXX should we enter the new page into the free space map immediately,
* or just keep it for this backend's exclusive use in the short run
* (until VACUUM sees it)? Seems to depend on whether you expect the
* current backend to make more insertions or not, which is probably a
* good bet most of the time. So for now, don't add it to FSM yet.
* XXX 我们应该马上把新的page放到FSM中吗,
* 或者只是把该page放在后台进程的私有空间中在很短时间内独占使用(直至vacuum可以看到它位置)?
* 看起来这依赖于你希望当前的后台进程是否执行更多的插入操作,这在大多数时间下会更好.
* 因此,现在还没有把它添加到FSM中.
*/
//终于找到了可用于存储数据的Block
RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
//返回
return buffer;
}
测试脚本
15:54:13 (xdb@[local]:5432)testdb=# insert into t1 values (1,'1','1');
调用栈
(gdb) b RelationGetBufferForTuple
Breakpoint 1 at 0x4ef179: file hio.c, line 318.
(gdb) c
Continuing.
Breakpoint 1, RelationGetBufferForTuple (relation=0x7f4f51fe39b8, len=32, otherBuffer=0, options=0, bistate=0x0,
vmbuffer=0x7ffea95dbf6c, vmbuffer_other=0x0) at hio.c:318
318 bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
(gdb) bt
#0 RelationGetBufferForTuple (relation=0x7f4f51fe39b8, len=32, otherBuffer=0, options=0, bistate=0x0,
vmbuffer=0x7ffea95dbf6c, vmbuffer_other=0x0) at hio.c:318
#1 0x00000000004df1f8 in heap_insert (relation=0x7f4f51fe39b8, tup=0x178a478, cid=0, options=0, bistate=0x0)
at heapam.c:2468
#2 0x0000000000709dda in ExecInsert (mtstate=0x178a220, slot=0x178a680, planSlot=0x178a680, estate=0x1789eb8,
canSetTag=true) at nodeModifyTable.c:529
#3 0x000000000070c475 in ExecModifyTable (pstate=0x178a220) at nodeModifyTable.c:2159
#4 0x00000000006e05cb in ExecProcNodeFirst (node=0x178a220) at execProcnode.c:445
#5 0x00000000006d552e in ExecProcNode (node=0x178a220) at ../../../src/include/executor/executor.h:247
#6 0x00000000006d7d66 in ExecutePlan (estate=0x1789eb8, planstate=0x178a220, use_parallel_mode=false,
operation=CMD_INSERT, sendTuples=false, numberTuples=0, direction=ForwardScanDirection, dest=0x17a7688,
execute_once=true) at execMain.c:1723
#7 0x00000000006d5af8 in standard_ExecutorRun (queryDesc=0x178e458, direction=ForwardScanDirection, count=0,
execute_once=true) at execMain.c:364
#8 0x00000000006d5920 in ExecutorRun (queryDesc=0x178e458, direction=ForwardScanDirection, count=0, execute_once=true)
at execMain.c:307
#9 0x00000000008c1092 in ProcessQuery (plan=0x16b3ac0, sourceText=0x16b1ec8 "insert into t1 values (1,'1','1');",
params=0x0, queryEnv=0x0, dest=0x17a7688, completionTag=0x7ffea95dc500 "") at pquery.c:161
#10 0x00000000008c29a1 in PortalRunMulti (portal=0x1717488, isTopLevel=true, setHoldSnapshot=false, dest=0x17a7688,
altdest=0x17a7688, completionTag=0x7ffea95dc500 "") at pquery.c:1286
#11 0x00000000008c1f7a in PortalRun (portal=0x1717488, count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x17a7688, altdest=0x17a7688, completionTag=0x7ffea95dc500 "") at pquery.c:799
#12 0x00000000008bbf16 in exec_simple_query (query_string=0x16b1ec8 "insert into t1 values (1,'1','1');") at postgres.c:1145
#13 0x00000000008c01a1 in PostgresMain (argc=1, argv=0x16dbaf8, dbname=0x16db960 "testdb", username=0x16aeba8 "xdb")
at postgres.c:4182
#14 0x000000000081e07c in BackendRun (port=0x16d3940) at postmaster.c:4361
#15 0x000000000081d7ef in BackendStartup (port=0x16d3940) at postmaster.c:4033
---Type <return> to continue, or q <return> to quit---
#16 0x0000000000819be9 in ServerLoop () at postmaster.c:1706
#17 0x000000000081949f in PostmasterMain (argc=1, argv=0x16acb60) at postmaster.c:1379
#18 0x0000000000742941 in main (argc=1, argv=0x16acb60) at main.c:228
(gdb)
感谢各位的阅读,以上就是“PostgreSQL中RelationGetBufferForTuple函数有什么作用”的内容了,经过本文的学习后,相信大家对PostgreSQL中RelationGetBufferForTuple函数有什么作用这一问题有了更深刻的体会,具体使用情况还需要大家实践验证。这里是亿速云,小编将为大家推送更多相关知识点的文章,欢迎关注!
亿速云「云服务器」,即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘,价格低至29元/月。点击查看>>
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。
原文链接:http://blog.itpub.net/6906/viewspace-2637322/