本篇内容介绍了“怎么理解PostgreSQL中Clock Sweep算法”的有关知识,在实际案例的操作过程中,不少人都会遇到这样的困境,接下来就让小编带领大家学习一下如何处理这些情况吧!希望大家仔细阅读,能够学有所成!
BufferDesc
共享缓冲区的共享描述符(状态)数据
/*
* Flags for buffer descriptors
* buffer描述器标记
*
* Note: TAG_VALID essentially means that there is a buffer hashtable
* entry associated with the buffer's tag.
* 注意:TAG_VALID本质上意味着有一个与缓冲区的标记相关联的缓冲区散列表条目。
*/
//buffer header锁定
#define BM_LOCKED (1U << 22) /* buffer header is locked */
//数据需要写入(标记为DIRTY)
#define BM_DIRTY (1U << 23) /* data needs writing */
//数据是有效的
#define BM_VALID (1U << 24) /* data is valid */
//已分配buffer tag
#define BM_TAG_VALID (1U << 25) /* tag is assigned */
//正在R/W
#define BM_IO_IN_PROGRESS (1U << 26) /* read or write in progress */
//上一个I/O出现错误
#define BM_IO_ERROR (1U << 27) /* previous I/O failed */
//开始写则变DIRTY
#define BM_JUST_DIRTIED (1U << 28) /* dirtied since write started */
//存在等待sole pin的其他进程
#define BM_PIN_COUNT_WAITER (1U << 29) /* have waiter for sole pin */
//checkpoint发生,必须刷到磁盘上
#define BM_CHECKPOINT_NEEDED (1U << 30) /* must write for checkpoint */
//持久化buffer(不是unlogged或者初始化fork)
#define BM_PERMANENT (1U << 31) /* permanent buffer (not unlogged,
* or init fork) */
/*
* BufferDesc -- shared descriptor/state data for a single shared buffer.
* BufferDesc -- 共享缓冲区的共享描述符(状态)数据
*
* Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change
* the tag, state or wait_backend_pid fields. In general, buffer header lock
* is a spinlock which is combined with flags, refcount and usagecount into
* single atomic variable. This layout allow us to do some operations in a
* single atomic operation, without actually acquiring and releasing spinlock;
* for instance, increase or decrease refcount. buf_id field never changes
* after initialization, so does not need locking. freeNext is protected by
* the buffer_strategy_lock not buffer header lock. The LWLock can take care
* of itself. The buffer header lock is *not* used to control access to the
* data in the buffer!
* 注意:必须持有Buffer header锁(BM_LOCKED标记)才能检查或修改tag/state/wait_backend_pid字段.
* 通常来说,buffer header lock是spinlock,它与标记位/参考计数/使用计数组合到单个原子变量中.
* 这个布局设计允许我们执行原子操作,而不需要实际获得或者释放spinlock(比如,增加或者减少参考计数).
* buf_id字段在初始化后不会出现变化,因此不需要锁定.
* freeNext通过buffer_strategy_lock锁而不是buffer header lock保护.
* LWLock可以很好的处理自己的状态.
* 务请注意的是:buffer header lock不用于控制buffer中的数据访问!
*
* It's assumed that nobody changes the state field while buffer header lock
* is held. Thus buffer header lock holder can do complex updates of the
* state variable in single write, simultaneously with lock release (cleaning
* BM_LOCKED flag). On the other hand, updating of state without holding
* buffer header lock is restricted to CAS, which insure that BM_LOCKED flag
* is not set. Atomic increment/decrement, OR/AND etc. are not allowed.
* 假定在持有buffer header lock的情况下,没有人改变状态字段.
* 持有buffer header lock的进程可以执行在单个写操作中执行复杂的状态变量更新,
* 同步的释放锁(清除BM_LOCKED标记).
* 换句话说,如果没有持有buffer header lock的状态更新,会受限于CAS,
* 这种情况下确保BM_LOCKED没有被设置.
* 比如原子的增加/减少(AND/OR)等操作是不允许的.
*
* An exception is that if we have the buffer pinned, its tag can't change
* underneath us, so we can examine the tag without locking the buffer header.
* Also, in places we do one-time reads of the flags without bothering to
* lock the buffer header; this is generally for situations where we don't
* expect the flag bit being tested to be changing.
* 一种例外情况是如果我们已有buffer pinned,该buffer的tag不能改变(在本进程之下),
* 因此不需要锁定buffer header就可以检查tag了.
* 同时,在执行一次性的flags读取时不需要锁定buffer header.
* 这种情况通常用于我们不希望正在测试的flag bit将被改变.
*
* We can't physically remove items from a disk page if another backend has
* the buffer pinned. Hence, a backend may need to wait for all other pins
* to go away. This is signaled by storing its own PID into
* wait_backend_pid and setting flag bit BM_PIN_COUNT_WAITER. At present,
* there can be only one such waiter per buffer.
* 如果其他进程有buffer pinned,那么进程不能物理的从磁盘页面中删除items.
* 因此,后台进程需要等待其他pins清除.这可以通过存储它自己的PID到wait_backend_pid中,
* 并设置标记位BM_PIN_COUNT_WAITER.
* 目前,每个缓冲区只能由一个等待进程.
*
* We use this same struct for local buffer headers, but the locks are not
* used and not all of the flag bits are useful either. To avoid unnecessary
* overhead, manipulations of the state field should be done without actual
* atomic operations (i.e. only pg_atomic_read_u32() and
* pg_atomic_unlocked_write_u32()).
* 本地缓冲头部使用同样的结构,但并不需要使用locks,而且并不是所有的标记位都使用.
* 为了避免不必要的负载,状态域的维护不需要实际的原子操作
* (比如只有pg_atomic_read_u32() and pg_atomic_unlocked_write_u32())
*
* Be careful to avoid increasing the size of the struct when adding or
* reordering members. Keeping it below 64 bytes (the most common CPU
* cache line size) is fairly important for performance.
* 在增加或者记录成员变量时,小心避免增加结构体的大小.
* 保持结构体大小在64字节内(通常的CPU缓存线大小)对于性能是非常重要的.
*/
typedef struct BufferDesc
{
//buffer tag
BufferTag tag; /* ID of page contained in buffer */
//buffer索引编号(0开始),指向相应的buffer pool slot
int buf_id; /* buffer's index number (from 0) */
/* state of the tag, containing flags, refcount and usagecount */
//tag状态,包括flags/refcount和usagecount
pg_atomic_uint32 state;
//pin-count等待进程ID
int wait_backend_pid; /* backend PID of pin-count waiter */
//空闲链表链中下一个空闲的buffer
int freeNext; /* link in freelist chain */
//缓冲区内容锁
LWLock content_lock; /* to lock access to buffer contents */
} BufferDesc;
BufferTag
Buffer tag标记了buffer存储的是磁盘中哪个block
/*
* Buffer tag identifies which disk block the buffer contains.
* Buffer tag标记了buffer存储的是磁盘中哪个block
*
* Note: the BufferTag data must be sufficient to determine where to write the
* block, without reference to pg_class or pg_tablespace entries. It's
* possible that the backend flushing the buffer doesn't even believe the
* relation is visible yet (its xact may have started before the xact that
* created the rel). The storage manager must be able to cope anyway.
* 注意:BufferTag必须足以确定如何写block而不需要参照pg_class或者pg_tablespace数据字典信息.
* 有可能后台进程在刷新缓冲区的时候深圳不相信关系是可见的(事务可能在创建rel的事务之前).
* 存储管理器必须可以处理这些事情.
*
* Note: if there's any pad bytes in the struct, INIT_BUFFERTAG will have
* to be fixed to zero them, since this struct is used as a hash key.
* 注意:如果在结构体中有填充的字节,INIT_BUFFERTAG必须将它们固定为零,因为这个结构体用作散列键.
*/
typedef struct buftag
{
//物理relation标识符
RelFileNode rnode; /* physical relation identifier */
ForkNumber forkNum;
//相对于relation起始的块号
BlockNumber blockNum; /* blknum relative to begin of reln */
} BufferTag;
算法思想,在 Buffer Manager 中已有详细介绍,摘录如下:
WHILE true
(1) Obtain the candidate buffer descriptor pointed by the nextVictimBuffer
(2) IF the candidate descriptor is unpinned THEN
(3) IF the candidate descriptor's usage_count == 0 THEN
BREAK WHILE LOOP /* the corresponding slot of this descriptor is victim slot. */
ELSE
Decrease the candidate descriptpor's usage_count by 1
END IF
END IF
(4) Advance nextVictimBuffer to the next one
END WHILE
(5) RETURN buffer_id of the victim
算法思想朴素且简单,比起高大上的LRU算法要简单多了.
nextVictimBuffer可视为时钟的时针,把缓冲区视为环形缓冲区,时针循环周而往复的循环,如缓冲区满足unpinned(ref_count == 0) && usage_count == 0的条件,则选中该缓冲,否则,usage_count减一,继续循环,直至找到满足条件的buffer为止.选中的buffer一定是buffers中较少使用的那个.
StrategyGetBuffer中的代码片段:
...
/* Nothing on the freelist, so run the "clock sweep" algorithm */
//空闲链表中找不到或者满足不了条件,则执行"clock sweep"算法
//int NBuffers = 1000;
trycounter = NBuffers;//尝试次数
for (;;)
{
//------- 循环
//获取buffer描述符
buf = GetBufferDescriptor(ClockSweepTick());
/*
* If the buffer is pinned or has a nonzero usage_count, we cannot use
* it; decrement the usage_count (unless pinned) and keep scanning.
* 如果buffer已pinned,或者有一个非零值的usage_count,不能使用这个buffer.
* 减少usage_count(除非已pinned)继续扫描.
*/
local_buf_state = LockBufHdr(buf);
if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0)
{
//----- refcount == 0
if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0)
{
//usage_count <> 0
//usage_count - 1
local_buf_state -= BUF_USAGECOUNT_ONE;
//重置尝试次数
trycounter = NBuffers;
}
else
{
//usage_count = 0
/* Found a usable buffer */
//发现一个可用的buffer
if (strategy != NULL)
//添加到该策略的环形缓冲区中
AddBufferToRing(strategy, buf);
//输出参数赋值
*buf_state = local_buf_state;
//返回buf
return buf;
}
}
else if (--trycounter == 0)
{
//----- refcount <> 0 && --trycounter == 0
/*
* We've scanned all the buffers without making any state changes,
* so all the buffers are pinned (or were when we looked at them).
* We could hope that someone will free one eventually, but it's
* probably better to fail than to risk getting stuck in an
* infinite loop.
* 在没有改变任何状态的情况,我们已经完成了所有buffers的遍历,
* 因此所有的buffers已pinned(或者在搜索的时候pinned).
* 我们希望某些进程会周期性的释放buffer,但如果实在拿不到,那报错总比傻傻的死循环要好.
*/
UnlockBufHdr(buf, local_buf_state);
elog(ERROR, "no unpinned buffers available");
}
//解锁buffer header
UnlockBufHdr(buf, local_buf_state);
}
ClockSweepTick()函数是StrategyGetBuffer()的辅助过程,移动时针(当前位置往前一格),返回时针指向的buffer.
其实现逻辑如下:
/*
* ClockSweepTick - Helper routine for StrategyGetBuffer()
* ClockSweepTick - StrategyGetBuffer()的辅助过程
*
* Move the clock hand one buffer ahead of its current position and return the
* id of the buffer now under the hand.
* 移动时针(当前位置往前一格),返回时针指向的buffer.
*/
static inline uint32
ClockSweepTick(void)
{
uint32 victim;
/*
* Atomically move hand ahead one buffer - if there's several processes
* doing this, this can lead to buffers being returned slightly out of
* apparent order.
* 原子移动时针一格
* - 如果有多个进程执行这个操作,这可能会导致缓冲返回的顺序稍微有些混乱.
*
*/
victim =
pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
if (victim >= NBuffers)
{
//-------- 候选buffer大于NBuffers
//记录元素的victim
uint32 originalVictim = victim;
/* always wrap what we look up in BufferDescriptors */
//回卷BufferDescriptors中查找的内容
victim = victim % NBuffers;
/*
* If we're the one that just caused a wraparound, force
* completePasses to be incremented while holding the spinlock. We
* need the spinlock so StrategySyncStart() can return a consistent
* value consisting of nextVictimBuffer and completePasses.
* 如果我们正好导致了wraparound,在持有自旋锁的情况下强制completePasses增加.
* 之所以需要自旋锁是因为StrategySyncStart()才能返回nextVictimBuffer和completePasses的一致性值.
*/
if (victim == 0)
{
//出现回卷
uint32 expected;//期望值(不考虑回卷)
uint32 wrapped;//回卷后的值
bool success = false;//成功标记
//期望值
expected = originalVictim + 1;
while (!success)
{
/*
* Acquire the spinlock while increasing completePasses. That
* allows other readers to read nextVictimBuffer and
* completePasses in a consistent manner which is required for
* StrategySyncStart(). In theory delaying the increment
* could lead to an overflow of nextVictimBuffers, but that's
* highly unlikely and wouldn't be particularly harmful.
* 在增加completePasses时请求获取自旋锁.
* 这样可以让其他读取进程以一致性的方式读取nextVictimBuffer和completePasses,
* 这种一致性的方式在StrategySyncStart()中是需要的.
* 理论上来说,延迟增加可能会导致nextVictimBuffers溢出,
* 但但这是非常不可能的,也不会特别有害。
*/
SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
//获取回卷数
wrapped = expected % NBuffers;
//原子比较并交换数字
success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
&expected, wrapped);
if (success)
//如成功,则增加计数
StrategyControl->completePasses++;
//释放自旋锁
SpinLockRelease(&StrategyControl->buffer_strategy_lock);
}
}
}
//返回victim
return victim;
}
/*
* pg_atomic_compare_exchange_u32 - CAS operation
*pg_atomic_compare_exchange_u32 - CAS操作
*
* Atomically compare the current value of ptr with *expected and store newval
* iff ptr and *expected have the same value. The current value of *ptr will
* always be stored in *expected.
* 原子比较ptr当前值与*expected,如果ptr和*expected值一致,则存储newval到ptr中.
* *ptr的当前值通常会存储在*expected中.
*
* Return true if values have been exchanged, false otherwise.
* 如值已交换,则返回T,否则返回F.
*
* Full barrier semantics.
* 完整的屏障语义.
*/
static inline bool
pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
uint32 *expected, uint32 newval)
{
AssertPointerAlignment(ptr, 4);
AssertPointerAlignment(expected, 4);
return pg_atomic_compare_exchange_u32_impl(ptr, expected, newval);
}
bool
pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
uint32 *expected, uint32 newval)
{
bool ret;
/*
* Do atomic op under a spinlock. It might look like we could just skip
* the cmpxchg if the lock isn't available, but that'd just emulate a
* 'weak' compare and swap. I.e. one that allows spurious failures. Since
* several algorithms rely on a strong variant and that is efficiently
* implementable on most major architectures let's emulate it here as
* well.
* 在自旋锁保护下执行原子操作.
* 这看起来如果锁不可能的话,我们可以跳过cmpxchg,但这只是模拟了一个"浅"比较和交换.
* 比如,这会引起spurious failures.
* 由于有几种算法依赖于一种强大的变体,而且这种变体可以在大多数主要架构上有效地实现,
* 因此我们在这里也对其进行模拟。
*/
SpinLockAcquire((slock_t *) &ptr->sema);
/* perform compare/exchange logic */
//执行比较/交换逻辑
ret = ptr->value == *expected;//ptr与*expected是否一致?
*expected = ptr->value;//*expected赋值为ptr
if (ret)
ptr->value = newval;//值一致,则ptr设置为新值
/* and release lock */
//释放自旋锁
SpinLockRelease((slock_t *) &ptr->sema);
//返回结果
return ret;
}
“怎么理解PostgreSQL中Clock Sweep算法”的内容就介绍到这里了,感谢大家的阅读。如果想了解更多行业相关的知识可以关注亿速云网站,小编将为大家输出更多高质量的实用文章!
亿速云「云服务器」,即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘,价格低至29元/月。点击查看>>
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。
原文链接:http://blog.itpub.net/6906/viewspace-2636667/