www.亚洲日本,一区国严二区亚洲三区,区三区激情福利综合中文字幕在线一区亚洲视频1

Raft協議實戰之Redis Sentinel的選舉Leader源碼解析

http://weizijun.cn/2015/04/30/Raft%E5%8D%8F%E8%AE%AE%E5%AE%9E%E6%88%98%E4%B9%8BRedis%20Sentinel%E7%9A%84%E9%80%89%E4%B8%BELeader%E6%BA%90%E7%A0%81%E8%A7%A3%E6%9E%90/

Raft協議是用來解決分布式系統一致性問題的協議，在很長一段時間，Paxos被認為是解決分布式系統一致性的代名詞。但是Paxos難于理解，更難以實現，諸如Google大牛們開發的分布式鎖系統Chubby都遭遇了很多坑。Raft協議設計的初衷就是容易實現，保證對于普遍的人群都可以十分舒適容易的去理解。另外，它必須能夠讓人形成直觀的認識，這樣系統的構建者才能夠在現實中進行必然的擴展。

本文從Redis Sentinel集群選擇Leader的具體流程和源碼分析，描述Raft協議中的選舉Leader算法。關于Redis Sentinel的介紹可以參看本人的另一篇文章《redis sentinel設計與實現》。

當Sentinel集群有Sentinel發現master客觀下線了，就會開始故障轉移流程，故障轉移流程的第一步就是在Sentinel集群選擇一個Leader，讓Leader完成故障轉移流程。

Raft協議選舉流程

描述Raft選舉流程之前需要了解一些概念。

節點的狀態

Raft協議描述的節點共有三種狀態：Leader, Follower, Candidate。在系統運行正常的時候只有Leader和Follower兩種狀態的節點。一個Leader節點，其他的節點都是Follower。Candidate是系統運行不穩定時期的中間狀態，當一個Follower對Leader的的心跳出現異常，就會轉變成Candidate，Candidate會去競選新的Leader，它會向其他節點發送競選投票，如果大多數節點都投票給它，它就會替代原來的Leader，變成新的Leader，原來的Leader會降級成Follower。

term

在分布式系統中，各個節點的時間同步是一個很大的難題，但是為了識別過期時間，時間信息又必不可少。Raft協議為了解決這個問題，引入了term（任期）的概念。Raft協議將時間切分為一個個的Term，可以認為是一種“邏輯時間”。

RPC

Raft協議在選舉階段交互的RPC有兩類：RequestVote和AppendEntries。

RequestVote是用來向其他節點發送競選投票。
AppendEntries是當該節點得到更多的選票后，成為Leader，向其他節點確認消息。

選舉流程

Raft采用心跳機制觸發Leader選舉。系統啟動后，全部節點初始化為Follower，term為0.節點如果收到了RequestVote或者AppendEntries，就會保持自己的Follower身份。如果一段時間內沒收到AppendEntries消息直到選舉超時，說明在該節點的超時時間內還沒發現Leader，Follower就會轉換成Candidate，自己開始競選Leader。一旦轉化為Candidate，該節點立即開始下面幾件事情：

1、增加自己的term。
2、啟動一個新的定時器。
3、給自己投一票。
4、向所有其他節點發送RequestVote，并等待其他節點的回復。

如果在這過程中收到了其他節點發送的AppendEntries，就說明已經有Leader產生，自己就轉換成Follower，選舉結束。

如果在計時器超時前，節點收到多數節點的同意投票，就轉換成Leader。同時向所有其他節點發送AppendEntries，告知自己成為了Leader。

每個節點在一個term內只能投一票，采取先到先得的策略，Candidate前面說到已經投給了自己，Follower會投給第一個收到RequestVote的節點。每個Follower有一個計時器，在計時器超時時仍然沒有接受到來自Leader的心跳RPC, 則自己轉換為Candidate, 開始請求投票，就是上面的的競選Leader步驟。

如果多個Candidate發起投票，每個Candidate都沒拿到多數的投票（Split Vote），那么就會等到計時器超時后重新成為Candidate，重復前面競選Leader步驟。

Raft協議的定時器采取隨機超時時間，這是選舉Leader的關鍵。每個節點定時器的超時時間隨機設置，隨機選取配置時間的1倍到2倍之間。由于隨機配置，所以各個Follower同時轉成Candidate的時間一般不一樣，在同一個term內，先轉為Candidate的節點會先發起投票，從而獲得多數票。多個節點同時轉換為Candidate的可能性很小。即使幾個Candidate同時發起投票，在該term內有幾個節點獲得一樣高的票數，只是這個term無法選出Leader。由于各個節點定時器的超時時間隨機生成，那么最先進入下一個term的節點，將更有機會成為Leader。連續多次發生在一個term內節點獲得一樣高票數在理論上幾率很小，實際上可以認為完全不可能發生。一般1-2個term類，Leader就會被選出來。

Sentinel的選舉流程

Sentinel集群正常運行的時候每個節點epoch相同，當需要故障轉移的時候會在集群中選出Leader執行故障轉移操作。Sentinel采用了Raft協議實現了Sentinel間選舉Leader的算法，不過也不完全跟論文描述的步驟一致。Sentinel集群運行過程中故障轉移完成，所有Sentinel又會恢復平等。Leader僅僅是故障轉移操作出現的角色。

選舉流程

1、某個Sentinel認定master客觀下線的節點后，該Sentinel會先看看自己有沒有投過票，如果自己已經投過票給其他Sentinel了，在2倍故障轉移的超時時間自己就不會成為Leader。相當于它是一個Follower。
2、如果該Sentinel還沒投過票，那么它就成為Candidate。
3、和Raft協議描述的一樣，成為Candidate，Sentinel需要完成幾件事情
- 1）更新故障轉移狀態為start
- 2）當前epoch加1，相當于進入一個新term，在Sentinel中epoch就是Raft協議中的term。
- 3）更新自己的超時時間為當前時間隨機加上一段時間，隨機時間為1s內的隨機毫秒數。
- 4）向其他節點發送is-master-down-by-addr命令請求投票。命令會帶上自己的epoch。
- 5）給自己投一票，在Sentinel中，投票的方式是把自己master結構體里的leader和leader_epoch改成投給的Sentinel和它的epoch。
4、其他Sentinel會收到Candidate的is-master-down-by-addr命令。如果Sentinel當前epoch和Candidate傳給他的epoch一樣，說明他已經把自己master結構體里的leader和leader_epoch改成其他Candidate，相當于把票投給了其他Candidate。投過票給別的Sentinel后，在當前epoch內自己就只能成為Follower。
5、Candidate會不斷的統計自己的票數，直到他發現認同他成為Leader的票數超過一半而且超過它配置的quorum（quorum可以參考《redis sentinel設計與實現》）。Sentinel比Raft協議增加了quorum，這樣一個Sentinel能否當選Leader還取決于它配置的quorum。
6、如果在一個選舉時間內，Candidate沒有獲得超過一半且超過它配置的quorum的票數，自己的這次選舉就失敗了。
7、如果在一個epoch內，沒有一個Candidate獲得更多的票數。那么等待超過2倍故障轉移的超時時間后，Candidate增加epoch重新投票。
8、如果某個Candidate獲得超過一半且超過它配置的quorum的票數，那么它就成為了Leader。
9、與Raft協議不同，Leader并不會把自己成為Leader的消息發給其他Sentinel。其他Sentinel等待Leader從slave選出master后，檢測到新的master正常工作后，就會去掉客觀下線的標識，從而不需要進入故障轉移流程。

關于Sentinel超時時間的說明

Sentinel超時機制有幾個超時概念。

failover_start_time 下一選舉啟動的時間。默認是當前時間加上1s內的隨機毫秒數
failover_state_change_time 故障轉移中狀態變更的時間。
failover_timeout 故障轉移超時時間。默認是3分鐘。
election_timeout 選舉超時時間，是默認選舉超時時間和failover_timeout的最小值。默認是10s。

Follower成為Candidate后，會更新failover_start_time為當前時間加上1s內的隨機毫秒數。更新failover_state_change_time為當前時間。

Candidate的當前時間減去failover_start_time大于election_timeout，說明Candidate還沒獲得足夠的選票，此次epoch的選舉已經超時，那么轉變成Follower。需要等到mstime() - failover_start_time < failover_timeout*2的時候才開始下一次獲得成為Candidate的機會。

如果一個Follower把某個Candidate設為自己認為的Leader，那么它的failover_start_time會設置為當前時間加上1s內的隨機毫秒數。這樣它就進入了上面說的需要等到mstime() - failover_start_time < failover_timeout*2的時候才開始下一次獲得成為Candidate的機會。

因為每個Sentinel判斷節點客觀下線的時間不是同時開始的，一般都有先后，這樣先開始的Sentinel就更有機會贏得更多選票，另外failover_state_change_time為1s內的隨機毫秒數，這樣也把各個節點的超時時間分散開來。本人嘗試過很多次，Sentinel間的Leader選舉過程基本上一個epoch內就完成了。

Sentinel 選舉流程源碼解析

Sentinel的選舉流程的代碼基本都在sentinel.c文件中，下面結合源碼對Sentinel的選舉流程進行說明。

定時任務

void sentinelHandleRedisInstance(sentinelRedisInstance *ri) {      ...      // 判斷 master 是否進入SDOWN 狀態     sentinelCheckSubjectivelyDown(ri);      /* Masters and slaves */     if (ri->flags & (SRI_MASTER|SRI_SLAVE)) {         /* Nothing so far. */     }              if (ri->flags & SRI_MASTER) {          // 判斷 master 是否進入 ODOWN 狀態         sentinelCheckObjectivelyDown(ri);          // 查看是否需要開始故障轉移         if (sentinelStartFailoverIfNeeded(ri))             // 向其他 Sentinel 發送 SENTINEL is-master-down-by-addr 命令             // 刷新其他 Sentinel 關于主服務器的狀態             sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);          // 執行故障轉移         sentinelFailoverStateMachine(ri);          //此處調用sentinelAskMasterStateToOtherSentinels，只是為了獲取其他Sentinel對于master是否存活的判斷，         //用來下一次判斷master是否進入ODOWN狀態         sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);     } }

Sentinel會每隔100ms執行一次sentinelHandleRedisInstance函數。流程會檢查master是否進入SDOWN狀態，接著會檢查master是否進入ODOWN狀態，接著會查看是否需要開始故障轉移，如果開始故障轉移就會向其他節點拉去投票，接下來有個故障轉移的狀態機，根據不同的failover_state，決定完成不同的操作，正常的時候failover_state為SENTINEL_FAILOVER_STATE_NONE。

向其他Sentinel獲取投票或者獲取對master存活狀態的判斷結果

#define SENTINEL_ASK_FORCED (1<<0) void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master, int flags) {     dictIterator *di;     dictEntry *de;      // 遍歷正在監視相同 master 的所有 sentinel     // 向它們發送 SENTINEL is-master-down-by-addr 命令     di = dictGetIterator(master->sentinels);     while((de = dictNext(di)) != NULL) {         sentinelRedisInstance *ri = dictGetVal(de);          // 距離該 sentinel 最后一次回復 SENTINEL master-down-by-addr 命令已經過了多久         mstime_t elapsed = mstime() - ri->last_master_down_reply_time;         char port[32];         int retval;          /* If the master state from other sentinel is too old, we clear it. */         // 如果目標 Sentinel 關于主服務器的信息已經太久沒更新，那么我們清除它         if (elapsed > SENTINEL_ASK_PERIOD*5) {             ri->flags &= ~SRI_MASTER_DOWN;             sdsfree(ri->leader);             ri->leader = NULL;         }          /* Only ask if master is down to other sentinels if:          *          * 只在以下情況滿足時，才向其他 sentinel 詢問主服務器是否已下線          *          * 1) We believe it is down, or there is a failover in progress.          *    本 sentinel 相信服務器已經下線，或者針對該主服務器的故障轉移操作正在執行          * 2) Sentinel is connected.          *    目標 Sentinel 與本 Sentinel 已連接          * 3) We did not received the info within SENTINEL_ASK_PERIOD ms.           *    當前 Sentinel 在 SENTINEL_ASK_PERIOD 毫秒內沒有獲得過目標 Sentinel 發來的信息          * 4) 條件 1 和條件 2 滿足而條件 3 不滿足，但是 flags 參數給定了 SENTINEL_ASK_FORCED 標識          */         if ((master->flags & SRI_S_DOWN) == 0) continue;         if (ri->flags & SRI_DISCONNECTED) continue;         if (!(flags & SENTINEL_ASK_FORCED) &&             mstime() - ri->last_master_down_reply_time < SENTINEL_ASK_PERIOD)             continue;          /* Ask */         // 發送 SENTINEL is-master-down-by-addr 命令         ll2string(port,sizeof(port),master->addr->port);         retval = redisAsyncCommand(ri->cc,                     sentinelReceiveIsMasterDownReply, NULL,                     "SENTINEL is-master-down-by-addr %s %s %llu %s",                     master->addr->ip, port,                     sentinel.current_epoch,                     // 如果本 Sentinel 已經檢測到 master 進入 ODOWN                      // 并且要開始一次故障轉移，那么向其他 Sentinel 發送自己的運行 ID                     // 讓對方將給自己投一票（如果對方在這個紀元內還沒有投票的話）                     (master->failover_state > SENTINEL_FAILOVER_STATE_NONE) ?                     server.runid : "*");         if (retval == REDIS_OK) ri->pending_commands++;     }     dictReleaseIterator(di); }

對于每個節點，Sentinel都會確認節點是否SDOWN，對于master，還需要確認ODOWN。sentinelAskMasterStateToOtherSentinels方法會在master進入SDOWN或者ODOWN調用sentinel is-master-down-by-addr命令，SDOWN時，該命令用來獲取其他Sentinel對于master的存活狀態，ODOWN是用來像其他節點投票的。SDOWN時，flags是SENTINEL_NO_FLAGS，ODOWN時，flags是SENTINEL_ASK_FORCED。

檢查是否開始故障轉移

/* This function checks if there are the conditions to start the failover,  * that is:  *  * 這個函數檢查是否需要開始一次故障轉移操作：  *  * 1) Master must be in ODOWN condition.  *    主服務器已經計入 ODOWN 狀態。  * 2) No failover already in progress.  *    當前沒有針對同一主服務器的故障轉移操作在執行。  * 3) No failover already attempted recently.  *    最近時間內，這個主服務器沒有嘗試過執行故障轉移  *    （應該是為了防止頻繁執行）。  *   * We still don't know if we'll win the election so it is possible that we  * start the failover but that we'll not be able to act.  *  * 雖然 Sentinel 可以發起一次故障轉移，但因為故障轉移操作是由領頭 Sentinel 執行的，  * 所以發起故障轉移的 Sentinel 不一定就是執行故障轉移的 Sentinel 。  *  * Return non-zero if a failover was started.   *  * 如果故障轉移操作成功開始，那么函數返回非 0 值。  */ int sentinelStartFailoverIfNeeded(sentinelRedisInstance *master) {     /* We can't failover if the master is not in O_DOWN state. */     if (!(master->flags & SRI_O_DOWN)) return 0;      /* Failover already in progress? */     if (master->flags & SRI_FAILOVER_IN_PROGRESS) return 0;      /* Last failover attempt started too little time ago? */     if (mstime() - master->failover_start_time <         master->failover_timeout*2)     {         if (master->failover_delay_logged != master->failover_start_time) {             time_t clock = (master->failover_start_time +                             master->failover_timeout*2) / 1000;             char ctimebuf[26];              ctime_r(&clock,ctimebuf);             ctimebuf[24] = '\0'; /* Remove newline. */             master->failover_delay_logged = master->failover_start_time;             redisLog(REDIS_WARNING,                 "Next failover delay: I will not start a failover before %s",                 ctimebuf);         }         return 0;     }      sentinelStartFailover(master);     return 1; }

sentinelStartFailoverIfNeeded方法會檢查master是否為ODOWN狀態。因為定時任務每次就會執行到該函數，所以還要確認故障轉移狀態SRI_FAILOVER_IN_PROGRESS是否已經開始。然后會看定時任務是否超時，只有以上條件都滿足才能開始故障轉移。關于定時任務是否超時，failover_start_time默認為0，它有2個地方會被修改，一個是開始故障轉移后，一個是收到其他Sentinel的投票請求。failover_start_time被修改的值為 mstime()+rand()%SENTINEL_MAX_DESYNC，這就是Raft協議說的隨機因子。SENTINEL_MAX_DESYNC是1000，相當于failover_start_time是當前時間加上1s內的隨機值，這個保證了，不同Sentinel在超時后，下次申請Leader時間的隨機。所以故障轉移開始，像Raft協議描述的“啟動一個新的定時器”，設置了failover_start_time。在投票的時候設置failover_start_time，那么先投票，再通過ODOWN和SRI_FAILOVER_IN_PROGRESS的節點，在檢查定時任務是否超時的時候就無法通過，相當于是Raft協議中的Follower，它不會參與競爭Leader。

成為Candidate，開始競選Leader

/* Setup the master state to start a failover. */ // 設置主服務器的狀態，開始一次故障轉移 void sentinelStartFailover(sentinelRedisInstance *master) {     redisAssert(master->flags & SRI_MASTER);      // 更新故障轉移狀態     master->failover_state = SENTINEL_FAILOVER_STATE_WAIT_START;      // 更新主服務器狀態     master->flags |= SRI_FAILOVER_IN_PROGRESS;      // 更新紀元     master->failover_epoch = ++sentinel.current_epoch;      sentinelEvent(REDIS_WARNING,"+new-epoch",master,"%llu",         (unsigned long long) sentinel.current_epoch);      sentinelEvent(REDIS_WARNING,"+try-failover",master,"%@");      // 記錄故障轉移狀態的變更時間     master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;     master->failover_state_change_time = mstime(); }

如果Sentinel通過三重檢查，進入了sentinelStartFailover，相當于成為了Candidate，它會做以下幾件事情：

1、把failover_state改成SENTINEL_FAILOVER_STATE_WAIT_START。
2、把master的狀態改成故障轉移中SRI_FAILOVER_IN_PROGRESS。
3、增加master的current_epoch，并賦值給failover_epoch。
4、把failover_start_time改成mstime()+rand()%SENTINEL_MAX_DESYNC。
5、把failover_state_change_time改成mstime()。

sentinelStartFailover完成了成為Candidate的前面兩步，接著要回到前面的定時任務sentinelHandleRedisInstance。因為sentinelStartFailoverIfNeeded返回了1，所以進入if流程，執行sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);，開始向其他Sentinel拉票。然后就進入sentinelFailoverStateMachine。

Follower投票

這里先來看下投票的源碼。

/* Vote for the sentinel with 'req_runid' or return the old vote if already  * voted for the specifed 'req_epoch' or one greater.  *  * 為運行 ID 為 req_runid 的 Sentinel 投上一票，有兩種額外情況可能出現：  * 1) 如果 Sentinel 在 req_epoch 紀元已經投過票了，那么返回之前投的票。  * 2) 如果 Sentinel 已經為大于 req_epoch 的紀元投過票了，那么返回更大紀元的投票。  *  * If a vote is not available returns NULL, otherwise return the Sentinel  * runid and populate the leader_epoch with the epoch of the vote.   *  * 如果投票暫時不可用，那么返回 NULL 。  * 否則返回 Sentinel 的運行 ID ，并將被投票的紀元保存到 leader_epoch 指針的值里面。  */ char *sentinelVoteLeader(sentinelRedisInstance *master, uint64_t req_epoch, char *req_runid, uint64_t *leader_epoch) {     if (req_epoch > sentinel.current_epoch) {         sentinel.current_epoch = req_epoch;         sentinelFlushConfig();         sentinelEvent(REDIS_WARNING,"+new-epoch",master,"%llu",             (unsigned long long) sentinel.current_epoch);     }      if (master->leader_epoch < req_epoch && sentinel.current_epoch <= req_epoch)     {         sdsfree(master->leader);         master->leader = sdsnew(req_runid);         master->leader_epoch = sentinel.current_epoch;         sentinelFlushConfig();         sentinelEvent(REDIS_WARNING,"+vote-for-leader",master,"%s %llu",             master->leader, (unsigned long long) master->leader_epoch);         /* If we did not voted for ourselves, set the master failover start          * time to now, in order to force a delay before we can start a          * failover for the same master. */         if (strcasecmp(master->leader,server.runid))             master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;     }      *leader_epoch = master->leader_epoch;     return master->leader ? sdsnew(master->leader) : NULL; }

前面說到Candidate開始競選后，會把當前epoch加1，這樣就比Follower大1，Follower收到第一個Candidate的投票后，因為自己當前的epoch比Candidate小，所以把當前的epoch改成第一個Candidate的epoch，然后把自己認為的Leader設置成該Candidate。然后其他Candidate再發起對該Follower的投票時，由于這些Candidate的epoch與自己選出Leader的epoch一樣，所以不會再改變自己認為的Leader。這樣，在一個epoch內，Follower就只能投出一票，給它第一個收到投票請求的Candidate。最后有個if (strcasecmp(master->leader,server.runid))，這個是為了設置failover_start_time，這樣Follower在當前epoch內，就無法成為Candidate了。

Sentinel執行任務的狀態機

void sentinelFailoverStateMachine(sentinelRedisInstance *ri) {     redisAssert(ri->flags & SRI_MASTER);      if (!(ri->flags & SRI_FAILOVER_IN_PROGRESS)) return;      switch(ri->failover_state) {         case SENTINEL_FAILOVER_STATE_WAIT_START:             // 統計選票，查看是否成為leader             sentinelFailoverWaitStart(ri);             break;         case SENTINEL_FAILOVER_STATE_SELECT_SLAVE:             // 從slave列表中選出最佳slave             sentinelFailoverSelectSlave(ri);             break;         case SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE:             // 把選出的slave設置為master             sentinelFailoverSendSlaveOfNoOne(ri);             break;         case SENTINEL_FAILOVER_STATE_WAIT_PROMOTION:             // 等待升級生效，如果升級超時，那么重新選擇新主服務器             sentinelFailoverWaitPromotion(ri);             break;         case SENTINEL_FAILOVER_STATE_RECONF_SLAVES:             // 向從服務器發送 SLAVEOF 命令，讓它們同步新主服務器             sentinelFailoverReconfNextSlave(ri);             break;     } }

Sentinel處理故障轉移流程是采用狀態處理的模式，不同狀態處理不同任務，任務完成后更新狀態到下一個狀態。sentinelFailoverStateMachine函數根據failover_state決定進入什么流程。在sentinelFailoverWaitStart函數里面，Leader就被選出了，其他幾個狀態是Leader進行故障轉移的流程。

確認自己是否成為Leader

void sentinelFailoverWaitStart(sentinelRedisInstance *ri) {     char *leader;     int isleader;      /* Check if we are the leader for the failover epoch. */     // 獲取給定紀元的領頭 Sentinel     leader = sentinelGetLeader(ri, ri->failover_epoch);     // 本 Sentinel 是否為領頭 Sentinel ？     isleader = leader && strcasecmp(leader,server.runid) == 0;     sdsfree(leader);      /* If I'm not the leader, and it is not a forced failover via      * SENTINEL FAILOVER, then I can't continue with the failover. */     // 如果本 Sentinel 不是領頭，并且這次故障遷移不是一次強制故障遷移操作     // 那么本 Sentinel 不做動作     if (!isleader && !(ri->flags & SRI_FORCE_FAILOVER)) {         int election_timeout = SENTINEL_ELECTION_TIMEOUT;          /* The election timeout is the MIN between SENTINEL_ELECTION_TIMEOUT          * and the configured failover timeout. */         // 當選的時長（類似于任期）是 SENTINEL_ELECTION_TIMEOUT         // 和 Sentinel 設置的故障遷移時長之間的較小那個值         if (election_timeout > ri->failover_timeout)             election_timeout = ri->failover_timeout;          /* Abort the failover if I'm not the leader after some time. */         // Sentinel 的當選時間已過，取消故障轉移計劃         if (mstime() - ri->failover_start_time > election_timeout) {             sentinelEvent(REDIS_WARNING,"-failover-abort-not-elected",ri,"%@");             // 取消故障轉移             sentinelAbortFailover(ri);         }         return;     }      // 本 Sentinel 作為領頭，開始執行故障遷移操作...      sentinelEvent(REDIS_WARNING,"+elected-leader",ri,"%@");      // 進入選擇從服務器狀態     ri->failover_state = SENTINEL_FAILOVER_STATE_SELECT_SLAVE;     ri->failover_state_change_time = mstime();      sentinelEvent(REDIS_WARNING,"+failover-state-select-slave",ri,"%@"); }

前面說到的sentinelStartFailover把failover_state設置成SENTINEL_FAILOVER_STATE_WAIT_START，于是進入sentinelFailoverWaitStart。

sentinelFailoverWaitStart會先查看leader是否已經選出。如果Leader是自己或者這是一次強制故障轉移，failover_state就設置為SENTINEL_FAILOVER_STATE_SELECT_SLAVE。強制故障轉移是通過Sentinel的SENTINEL FAILOVER <master-name>命令設置的，這里不做討論。

如果自己當選Leader，就會進入下一個任務處理狀態，開始故障轉移流程。如果在election_timeout內還沒當選為Leader，那么本次epoch內，Candidate就沒有當選，需要等待failover_timeout超時，進入下一次競選，或者本次epoch內，有Leader被選出，自己變會Follower。

統計投票

/* Scan all the Sentinels attached to this master to check if there  * is a leader for the specified epoch.  *  * 掃描所有監視 master 的 Sentinels ，查看是否有 Sentinels 是這個紀元的領頭。  *  * To be a leader for a given epoch, we should have the majorify of  * the Sentinels we know that reported the same instance as  * leader for the same epoch.   *  * 要讓一個 Sentinel 成為本紀元的領頭，  * 這個 Sentinel 必須讓大多數其他 Sentinel 承認它是該紀元的領頭才行。  */ // 選舉出 master 在指定 epoch 上的領頭 char *sentinelGetLeader(sentinelRedisInstance *master, uint64_t epoch) {     dict *counters;     dictIterator *di;     dictEntry *de;     unsigned int voters = 0, voters_quorum;     char *myvote;     char *winner = NULL;     uint64_t leader_epoch;     uint64_t max_votes = 0;      redisAssert(master->flags & (SRI_O_DOWN|SRI_FAILOVER_IN_PROGRESS));      // 統計器     counters = dictCreate(&leaderVotesDictType,NULL);      /* Count other sentinels votes */     // 統計其他 sentinel 的主觀 leader 投票     di = dictGetIterator(master->sentinels);     while((de = dictNext(di)) != NULL) {         sentinelRedisInstance *ri = dictGetVal(de);          // 為目標 Sentinel 選出的領頭 Sentinel 增加一票         if (ri->leader != NULL && ri->leader_epoch == sentinel.current_epoch)             sentinelLeaderIncr(counters,ri->leader);          // 統計投票數量         voters++;     }     dictReleaseIterator(di);      /* Check what's the winner. For the winner to win, it needs two conditions:      *      * 選出領頭 leader ，它必須滿足以下兩個條件：      *      * 1) Absolute majority between voters (50% + 1).      *    有多于一般的 Sentinel 支持      * 2) And anyway at least master->quorum votes.       *    投票數至少要有 master->quorum 那么多      */     di = dictGetIterator(counters);     while((de = dictNext(di)) != NULL) {          // 取出票數         uint64_t votes = dictGetUnsignedIntegerVal(de);          // 選出票數最大的人         if (votes > max_votes) {             max_votes = votes;             winner = dictGetKey(de);         }     }     dictReleaseIterator(di);      /* Count this Sentinel vote:      * if this Sentinel did not voted yet, either vote for the most      * common voted sentinel, or for itself if no vote exists at all. */     // 本 Sentinel 進行投票     // 如果 Sentinel 之前還沒有進行投票，那么有兩種選擇：     // 1）如果選出了 winner （最多票數支持的 Sentinel ），那么這個 Sentinel 也投 winner 一票     // 2）如果沒有選出 winner ，那么 Sentinel 投自己一票     if (winner)         myvote = sentinelVoteLeader(master,epoch,winner,&leader_epoch);     else         myvote = sentinelVoteLeader(master,epoch,server.runid,&leader_epoch);      // 領頭 Sentinel 已選出，并且領頭的紀元和給定的紀元一樣     if (myvote && leader_epoch == epoch) {          // 為領頭 Sentinel 增加一票（這一票來自本 Sentinel ）         uint64_t votes = sentinelLeaderIncr(counters,myvote);          // 如果投票之后的票數比最大票數要大，那么更換領頭 Sentinel         if (votes > max_votes) {             max_votes = votes;             winner = myvote;         }     }     voters++; /* Anyway, count me as one of the voters. */      // 如果支持領頭的投票數量不超過半數     // 并且支持票數不超過 master 配置指定的投票數量     // 那么這次領頭選舉無效     voters_quorum = voters/2+1;     if (winner && (max_votes < voters_quorum || max_votes < master->quorum))         winner = NULL;      // 返回領頭 Sentinel ，或者 NULL     winner = winner ? sdsnew(winner) : NULL;     sdsfree(myvote);     dictRelease(counters);     return winner; }

sentinelGetLeader會統計所有其他Sentinel的投票結果，如果投票結果中有個Sentinel獲得了超過半數且超過master的quorum，那么Leader就被選出了。

Candidate第一次進入sentinelGetLeader函數的時候是還沒向其他Sentinel發起投票，winner為NULL，于是就會給自己投上一票，這就是前面Raft協議說到的，在開始競選前“3、給自己投一票“，這樣競選前的4個步驟就全部完成了。以后再進入sentinelGetLeader就可以統計其他Sentinel的投票數目。當發現有個Sentinel的投票數據超過半數且超過quorum，就會返回該Sentinel，sentinelFailoverWaitStart會判斷該Sentinel是否是自己，如果是自己，那么自己就成為了Leader，開始進行故障轉移，不是自己，那么等待競選超時，成為Follower。

關于Leader通知其他Sentinel自己成為Leader的說明

在Sentinel的實現里面。關于Leader發送競選成功的消息給其他Sentinel，并沒有專門的邏輯。某個Sentinel成為Leader后，他就默默的干起活。故障轉移中Leader通過獲取選出的slave的INFO信息，發現其確認了master身份，Leader就會修改config_epoch為最新的epoch。

/* Process the INFO output from masters. */ void sentinelRefreshInstanceInfo(sentinelRedisInstance *ri, const char *info) {     ...     /* Handle slave -> master role switch. */     // 處理從服務器轉變為主服務器的情況     if ((ri->flags & SRI_SLAVE) && role == SRI_MASTER) {         /* If this is a promoted slave we can change state to the          * failover state machine. */         if ((ri->master->flags & SRI_FAILOVER_IN_PROGRESS) &&             (ri->master->failover_state ==                 SENTINEL_FAILOVER_STATE_WAIT_PROMOTION))         {             /* Now that we are sure the slave was reconfigured as a master              * set the master configuration epoch to the epoch we won the              * election to perform this failover. This will force the other              * Sentinels to update their config (assuming there is not              * a newer one already available). */             ri->master->config_epoch = ri->master->failover_epoch;             ri->master->failover_state = SENTINEL_FAILOVER_STATE_RECONF_SLAVES;             ri->master->failover_state_change_time = mstime();             sentinelFlushConfig();             sentinelEvent(REDIS_WARNING,"+promoted-slave",ri,"%@");             sentinelEvent(REDIS_WARNING,"+failover-state-reconf-slaves",                 ri->master,"%@");             sentinelCallClientReconfScript(ri->master,SENTINEL_LEADER,                 "start",ri->master->addr,ri->addr);             sentinelForceHelloUpdateForMaster(ri->master);         }         ...     }     ... }

config_epoch會通過hello頻道發送給其他Sentinel。其他Sentinel發現config_epoch更新了，就會更新最新的master地址和config_epoch。這相當于Leader把當選消息告知了其他Sentinel。

/* Process an hello message received via Pub/Sub in master or slave instance,  * or sent directly to this sentinel via the (fake) PUBLISH command of Sentinel.  *  * If the master name specified in the message is not known, the message is  * discarded. */ void sentinelProcessHelloMessage(char *hello, int hello_len) {     ...         /* Update master info if received configuration is newer. */         if (master->config_epoch < master_config_epoch) {             master->config_epoch = master_config_epoch;             if (master_port != master->addr->port ||                 strcmp(master->addr->ip, token[5]))             {                 sentinelAddr *old_addr;                  sentinelEvent(REDIS_WARNING,"+config-update-from",si,"%@");                 sentinelEvent(REDIS_WARNING,"+switch-master",                     master,"%s %s %d %s %d",                     master->name,                     master->addr->ip, master->addr->port,                     token[5], master_port);                  old_addr = dupSentinelAddr(master->addr);                 sentinelResetMasterAndChangeAddress(master, token[5], master_port);                 sentinelCallClientReconfScript(master,                     SENTINEL_OBSERVER,"start",                     old_addr,master->addr);                 releaseSentinelAddr(old_addr);             }         }      ... }

參考資料：

Redis 2.8.19 source code

http://redis.io/topics/sentinel

《In Search of an Understandable Consensus Algorithm》 Diego Ongaro and John Ousterhout Stanford University

《Redis設計與實現》黃健宏機械工業出版社

posted on 2016-12-14 18:33 jinfeng_wang 閱讀(2085) 評論(0) 編輯收藏所屬分類: 2016-REDIS

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: 緩存系列文章--7.無底洞問題(multiget hole) 緩存系列文章--6.緩存雪崩問題緩存系列文章--5.緩存穿透問題緩存系列文章--4.緩存的粒度控制緩存系列文章--2.是否真的需要緩存？緩存系列文章--3.緩存常用更新策略對比(一致性)。緩存系列文章--1.緩存的一些基本常識 JedisCluster 源碼分析 redis cluster使用經驗深入淺出Redis（三）高級特性：管道

jinfeng_wang

公告

常用鏈接

留言簿(40)

隨筆分類(592)

隨筆檔案(400)

Domestic

Foreign

搜索

積分與排名

最新評論

閱讀排行榜

評論排行榜

Raft協議選舉流程

節點的狀態

term

RPC

選舉流程

Sentinel的選舉流程

選舉流程

關于Sentinel超時時間的說明

Sentinel 選舉流程源碼解析

定時任務

向其他Sentinel獲取投票或者獲取對master存活狀態的判斷結果

檢查是否開始故障轉移

成為Candidate，開始競選Leader

Follower投票

Sentinel執行任務的狀態機

確認自己是否成為Leader

統計投票

關于Leader通知其他Sentinel自己成為Leader的說明

jinfeng_wang

公告

常用鏈接

留言簿(40)

隨筆分類(592)

隨筆檔案(400)

Domestic

Foreign

搜索

積分與排名

最新評論

閱讀排行榜

評論排行榜

Raft協議選舉流程

節點的狀態

term

RPC

選舉流程

Sentinel的選舉流程

選舉流程

關于Sentinel超時時間的說明

Sentinel 選舉流程源碼解析

定時任務

向其他Sentinel獲取投票或者獲取對master存活狀態的判斷結果

檢查是否開始故障轉移

成為Candidate，開始競選Leader

Follower投票

Sentinel執行任務的狀態機

確認自己是否成為Leader

統計投票

關于Leader通知其他Sentinel自己成為Leader的說明

成為Candidate，開始競選Leader