技术分享 | OceanBase写入限速源码解读-低调大师

技术分享 | OceanBase写入限速源码解读

2023-05-08 438 89

作者：陈慧明

爱可生测试团队成员，主要参与dmp和dble自动化测试项目。

本文来源：原创投稿

* 爱可生开源社区出品，原创内容未经授权不得随意使用，转载请联系小编并注明来源。

一、简介

OceanBase中的写入限速机制旨在控制系统中写入操作（一般写入操作包括插入、更新和删除等）的速率，目的是为了提高数据库系统的稳定性。本文主要通过以下2个参数来解释写入限速的实现机制。

writing_throttling_trigger_percentage：设置写入速度的阈值百分比。当内存使用达到该阈值时，触发写入限速机制。默认值为60，取值范围为1到100（100表示关闭写入限速）。
writing_throttling_maximum_duration：指定触发写入限速后，所需的剩余内存分配时间。默认值为2小时。通常情况下，不需要修改该参数。

请注意，OceanBase 2.2.30 及之后版本才开始支持该机制。

二、实现原理

1. 进入限速逻辑

当执行写入操作申请内存时，触发写入限速条件：已使用的 Memstore 内存超过设定的比例（比例阈值由 writing_throttling_trigger_percentage 参数确定），系统进入限速逻辑。

2. 多次限速

限速逻辑会将本次申请内存的任务分成多次进行限速。每次限速的执行时间最多为20毫秒。

3.系统在每次限速中进行一个限速循环，在限速循环中，系统会检查以下条件：

内存释放：系统检查内存是否已经释放足够多的内存（满足不进入限速的条件），系统已经不需要限速。
SQL执行时间限制：系统检查SQL的执行时间是否已经达到限制。如果已经达到限制，则系统退出限速循环，并将SQL完成的信息发送给客户端。
休眠时间：系统检查是否已经休眠了20秒。如果已经休眠，则系统退出限速循环，并将SQL完成的信息发送给客户端。

4. 完成限速

如果上述任意一项条件满足，系统将退出限速循环，并将SQL完成的信息发送给客户端。这样可以确保SQL能够成功执行完成，并保证系统的稳定性。

5. 流程参考

三、源码解读

以下通过源码以一条insert语句的部分堆栈来解释writing_throttling_trigger_percentage和writing_throttling_maximum_duration是如何影响限速逻辑的。

ObTablet::insert_row_without_rowkey_check()

int ObTablet::insert_row_without_rowkey_check(
    ObRelativeTable &relative_table,
    ObStoreCtx &store_ctx,
    const common::ObIArray<share::schema::ObColDesc> &col_descs,
    const storage::ObStoreRow &row)
{
  int ret = OB_SUCCESS;
  {
    // insert_row_without_rowkey_check 执行结束时会调用ObStorageTableGuard的析构函数,进行限速处理
    ObStorageTableGuard guard(this, store_ctx, true);
    ObMemtable *write_memtable = nullptr;
    ...
    //write_memtable->set()会调用ObFifoArena::alloc()申请内存, 在分配内存时进行限速判断
    else if (OB_FAIL(write_memtable->set(store_ctx, relative_table.get_table_id(),
    full_read_info_, col_descs, row)))
    ...
  }
  return ret;
}

该方法会实例化 ObStorageTableGuard类 , 限速的执行过程定义在该类的析构函数内, 所以程序会在执行完 write_memtable 后才进行限速。后续会进行写Memtable的流程,这里不做赘述, 大致调用堆栈如下：

| > oceanbase::storage::ObTablet::insert_row_without_rowkey_check(...) (/src/storage/tablet/ob_tablet.cpp:1425)
| + > oceanbase::memtable::ObMemtable::set(...) (/src/storage/memtable/ob_memtable.cpp:339)
| + - > oceanbase::memtable::ObMemtable::set_(...) (/src/storage/memtable/ob_memtable.cpp:2538)
| + - x > oceanbase::memtable::ObMemtable::mvcc_write_(...) (/src/storage/memtable/ob_memtable.cpp:2655)
| + - x = > oceanbase::memtable::ObMvccEngine::create_kv(...) (/src/storage/memtable/mvcc/ob_mvcc_engine.cpp:275)
| + - x = | > oceanbase::memtable::ObMTKVBuilder::dup_key(...) (/src/storage/memtable/ob_memtable.h:77)
| + - x = | + > oceanbase::common::ObGMemstoreAllocator::AllocHandle::alloc(...) (/src/share/allocator/ob_gmemstore_allocator.h:84)
| + - x = | + - > oceanbase::common::ObGMemstoreAllocator::alloc(...) (/src/share/allocator/ob_gmemstore_allocator.cpp:125)
| + - x = | + - x > oceanbase::common::ObFifoArena::alloc(...) (/src/share/allocator/ob_fifo_arena.cpp:157)
| + - x = | + - x = > oceanbase::common::ObFifoArena::speed_limit(...)(/src/share/allocator/ob_fifo_arena.cpp:301)
| + - x = | + - x = | > oceanbase::common::ObFifoArena::ObWriteThrottleInfo::check_and_calc_decay_factor(...)(/src/share/allocator/ob_fifo_arena.cpp:75)
| + > oceanbase::storage::ObStorageTableGuard::~ObStorageTableGuard(...) (/src/storage/ob_storage_table_guard.cpp:53)

ObFifoArena::alloc()

写memtable时会申请内存, 这时候会去判断是否需要限速

void* ObFifoArena::alloc(int64_t adv_idx, Handle& handle, int64_t size)
{
  int ret = OB_SUCCESS;
  void* ptr = NULL;
  int64_t rsize = size + sizeof(Page) + sizeof(Ref);
  // 调用speed limit 判断限速
  speed_limit(ATOMIC_LOAD(&hold_), size);
  ...
}

ObFifoArena::speed_limit()

这个方法主要用来判断是否需要限速，同时根据配置的writing_throttling_maximum_duration值，计算出一个衰减因子用于等待时间的计算

void ObFifoArena::speed_limit(const int64_t cur_mem_hold, const int64_t alloc_size)
{
  int ret = OB_SUCCESS;
  //获取租户的writing_throttling_trigger_percentage值
  int64_t trigger_percentage = get_writing_throttling_trigger_percentage_();
  int64_t trigger_mem_limit = 0;
  bool need_speed_limit = false;
  int64_t seq = 0;
  int64_t throttling_interval = 0;
  // trigger_percentage <100 ,表示开启限速,再进行内存使用是否达到触发阈值的判断
  if (trigger_percentage < 100) {
    if (OB_UNLIKELY(cur_mem_hold < 0 || alloc_size <= 0 || lastest_memstore_threshold_ <= 0 || trigger_percentage <= 0)) {
      COMMON_LOG(ERROR, "invalid arguments", K(cur_mem_hold), K(alloc_size), K(lastest_memstore_threshold_), K(trigger_percentage));
    } else if (cur_mem_hold > (trigger_mem_limit = lastest_memstore_threshold_ * trigger_percentage / 100)) {
      // 当前使用内存超过触发阈值,需要限速,设置need_speed_limit 为true
      need_speed_limit = true;
      // 获取writing_throttling_maximum_duration的值,默认 2h
      int64_t alloc_duration = get_writing_throttling_maximum_duration_();
      // 计算衰减因子,用于sleep时间计算
      if (OB_FAIL(throttle_info_.check_and_calc_decay_factor(lastest_memstore_threshold_, trigger_percentage, alloc_duration))) {
        COMMON_LOG(WARN, "failed to check_and_calc_decay_factor", K(cur_mem_hold), K(alloc_size), K(throttle_info_));
      }
    }
  
    //这块代码是将内存和时钟值绑定,确保内存分配和写入限速的稳定性
    advance_clock();
    seq = ATOMIC_AAF(&max_seq_, alloc_size);
    get_seq() = seq;
     
    // 将need_speed_limit 赋值给tl_need_speed_limit 线程变量
    tl_need_speed_limit() = need_speed_limit;
    //日志记录,限速信息
    if (need_speed_limit && REACH_TIME_INTERVAL(1 * 1000 * 1000L)) {
      COMMON_LOG(INFO, "report write throttle info", K(alloc_size), K(attr_), K(throttling_interval),
                  "max_seq_", ATOMIC_LOAD(&max_seq_), K(clock_),
                  K(cur_mem_hold), K(throttle_info_), K(seq));
    }
  }
}

ObFifoArena::ObWriteThrottleInfo::check_and_calc_decay_factor()

计算衰减因子

int ObFifoArena::ObWriteThrottleInfo::check_and_calc_decay_factor(int64_t memstore_threshold,
                                                                  int64_t trigger_percentage,
                                                                  int64_t alloc_duration)
{
  int ret = OB_SUCCESS;
  if (memstore_threshold != memstore_threshold_
      || trigger_percentage != trigger_percentage_
      || alloc_duration != alloc_duration_
      || decay_factor_ <= 0) {
    memstore_threshold_ = memstore_threshold;
    trigger_percentage_ = trigger_percentage;
    alloc_duration_ = alloc_duration;
    int64_t available_mem = (100 - trigger_percentage_) * memstore_threshold_ / 100;
    double N =  static_cast<double>(available_mem) / static_cast<double>(MEM_SLICE_SIZE);
    decay_factor_ = (static_cast<double>(alloc_duration) - N * static_cast<double>(MIN_INTERVAL))/ static_cast<double>((((N*(N+1)*N*(N+1)))/4));
    decay_factor_ = decay_factor_ < 0 ? 0 : decay_factor_;
    COMMON_LOG(INFO, "recalculate decay factor", K(memstore_threshold_), K(trigger_percentage_),
               K(decay_factor_), K(alloc_duration), K(available_mem), K(N));
  }
  return ret;
}

decay_factor公式中，alloc_duration为writing_throttling_maximum_duration的值，4.0版本中为2h，MIN_INTERVAL默认值20ms。

简单来说，这个衰减因子是根据当前可用内存和writing_throttling_maximum_duration的值通过一个多项式计算出来的，整个过程如果writing_throttling_maximum_duration值不做调整，每次休眠时间会随着可用内存逐渐减少而慢慢增加。

ObStorageTableGuard::~ObStorageTableGuard()

限速流程执行

ObStorageTableGuard::~ObStorageTableGuard()
{
  //tl_need_speed_limit 在ObFifoArena::alloc()方法中赋值
  bool &need_speed_limit = tl_need_speed_limit();
  // 在写操作的上下文中, 创建ObStorageTableGuard 实例时，need_control_mem_ 会被赋值为true
  if (need_control_mem_ && need_speed_limit) {
    bool need_sleep = true;
    int64_t left_interval = SPEED_LIMIT_MAX_SLEEP_TIME;
    //SPEED_LIMIT_MAX_SLEEP_TIME 默认20s,表示最大sleep时间
    if (!for_replay_) {
        // 如果不是回放日志
        //store_ctx_.timeout_ - ObTimeUtility::current_time() 表示距离事务超时还要多久,如果该值小于0,表示事务已经超时
        //两者取小
      left_interval = min(SPEED_LIMIT_MAX_SLEEP_TIME, store_ctx_.timeout_ - ObTimeUtility::current_time());
    }
    // 如果memtable是冻结状态,不需要限速
    if (NULL != memtable_) {
      need_sleep = memtable_->is_active_memtable();
    }
    uint64_t timeout = 10000;//10s
    //事件记录, 可以在v$session_event中查看,event名: memstore memory page alloc wait
    //可以通过sql: select * from v$session_event where EVENT='memstore memory page alloc wait' 查询;
    common::ObWaitEventGuard wait_guard(common::ObWaitEventIds::MEMSTORE_MEM_PAGE_ALLOC_WAIT, timeout, 0, 0, left_interval);
 
    reset();
    int tmp_ret = OB_SUCCESS;
    bool has_sleep = false;
    int64_t sleep_time = 0;
    int time = 0;
    int64_t &seq = get_seq();
    if (store_ctx_.mvcc_acc_ctx_.is_write()) {
      ObGMemstoreAllocator* memstore_allocator = NULL;
      //获取当前租户的memstore内存分配器
      if (OB_SUCCESS != (tmp_ret = ObMemstoreAllocatorMgr::get_instance().get_tenant_memstore_allocator(
          MTL_ID(), memstore_allocator))) {
      } else if (OB_ISNULL(memstore_allocator)) {
        LOG_WARN_RET(OB_ALLOCATE_MEMORY_FAILED, "get_tenant_mutil_allocator failed", K(store_ctx_.tablet_id_), K(tmp_ret));
      } else {
        while (need_sleep &&
               !memstore_allocator->check_clock_over_seq(seq) &&
               (left_interval > 0)) {
          if (for_replay_) {
            // 如果是回放日志,并且当前租户下有正在进行的日志流,不做休眠,直接break
            if(MTL(ObTenantFreezer *)->exist_ls_freezing()) {
              break;
            }
          }
          //计算休眠时间
          int64_t expected_wait_time = memstore_allocator->expected_wait_time(seq);
          if (expected_wait_time == 0) {
            break;
          }
          //SLEEP_INTERVAL_PER_TIME 单次休眠时间,默认20ms
          //线程休眠,每次最多20ms
          uint32_t sleep_interval =
            static_cast<uint32_t>(min(min(left_interval, SLEEP_INTERVAL_PER_TIME), expected_wait_time));
          ::usleep(sleep_interval);
          // 累加休眠时间
          sleep_time += sleep_interval;
          // 休眠次数
          time++;
          //每次休眠之后,减去休眠时间
          left_interval -= sleep_interval;
          has_sleep = true;
          //每次休眠之后,重新判断是否需要限速,因为可能在休眠过程中,内存经过转储后已经释放出来了,这时候就不需要继续限速了
          need_sleep = memstore_allocator->need_do_writing_throttle();
        }
      }
    }
    // 日志记录,限速执行详情
    if (REACH_TIME_INTERVAL(100 * 1000L) &&
        sleep_time > 0) {
      int64_t cost_time = ObTimeUtility::current_time() - init_ts_;
      LOG_INFO("throttle situation", K(sleep_time), K(time), K(seq), K(for_replay_), K(cost_time));
    }
 
    if (for_replay_ && has_sleep) {
      get_replay_is_writing_throttling() = true;
    }
  }
  reset();
}

总结

OB的写入限速功能是在ObStorageTableGuard类的析构函数中实现的。由于该函数会在memtable写入完成后才被调用，因此限速行为是后置的，会影响下一次内存分配。换言之，在当前写入操作完成后，才会判断是否需要执行限速，若需要，会延迟下一次内存分配。这种设计既确保限速不会影响当前的写入操作，又能有效控制内存的分配和消耗。

四、使用方法

该参数是租户级别的参数，可以在租户管理员账号下或者在sys租户中指定租户，设置内存写入达到 80% 开始限速，并保证剩余内存足够提供 2h 的写入限速，示例:

obclient> ALTER SYSTEM SET writing_throttling_trigger_percentage = 80;
Query OK, 0 rows affected
obclient> ALTER SYSTEM SET writing_throttling_maximum_duration = '2h';
Query OK, 0 rows affected
  
或者在sys租户中指定租户
obclient> ALTER SYSTEM SET writing_throttling_trigger_percentage = 80 tenant=<tenant_name>;

五、使用场景

1.创建租户时使用在写压力比较大的情况下，比如做导入数据时，限制写入速度也是一种简单高效的解决方法，虽然OceanBase的LSM-Tree存储引擎架构可以及时冻结memtable并释放内存，但在写入速度高于转储速度的场景下，仍有可能导致Memstore耗尽。最新版本4.0默认开启此配置，结合转储配置，可以有效控制Memstore的消耗。

2.发现qps异常下降时，尤其是包含大量写时，也可以通过以下方式确认是否是由于写入限制导致。

系统表

如果是触发限速导致的qps值下降，根据上面的代码分析可知，会记录在session_event表中，事件名是“memstore memory page alloc wait”。

select * from v$session_event where EVENT='memstore memory page alloc wait' \G;
*************************** 94. row ***************************
           CON_ID: 1
           SVR_IP: 10.186.64.124
         SVR_PORT: 22882
              SID: 3221487713
            EVENT: memstore memory page alloc wait
      TOTAL_WAITS: 182673
   TOTAL_TIMEOUTS: 0
      TIME_WAITED: 1004.4099
     AVERAGE_WAIT: 0.005498403704981032
         MAX_WAIT: 12.3022
TIME_WAITED_MICRO: 10044099
              CPU: NULL
         EVENT_ID: 11015
    WAIT_CLASS_ID: 109
      WAIT_CLASS#: 9
       WAIT_CLASS: SYSTEM_IO

日志

通过grep 'report write throttle info' observer.log ，如果输入如下日志就可以确定是由于限速导致的。

[2023-04-17 17:17:30.695621] INFO  [COMMON] speed_limit (ob_fifo_arena.cpp:319) [26466][T1_L0_G0][T1][Y59620ABA407C-0005F9818D1BFE06-0-0] [lt=2] report write throttle info(alloc_size=32, attr_=tenant_id=1, label=Memstore, ctx_id=1, prio=0, throttling_interval=0, max_seq_=11045142952, clock_=11045143112, cur_mem_hold=524288000, throttle_info_={decay_factor_:"6.693207379708156213e-02", alloc_duration_:7200000000, trigger_percentage_:21, memstore_threshold_:2147483600, period_throttled_count_:0, period_throttled_time_:0, total_throttled_count_:0, total_throttled_time_:0}, seq=11045142952)

同时grep 'throttle situation' observer.log，可以看到这次限速的具体内容。

[2023-04-17 17:17:31.006880] INFO  [STORAGE] ~ObStorageTableGuard (ob_storage_table_guard.cpp:109) [26466][T1_L0_G0][T1][Y59620ABA407C-0005F9818D1BFE06-0-0] [lt=85] throttle situation(sleep_time=4, time=1, seq=11048795064, for_replay_=false, cost_time=7025)

本文关键字： #Oceanbase# #写入限速#

文章推荐：

技术分享 | OceanBase 手滑误删了数据文件怎么办

技术分享 | MySQL InnoDB Cluster Set 介绍

技术分享 | MySQL 编写脚本时避免烦人的警告

技术分享 | 调整 max-write-buffer-size 优化 pika 性能10倍的案例

关于SQLE

爱可生开源社区的 SQLE 是一款面向数据库使用者和管理者，支持多场景审核，支持标准化上线流程，原生支持 MySQL 审核且数据库类型可扩展的 SQL 审核工具。

SQLE 获取

类型	地址
版本库	https://github.com/actiontech/sqle
文档	https://actiontech.github.io/sqle-docs-cn/
发布信息	https://github.com/actiontech/sqle/releases
数据审核插件开发文档	https://actiontech.github.io/sqle-docs-cn/3.modules/3.7_auditplugin/auditplugin_development.html

更多关于 SQLE 的信息和交流，请加入官方QQ交流群：637150065...

本文分享自微信公众号 - 爱可生开源社区（ActiontechOSS）。
如有侵权，请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”，欢迎正在阅读的你也加入，一起分享。

微信关注我们

原文链接：https://my.oschina.net/actiontechoss/blog/8747006

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

2023-05-15 10:22:00

LLMOps 平台 Dify.AI 宣布 46,558 行代码完全开源

Dify.AI 宣布正式开源，并临时决定将开源协议从 AGPL 放宽到 Apache 2.0。Dify.AI 与飞蛾 (FEIE.WORK) 创始人表示：开源！从第一天起我们就志在要代表中国团队做一个像 Docker、Stripe 那样的世界级产品，一个大模型能力上的 OS。虽然过去八周我们没日没夜才做出产品现在的样子，但即使要推倒重来我们也做好了准备，所以开源我们毫不犹豫。Dify 要在开源中学习、成长。开源只是 Dify 的一小步，但这是全球开发者解锁 LLM 潜力的一大步。它目前还不完美，但在与社区共创、共建的过程中会成为全世界 Prompt 工程师最喜爱的产品。 No waiting list. No copy to China. 相信社区的力量！ Dify.AI 是一款开源且易用的 LLMOps 平台，旨在帮助开发者更简单、更快速地构建 AI 应用。Dify 提供了可视化的 Prompt 编排、运营、数据集管理等功能。你能在几分钟内创建一个 AI 应用，或将 LLM 快速集成到现有应用中，进行持续运营和改进，创造一个真正有价值的 AI 应用。根据解释，“Dify”这个名...

656

2023-05-11 13:46:00

6000+字讲透ElasticSearch 索引设计

ElasticSearch 索引设计在MySQL中数据库设计非常重要，同样在ES中数据库设计也是非常重要的概述我们创建索引就像创建表结构一样，必须非常慎重的，索引如果创建不好后面会出现各种各样的问题索引设计的重要性索引创建后，索引的分片只能通过_split和_shrink接口对其进行成倍的增加和缩减主要是因为es的数据是通过_routing分配到各个分片上面的，所以本质上是不推荐去改变索引的分片数量的，因为这样都会对数据进行重新的移动。还有就是索引只能新增字段，不能对字段进行修改和删除，缺乏灵活性，所以每次都只能通过_reindex重建索引了，还有就是一个分片的大小以及所以分片数量的多少严重影响到了索引的查询和写入性能，所以可想而知，设计一个好的索引能够减少后期的运维管理和提高不少性能，所以前期对索引的设计是相当的重要的。基于时间的Index设计 Index设计时要考虑的第一件事，就是基于时间对Index进行分割，即每隔一段时间产生一个新的Index 这样设计的目的因为现实世界的数据是随着时间的变化而不断产生的，切分管理可以获得足够的灵活性和更好的性能如果数据都存储...

327

资源下载

更多资源

优质分享Android(本站安卓app)

近一个月的开发和优化，本站点的第一个app全新上线。该app采用极致压缩，本体才4.36MB。系统里面做了大量数据访问、缓存优化。方便用户在手机上查看文章。后续会推出HarmonyOS的适配版本。

Mario，低调大师唯一一个Java游戏作品

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长，特征是大鼻子、头戴帽子、身穿背带裤，还留着胡子。与他的双胞胎兄弟路易基一起，长年担任任天堂的招牌角色。

Apache Tomcat7、8、9（Java Web服务器）

Tomcat是Apache 软件基金会（Apache Software Foundation）的Jakarta 项目中的一个核心项目，由Apache、Sun 和其他一些公司及个人共同开发而成。因为Tomcat 技术先进、性能稳定，而且免费，因而深受Java 爱好者的喜爱并得到了部分软件开发商的认可，成为目前比较流行的Web 应用服务器。

Eclipse（集成开发环境）

Eclipse 是一个开放源代码的、基于Java的可扩展开发平台。就其本身而言，它只是一个框架和一组服务，用于通过插件组件构建开发环境。幸运的是，Eclipse 附带了一个标准的插件集，包括Java开发工具（Java Development Kit，JDK）。