Skip to content

Conversation

@limowang
Copy link
Collaborator

@limowang limowang commented Mar 24, 2025

#2215

During rolling upgrades and duplication data migration, there are cases where a replica first learns the checkpoint from the primary replica and then loads the partition. However, if the primary replica crashes during the execution of ::dsn::error_code pegasus_server_impl::copy_checkpoint_to_dir_unsafe(const char *checkpoint_dir, int64_t *checkpoint_decree, bool flush_memtable) while generating the checkpoint, it may skip the final block of code:

std::vector<rocksdb::ColumnFamilyDescriptor> column_families(
    {{DATA_COLUMN_FAMILY_NAME, rocksdb::ColumnFamilyOptions()},
     {META_COLUMN_FAMILY_NAME, rocksdb::ColumnFamilyOptions()}});
status = rocksdb::DB::OpenForReadOnly(
    rocksdb::DBOptions(), checkpoint_dir, column_families, &handles_opened, &snapshot_db);
if (!status.ok()) {
    derror_replica(
        "OpenForReadOnly from {} failed, error = {}", checkpoint_dir, status.ToString());
    snapshot_db = nullptr;
    cleanup(true);
    return ::dsn::ERR_LOCAL_APP_FAILURE;
}
dcheck_eq_replica(handles_opened.size(), 2);
dcheck_eq_replica(handles_opened[1]->GetName(), META_COLUMN_FAMILY_NAME);

As a result, the META_COLUMN_FAMILY_NAME column family might be missing from the checkpoint. When other potential follower replicas request checkpoint data, the primary replica does not validate whether the checkpoint data is complete, which could lead to triggering the !missing_meta_cf assertion failure.

Therefore, after the node crash and restart, the completeness of the checkpoint needs to be verified.

@github-actions github-actions bot added the cpp label Mar 24, 2025
@limowang limowang changed the title bug: When loading the replica in the replica_server, the assertion !missing_meta_cf failed bug(duplication): When loading the replica in the replica_server, the assertion !missing_meta_cf failed Mar 24, 2025
@limowang limowang changed the title bug(duplication): When loading the replica in the replica_server, the assertion !missing_meta_cf failed fix(checkpoint): When loading the replica in the replica_server, the assertion !missing_meta_cf failed Mar 26, 2025
@limowang limowang force-pushed the wangguangshuo_missing_meta_cf branch 2 times, most recently from b06400c to 435c533 Compare March 27, 2025 03:58
@limowang limowang requested review from acelyc111 and empiredan March 27, 2025 07:27
LOG_ERROR_PREFIX(
"OpenForReadOnly from {} failed, error = {}", checkpoint_dir, status.ToString());
snapshot_db = nullptr;
cleanup();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use defer to execute cleanup() automatically

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thank you for your advice !

@limowang limowang force-pushed the wangguangshuo_missing_meta_cf branch from 99d4a19 to 6354cb3 Compare April 8, 2025 08:31
@github-actions github-actions bot added the github label Apr 8, 2025
.clang-tidy Outdated
# Disable some checks that are not useful for us now.
# They are sorted by names, and should be consistent to build_tools/clang_tidy.py.
Checks: 'abseil-*,boost-*,bugprone-*,cert-*,clang-analyzer-*,concurrency-*,cppcoreguidelines-*,darwin-*,fuchsia-*,google-*,hicpp-*,linuxkernel-*,llvm-*,misc-*,modernize-*,performance-*,portability-*,readability-*,-bugprone-easily-swappable-parameters,-bugprone-lambda-function-name,-bugprone-macro-parentheses,-bugprone-sizeof-expression,-cert-err58-cpp,-concurrency-mt-unsafe,-cppcoreguidelines-avoid-c-arrays,-cppcoreguidelines-avoid-magic-numbers,-cppcoreguidelines-avoid-non-const-global-variables,-cppcoreguidelines-macro-usage,-cppcoreguidelines-non-private-member-variables-in-classes,-cppcoreguidelines-owning-memory,-cppcoreguidelines-pro-bounds-array-to-pointer-decay,-cppcoreguidelines-pro-bounds-pointer-arithmetic,-cppcoreguidelines-pro-type-const-cast,-cppcoreguidelines-pro-type-union-access,-fuchsia-default-arguments-calls,-fuchsia-multiple-inheritance,-fuchsia-overloaded-operator,-fuchsia-statically-constructed-objects,-google-readability-avoid-underscore-in-googletest-name,-hicpp-avoid-c-arrays,-hicpp-named-parameter,-hicpp-no-array-decay,-llvm-include-order,-misc-definitions-in-headers,-misc-non-private-member-variables-in-classes,-misc-unused-parameters,-modernize-avoid-bind,-modernize-avoid-c-arrays,-modernize-replace-disallow-copy-and-assign-macro,-modernize-use-trailing-return-type,-performance-unnecessary-value-param,-readability-function-cognitive-complexity,-readability-identifier-length,-readability-magic-numbers,-readability-named-parameter,-readability-suspicious-call-argument'
Checks: 'abseil-*,boost-*,bugprone-*,cert-*,clang-analyzer-*,concurrency-*,cppcoreguidelines-*,darwin-*,fuchsia-*,google-*,hicpp-*,linuxkernel-*,llvm-*,misc-*,modernize-*,performance-*,portability-*,readability-*,-bugprone-easily-swappable-parameters,-bugprone-lambda-function-name,-bugprone-macro-parentheses,-bugprone-sizeof-expression,-cert-err58-cpp,-concurrency-mt-unsafe,-cppcoreguidelines-avoid-c-arrays,-cppcoreguidelines-avoid-magic-numbers,-cppcoreguidelines-avoid-non-const-global-variables,-cppcoreguidelines-macro-usage,-cppcoreguidelines-non-private-member-variables-in-classes,-cppcoreguidelines-owning-memory,-cppcoreguidelines-pro-bounds-array-to-pointer-decay,-cppcoreguidelines-pro-bounds-pointer-arithmetic,-cppcoreguidelines-pro-type-const-cast,-cppcoreguidelines-pro-type-union-access,-fuchsia-default-arguments-calls,-fuchsia-multiple-inheritance,-fuchsia-overloaded-operator,-fuchsia-statically-constructed-objects,-google-readability-avoid-underscore-in-googletest-name,-hicpp-avoid-c-arrays,-hicpp-named-parameter,-hicpp-no-array-decay,-llvm-include-order,-misc-definitions-in-headers,-misc-non-private-member-variables-in-classes,-misc-unused-parameters,-modernize-avoid-bind,-modernize-avoid-c-arrays,-modernize-replace-disallow-copy-and-assign-macro,-modernize-use-trailing-return-type,-performance-unnecessary-value-param,-readability-function-cognitive-complexity,-readability-identifier-length,-readability-magic-numbers,-readability-named-parameter,-readability-suspicious-call-argument,-abseil-string-find-str-contains'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which rules are disabled ? Should also be consistent with build_tools/clang_tidy.py.

Copy link
Collaborator Author

@limowang limowang Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter -abseil-string-find-str-contains was added, otherwise the line d1.find("checkpoint") != std::string::npos would not pass the format check (but I didn't modify it).I will synchronize the modification in /build_tools/clang_tidy.py.

@limowang limowang force-pushed the wangguangshuo_missing_meta_cf branch from 4df3751 to 58bbbb5 Compare April 9, 2025 07:31
@limowang limowang force-pushed the wangguangshuo_missing_meta_cf branch from 58bbbb5 to 02324f8 Compare April 9, 2025 08:09
@github-actions github-actions bot removed the github label Apr 9, 2025
@limowang limowang force-pushed the wangguangshuo_missing_meta_cf branch from ed7ca12 to 6bb5e1f Compare April 9, 2025 09:35
@limowang limowang force-pushed the wangguangshuo_missing_meta_cf branch from d3e7b3e to 91f05fa Compare April 9, 2025 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants