Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Multiple replicas of Doris tablet are cold backed up to HDFS, with some replicas experiencing cold backup anomalies #47056

Open
2 of 3 tasks
lcy999 opened this issue Jan 16, 2025 · 0 comments

Comments

@lcy999
Copy link

lcy999 commented Jan 16, 2025

Search before asking

  • I had searched in the issues and found no similar issues.

Version

doris 2.1.7
hadoop 3.1.4

What's Wrong?

Multiple replicas of the tablet are cold backed up to HDFS. It is common for some replicas to experience cold backup anomalies, while other tablets may have all replicas successfully cold backed up. If the partition replica is set to 1, this issue will not occur. The errors reported when replicas are cold backed up to HDFS mainly include ‘Blocklist for /data/10108/10110.0.meta has changed!’ and ‘Cannot read cooldown meta: [INTERNAL_ERROR] malformed tablet meta’.

Below is the specific information:

  1. create table info:

CREATE TABLE IF NOT EXISTS example_tbl_by_default_t01
(
timestamp DATETIME NOT NULL COMMENT "日志时间",
type INT NOT NULL COMMENT "日志类型",
error_code INT COMMENT "错误码",
error_msg VARCHAR(1024) COMMENT "错误详细信息",
op_id BIGINT COMMENT "负责人id",
op_time DATETIME COMMENT "处理时间"
)
auto partition by list(error_msg)()
DISTRIBUTED BY HASH(type) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 2"
);

  1. storage policy and resource info:
    CREATE RESOURCE "remote_hdfs_t01" PROPERTIES (
    "type"="hdfs",
    "fs.defaultFS"="qione01:9000"
    )

CREATE STORAGE POLICY policy_hdfs_t01
PROPERTIES(
"storage_resource" = "remote_hdfs_t01",
"cooldown_ttl" = "60"
)

ALTER TABLE example_tbl_by_default_t01 set ("storage_policy" = "policy_hdfs_t01");

  1. detail error:
    It has been confirmed that the meta file causing the error exists on HDFS and is in a normal state.

[hdfs_builder.cpp:60] java.io.IOException: Blocklist for /data/10108/10110.0.meta has changed!
at org.apache.hadoop.hdfs.DFSInputStream.fetchAndCheckLocatedBlocks(DFSInputStream.java:302)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:238)
at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1012)
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:952)
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:930)
at org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1128)
at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1496)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1705)
at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:259)

[tablet.cpp:2451] cannot read cooldown meta: [INTERNAL_ERROR]malformed tablet meta
, path=/data/24763/24765.0.meta
0# doris::Tablet::_read_cooldown_meta(std::shared_ptrdoris::io::RemoteFileSystem const&, doris::TabletMetaPB*) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/..
/../../../include/c++/11/bits/unique_ptr.h:120
1# doris::Tablet::_follow_cooldowned_data() at /root/doris/be/src/common/status.h:491
2# doris::Tablet::cooldown(std::shared_ptrdoris::Rowset) at /root/doris/be/src/common/status.h:491
3# std::_Function_handler<void (), doris::StorageEngine::_cooldown_tasks_producer_callback()::$_1>::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x8
6_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:701
4# doris::WorkThreadPool::work_thread(int) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:646
5# execute_native_thread_routine at /data/gcc-11.1.0/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:85
6# start_thread
7# __clone

This is the information after cold backup of another table tablet. The issue of partial replicas of the tablet failing to cold back up will persist. After restarting the BE, it will return to normal, and the above errors will no longer occur.
Image

What You Expected?

Multiple replicas of the tablet can be successfully cooled down to HDFS

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant