StarRocks Best Practices on Server type, Data Modeling, Query Performance, Data Loading (WIP) #22765
Closed
Replies: 1 comment
-
6/2/23 @ss892714028 is in process to update this. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Selecting the optimal server type
For the Frontend (FE), we recommend a total CPU core count of 8 and over 16GB of memory. For the Backend (BE), a total CPU core count of 16, with over 64GB of memory, will work best. It's important to note that the CPU of the BE server needs to support the AVX2 instruction set to leverage StarRocks' vectorization technology.
In addition, your hardware should include a 10 Gigabit Ethernet card and a compatible 10 Gigabit switch for smooth networking.
Picking the right type of tables
First off, let's talk about the duplicated key table. This is StarRocks' default table and one of the most popular ones among users. Picture this: you've got a table creation statement on your right. See the DUPLICATE KEY keyword? After this, you can declare your sorting columns. It's a great idea to choose columns that are frequently used in your filter conditions.
Next up, we have the Aggregation table. This one is ideal when you have business scenarios involving aggregated indicator columns. Imagine you've created a data table using this table. What happens is that your indicator columns get neatly aggregated based on your definitions.
Finally, let's talk about the Primary Key table. This table shines when you're dealing with real-time or frequent updates. The Primary Key table ensures that for any given primary key, there's just one record, eliminating the need for Merge operations. This greatly improves query performance at the cost of a slight dip in write performance and memory usage.
Things to think about for query performance
Let's start with partition bucketing. This clever technique is all about slicing and dicing your data into digestible pieces to minimize data scanned. Imagine cutting your data into parts by range, then dividing these parts into tablets using a hash.
Next up, we have sorting columns, these bad boys can dramatically enhance your query efficiency. Let's say, your SQL query often uses 'site_id' and 'city_code' as filter conditions, why not use them as sorting columns? Makes sense, right?
Also enable the query cache. Typically it's not enabled by default. In certain situations, you can get a 10X in performance for your queries.
Picking the optimal data loading methodology
Now let's talk about the different import methods StarRocks offers. Got local CSV or JSON files? Use the stream load method to bring them in. HDFS files? No problem, Broker Load's got you covered. What about Kafka text or json files? Routine Load and StarRocks Kafka Connector is your friend. And if you have large-scale files with a Spark cluster, consider using SparkLoad to import data. Got data in Apache Iceberg, Apache Hive or Apache Hudi, connect to that external catalog and then use the select and insert method. But that's not all. StarRocks even allows for setting up external tables, and you can store external table data in StarRocks using the Insert into select method. This can even import MySQL tables into StarRock.
See more at https://github.com/StarRocks/starrocks/discussions/24660
Additional Content
https://www.bilibili.com/video/BV1SX4y1c7i4/?spm_id_from=333.337.search-card.all.click
Beta Was this translation helpful? Give feedback.
All reactions