In the wake of the rapid advancements in artificial intelligence and the Internet of Things, big data has become one of the most influential production tools with a growing interest in time-series data. Thus, it is imperative for the entire industry to find solutions to the question of how to better utilize time-series data and create a robust database for complex scenarios.
Initiated by Tsinghua University, Apache IoTDB is an open-source project that serves as a platform to integrate IoT time-series data collection, storage, querying, and analysis. According to a benchmarking test conducted by the China Software Evaluation Center and Renmin University of China, IoTDB's performance indicators are considerably better than many international time-series databases currently in use.
IoTDB supports the deployment of "end-edge-cloud" and is suitable for data management scenarios in high-end equipment management, factories, high-speed networks,etc. It's competent to address many pain points and has been extensively used across numerous industries like energy, rail transportation, and IoV.
In this article, we invited Dr.Qiao Jialin, Assistant researcher in the School of Software, Tsinghua University, to introduce the IoT native database IoTDB and share his insights on the rapid growth of open-source database projects, open-source governance, development of time-series databases, and how to empower enterprises to enhance productivity.
The first university-based top-level Apache project in China
Q: IoTDB is the first university-based top-level Apache project in China. Can you tell us how it all began?
A: Getting into the Apache Software Foundation(ASF) consists of two major phases. The first is to apply for admission to the Apache Incubator. At this stage, presenting the value and significance of the project is essential, which should be explained in the application proposal. Apache members will assess the project's value based on the proposal.
A project is judged primarily on whether it attempts to solve a pain point that is relatively wide-ranging and whether it is beneficial to society. Additionally, members of the ASF will assess whether a new project overlaps with an existing one. It usually selects one project per direction for incubation; therefore, a project with extensive overlaps may be rejected.
In this regard, IoTDB focuses primarily on the problems associated with managing industrial internet of things projects. In 2011, our lab approached industrial IoT projects to help enterprises manage their data generated by engineering machinery that produced a large volume of time-series data. During that period, we selected some open-source platforms for project implementation, but they were not designed for IoT scenarios. As a result, performance bottlenecks such as slow read/write speeds and low compression ratios started to occur. Ultimately, we decided to start from scratch to resolve these issues. Hence, the underlying background of IoTDB is more practical, whereas the problems solved are realistic, leading to a better market base.
The second step is completing specifications and community building to make it a top-level project.
The incubator phase needs to focus on project compliance and community building. Project compliance includes whether the Apache protocol declaration is made in the code, whether the dependent open source components are compatible with the Apache protocol, whether the release is made according to the Apache specification, etc. Community building includes community activity, the number of discussions on the mailing list, whether there are external committers and PMCs, etc.
To achieve this, IoTDB was determined to build open-source communities. We do not have a set of metrics to monitor the development of communities; instead, we are all interested in seeing them grow and develop organically. Additionally, the community is very open to external contributors, and IoTDB was founded in a school lab, which welcomes new students yearly. The community has a mentorship program to match existing community members with newcomers. Additionally, developing the community and the backend involves writing a lot of documentation to assist newcomers in getting started and finding their way.
Q: What are the differences between universities and other organizations initiating open source projects? What factors contribute to maintaining a high activity level and stable community participation for university-based projects?
A: Most university-initiated open-source projects do not have a dedicated community operation team. Instead, developers or students manage the community, and users can directly contact the developers for assistance.
Meanwhile, university-based projects have a higher staff turnover rate. Most graduate students are involved in the community for two years, whereas undergraduate students are generally engaged in the community part-time because they have more courses, exams, or internships to complete.
This is, however, somewhat in line with the style of open-source communities, where everyone contributes in their spare time, and communication is established through e-mails and documents. Accordingly, university open-source projects should interact more with the community and synchronize ideas and design thoughts with the community. Students may devote much time and energy to building promising projects, but they need to pay more attention to publicity.
Operations after becoming a top-level project
Q: What improvements did IoTDB's contributors, community users, etc., experience as it grew into a top open-source project? Have its operations changed in any way?
A: In fact, our changes largely began once we became members of the Apache Software Foundation. Previously, when we started working, we just discussed it with a few students, and only labmates were aware of our work, which lacks publicity.
After entering ASF, we created a document containing every project discussion and sent it to the community. Updates will only be made when most people acknowledge them. In addition, we have organized some meetups, managed the project's public website, and set up social media channels such as WeChat groups, QQ groups, Slack connections, etc. Business users will not choose your software because it is a top project. Instead, they will be rational enough to test your software before making a purchase decision.
This is also true for contributors. Apache top contributors are often themselves users, so their interest in Apache top projects may be what motivates them to contribute. After completing the evaluation, many users and companies will invest in research and development to become full-time members of the community. In addition, the community's working model has changed from being dominated by one organization to being built by the community as a whole.
Stability comes first, always
Q: What do you consider database projects more critical: performance, stability, or maintainability? What are the crucial factors for a time-series database?
A: In short, stability comes first, maintainability second, and performance third.
Having stability in production means that the system will perform the same as in the last test so that everyone can accept it and no significant problems will occur.
Maintainability is also a guarantee of stability, and we have added many O&M-friendly features in the system design process.
As for performance, sometimes it's not necessary to pursue extreme performance. Adding more hardware might suffice. However, the performance must be predictable; otherwise, the system will not be helpful.
The time-series database is usually used in a factory or equipment context, where the configuration of the machine and the network environment aren't as good as those found on the Internet. Moreover, data load will be larger and more complex, such as the emergence of network fluctuations and data quality issues, which are essential aspects to consider.
Q: In terms of open-source time-series databases, there are a range of popular ones in use, including InfluxDB, OpenTSDB, TDEngine, etc. What are the advantages of IoTDB?
A: The advantages of IoTDB primarily lie in two areas.
The first is the technical advantage. Due to our early involvement with the IoT scene, we discovered more relevant problems in the process, and thus we can develop a design that meets the needs of a wider audience without various restrictions. Additionally, this project comes from Tsinghua University, allowing us to benefit from its rich research and innovation resources.
Next, we have the community advantage. The Apache Software Foundation has helped us develop a more open community, and many of the developers come from the time-series database departments of Internet companies. Our community atmosphere is excellent, and we often have in-depth discussions and sharing activities. This has inspired more students to join and the advantage of the community will act as the most powerful motivation to develop a basic software application in the long term.
IoT data models
Q: Many industries have utilized IoTDB, such as wind power, engineering machinery, meteorological big data platforms, etc. Using a power plant as an example, how does IoTDB enable enterprises to manage data more efficiently?
A: We created our own time-series data file format, TsFile. TsFile is based on IoT data models, making it a better solution for storing and indexing time-series data.
Additionally, we improved the efficiency of data queries by optimizing the read/write process of the database engine, organizing and processing the data into rows and columns, and designing different granularities of pre-aggregated information for queries, which can significantly enhance the performance.
In a power plant, a large generating unit has thousands of measurement points. Traditional relational databases cannot store all these measurements in a single table that is typically limited to storing a thousand measurement terms, and manual table splitting will result in greater complexity. The IoT data model of IoTDB facilitates using any measurement point while maintaining consistent performance. The use of multilayer indexes can accelerate the process of searching for sequences and data.
When selecting models, what matters for international projects?
Q: Germany and the United States are also promoting and using IoTDB, so what are the differences in the needs and focus of these companies regarding accepting and using a product like IoTDB?
A: Renowned international clients such as Siemens and Bosch conduct extensive testing when choosing models. They would, for example, evaluate the technology and product states of 15 traditional real-time databases, such as PI System of America, Delta V of Emerson, ABB, Aspen, etc., and compare them with IoTDB.
Moreover, they would also make a brief comparison of more than 20 types of time-series databases using the DB-engine and select several to be tested in more detail. Furthermore, they compared our test results with test results collected from other users of PI and SQL servers.
They are very strict and will consider similar products. Typically, they use a production load, or a load similar to production, and then increase the pressure over that, rather than experimenting with very high pressures because those are very uncommon in practice and are not particularly relevant to a product line.
Moreover, multinational companies pay close attention to how internationalized the project is. As an indicator of the health of the community, internationalization is of vital importance, and earning stars for your projects by cheating will not enhance your reputation among international users.
In addition, these companies pay close attention to the community participants, such as whether the community maintainers are from the same organization and whether a variety of project managers are involved.
The future of open-source databases
Q: How has the IoTDB project been progressing to date?
A: IoTDB maintains a fast development pace. In April, version 0.13 was released, adding support for univariate and multivariate sequences, triggers, and other features. Moreover, it supports continuous queries, nested expressions, etc., optimizes the process of writing data, and improves the performance of merging system files. Meanwhile, it enhances compatibility with external systems by adding Grafana add-ons, REST APIs, etc. We are currently working to optimize the distributed version, which should be available by August.
Q: In recent years, open-source databases, especially domestic ones, have become increasingly popular. What are your views regarding these databases, and where do you think they will go?
A: In China, open-source database projects are valuable for training database talent. Universities offer database courses; however, most focus on SQL, so it's difficult for students to comprehend how to create a database.
By participating in open source projects, individuals can gain direct experience in database development, which in turn helps China cultivate fundamental software expertise. In the future, there will likely be new kinds of databases, such as time series and graph databases. Databases will also be more targeted at specific application areas. Furthermore, the combination of databases with AI, analysis systems, streaming systems, etc., is also an innovative direction.
Q: Can you give some advice to developers working on open-source databases and students considering joining IoTDB?
A: While open-source databases are popular today, they are still a form of system software with high complexity and threshold. Therefore, you should have reasonable expectations when it comes to using databases. Developing open-source databases may not yield results in one or two weeks. This requires knowledge of database and architecture design concepts and finding a point to study and optimize, which is relatively lengthy.
Guest Introduction
Dr. Qiao Jialin is currently a postdoc and assistant researcher in the School of Software at Tsinghua University. Dr.Qiao is Apache IoTDB's PMC member and Chief Committer, and operates a personal Wechat subscription account "Tie Tou Qiao". He is a Silver Lecturer of the OpenAtom Foundation and the winner of the first prize of the Beijing Science and Technology Progress Award. Dr.Qiao specializes in databases, including file structures, indexes, replication management, etc. He has been a member of the IoTDB team from the beginning and keeps working on it after it graduated as an Apache Software Foundation top-level project.