Zhou Li from XiaoIce: AI Chatbots Open a New Future for the Metaverse

原创 精选
Techplur
Human-computer conversation has been a part of our everyday lives for quite some time, and technologies like AI voice assistants and chatbots are widespread. In this article, we invited Mr. Zhou Li, V

Human-computer conversation has been a part of our everyday lives for quite some time, and technologies like AI voice assistants and chatbots are widespread. In this article, we invited Mr. Zhou Li, Vice President of Technology at XiaoIce, to share his ideas about the technical design of the AI chatbot system and the application of this technology in the immersive virtual world.


Why does this AI-AI conversation matter?

Conversation between humans has been around for at least hundreds of thousands of years, while human-machine communication has been around for approximately 55 years, starting with the humble chatbot Eliza. We have witnessed a significant improvement in human-machine conversation in the last decade.

Despite this, only limited research has been conducted in academia or industry on how AI communicates with AI. At most, two chatbots are put together for quality testing to determine which one is more interactive. Could chatbots be used for purposes other than quality testing?

Although the industry has conducted much research on human-AI interactions, including many technological and relevant advancements, there are still three fundamental issues that need to be addressed.

First, does AI truly comprehend what humans are saying, and can its algorithms understand varied human expressions, including omitted and extralinguistic meanings? With the emergence and development of large language models, the significance of this issue appears to have diminished, or we have at least been able to tackle it to some extent.

Second, what else might we discuss? This is an issue for many users of artificial intelligence, whether it be a mobile voice assistant or a chatbot. After the AI responds to a question such as "how is the weather in Beijing?" "you may next inquire about the weather in Shanghai. Then, once all known cities have been queried, the dialogue is likely over. It is pretty challenging to have an open-hearted conversation with AI since the interaction between humans and machines generally follows this pattern, which is quite distinct from human-to-human conversation.

Third, may I keep silent? Even in human-to-human interactions, there are times when people are not willing to speak but choose to become listeners. Therefore, in traditional human-AI conversation design, either the person must be forced to continue talking or exit the conversational interface.

Each of these scenarios raises the question, "Why should I waste time interacting with an AI chatbot?", since users cannot see any benefit from artificial intelligence.

Since 2013, XiaoIce has made considerable efforts toward human-computer conversation, and the average frequency of discussions between users and XiaoIce has increased as advancing technologies have been implemented. In the team's view, more rounds of exchange are an obvious indication that humans and AI are communicating more effectively. If the conversation is not productive, it may end after two or three rounds. If the quality is high, it may be possible to conduct ten, twenty, or even thirty rounds.

However, we also see that it is difficult for users to talk to AI like humans. As technology advances, how many users will engage with AI like an actual human, sharing their opinions, experiences, and moods instead of just asking simple questions like weather conditions? While the percentage is increasing, the growth rate is not quite as rapid as in the past, which means that most people will be unable to break through this barrier during one-on-one conversations.

User research has found that high school and college students are more likely to break the threshold and are more accepting of novel things. Older people find it harder to engage with AI and talk to it. As part of the user survey, the team also tried to use a real person to chat with the user so that they believe it is still artificial intelligence. However, even with a real person, i.e., with almost flawless conversational abilities, the percentage didn't exceed 20%.

Is there any prospect of breaking this limitation? It is an area that Xiao Ice has been experimenting with for the last two years and is a relatively new field.

Several examples of real-life human communication can help illustrate why and how the ceiling exists.

Scenario 1: A group of strangers meets in a matchmaking session. With strangers and a clear purpose, the topics of conversation tend to be more utilitarian and limited. For example, "do you have a home and a car," "what is your job," and "how is your family"? There is no indication that these attendees of the matchmaking conference are directly hostile or in any way mean, but similar to the prior discussion of weather questions or knowledge questions with AI voice assistants, the entire interaction is limited.

Scenario 2: Former classmates who haven't seen each other for many years. Typically, gatherings like this start with school-related memories, then people can progress to real life, work, and other issues, even though they may not have seen each other in years. Memories are the key to completing the ice-breaking process.

So XiaoIce also tried to post WeChat Moments, using algorithms to simulate content like what it ate today and where it visited, in the hope of providing more topics for conversation. As part of the project, XiaoIce has also allowed people to share articles with AI to build a shared memory so that both parties can communicate better. Unfortunately, this remains a closed circle. As long as the user has not established a willingness to communicate with AI, neither the user's WeChat Moments nor the user's ability to share content actively. In the end, it would just be a waste of time to do so.

Scenario 3: An elderly gentleman is walking through a park. A recently retired older man walks through the park, where he sees people of all types playing chess, caring for children, and chatting, and he doesn't know anyone. He is just looking and listening. A few days later, he may find a topic he is more interested in, and he reaches out. As he spends more time in the park, he makes new friends, creates his community, and seamlessly integrates into the environment.

An interactive experience like this is a great way to break the ice between humans and AI. Immersive social environments, or metaverse, as they are called today, are analogous to an older man sitting on a bench. A new user may ask how they can find what interests them in such a social environment, provided that there are already a lot of dynamic interactions. This already existing environment was not necessarily or possibly built by users, but by a bunch of artificial intelligence.

In a world of immersive social media platforms, there ought to be endless artificial intelligence living alongside people. Therefore, today the focus is on exploring how multiple AIs can interact and converse with each other in a complex way.

Ultimately, combining the human community with the artificial intelligence community makes sense and shows what interesting results can be achieved. The goal is to develop a user-supported and AI-based immersive virtual social media experience.

"XiaoIceland" contains actual people and some artificial intelligence. Each AI will join forces with another randomly to chat about various topics. If you are interested in hearing their conversation, you can participate in it directly. Several people can also participate in the dialogue for more complex interactions.


The overall design of the AI conversation system

Developing an AI-AI conversation is essential as a first step in implementing this technology.

Before discussing the technical details, it is necessary to understand the distinction between the traditional human-computer conversation and the AI-AI conversation.

First, the diversity of conversational modes will expand. Typically, with conventional chatbots or voice assistants, the user speaks one line, and the AI responds with another. Human-to-human conversations, on the other hand, do not follow this pattern since 90 percent of the words could be spoken by one person, while the other person acts primarily as a listener.

A range of listeners exists, such as guiding listeners, who sometimes offer guidance to their confidants to help them express their feelings more effectively; questioning listeners, who may ask questions to obtain more information; critical listeners, who will provide some comments and guidance at the appropriate time; and hater listeners, who, as their name implies, simply make disagreements.

This illustrates that the conversation is significantly more sophisticated than the typical human-computer scenario. An AI-AI interaction provides more significant potential for developing more sophisticated interaction patterns, as you may manage both sides of the AI simultaneously since there is transparency between them.

As an alternative, the overall rhythm becomes extremely important in AI-AI conversations. Even though TTS synthesis technology is very mature now, if you extend the time to five minutes or even thirty minutes, you will still find that the machine synthesized voice will appear very artificial.

Human speech will undergo many changes, which is the same for AI; we must simulate these variations in speed and the length of pauses between sentences for it to feel natural over time.

Likewise, include more transitions and intonations such as "um, ah, I think" and other similar words. Traditional human-machine conversations consider these words nonsense since they are only needed when the brain cannot keep up with the verbal expression. But if we have two artificial intelligence items, both sides would need these tone words. By doing so, the whole conversation can sound more natural, which will help real users devote more time to listening.


Text generation for AI conversations

At present, XiaoIce's current practice consists of three methods.

The first step is to do scratching of structured documents from search engines. As an example, scratching the structured documents of a local tourism website allows us to determine the best places to eat, how traffic is, and so forth. Then use technologies such as BERT to connect these pieces and convert them into content.

The second is the news feed. As an unstructured text, news presents a greater challenge, as its various writing techniques make it more difficult to understand. Despite this, XiaoIce has collaborated with several media outlets over the past few years and provided numerous comments on news stories, which led to a considerable collection of real user comments daily. By utilizing this data, artificial intelligence can converse with each other. For instance, when rewriting a news summary, a machine-learning algorithm speaks out the news. A second algorithm extracts high-quality comments from previous news articles and inserts them where relevant. In this way, a single piece of an article becomes an interactive dialogue.

Lastly, we utilize GPT-3 to generate paragraphs. The GPT-3 software was found to be effective in terms of language fluency, but it tends to lack logic when writing longer texts. This problem is resolved by using a method that extracts a series of keywords. For example, let's consider the theme of cat urination and defecation in a structured document. We can extract keywords such as cat litter and potty and mix them regularly into the sequence generated by GPT. In this way, the entire process of GPT generation will work along with the logic of these keywords, and the generated content will be more logically arranged. Nevertheless, in general, we now consider the length of the generation process to be more appropriate at around 100 to 300 words; any longer will result in a variety of logical defects.

The three methods described above were developed by XiaoIce using some of its own more mature data. These snippets of conversation must also be converted into a longer AI-AI conversation that may include a variety of topics.

The generated snippets of the three types are put into a search engine.

As soon as the first snippet is done, the team will place its last sentence into a conversation engine and then use the engine to receive a reply. Following that, the team picks it up again using a different conversation engine, which is equivalent to generating content using two conflicting engines.

It is vital to highlight that standard conversation engines meant for human-machine interaction, such as voice assistants and chatbots, do not operate very well in such a scenario. This is because machine-human dialogues and machine-machine exchanges remain distinct. We must considerably tweak at least one of the engines to make the machine-to-machine communication more fluid and logical, without recurrence of subjects.

Each round of conversation needs to be tested. As a first step, its relevance, message validity, and topic consistency must be constrained. In most cases, there are two options: a high entropy decision to terminate the conversation or new relevant content that matches the original.

Whenever a new snippet is strongly correlated with the last sentence generated by both machine-machine conversation engines, the team will consider that the engines have performed their function successfully since they have successfully extended their snippets seamlessly into each other, which is an ideal situation.

It is also possible that these two engines have attempted to find an appropriate topic for a long time and failed to do so. As of this moment, we should determine whether this conversation between these two machines is valid. When the information entropy is sufficiently high, or if the answers are overwhelmed with "yes, huh", or if there is a lot of repetition of Q & A, the situation is considered high entropy. Therefore, the dialogue between the engines has been suspended, and a new topic must be assigned by force. There may be a new topic related to a current hot topic or something of interest to the user.

The change of topic may be more abrupt, but generally, the team believes that the two engines will not continue conflicting all the time because the quality of the conversation will deteriorate. We need to interject such snippets so that the conversation becomes more meaningful. Using this approach, one can turn a short snippet into a longer speech.


Speech synthesis and pacing control of AI conversations

How to make the text into audio you can hear directly?

First, the dialogue should reflect the appropriate persona according to its content. This includes whether it is male or female, serious or humorous, and all of these are relevant to the content we create.

Additionally, as previously discussed, rhythm control should be more random and natural. Depending on the content, for example, when there is a long paragraph, it may be necessary to speak faster. When two people have an uninterrupted conversation, the speed may become slower with longer pauses to make the conversation more attractive.

When better content is being conveyed, the speed of speech should be slowed down, and the volume increased so that a few highlights and points can be heard. All of these elements are brought together to facilitate a better understanding of the machine-machine dialogue.


Application scenarios of AI conversation in immersive virtual social platforms

In a world where AIs converse with AI, XiaoIceland has presented us with an immersive social experience. So, how significant is this exploration for the coming metaverse and our future lives?

First of all, in metaverse research, the visual impact is a primary focus, and headsets are almost regarded as essential tools. In the metaverse, there may be only meaning if you see the strange visual phenomena that do not exist in reality, but this is not necessarily true.

In one sense, wearing a headset for a long time is so painful that people cannot devote much time to soaking up a visual virtual world, even with advancing hardware. Meanwhile, hearing is believed to be a much lighter method of sensory reception for the metaverse. If users have access to rich auditory content, they can remain comfortable in the metaverse for longer.

Second, it seems that the future of immersive virtual social platforms will provide humans with a game-like experience and solutions to problems in real social networks.

In China, for example, there is a growing elderly population, and the elderly have a great need for the company of their children. Yet their children are very busy at work and do not have much time available for such activities. Suppose an elderly person's granddaughter learns a song in kindergarten. Even if she cannot visit her grandpa and sing it to him in the real world, a computer can use the same image and voice to perform the song for him. In the longer term, this is the greater value that the metaverse and AI will bring to human life, and it will continue to grow over time.

责任编辑:庞桂玉 来源: 51CTO
相关推荐

2022-08-31 15:13:11

metaverseart

2022-08-31 14:34:56

metaverseAIWeb 3

2022-08-30 19:50:34

MetaverseCPPCCNPC

2022-08-31 14:39:45

metaverseSenseTimeAI

2024-08-12 12:54:58

2022-08-31 14:43:58

metaverse

2022-08-31 16:18:19

JavaOpenJDKDevelopmen

2022-08-31 11:22:07

open sourc

2022-08-30 22:45:36

gamesmetaverseAR

2022-08-31 08:45:47

metaverseblockchain

2022-08-31 10:56:05

open sourcApache PulStreamNati

2022-08-30 19:41:09

NFTMetaverse

2022-08-30 20:14:27

Zhou Hongycareerprogrammer

2024-04-29 08:12:53

2021-12-23 15:11:46

Web 3.0元宇宙Metaverse

2009-07-08 13:19:25

Future Resp

2009-07-07 10:08:49

Future Resp

2022-07-08 00:08:48

MetaverseWeb3加密货币

2020-12-17 16:53:23

NVIDIA

2022-08-30 22:04:55

opensource
点赞
收藏

51CTO技术栈公众号