Human-Robot Commensality: Bite Timing Prediction for Robot-Assisted Feeding in Groups

Link:2207.03348 (arxiv.org)

Problems solved

The authors first built their own video dataset and then trained a model based on this dataset to predict the optimal feeding time for robotic feeding (the requirement is not to disrupt the social dynamics, and the ideal goal is seamless interactions during robot-assisted feeding in a social dining social dining). Ultimately, the disabled patient can also have a pleasant dining experience with friends.

Contributions

  1. Collecting a Human-Human Commensality Dataset (HHCD) containing 30 groups of three people eating together
  2. Use this dataset to analyze human-human commensality behaviors and develop bite timing prediction models in social dining scenarios.
  3. Transfer these models to human-robot commensality scenarios

Methodologies

Problems with existing methods:

  1. Although there are several automated feeding systems on the market, require manual triggering of bite timing by the user, which is challenging for users with cognitive disabilities and inconvenient in social settings

  2. Current robot feeding systems are not designed with that experience in eating together

Challenge:

  1. Infer appropriate bite timing to a social dining setting requires not only attuning to the user’s eating behavior but also to the complex social dynamics of the group. For example, a robot should not attempt to feed a user who is actively engaged in conversation.

Problem Definition:

  1. For a single-person meal, the robot captures a signal U, which is an indication from the user, e.g., a voice, a gesture; it shows that the user now has the desire to eat (the capture of the signal indicates that feeding is needed at this moment (the right time to feed)). The model inputs U and outputs y at moment t+1. y is a boolean value indicating whether or not +1 wants to feed at this moment.
    “The objective of the bite timing prediction problem in robot-assisted feeding with a single diner is to predict the timing of when this user will take a bite of food by capturing their signals U such as voice, body gestures, head movements or speaking status. We define the proper timing for when a robot should feed as when the user intends to take a bite of food. It takes input signals U(t0 : t) from time t0 to time t and learns a function F(U) to predict a Boolean y(t + h) = F (U(t0 : t)), ”

  2. (This paper) The user dines with two friends, and the model inputs two more social signals L & R (from the two friends). In such a scenario, the inputs to the model are, the two social signals L & R and the user’s personal signal U (history). The combination of the three outputs y at the moment t+1. y is also a boolean value indicating whether or not the +1 wants to eat at this moment.
    “In this paper, we consider a social variant of the bite timing prediction problem where a user is interacting with two co-diners. Our goal is to predict the timing of a user to take a bite of food based on the social cues within the interaction. From an initial time t0 to time t, the user receives social signals L(t0 : t) and R(t0 : t) from their left and right conversational co-diners, respectively. Given these external social signals and the target user’s own history of signals U(t0 : t), we aim to predict y. ”

Model:

  1. Triplet-SoNNET:The three signals L&R&U are input separately and superimposed on each other (similar to the feeling of residuals), e.g. the output of the L signal after undergoing convolution is superimposed into R then convolved again with the input of R.

  2. Triplet-SoNNET Problem: In the training dataset, the emitters of LR social signals are all physically fit people, which will be different from the users in terms of sitting posture, social signals, etc., leading to inaccurate model results.

  3. IMPROVED: Couplet-SoNNET–user’s signal U only retains bite features, other social signals from the target user are ignored.

    Couplet-SoNNET, where we ignore most social signals from the target user by removing the last channel in Triplet-SoNNET.

Limitations

  1. there are a lot of assumptions in the experiment, e.g., users will not change their dining habits due to the presence of the robotic arm, etc. 2. there is a need to take into account more social factors, e.g., the culture of the user, the people who will dine together (a more precise dining scenario, with classmates?). With teachers? Business? Relaxed?) , more factors can be included in the topic of the symbiotic relationship between humans and robots

Other

1.Some English sentences from the original paper

2.A socially-aware robot(I prefer this term because human socialization is a very complex system that includes many emergent phenomena)