πŸ›‹οΈ Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities

Consistent Multilingual Frame of Reference Test (COMFORT)

Pluralistic Alignment @ NeurIPS 2024

1University of Michigan 2University of Waterloo
3Vector Institute, Canada CIFAR AI Chair 4Michigan State University

*Denotes Equal Contribution

An evaluation protocol to systematically assess the spatial reasoning capabilities of VLMs.

Abstract

Spatial expressions in situated communication can be ambiguous, as their meanings vary depending on the frames of reference (FoR) adopted by speakers and listeners. While spatial language understanding and reasoning by vision-language models (VLMs) have gained increasing attention, potential ambiguities in these models are still under-explored. To address this issue, we present the COnsistent Multilingual Frame Of Reference Test (COMFORT), an evaluation protocol to systematically assess the spatial reasoning capabilities of VLMs. We evaluate nine state-of-the-art VLMs using COMFORT. Despite showing some alignment with English conventions in resolving ambiguities, our experiments reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or culture-specific conventions in cross-lingual tests, as English tends to dominate other languages. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning.

main

In situated communication, spatial language understanding and reasoning are often ambiguous, leading to varying interpretations among people from different cultural backgrounds. Specifically: (a) different frames of reference can result in different interpretations of the same spatial expression; (b) speakers of different languages may use distinct coordinate frames for non-fronted reference objects; and (c) spatial relations extend beyond exact axes to include acceptable regions.

COMFORT-BALL

The relatum is non-fronted, we focus on the ambiguity of FoR conventions associated with different languages. The split involves an observer's egocentric perception of a referent (e.g., a red ball) and a non-fronted relatum (e.g., a blue ball). We further randomize the dataset with object-level (colors, sizes, and shapes) and scene-level variations (camera positions and distractors) to consider more diverse yet reasonable settings.

ball_default
ball_color
ball_size
ball_cam
ball_distractor


COMFORT-CAR

The relatum is fronted, multiple FoRs can be explicitly adopted to interpret the scene. A COMFORT-CAR image, therefore, involves the egocentric perception of a referent, a fronted relatum, and an additional human addressee. One can interpret the spatial relations using either the Camera, Addressee, or Relatum (C/A/R) as the origin to resolve the reference frame ambiguity. COMFORT-CAR has a set of 10 realistic objects in a typical household or outdoor scene, including horse, car, bench, laptop, rubber duck, chair, dog, sofa, bed, and bicycle, all of which have a clear semantic front. We use a basketball as the referent and vary the relatum. In addition to these objects, we include a human addressee in the scene. To disentangle different FoRs as much as possible, we let the addressee face right, and let the relatum face either left or right in the rendered images from the rendering camera's perspective.

car_default
car_bicycle
car_bed
car_laptop
car_dog

Consistent Evaluation on Vision-Language Models


The COMFORT framework enables us to investigate whether the internal representations of vision-language models (VLMs) encode spatial relations, and if they do, which underlying coordinate systems these representations capture. We show some sample plots below. The raw probability p(ΞΈ) in gray and normalized probability pΜ‚(ΞΈ) in black, and reference probabilities Ξ»cos(ΞΈ) in red.

COMFORT-BALL

Image 2

        LLaVA1.5-7B

Image 1

        XComposer2

Image 3

          GPT-4o



COMFORT-CAR

Image 2

        LLaVA1.5-7B

Image 1

        XComposer2

Image 3

          GPT-4o

Empirical Experiments and Main Findings

Most VLMs Prefer Reflected Coordinate Transformation Convention

Model Back Ξ΅cos(↓) Front Ξ΅cos(↓) Left Ξ΅cos(↓) Right Ξ΅cos(↓) Aggregated Preferred
Same Rev. Same Rev. Same Rev. Same Rev. Tran. Rot. Ref.
InstructBLIP-7B 45.6 39.0 31.6 52.0 37.2 48.0 47.5 37.8 40.5 44.2 43.9 -
InstructBLIP-13B 40.9 45.5 46.0 37.4 43.4 44.9 45.6 41.6 44.0 42.3 43.0 -
mBLIP 51.2 53.7 51.2 47.9 52.4 53.5 54.6 46.8 52.3 50.5 52.1 -
GLaMM 58.3 33.3 43.9 42.9 38.3 51.8 17.3 63.7 39.5 47.9 33.0 Ref.
LLaVA-1.5-7B 54.0 32.9 59.1 24.8 11.9 70.0 13.0 68.5 34.5 49.0 20.7 Ref.
LLaVA-1.5-13B 61.8 19.2 56.0 27.7 31.7 61.8 24.3 64.3 43.4 43.2 25.7 Ref.
XComposer2 73.2 17.9 74.5 20.7 20.1 80.9 21.3 81.1 47.3 50.1 20.0 Ref.
MiniCPM-V 70.9 21.9 64.3 26.9 19.7 74.1 21.1 73.3 44.0 49.1 22.4 Ref.
GPT-4o 75.7 28.2 73.6 32.0 24.3 80.8 25.1 80.8 49.7 55.5 27.4 Ref.

Preferred coordinate transformation mapping from the egocentric viewer (camera) to the relatum in the relative FoR. The cosine region parsing errors εcos are computed against both the Same and Reversed directions relative to the egocentric viewer's coordinate system. For example, native English speakers typically prefer a Reflected transformation, which maintains the lateral (left/right) axis but reverses the sagittal (front/back) axis relative to the viewer. We determine the preferred transformation based on the aggregated performance, with “–” for no significant preference.

Most VLMs Prefer Egocentric Relative Frame of Reference

Model Back Ξ΅cos (↓) Front Ξ΅cos (↓) Left Ξ΅cos (↓) Right Ξ΅cos (↓) Aggregated Prefer
Ego.Int.Add. Ego.Int.Add. Ego.Int.Add. Ego.Int.Add. Ego.Int.Add.
InstructBLIP-7B 41.038.638.6 40.946.946.9 45.632.551.9 39.651.231.8 41.842.342.3 -
InstructBLIP-13B 32.934.434.4 52.548.548.5 47.856.227.8 40.627.656.6 43.541.741.8 -
mBLIP-BLOOMZ 52.253.253.2 45.344.644.6 47.847.648.1 45.448.442.4 47.748.447.1 -
GLaMM 28.049.149.1 30.040.040.0 14.056.841.5 13.753.046.6 21.449.844.4 Ego.
LLaVA-1.5-7B 20.943.043.0 34.532.632.6 13.453.547.4 14.353.649.3 20.845.743.1 Ego.
LLaVA-1.5-13B 31.938.838.8 24.857.157.1 11.751.151.1 27.557.448.7 24.051.148.9 Ego.
XComposer2 12.749.349.3 15.248.348.3 18.861.253.7 16.558.415.8 15.854.551.4 Ego.
MiniCPM-V 34.240.740.7 35.553.453.4 18.053.953.9 19.058.126.7 26.751.551.3 Ego.
GPT-4o 38.336.736.7 43.150.250.2 34.759.356.5 24.357.361.7 35.150.951.3 Ego.

Preferred frame of reference in VLMs. Models' Cosine Region Parsing Errors εcos are computed against the Intrinsic FoR (relatum origin), Egocentric relative FoR (camera origin), and Addressee-centric relative FoR (addressee origin). English typically prefers an egocentric relative FoR. We determine the preferred FoR based on the aggregated performance, with “-” indicating no significant preference.

VLMs Fail to Adopt Alternative Frames of Reference Flexibly

Model Egocentric Intrinsic Addressee Aggregated
Acc% (↑) εcosΓ—102(↓) Acc% (↑) εcosΓ—102(↓) Acc% (↑) εcosΓ—102(↓) Acc% (↑) εcosΓ—102(↓)
InstructBLIP-7B 47.2(+0.0) 43.5(+1.7) 47.2(+0.0) 42.3(+0.0) 47.2(+0.0) 43.6(+1.3) 47.2(+0.0) 43.1(+1.0)
InstructBLIP-13B 47.2(+0.0) 43.8(+0.3) 47.2(+0.0) 43.2(+1.5) 47.2(+0.0) 42.9(+1.1) 47.2(+0.0) 43.3(+1.0)
mBLIP-BLOOMZ 51.9(βˆ’0.9) 55.4(+7.7) 49.8(βˆ’3.0) 54.2(+5.8) 49.6(βˆ’3.2) 55.8(+8.7) 50.4(βˆ’2.4) 55.1(+7.4)
GLaMM 47.2(βˆ’10.6) 23.3(βˆ’0.7) 47.2(+0.8) 44.2(βˆ’6.9) 47.2(βˆ’2.8) 42.8(βˆ’6.1) 47.2(βˆ’4.2) 36.8(βˆ’4.6)
LLaVA-1.5-7B 55.2(βˆ’2.6) 18.4(βˆ’3.0) 48.3(+4.7) 45.7(βˆ’4.1) 48.2(βˆ’5.0) 43.4(βˆ’1.0) 50.6(βˆ’1.0) 35.8(βˆ’2.7)
LLaVA-1.5-13B 51.6(βˆ’15.0) 23.9(+3.1) 47.3(+0.8) 45.0(βˆ’7.0) 47.5(βˆ’3.8) 38.9(βˆ’4.2) 48.8(βˆ’6.0) 35.9(βˆ’0.6)
XComposer2 85.6(βˆ’7.0) 18.8(+3.0) 51.0(+0.5) 51.0(βˆ’3.3) 53.2(βˆ’0.6) 49.8(βˆ’1.6) 63.3(βˆ’2.4) 39.9(βˆ’0.6)
MiniCPM-V 72.4(βˆ’4.8) 24.6(βˆ’2.1) 49.9(βˆ’2.6) 47.8(βˆ’3.7) 52.9(βˆ’0.5) 45.1(βˆ’6.2) 58.4(βˆ’2.6) 39.2(βˆ’4.0)
GPT-4o 78.3(+4.6) 28.1(βˆ’7.0) 53.4(βˆ’1.9) 44.6(βˆ’6.3) 49.1(βˆ’5.7) 44.9(βˆ’6.4) 60.3(βˆ’1.0) 39.2(βˆ’6.6)

The accuracy and cosine region parsing errors of VLMs with explicitly prompted to follow each frame of reference are provided (cam/rel/add). The values in parentheses indicate the performance change relative to the scenario with no perspective (nop) prompting.

Spatial Representations in VLMs are Not Robust and Consistent

Model Obj F1 (↑) Acc% (↑) Ξ΅cos Γ—10Β² (↓) Ξ΅hemi Γ—10Β² (↓) Οƒ Γ—10Β² (↓) Ξ· Γ—10Β² (↓) csym Γ—10Β² (↓) copp Γ—10Β² (↓)
BALL CAR BALL CAR BALL CAR BALL CAR BALL CAR BALL CAR BALL CAR BALL CAR
InstructBLIP-7B 66.7 66.7 47.2 47.2 43.9 43.5 57.8 56.4 26.7 30.5 48.4 43.4 17.2 16.9 16.6 22.6
InstructBLIP-13B 67.3 50.3 47.2 47.2 43.0 43.8 55.5 55.9 27.1 36.8 48.2 46.4 17.3 17.0 21.0 21.9
mBLIP-BLOOMZ 99.1 33.3 47.5 51.9 52.1 55.4 62.1 65.6 43.7 48.6 54.1 60.7 29.1 30.1 33.8 42.0
GLaMM 100.0 99.8 47.2 47.2 33.0 23.3 45.2 37.6 29.9 23.4 45.0 28.4 10.1 9.4 13.7 14.6
LLAVA-1.5-7B 100.0 88.6 63.2 55.2 20.7 18.4 33.7 32.5 25.2 20.0 23.5 21.8 5.8 5.4 8.3 10.7
LLAVA-1.5-13B 100.0 98.6 55.3 51.6 25.7 23.8 37.6 37.1 19.3 20.8 24.9 29.9 7.0 5.8 9.3 10.8
XComposer2 100.0 95.3 92.4 85.6 20.0 18.8 21.1 26.3 19.2 15.3 13.7 22.9 9.0 6.5 10.5 12.0
MiniCPM-V 66.8 81.5 81.0 72.4 32.4 24.6 32.8 35.8 19.2 19.2 29.8 22.7 10.1 9.2 12.4 14.9
GPT-4o 100.0 94.5 89.2 78.3 27.4 28.1 27.5 35.0 20.9 24.0 43.1 38.8 14.1 13.3 14.2 16.7
Random (30 trials) 50.0 50.9 46.3 58.7 28.3 26.6 42.5 44.2
Always β€œYes” 50.0 47.2 61.2 68.7 0.0 0.0 0.0 100.0

A comprehensive evaluation of VLMs in egocentric relative FoR with reflected transformation, using an explicit camera perspective (cam) prompt, is conducted. The metrics considered include object hallucination (F1-score), accuracy (Acc), region parsing error (ε), prediction noise (η), standard deviation (σ), and consistency (c).

Multilingual VLMs are not faithfully following the preferences and conventions (associated with different languages) to select the FoR.

Language English Tamil Hausa
Intrinsic 50.9 52.0 54.0
Ego-Rel Ref. 35.8 40.4 41.0
Rot. 57.3 55.2 56.1
Tran. 53.7 51.1 53.0
Add-Rel Ref. 58.8 52.2 52.8
Rot. 51.3 52.9 55.3
Tran. 56.1 56.1 56.1
GPT-4o Prefer Ego-Ref. Ego-Ref. Ego-Ref.
Human Prefer Ego-Ref. Ego-Rot. Ego-Trans.
world_map_multilingual

A visualization of the world map that displays the preference of each region for using the intrinsic FoR over the relative FoR. The plot is based on the top three spoken languages in each region, as reported by The World Factbook (Central Intelligence Agency, 2009), and averages the cosine parsing error (Ξ΅cos, ↓), weighted by the speaking population. We present a quantitative comparison of English, Tamil, and Hausa, with the best-performing FoR marked in bold and the convention preferred by human speakers underlined.

BibTeX

@misc{zhang2024visionlanguagemodelsrepresentspace,
          title={Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities}, 
          author={Zheyuan Zhang and Fengyuan Hu and Jayjun Lee and Freda Shi and Parisa Kordjamshidi and Joyce Chai and Ziqiao Ma},
          year={2024},
          eprint={2410.17385},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2410.17385},
      }