Array Telepresence DX Camera vs. Cisco and Polycom’s Voice Tracking Camera Technology

At a cursory glance one might think that the Array DX Camera module and Image Processor are competitive offerings with Cisco’s SpeakerTrack 60 and Polycom’s Eagle Eye Director which attempt to solve the problem of the farthest participants from the camera having tiny, indecipherable features and poor human factors by using machine vision and voice-tracking to zoom in on and frame the active speakers.

The reality is Array’s approach solves the “postage stamp-sized head” problem in a vastly superior manner and then further improves the scene and the experience in a variety of ways that will delight your end-users and everyone you conference with.

Immersive Telepresence vs. the Observant Videoconferencing Experience

First, let’s look at “the goal” vs. “the outcome” of the two solutions:

Array’s Goal: Create an immersive telepresence environment where remote participants appear lifesize, everyone appears to be in the same physical space, where the technology is invisible, and the brain can become immersed in the experience so it is able to focus on “The Message” vs. “The Medium”.

Voice-tracking Camera Goal: Zoom in on the active speaker (or speakers), frame the shot so that that the speaker(s) is/are centered in a head and shoulders view and then cut to that shot so the “postage stamp-sized head” people are now visible.

Outcomes

Voice-Tracking Cameras Destroy Immersion - The voice-tracking cameras achieve their goal of zooming in on the participants but the constant panning, tilting, and zooming of visible cameras destroys any ability to become “immersed” in the experience. Human beings are biologically attuned to motion and every time the visible camera moves in the front of the room the brain, consciously or unconsciously, registers the motion creating a focus on the “medium” and distracting away from the “The Message”. On the remote end of the call participants are treated to a constantly changing perspective destroying any sense of immersion on their end as well.

Visible Cameras Are Not-desirable - Visible cameras remind participants of the artificiality of the experience and cause many to worry about their appearance and the potential of the meeting being recorded which can impact the candidness of the discussions.

The Delay in Voice Tracking Makes It Worse - With all voice-tracking cameras there is a slight delay of at least 2-3 seconds between the time a new speaker is identified, acquired, the camera is zoomed, the shot framed, and the cut-over is made. It doesn’t take long in a voice-activated camera-produced meeting before your brain begins anticipating the cut-over causing an additional focus on “the medium” vs. “the message”. Occasionally the shot is not framed ideally and/or auto-corrects which also brings focus back to the “the medium” distracting from “the message”. The threshold for conversational lag in spoken conversation to become annoying has been measured at 200ms. Glass-to-glass latency for compression, network transmission, and decompression adds 75-90ms right off the start so an additional 2-3+ seconds makes it considerably worse.

In comparison, Array’s dedicated hardware approach with 9 dedicated processor chips adds an imperceptible 10ms with absolutely no noticeable delay to impact natural conversation.

Cisco & Polycom Demonstration Videos Highlight the Problems with Voice Tracking Delay

Top: Cisco SpeakerTrack 60 Demonstration - Bottom: Polycom EagleEye Director Demonstration

In these two demonstration videos for Cisco SpeakerTrack and Polycom Eagle-Eye Director you can see the incredible lagtime of 2-3+ seconds from when a speaker is identified, acquired, the camera physically moved and zoomed, and cut-over. The cut-over at the 58 second mark of the Polycom Eagle-Eye Director demo also appears to have a second or so cut out in post-production editing since that time is less than a second less than the other cut-overs in the video. We measured the cut overs in the Cisco video, which was an uninterrupted live shot with no apparent post-production editing, using an iPhone stopwatch at between 3.1 and 3.4 seconds.

Actual YouTube comment on the Polycom’s EagleEye Director video demonstrating that humans have innate expectations with respect to inter-personal communication. Do people want multiple, visible panning-tilting-cameras constantly moving in front of them OR do they want to feel like they are in the same room with lifesize people?

Cisco & Polycom Demonstration Videos Not Representative of Actual Experience

Actual YouTube Comment on Polycom’s EagleEye Director Video

You can’t hold it against Cisco and Polycom for trying to showcase their products in the best possible light but the demonstration videos don’t really demonstrate actual use because the participants are unnaturally looking directly into the camera. In real videoconferencing and telepresence meetings the participants would be, quite naturally, looking at the other participants’ facial features on the screen 14 degrees lower than represented in the videos. So if these highly-trained product managers, sales representatives, and demonstrators were not consciously staring at the camera you would be getting an entirely different experience all together with their eye-line off around 14 degrees.

Voice Tracking in Dependent on Acoustical Quality and Full-on Facial Capture - To operate effectively the voice activated camera systems require a certain level of acoustical quality which is not always feasible and can impact system performance. Furthermore, the systems use machine vision to track facial features to eliminate extraneous noise from triggering a camera movement. This means that the person talking needs to be facing the camera with both eyes, the nose, and mouth visible or the system will not make the switch. If you are talking to a local colleague across the table and not facing the camera then there is no switch even though the participants are now anticipating one which brings more focus to “the Medium” vs. “The Message”.

**The Array Approach Solves the “Postage Stamp-Sized Head” issue, the Eye-line issue AND Improves the Experience in Numerous Additional Ways**

Standard View from a Pan-Tilt-Zoom camera. This view is essentially the same “establishing shot” view that would be provided by a voice-activated camera system showing all the participants in the meeting before one of the cameras began tracking, panning, tilting, and zooming to focus on the active speaker.

The Equal-i view from the exact same perspective as the PTZ view above. Array has solved the “Postage Stamp-sized Head” problem but has also improved the eye-line to perfect vertical, concealed the camera, improved the meeting format, and created a wide-format, super-panoramic view of the remote scene that pulls in the eye’s peripheral vision to improve the sense of immersion.

“Postage Stamp-sized Heads” Brought Life-sized - The Array approach solves the problem of the farthest participants having tiny heads by concealing the DX Dual Camera at perfect vertical eye-line in between dual displays centered with the table. The Array Image Processor sits between the camera and the customer’s videoconferencing codec and improves the scene before sending the image to the codec for the trip across the network.

The Array Image Processor (IP) solves the problem of the “Postage Stamp-sized Heads” by bringing the farthest participants “Up Close and Personal”. The IP “Equal-i-zes” the size of the farthest person to the size of the closest person so all remote participants appear to be the same size. This both increases the size of the farthest participant to the size of the closest participant while simultaneously improving the format of the meeting to one where all the remote participants appear to be sitting right across the table.

Most importantly, Array solved the problem in a vastly superior way by making the technology disappear and creating an environment where the participants can become immersed in the experience where they can focus on “the message” instead of “the medium”.

Equal-i Improves Eye-contact Dramatically for All Participants

The Equal-i approach places the camera at perfect vertical eye-line in the exact center of the table with perfect vertical and horizontal gaze angle for the two participants at the head of the table who are usually the most important people in the meeting. The other participants still have perfect vertical eye-line and the horizontal gaze angle increases for the participants further away from the camera. The horizontal gaze angle of Equal-i rooms are dramatically better than the horizontal gaze angle of the $250K group telepresence systems from Cisco and Polycom whose participants on the far right and far left are being captured from the side from the center-mounted camera. When the participants on the far right or far left are talking to remote participants directly across from them in the far right or far left positions, they are essentially talking to the side of that person’s head while the person on the remote side of the call is looking at the side of the speaker’s head.

Equal-i has a measurably better perfect vertical eye-line and measurably better horizontal eye-gaze angles and keeps the participants situated directly in-front of the cameras for a superb aspect ratio from every seat!

The Array Approach Provides Additional Improvements to the Scene and End-User Acceptance/Satisfaction - In addition to solving the problem of the “Postage Stamp-sized Head”, the Array approach offers significant additional improvements to the experience including:

Perfect Vertical Eye-line - Eye-line is improved by about 14 degrees to perfect vertical eye-line.
Concealed Camera - The DX Dual Camera is concealed at perfect vertical eye-line between dual displays requiring a space of only 4mms. The camera sits on and is concealed by the display bezels.
Improved Meeting Format - When the farthest participants are brought “Up Close and Personal” the meeting format is improved to one where everyone appears across the table.
Powers Dual Displays Using a Single Videoconferencing Codec with No Impact on Network Bandwidth in a Single Codec Call - After the Image Processor improves the scene it takes the two images of the right and left hand side of the table and utilizing our patent-pending non-linear scaling technology places them them into a single image frame that it hands to the videoconferencing codec for the trip across the network. On the remote side of an Equal-i call, the remote Image Processor sits between the videoconferencing codec and the displays. The Image Processor catches the incoming “squeezed” image and splits out the single image into the right and left-hand sides of the scene for each display. This unique capability allows Array to power wide-format, dual displays offering a super-panoramic view into the remote scene using a single videoconferencing codec and bitstream with no impact to bandwidth or video network infrastructure.
Creates a Wide-Format Super-Panoramic View Into the Remote Scene further Improving Immersion- The unique aspect ratio of the camera placed at perfect vertical eye-line and delivered to dual-displays (or a video wall, array of panoramic projectors, or other display technology) creates a “wide-format, super-panoramic” view into the remote scene. This wide-format view brings in more of the eye’s peripheral vision vs. the observant experience of a single display. Since peripheral vision is one of the brain’s immersion cues the sense of immersion is increased for participants.

Array Telepresence DX Camera vs. Cisco and Polycom’s Voice Tracking Camera Technology

Immersive Telepresence vs. the Observant Videoconferencing Experience

Outcomes

Cisco & Polycom Demonstration Videos Highlight the Problems with Voice Tracking Delay

Cisco & Polycom Demonstration Videos Not Representative of Actual Experience

**The Array Approach Solves the “Postage Stamp-Sized Head” issue, the Eye-line issue AND Improves the Experience in Numerous Additional Ways**

Recent Posts

Archives

Categories