SOBA開発者ブログ

2022年10月7日

WebRTC の映像で顔認識 (表情認識) をやってみた

こんにちは。SOBAプロジェクトの花崎です。

映像のリアルタイム表情認識が出来ないかと思って調べてみました。
この辺の知識は10年位前からアップデートされていないので、昨今のトレンドがよく判りません。
Emotion Detection といえば OpenCV かなとはじめは思っていたのですが、流石に敷居が高いです。

出来ればWebRTCを用いたWeb会議で使いたいです。つまりgetUserMedia で取得したMediaStreamをvideoタグに渡した映像を元に、JavaScript でリアルタイム解析がしたいです。

調べると、clmtrackr な顔認識ライブラリが良く使われている様です。

https://github.com/auduno/clmtrackr

スター数すごいですね。

clmtrackr とは

読み方が全く判りません。CLMトラッカーでしょうか。

https://github.com/auduno/clmtrackr#clmtrackr

clmtrackr is a javascript library for fitting facial models to faces in videos or images.
It currently is an implementation of constrained local models fitted by regularized landmark mean-shift, as described in Jason M. Saragih’s paper.
clmtrackr tracks a face and outputs the coordinate positions of the face model as an array, following the numbering of the model below:

clmtrackrは、ビデオや画像中の顔に顔モデルをフィットさせるためのjavascriptライブラリです。
現在、Jason M. Saragihの論文で述べられているように、正則化されたランドマークの平均シフトによってフィットした制約付きローカルモデルの実装です。
以下の番号づけされたモデルの様に、clmtrackrは顔を追跡して、顔モデルの座標位置を配列として出力します。

制約付きローカルモデルって何だろう。

https://en.wikipedia.org/wiki/Constrained_conditional_model

A constrained conditional model (CCM) is a machine learning and inference framework that augments the learning of conditional (probabilistic or discriminative) models with declarative constraints. The constraint can be used as a way to incorporate expressive[clarification needed] prior knowledge into the model and bias the assignments made by the learned model to satisfy these constraints. The framework can be used to support decisions in an expressive output space while maintaining modularity and tractability of training and inference.

制約条件付きモデル（CCM）は、条件付き（確率的または判別的）モデルの学習を宣言的制約で補強する機械学習および推論のフレームワークである。制約とは、表現力豊かな[要出典]事前知識をモデルに組み込み、学習されたモデルがこれらの制約を満たすように割り当てにバイアスをかける方法として使用することができる。このフレームワークは、学習と推論のモジュール性と扱いやすさを維持しながら、表現力豊かな出力空間での意思決定を支援するために用いることができる。

機械学習の用語だという事だけは判りました。

説明は何だかよく判りませんので、デモを動かしてみましょう。

https://auduno.github.io/clmtrackr/examples/clm_emotiondetection.html

リアルタイムで顔モデルを認識していますね。表情認識もスコアづけされています。これは便利です。

動かしてみる

デモをRailsで写経してみます。
haml:

#blog2.blog2.relative
  %video#video.absolute{"autoplay": "true", "playsinline": "true"}
  %canvas#overlay.overlay.absolute
  #emotions.relative
    %p
      楽しい:
      %span#happy.emotion_icon

-# clmtrackr
= javascript_include_tag "clmtrackr/clmtrackr.js"
= javascript_include_tag "clmtrackr/models/model_pca_20_svm.js"

-# clmtrackr emotion_classifier
= javascript_include_tag "clmtrackr/emotion_classifier.js"
= javascript_include_tag "clmtrackr/emotionmodel.js"

JavaScript:

(() = ＞ {

  const mediaConstraints = {
    audio: false,
    video: true
  }

  const track = new clm.tracker({
    useWebGL: true
  });

  const ec = new emotionClassifier();

  const overlay = document.querySelector('#overlay');

  const overlayCC = overlay.getContext('2d');

  const video = document.getElementById('video');

  const happy = document.getElementById('happy').textContent;

  window.onload = () = ＞ {
    init();
  };

  async function localGUM() {

    const stream = await navigator.mediaDevices.getUserMedia(mediaConstraints);
    video.srcObject = stream
    video.play();

    video.onloadedmetadata = () = ＞ {

      const aspRatio = window.innerWidth / video.videoWidth;

      overlay.width = video.width = window.innerWidth;

      overlay.height = video.height = video.videoHeight * aspRatio;

      track.start(video);

      drawLoop();

    };
  };

  async function init() {

    await localGUM().catch((err) = ＞ {
      console.log('err:%o', err);
    });

    document.getElementById('happy').textContent = '0'

    pModel.shapeModel.nonRegularizedVectors.push(9);

    pModel.shapeModel.nonRegularizedVectors.push(11);

    track.init(pModel);

    ec.init(emotionModel);

  }

  function drawLoop() {

    overlayCC.clearRect(0, 0, overlay.width, overlay.height);

    overlayCC.drawImage(video, 0, 0, overlay.width, overlay.height);

    if (track.getCurrentPosition()) {

      track.draw(overlay);

      const currentParam = track.getCurrentParameters();

      const emotionResult = ec.meanPredict(currentParam);

      if (emotionResult) {

        document.getElementById('happy').textContent = emotionResult[5].value.toFixed(2)

      }
    }

    requestAnimationFrame(drawLoop);

  }

})();

動きました。凄いですね。(語彙力)

表情認識の精度

model_pca_20_svm.js が、顔認識用の学習モデルの様です。ファイルを見ましたがさっぱり内容は判りません。

The library provides some generic face models that were trained on the MUCT database and some additional self-annotated images. Check out clmtools for building your own models.

このライブラリは、MUCTデータベースで学習された一般的な顔モデルと、いくつかの自己注釈付き画像を提供しています。独自のモデルを作成するには ‘clmtools’ をお試しください。

MUCT データベースとは何でしょうか。

The MUCT Face Database
The MUCT database consists of 3755 faces with 76 manual landmarks. The database was created to provide more diversity of lighting, age, and ethnicity than currently available landmarked 2D face databases. Some examples are shown on the right (in low-resolution). A full resolution image may be found here.

MUCT フェイスデータベース
MUCTデータベースは、3755人の顔と76個の手動ランドマークから構成されています。このデータベースは、現在利用可能なランドマーク付きの2次元顔データベースよりも、照明、年齢、民族の多様性を提供するために作成された。右の画像はその一例です（低解像度）。全解像度の画像はこちらでご覧になれます。

そうですか。

GitHubで公開されているようです。

https://github.com/StephenMilborrow/muct

私は興味本位で clone してみましたが、やめておいたほうがいいかもです。500MB以上あります。
ざっとしか見てませんが、兎に角大量の外国の方々の顔写真です。

使った感じでは、1名限定ですが、顔認識の精度はそこそこ高い様に思います。
clmtools なるものを使って独自モデルを作れるとの事ですが、そこまでやる必要があるかは疑問です。

ただし表情の認識に関しては微妙な感じです。
こちらの精度を上げる方法については未だよく判りません。

clmtrackr、敷居は低いけど、使いこなすのはなかなか大変そうですね。

SOBAフレームワーク・クラウドは、ビジュアルコミュニケーションに特化したフレームワークです。
映像・音声パーツ、チャットパーツ、ホワイトボードパーツなど、ビジュアルコミュニケーションに必要な各種パーツが用意されています。
これらのパーツはHTMLのタグとして用意されており、このHTMLのタグをユーザのWebページに配置することで、
各種のビジュアルコミュニケーション機能を非常に簡単にWebページに組み込むことができます。

各パーツのカスタマイズ・オプションページに動かせるサンプルページをご用意しましたので、ぜひご覧ください。