okkyの銀河制圧奇譚: 相関係数によってカウンターを分類する

カウンターごとに相関が求まったので、それを元にカウンターを分類する。

その前に。カウンターにくっついているマシン名がうざったい。

"\\A\Processor(_Total)\% Idle Time" "0.99999999999999558689" ....

というこの「\\A\」がうざったいのだ。と言うのは、後で「マシン同士の」相関を取る必要があるからだ。カウンター名にマシン名が含まれていては「同じカウンター同士」を判別するのが面倒になる。だからこの場で外してしまおう。

そしてついでに、同じマシンに関して、別の日付のものをあわせて1つにしてしまおう。

次に。各TSVファイルを分類する。分類するスクリプトはこれ:

sort.pl

$r2_limitation = 0.25;

open( INFILE, "$ARGV[0]" ) or die "can't open file $ARGV[0] as read\n";
open( OHI, " > $ARGV[1]" ) or die "can't open file $ARGV[1] as write\n";
open( OBO, " > $ARGV[2]" ) or die "can't open file $ARGV[2] as write\n";
open( OLO, " > $ARGV[3]" ) or die "can't open file $ARGV[3] as write\n";

while () {
  chomp;
  my @cells = split /\t/;
  my $i, $n;

  for ( $i = 0; $i <= $#cells; $i++ ) {
    $cells[$i] =~ s/"(.*)".*$/\1/g;
  }

  $n  = $cells[0];
  if ( $n eq "name" ) {
    next;
  }
  if ( $counter{$n} <= 0 ) {
    push @name, $n;
  }

  $i  = $counter{$n};
  $counter{$n}++;
  $r{$n}[$i]   = $cells[1];
  $r2{$n}[$i]  = $cells[2];
  $a{$n}[$i]   = $cells[3];
  $b{$n}[$i]  = $cells[4];
}

while ( $n = pop @name ) {
  my $allbig = 1;
  my $allsmall  = 1;
  my $findNaN = 0;

  for ( $i = 0; $i < $counter{$n}; $i++ ) {
    if ( $r2{$n}[$i] < $r2_limitation ) {
      $allbig  = 0;
    }
    if ( $r2{$n}[$i] >= $r2_limitation ) {
      $allsmall = 0;
    }
    if ( $r2{$n}[$i] eq "NaN" ) {
      $findNaN  = 1;
    }
  }

  if (( $allbig )&&( !$allsmall )) {
    print OHI "\"$n\"\n";
    for ( $i = 0; $i < $counter{$n}; $i++ ) {
      print OHI "# \"$r{$n}[$i]\"\t\"$r2{$n}[$i]\"\t\"$a{$n}[$i]\"\t\"$b{$n}[$i]\"\n";
    }
  }

  if (( !$allbig )&&( $allsmall )) {
    print OLO "\"$n\"\n";
    for ( $i = 0; $i < $counter{$n}; $i++ ) {
      print OLO "# \"$r{$n}[$i]\"\t\"$r2{$n}[$i]\"\t\"$a{$n}[$i]\"\t\"$b{$n}[$i]\"\n";
    }
  }

  if (( !$allbig )&&( !$allsmall )) {
    print OBO "\"$n\"\n";
    for ( $i = 0; $i < $counter{$n}; $i++ ) {
      print OBO "# \"$r{$n}[$i]\"\t\"$r2{$n}[$i]\"\t\"$a{$n}[$i]\"\t\"$b{$n}[$i]\"\n";
    }
  }
  if (( $allbig )&&( $allsmall )) {
    if ( $findNaN ) {
      print STDERR "\"$n\" seems to have NaN as it's r2 value\n";
    } else {
      die "\"$n\" did not have any element in $ARGV[0]?!\n";
    }
  }
}

引数は4つ。
1つ目が入力TSVファイル名。
2つ目は $r2_limitation 以上のr²値を持っているカウンターだけを集めたTSVファイルの名前。
4つ目は $r2_limitation 未満のr²値を持っているカウンターだけを集めたTSVファイルの名前。
3つ目は r²の値が $r2_limitation 以上だったり以下だったりふらつくものを集めたTSVファイルの名前。

とりあえず $r2_limitation はSTATISTICS HACKの推奨にあわせて「強い相関」の値0.25に設定してあります。

% cd $TOP
% cd $TOP/05sort
% cat filter.sh
#!/bin/sh

onebefore="../04Colleration";
for i in A B C D; do
  cat $onebefore/$i.*.tsv | sed -e 's/^"\\\\[[:upper:][:digit:]]\+\\/"\\/g' | sort > $i.lst
done

for i in A B C D; do
  ./sort.pl $i.lst $i.HI $i.BORDER $i.LOW
done
% bash filter.sh

これで A.HI, A.BORDER, A.LOW や B.HI, B.BORDER, B.LOW ができた。

一番計算機パワーを食うところはこれでおしまい。ここから先は人間が悩むしかない…。

okkyの銀河制圧奇譚

2008年8月12日

相関係数によってカウンターを分類する

0 件のコメント:

コメントを投稿