PostgreSQL pg_restore 「符号化方式"UTF8"で無効なバイトシーケンスです」からの逃げ方

古い PostgreSQL で pg_dump したダンプを、最近の PostgreSQL で pg_restore しようとしたときに、「UTF-8のバイトシーケンスがおかしい」と怒られた場合の逃げ方。

$ pg_restore -U postgres -d db3 db3_dump.pgdmp
pg_restore: [アーカイバ(db)] TOC処理中にエラーがありました:
pg_restore: [アーカイバ(db)] TOCエントリ194; 0 48277 TABLE DATA attachments postgresのエラーです
pg_restore: [アーカイバ(db)] COPY failed for table "attachments": ERROR:  符号化方式"UTF8"で無効なバイトシーケンスです: 0x90
HINT:  サーバが想定する符号化方式に合わないバイトシーケンスが存在した場合にもこのエラーが発生します。これは"client_encoding"で制御されます。
CONTEXT:  attachmentsのCOPY。行番号 3740
警告: リストアにてエラーを無視しました: 1

とか

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 2279; 0 253991651 TABLE DATA currencies postgres
pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe76169
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT:  COPY currencies, line 1590
WARNING: errors ignored on restore: 1

な場合。

SQLインジェクション対策で UTF-8 の扱いが変わったらしく、Centos 4.8 (PostgreSQL 7.4) で pg_dump したダンプを CentOS6.3 (PostgreSQL 8.4) 上で pg_restore しようとしたら上のエラーにぶつかった。

調べていくと、ここにある手順で逃げられるとのことなので、逃げてみた。

大体次の手順。

エラー確認
ダンプ内の内容物一覧出力
問題部分の抽出
問題部分の修正(UPDATE)
修正済みダンプの取得

ダンプ元のDBが生きているor同バージョンのDBが使えることが前提なので注意。

まず、リストア時に出るエラーを確認。太字部分が大事。

# su - postgres
bash-4.1$ pg_restore -U postgres -d db3 db3_dump.pgdmp
pg_restore: [アーカイバ(db)] TOC処理中にエラーがありました:
pg_restore: [アーカイバ(db)] TOCエントリ194; 0 48277 TABLE DATA attachments postgresのエラーです
pg_restore: [アーカイバ(db)] COPY failed for table "attachments": ERROR:  符号化方式"UTF8"で無効なバイトシーケンスです: 0x90
HINT:  サーバが想定する符号化方式に合わないバイトシーケンスが存在した場合にもこのエラーが発生します。これは"client_encoding"で制御されます。
CONTEXT:  attachmentsのCOPY。行番号 3740
警告: リストアにてエラーを無視しました: 1

次に pg_restoreに -l オプションを付けて、ダンプファイル内のオブジェクトの一覧を出力する。

$ pg_restore --disable-triggers -U postgres -Fc db3_dump.pgdmp -l  > somedb_object.txt

出てきたリストを一旦全部コメントアウト(perl 使ってますが、行頭に ; が付けられれば何でも可)する。

$ perl -nle 'print ";$_"' somedb_object.txt > somedb_object.txt_

リストファイルを開いて、最初のエラーに出ていた問題部分 "194; 0 48277 TABLE DATA attachments postgres" のコメントアウト( ; )を外す

$ vi somedb_object.txt_

問題部分のコメントを外したリストファイルを使い、ダンプファイルから問題部分をファイル(broken.txt)に抽出する。

$ pg_restore --disable-triggers -U postgres -Fc db3_dump.pgdmp -L  somedb_object.txt_ > broken.txt

UTF-8 を正しい形式に変換し直す。

$ iconv -c -f utf8 -t utf8 < broken.txt > fixed.txt

元ファイルと再変換後のファイルの差分をみることで UTF-8 がおかしい部分が見つかる。

$ diff broken.txt fixed.txt
3765c3765
< 68398 4942    6976    4941    \N      text/plain      none   \n-*-*-\nほげほげ\n        Content-Type: text/plain; charset="UTF-8"\nContent-Transfer-Encoding: 8bit\n  26
---
> 68398 4942    6976    4941    \N      text/plain      none   \n-*-*-\n F \n     Content-Type: text/plain; charset="UTF-8"\nContent-Transfer-Encoding: 8bit\n  26

この場合、「ほげほげ」が問題になってるのが見ればわかるので、psql を使って元DBからその部分を修正するか除去する(テーブルスキーマ要確認)。

（元DBが無い場合は、元DBと同じバージョンの PostgreSQL 環境を作ってそこにリストアし、それに対して操作してやる）

$ psql db3
db3=# update attachments set content='' where id=4942;

以上の操作で不具合が修正された元DBを dump して、不具合修正版のダンプファイルを取得する。

$ pg_dump -b -Fc db3 > fixed.pgdump

不具合修正版のダンプファイルを、新DB環境に持っていって pg_restore する。

$ pg_restore -U postgres -d db3 fixed.pgdump

これで問題なくリストアされるはず。