表題

文字列ライブラリ

著者

Olin Shivers

状態

この SRFI は現在「確定」の状態である。 SRFI の各状態の説明についてはここを参照せよ。この SRFI に関する議論についてはメーリングリストのアーカイブを参照せよ。

受理: 1999/10/17
草案: 1999/10/18-1999/12/16
改訂: 1999/10/31
改訂: 1999/11/13
改訂: 1999/11/22
改訂: 2000/04/30
改訂: 2000/06/09
改訂: 2000/12/23

概要

R5RS Scheme には貧弱な文字列処理ユーティリティしかなく、移植性のあるコードを書く場合に障害となる。この SRFI では、一貫性がある包括的な文字列処理手続きを提案する。また、ここで述べる仕様の参照実装を与える。参照実装は次の特徴をもつ。

移植性がある。
効率的である。
オープンソースである。

この SRFI の各ルーチンは、 R5RS の文字列処理ルーチンと後方互換性がある。

手続きの索引

以下に string-lib と string-lib-internals パッケージで提供される手続きの一覧を示す。 R5RS の手続きは太字で示し、 R5RS を拡張した手続きは太字の斜体で示す。

述語

string? string-null?
string-every string-any

構築子

make-string string string-tabulate

リストと文字列の変換

string->list list->string
reverse-list->string string-join

選択

string-length
string-ref
string-copy
substring/shared
string-copy!
string-take string-take-right
string-drop string-drop-right
string-pad  string-pad-right
string-trim string-trim-right string-trim-both

変更

string-set! string-fill!

比較

string-compare string-compare-ci
string<>     string=    string<    string>    string<=    string>=
string-ci<>  string-ci= string-ci< string-ci> string-ci<= string-ci>=
string-hash  string-hash-ci

プレフィックスとサフィックス

string-prefix-length    string-suffix-length
string-prefix-length-ci string-suffix-length-ci

string-prefix?    string-suffix?
string-prefix-ci? string-suffix-ci?

検索

string-index string-index-right
string-skip  string-skip-right
string-count
string-contains string-contains-ci

アルファベットの大小文字変換

string-titlecase  string-upcase  string-downcase
string-titlecase! string-upcase! string-downcase!

反転と追加

string-reverse string-reverse!
string-append
string-concatenate
string-concatenate/shared string-append/shared
string-concatenate-reverse string-concatenate-reverse/shared

畳み込み、逆畳み込み、マップ

string-map      string-map!
string-fold     string-fold-right
string-unfold   string-unfold-right
string-for-each string-for-each-index

複製とローテート

xsubstring string-xcopy!

その他: 挿入、解析

string-replace string-tokenize

フィルタと削除

string-filter string-delete

低レベルの手続き

string-parse-start+end
string-parse-final-start+end
let-string-start+end

check-substring-spec
substring-spec-ok?

make-kmp-restart-vector kmp-step string-kmp-partial-search

論拠

この SRFI では 2 つのライブラリを定義しており、文字列処理のための豊富な手続きを提供している。これらは、スクリプトを書くときやテキスト処理アプリケーションを書くときに便利である。このライブラリの設計は、 MIT Scheme, Gambit, RScheme, MzScheme, slib, Common Lisp, Bigloo, guile, Chez, APL, Java, the SML standard basis の文字列ライブラリの影響を受けている。

文字の比較を伴う手続きでは、大小文字を区別する手続きと区別しない手続きが使用できる。

すべての機能において、部分文字列に作用する手続きと、文字列全体に作用する手続きが使用できる。

文字列はコードポイントのシーケンスである

この SRFI では、文字列を「コードポイント」(文字エンコーディング) のシーケンスとして捉える。文字列の比較や反転などの操作は、コードポイントごとに行われる。 ASCII 以外の文字タイプに関しては、以下の解説を参照せよ。

有効な文字列としては、意味のある「テキスト」シーケンスでなくてもよい。たとえば、ベースとなる文字が存在しない、幅が 0 の Unicode アクセント文字を考えてみよう。これは有効な文字列であるが、自然言語のテキストとして解釈される場合は意味をなさない。この SRFI のルーチンは、このような「テキスト」に関する事柄は扱わず、文字列を単に「コードポイント」のシーケンスと見なした低レベルな処理を行うだけである。

文字列操作はロケール非依存およびコンテキスト非依存である

この SRFI ではロケール非依存およびコンテキスト非依存の文字列操作を定義する。テキスト処理を行う場合、ロケールに依存する比較や順序付けの手続きは確かに重要ではあるが、基本的な文字列操作においては、ロケールに依存しない一連の操作が用意されていることも重要である。そうでなければ、ロケールを変更すると、ハッシュテーブルや B-ツリー、シンボルテーブル、ディレクトリなどのデータ構造が壊れてしまうことになる。

順序付けなどのロケール依存、コンテキスト依存のテキスト操作に関しては、別の SRFI で定義されるだろう。

国際化と ASCII 以外の文字タイプ

この SRFI では、8 ビット Latin-1 や 16 ビットおよび 32 ビット Unicode のような、 ASCII 以外の文字エンコーディングをどのように扱うかについて正面から取り組んだ。少なくともこれら 3 つの標準的なエンコーディングの文字列実装に対して移植性があることが、この SRFI で定義している API の設計目標である。残念なことに、この目標は API の設計に大きな制限を課すことになる。 ASCII 以外の文字エンコーディングを扱うことは非常に複雑な問題であることに注意していただきたい。これら多くの問題の簡単な解決法はないのである。

大小文字のマッピングと case-folding

ASCII 以外のエンコーディングにおける大文字・小文字の区別は複雑である。

大小文字の変換を行う際、1 つの文字に対して複数の文字が対応することがある。たとえば、Latin-1 ドイツ語の eszet 文字は、大文字では "SS" となる。
- したがって、 R5RS の関数 char-upcase は十分に定義されているとは言えない。この関数は 1 文字しか生成しないからである。
- また、バッファを直接書き換える string-upcase! 手続きもきちんと定義することができない。元の文字列が結果の文字列を含むことができるほど十分な長さを持っていないかも知れないからだ。 N 文字の文字列を大文字に変換すると 2N 文字の文字列になることもあるのである。
- このことから、大小文字を区別しない文字列マッチングや検索はとても複雑になる。たとえば、n 文字の文字列 s は 2N 文字の文字列 s' にマッチする可能性がある。
大小文字の変換を行う際、周囲のコンテキストによって変換が異なる場合がある。たとえば、Unicode ギリシア語の大文字のシグマは、その文字が単語の末尾にあるかどうかによって対応する小文字が異なる文字になる。この場合も、R5RS の char-downcase 関数では十分な変換が行えない。
Unicode では、小文字、大文字、タイトル文字の 3 種類を定義している。大文字とタイトル文字の違いは、Unicode 複合文字が存在するときに現れてくる。たとえば、Unicode には複合ペア "dz" を表現するための 1 文字が存在するが、この "dz" 文字を大文字にすると複合文字 "DZ" になるのに対し、タイトル文字 (つまり、アメリカ人がいうところのキャピタル化) にすると、複合文字 "Dz" になる。
トルコ語では、他の言語とは異なる大小文字変換の規則をもつ。

Unicode Consortium のウェブサイト

http://www.unicode.org/

では、この問題に対して詳細な議論を展開している。大小文字変換に関するテクニカルレポート 21

http://www.unicode.org/unicode/reports/tr21/

を参照せよ。

SRFI 13 では、これらの問題に取り組むことはせず、ロケール非依存、コンテキスト非依存の単純な 1 対 1 変換を使用している。特に、

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

で規定されている Unicode 1 対 1 変換を使用している。

このファイルのフォーマットについては

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html

で説明されている。

したがって、ドイツ語の eszet の大文字はそれ自身に変換され、 "SS" には変換されないことに注意せよ。

SRFI 13 の case-mapping と case-folding はロケールに依存しないので、ロケールを変更してもハッシュテーブルや B-ツリー、シンボルテーブルなどを破壊することはないだろう。

文字列の等値性と正規化

2 つの文字列が等値であるか比較することは複雑な処理を必要とする。なぜなら、Unicode は「同じ」文字に対して複数のエンコーディングを提供する場合があるし、我々が普通 1 文字だと考えるものを Unicode では複数のコードポイントの シーケンスとして表現することがあるからである。たとえば、アキュートアクセントをもつ文字 "e" を考えてみる。 Unicode ではこれを表現する文字が 1 つ存在するが、この文字を 2 文字のシーケンスとして表現することも許可している。つまり、"e" 文字に続けてゼロ幅文字であるアキュートアクセント文字を配置することでも表現できるのである。別の例としては、Unicode は、アジア地域のいくつかの文字に対して、狭い幅の文字 (in narrow width) と広い幅の文字 (in full width) を提供している。 (訳注：半角文字、全角文字のことか)

文字列の等値性を判定するためには、いくつかの方法があるだろう。以下では、おおよそ正確な方法から順に記述する。

実際のエンコーディングを正確に比較する方法。この方法の場合、<e-acute> は <e, acute> と等値にはならない。
「正規化された (normalized)」文字列を比較する方法。この方法の場合、上記の 2 つの文字列は等値になる。
より許容範囲の広い正規化による方法。この方法では、「同じ」文字だが見た目が異なる文字は、等値となる。たとえば、アジア地域の狭い幅／広い幅の文字は、等値とみなされる。
アクセントや発音区別符 (diacritical mark) を区別しない比較方法があるとよいかもしれない。
大小文字を区別しない比較方法があるとよいかもしれない。
上記のすべての区別を行わない比較方法があるとよいかもしれない。
さまざまな正規化の方法があるとよいかもしれない。

この SRFI で定めるライブラリでは、これらの複雑な問題に取り組むことはしない。 SRFI 13 における文字列の等値性とは、単純に文字のエンコーディング値が等値であることを意味する。アクセントを区別しない比較方法や、その他の比較方法は、このライブラリでは提供されない。次の URL で定義されている 1 対 1 の大小文字マッピングを使用した、大小文字を区別しない比較方法のみ提供する。

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

この比較方法は、プログラムやシステムで文字列を使用する分には十分であろう。 (たとえば、プログラムの識別子やオペレーティングシステムのファイル名を処理する場合)。

文字列の不等値性

文字列の等値性の問題に加えて、文字列を順序付けする場合に出てくる問題がある。

フランス語では、アクセントを右から左の順に順序付ける。これは文字の順序とは逆の順序になる。
大小文字を区別しない順序付けは、「コードポイント」だけでは定義し切れない。この問題は ASCII コードだけの場合にも当てはまる。 ASCII コードの大文字と小文字の間には、区切り文字 (左角括弧、バックスラッシュ、右角括弧、キャレット、アンダーバー、バッククォート) があるが、大小文字を区別しない比較方法において、左角括弧は "a" よりも小さいのだろうか、それとも、大きいのだろうか？
ドイツ語の eszet 文字は、それが "ss" という連続した文字であるかのように順序付けしなければならない。

Unicode では文字列の順序付け (collation) を行うための、複雑な仕組みを定義している。その方法では、文字列をマルチバイトのソートキーにマッピングし、そのキーに基づいて単純な辞書式のソートを行う。これらの規則は、ドメイン固有の、または、言語固有の規則によって上書きできる。これらの問題に対しても、この SRFI では何の取り組みを行わない。 SRFI 13 における文字列の順序付けは、その文字列を構成している値を 1 つづつ比較することによって行う。

命名規則

このライブラリには数多くの手続きがあるが、すべて一貫した命名規則に従っており、 SRFI 1 で述べられている慣習にも従っている。名前はいくつかの語彙素 (lexeme) から構成され、それらは各手続きの構造や関係を表す。これにより、プログラマがコードを書くときに手続きの名前を思い出しやすくなるようにしている。特に、以下の規則に従っている。

名前が "-ci" で終わる手続きは、大小文字を区別しない。
名前が "!" で終わる手続きは、副作用をもつ。このような手続きの戻り値は不定値である。
どの手続きでも、よく使われる引数の順序は同じである。
左方向/右方向/両方向について。方向をもつ手続きは次の規則に従う。

方向サフィックス

左から右なし

右から左 -right

両方向 -both

この規則は、SRFI 1 で導入された一般的な規則である。慣習は広く使われるほど価値が出てくる。

方向		サフィックス
左から右		なし
右から左		`-right`
両方向		`-both`

ストレージの共有

いくつかの Scheme 実装 (たとえば guile や T) では、他の文字列とストレージを共有する部分文字列を作成することができる。この機能は「共有テキスト部分文字列」(shared-text substrings) と呼ばれている。共有テキスト部分文字列は、部分文字列を作成するために必要なメモリ割り当てやコピー時間を削減することができる。これは、ある種のアプリケーションでは多くの節約になり、線形時間を必要とする処理を定数時間に抑えることができる。さらに、これらの部分文字列が共有されているという事実に依存しているアルゴリズムもある。つまり、あるストレージが変更されると、そのストレージを共有しているすべての文字列も変更されることを前提としているのである。しかしながら、共有テキスト部分文字列はあまり一般的な機能ではない。ほとんどの Scheme 処理系はこの機能を提供していない。

SRFI 13 では、共有テキスト部分文字列に関しては、中立の立場をとる。特に、この SRFI を実装するためには、共有テキスト部分文字列は必要ない。

共有テキスト部分文字列ほどの利点はないが、 SRFI 13 によって別のストレージ共有がもたらされることがある。 SRFI 13 の手続きの中には、引数として渡された文字列をそのまま返すことが許されているものがある。たとえば、 substring/shared 手続きで部分文字列を構築する場合、要求された部分文字列が文字列全体であれば、この手続きは引数に渡された文字列をそのまま返してよいことになっている。すなわち、次の式が満たされる。

(eq? s (substring/shared s 0 (string-length s))) => true or false

反対に、R5RS の substring 関数は、新しくメモリを割り当てて返すことになっている。

(eq? s (substring s 0 (string-length s))) => false.

共有テキストに対する SRFI 13 の基本スタンスを踏まえた上で、このような共有を行っても良いが、必ずしも共有する必要はない。したがって、このような共有の挙動に依存してはならない。

戻り値が引数とストレージを共有することが許されている手続きのほとんどは、新しいストレージを割り当てることが保証されている別バージョンが存在する。文字列を新しく割り当てることを保証したければ、これらの「純粋な」手続きを使えばよい。

文字列を新しく割り当てることが保証されている手続き	文字列を共有することが許されている手続き
`string-copy`	`substring/shared`
`string-copy`	`string-take string-take-right`
`string-copy`	`string-drop string-drop-right`
`string-concatenate`	`string-concatenate/shared`
`string-append`	`string-append/shared`
`string-concatenate-reverse`	`string-concatenate-reverse/shared`
	`string-pad string-pad-right`
	`string-trim string-trim-right`
	`string-trim-both`
	`string-filter string-delete`

一方で、この機能は、共有テキスト部分文字列がなくても効率的なコードを書けるようにするためのものである。共有テキスト部分文字列を作成しなくても、文字列の開始位置と終了位置を渡すだけで効率的なコードを書くことができる。もし、簡単な実装でもいいから共有テキスト部分文字列があれば、開始位置と終了位置を指定するための引数はすべて不要になり、 API は随分とシンプルになるだろう。しかし、SRFI 13 は共有テキスト部分文字列を提供するための実装を要求しないので、代わりとなる拡張 API を提供するのである。

R4RS/R5RS の手続き

R4RS と R5RS では 22 個の文字列手続きを定義している。ここで定義する string-lib パッケージでは、そのうちの 8 個はそのまま含まれており、 3 個は後方互換な形で拡張されており、残りの 11 個は削除している (それらの機能は他の手続きにより実現可能である)。

R4RS や R5RS からそのまま流用した 8 個の手続きは次のとおり。
string?, make-string, string, string-length, string-ref, string-set!, string-append, list->string

削除された 11 個の手続きは次のとおり。
string=?, string-ci=?, string<?, string-ci<?, string>?, string-ci>?, string<=?, string-ci<=?, string>=?, string-ci>=?, substring
string-lib パッケージは、これらの手続きに代わる別の手続きを提供しており、それらは拡張された機能をもつ。

3 個の拡張された手続きは次のとおり。

string-fill! s char [start end] -> unspecified
string->list s [start end] -> char-list
string-copy  s [start end] -> string

これらはすべて start/end オプション引数をとり、部分文字列の範囲を指定できる。

この SRFI 以外の推奨事項

この SRFI では以下のことを推奨する。

共有テキスト部分文字列を導入し、共有ストレージに依存したプログラムを書くことができるようにする SRFI を定義すること。
Unicode テキストを操作するための SRFI を定義すること。様々な正規化、順序付け、検索、などの操作を定義する。順序付けは、特定のロケールや言語の順序付け規則を表現した collation 構造体によりパラメータ化するとよいだろう。別の設計としては、順序付け規則を表現する構造体を特別な手続きにより動的スコープに設定する方法がある。その構造体を、文字列の順序付けを必要とする手続きのオプション引数として渡せば、動的スコープの設定をオーバーライドできるようにする。 たとえば、
```
(with-locale* denmark-locale
  (lambda ()
    (f x)
    (g 42)))

(with-locale taiwan-locale
  (f x)
  (h denmark-locale)
  (g 42))

(set-locale! denmark-locale)
```
少なくとも ASCII, Latin-1, Unicode の間で移植性がある文字操作を行うための SRFI を定義すること。
- 後方互換性のために char-upcase と char-downcase を定義し、Unicode の UnicodeData.txt テーブルで定められているロケール非依存でコンテキスト非依存な１対１の大小文字マッピングを使えるようにする。
- Unicode/Latin-1/ASCII の変換を行うために、文字と整数とを変換する標準関数を定義する。これにより移植性のあるコードを書くことができる。
- char-upcase と char-downcase に加えて char-titlecase を追加する。
- char-upcase? と char-downcase? に加えて char-titlecase? を追加する
- 文字処理関数に、タイトル文字変換、大文字変換、小文字変換を行う関数を追加する。変更不可な文字列、マルチ文字も可能な文字列を返すことで、１対多の文字マッピングを行えるようにする。これらの文字マッピングは、ロケール依存やコンテキスト依存である必要はない。

これらの推奨事項は SRFI 13 の仕様の一部ではない。また、Unicode/Latin-1/ASCII に対して文字と整数は変換できることを要求してはいるが、このことは実装の内部エンコーディングについては何も要求していないことに注意せよ。

手続きの仕様

後述する手続きの仕様では、以下のことを前提としている。

引数 s は文字列である。
引数 char は文字である。
引数 start と end は、文字列引数の部分文字列を表すインデックスであり、インデックス start からインデックス end の 1 つ前までを部分文字列とする。オプションとする場合は、start のデフォルトは 0 であり、 end のデフォルトは文字列の長さとなる。これらの値を引数に指定する場合は、対応する文字列 s に対して、 0 <= start <= end <= (string-length s) を満たさなければならない。これらの引数の一般的な使われ方としては、手続きの処理対象を部分文字列に制限する働きをする。
引数 pred は 1 つの文字を受け取り、真偽値を返す、述語手続きである。
引数 char/char-set/pred は、文字列内の文字を選択または検索するために使われる。文字 (char) の場合は、その文字に等値かどうかの判定に使われる。文字セット (char-set) の場合は、その文字セットの中に文字が存在するかどうかの判定に使われる。手続き (pred) の場合は、各文字に対してその手続きが呼び出され、その文字が選択されたかどうかを判定する。
引数 i は、正確数である 0 以上の整数で、文字列のインデックスを指定する。
引数 len と nchars は正確数である 0 以上の整数で、文字列の長さ、または、文字の個数を指定する。
引数 obj は任意の値である。

以上の引数に対して間違った型の値を指定すると、エラーとなる。

角括弧で囲まれている引数はオプション引数である。仕様に明示されていない限り、オプション引数は 0 個、最初の 1 個、最初の 2 個 ... のように任意に指定できる。手続きが多重値を返す場合は、それらも角括弧の中に記述する。たとえば、次のシグニチャを持つ手続きの場合、

halts? f [x init-store] -> [boolean integer]

１つ (f)、２つ (f, x) または３つ (f, x, init-store) の引数を受け取り、ブール値と整数値の２つの値を返す。

後ろに "..." が付いている引数は、0 個以上の引数を表す。たとえば、次のシグニチャを持つ手続きの場合、

sum-squares x ...  -> number

0 個以上の引数 (x ...) を取る。次のシグニチャを持つ手続きの場合、

spell-check doc dict₁ dict₂ ... -> string-list

２つの必須引数 (doc と dict₁) と 0 以上のオプション引数 (dict₂ ...) を取る。

手続きの戻り値が不定 (unspecified) であると言う場合、これは、手続きが何を返すかについては規定していない、という意味である。そのような手続きの戻り値は、呼出しごとに異なっていても構わない。単に、コマンド継続 (command continuation) に渡されるかも知れない値 (または多重値) を返すことが要求されているだけである。 たとえば、 begin 式の非終端の部分フォームの値が返されるかも知れない。 R5RS では、戻り値が不定である手続きは１つの値を返すことが要求されていることに注意しよう。非-R5RS な処理系では、この制限を満たしていないかも知れない。

主要な手続き

モジュールシステムやパッケージシステムをもつ Scheme システムでは、以下の手続きはモジュール名 "string-lib" に含まれるべきある。

述語

string? obj -> boolean

[R5RS] obj が文字列なら #t を返す。そうでなければ #f を返す。

string-null? s -> boolean

s が空文字列かどうか判定する。

string-every char/char-set/pred s [start end] -> value

string-any char/char-set/pred s [start end] -> value

文字列 s の中のすべての文字 (または任意の文字) が、指定された基準を満たすかどうかを確認する。文字列の左 (インデックス start) から右 (インデックス end) の順に確認する。

char/char-set/pred が文字の場合、 s の各文字がこの文字に一致するか確認する。

char/char-set/pred が文字セットの場合、 s の各文字がこの文字セットに含まれているか確認する。

char/char-set/pred が述語手続きの場合、 s の各文字に適用される。 The predicate is "witness-generating:"

string-any が真を返す場合、その戻り値は、述語の戻り値である。
string-every が真を返す場合、その戻り値は、述語を最後の文字 s[end-1] に適用した戻り値である。 string-every を空文字列に適用した場合は、 #t を返す。

string-every や string-any が述語を最後の文字 (すなわち, s[end-1]) に適用する場合、その最後の適用は末尾呼び出しである。

これらの手続きの名前は、末尾に疑問符が付いていない。このことは、述語を適用した場合、単純な真偽値 (#t または #f) を返すのではなく、一般の値を返すことを示唆している。

コンストラクタ

make-string len [char] -> string: [R5RS] make-string は長さ len の新しい文字列を割り当てて返す。 char を指定した場合は、文字列のすべての要素は char に初期化される。指定しなかった場合は、文字列の内容は不定になる。
string char₁ ... -> string: [R5RS] 引数で指定された文字で構成される新しい文字列を割り当てて返す。
string-tabulate proc len -> string: proc は integer->char の手続きである。長さ len の文字列を作成し、その各インデックスを引数にして proc を呼び出して、その戻り値をそのインデックスの要素として初期化する。 proc をインデックスに適用する順序は不定である。

リストと文字列の変換

string->list s [start end] -> char-list

list->string char-list -> string

[R5RS+] string->list は、文字列を構成する各文字のリストを新しく割り当てて返す。 list->string は、文字のリスト char-list 内の文字で構成される文字列を新しく割り当てて返す。 string->list と list->string は equal? に関して逆変換の関係にある。

string->list は R5RS の定義を拡張しており、オプション引数 start/end を指定できる。

reverse-list->string char-list -> string

この手続きは (compose list->string reverse) の効率的な実装である。

(reverse-list->string '(#\a #\B #\c)) -> "cBa"

この処理は、ループ内で文字列を逆順リストとして構築していく文字列処理の最後の処理としてよく使われる (この手続きの「チャンク化」バージョンについては、 string-concatenate-reverse を参照せよ)。

string-join string-list [delimiter grammar] -> string

この手続きは、区切り文字列を用いて複数の文字列を結合する。

grammar 引数は、区切り文字列の使い方をシンボルで指定する。デフォルトは 'infix である。

'infix は中置区切りを表す: 区切り文字列をリストの各要素の間に挿入する。空リストを指定した場合は、空文字列が生成される。しかし、空文字列を中置区切り (セパレータ) で分解する場合、空リストが生成されることもあるだろうし、空文字列を含む１要素のリストが生成されることもあるだろうから、曖昧さがあることに注意せよ。
'strict-infix は 'infix と同じだが、空リストを与えられた場合は、エラーを出す。
'suffix は後置区切りを表す: 区切り文字列をすべてのリスト要素の後ろに挿入する。この処理には曖昧さはない。
'prefix は前置区切りを表す。: 区切り文字列をすべてのリスト要素の前に挿入する。この処理には曖昧さはない。

delimiter 引数は区切り文字列であり、デフォルトは１個のスペース " " である。

(string-join '("foo" "bar" "baz") ":")         => "foo:bar:baz"
(string-join '("foo" "bar" "baz") ":" 'suffix) => "foo:bar:baz:"

;; Infix grammar is ambiguous wrt empty list vs. empty string,
(string-join '()   ":") => ""
(string-join '("") ":") => ""

;; but suffix & prefix grammars are not.
(string-join '()   ":" 'suffix) => ""
(string-join '("") ":" 'suffix) => ":"

選択

string-length s -> integer

[R5RS] 文字列 s に含まれる文字数を返す。

string-ref s i -> char

[R5RS] 文字 s[i] を返す。インデックス i は 0-based であり、 s の有効なインデックスでなければならない。

string-copy s [start end] -> string

substring/shared s start [end] -> string

[R5RS+] substring/shared returns a string whose contents are the characters of s beginning with index start (inclusive) and ending with index end (exclusive). It differs from the R5RS substring in two ways:

The end parameter is optional, not required.
substring/shared may return a value that shares memory with s or is eq? to s.

string-copy is extended from its R5RS definition by the addition of its optional start/end parameters. In contrast to substring/shared, it is guaranteed to produce a freshly-allocated string.

Use string-copy when you want to indicate explicitly in your code that you wish to allocate new storage; use substring/shared when you don't care if you get a fresh copy or share storage with the original string.

(string-copy "Beta substitution") => "Beta substitution"
(string-copy "Beta substitution" 1 10)
    => "eta subst"
(string-copy "Beta substitution" 5) => "substitution"

string-copy! target tstart s [start end] -> unspecified

Copy the sequence of characters from index range [start,end) in string s to string target, beginning at index tstart. The characters are copied left-to-right or right-to-left as needed -- the copy is guaranteed to work, even if target and s are the same string.

It is an error if the copy operation runs off the end of the target string, e.g.

(string-copy! (string-copy "Microsoft") 0
              "Regional Microsoft Operating Companies") => error

string-take s nchars -> string

string-drop s nchars -> string

string-take-right s nchars -> string

string-drop-right s nchars -> string

string-take returns the first nchars of s; string-drop returns all but the first nchars of s. string-take-right returns the last nchars of s; string-drop-right returns all but the last nchars of s. If these procedures produce the entire string, they may return either s or a copy of s; in some implementations, proper substrings may share memory with s.

(string-take "Pete Szilagyi" 6) => "Pete S"
(string-drop "Pete Szilagyi" 6) => "zilagyi"

(string-take-right "Beta rules" 5) => "rules"
(string-drop-right "Beta rules" 5) => "Beta "

It is an error to take or drop more characters than are in the string:

(string-take "foo" 37) => error

string-pad s len [char start end] -> string

string-pad-right s len [char start end] -> string

Build a string of length len comprised of s padded on the left (right) by as many occurrences of the character char as needed. If s has more than len chars, it is truncated on the left (right) to length len. Char defaults to #\space.

If len <= end-start, the returned value is allowed to share storage with s, or be exactly s (if len = end-start).

(string-pad     "325" 5) => "  325"
(string-pad   "71325" 5) => "71325"
(string-pad "8871325" 5) => "71325"

string-trim s [char/char-set/pred start end] -> string

string-trim-right s [char/char-set/pred start end] -> string

string-trim-both s [char/char-set/pred start end] -> string

Trim s by skipping over all characters on the left / on the right / on both sides that satisfy the second parameter char/char-set/pred:

if it is a character char, characters equal to char are trimmed;
if it is a char set cs, characters contained in cs are trimmed;
if it is a predicate pred, it is a test predicate that is applied to the characters in s; a character causing it to return true is skipped.

Char/char-set/pred defaults to the character set char-set:whitespace defined in SRFI 14.

If no trimming occurs, these functions may return either s or a copy of s; in some implementations, proper substrings may share memory with s.

(string-trim-both "  The outlook wasn't brilliant,  \n\r")
    => "The outlook wasn't brilliant,"

変更

string-set! s i char -> unspecified

[R5RS] i は s の有効なインデックスでなければならない。 string-set! は s のインデックス i の要素として char を格納する。コード中に現れる文字列リテラルは変更不可であり、それを string-set!. で変更することはエラーである。

(define (f) (make-string 3 #\*))
(define (g) "***")
(string-set! (f) 0 #\?)                ==>  unspecified
(string-set! (g) 0 #\?)                ==>  error
(string-set! (symbol->string 'immutable)
             3
             #\?)                      ==>  error

string-fill! s char [start end] -> unspecified

[R5RS+] s のすべての要素を char で埋める。

string-fill は R5RS の定義を拡張しており、オプション引数 start/end を指定することができる。

比較

string-compare s1 s2 proc< proc= proc> [start1 end1 start2 end2] -> values

string-compare-ci s1 s2 proc< proc= proc> [start1 end1 start2 end2] -> values

Apply proc<, proc=, or proc> to the mismatch index, depending upon whether s1 is less than, equal to, or greater than s2. The "mismatch index" is the largest index i such that for every 0 <= j < i, s1[j] = s2[j] -- that is, i is the first position that doesn't match.

string-compare-ci is the case-insensitive variant. Case-insensitive comparison is done by case-folding characters with the operation

(char-downcase (char-upcase c))

where the two case-mapping operations are assumed to be 1-1, locale- and context-insensitive, and compatible with the 1-1 case mappings specified by Unicode's UnicodeData.txt table:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

The optional start/end indices restrict the comparison to the indicated substrings of s1 and s2. The mismatch index is always an index into s1; in the case of proc=, it is always end1; we observe the protocol in this redundant case for uniformity.

(string-compare "The cat in the hat" "abcdefgh"
                values values values
                4 6         ; Select "ca"
                2 4)        ; & "cd"
    => 5    ; Index of S1's "a"

Comparison is simply done on individual code-points of the string. True text collation is not handled by this SRFI.

string= s1 s2 [start1 end1 start2 end2] -> boolean

string<> s1 s2 [start1 end1 start2 end2] -> boolean

string< s1 s2 [start1 end1 start2 end2] -> boolean

string> s1 s2 [start1 end1 start2 end2] -> boolean

string<= s1 s2 [start1 end1 start2 end2] -> boolean

string>= s1 s2 [start1 end1 start2 end2] -> boolean

These procedures are the lexicographic extensions to strings of the corresponding orderings on characters. For example, string< is the lexicographic ordering on strings induced by the ordering char<? on characters. If two strings differ in length but are the same up to the length of the shorter string, the shorter string is considered to be lexicographically less than the longer string.

The optional start/end indices restrict the comparison to the indicated substrings of s1 and s2.

Comparison is simply done on individual code-points of the string. True text collation is not handled by this SRFI.

string-ci= s1 s2 [start1 end1 start2 end2] -> boolean

string-ci<> s1 s2 [start1 end1 start2 end2] -> boolean

string-ci< s1 s2 [start1 end1 start2 end2] -> boolean

string-ci> s1 s2 [start1 end1 start2 end2] -> boolean

string-ci<= s1 s2 [start1 end1 start2 end2] -> boolean

string-ci>= s1 s2 [start1 end1 start2 end2] -> boolean

Case-insensitive variants.

Case-insensitive comparison is done by case-folding characters with the operation

(char-downcase (char-upcase c))

where the two case-mapping operations are assumed to be 1-1, locale- and context-insensitive, and compatible with the 1-1 case mappings specified by Unicode's UnicodeData.txt table:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

string-hash s [bound start end] -> integer

string-hash-ci s [bound start end] -> integer

Compute a hash value for the string s. Bound is a non-negative exact integer specifying the range of the hash function. A positive value restricts the return value to the range [0,bound).

If bound is either zero or not given, the implementation may use an implementation-specific default value, chosen to be as large as is efficiently practical. For instance, the default range might be chosen for a given implementation to map all strings into the range of integers that can be represented with a single machine word.

The optional start/end indices restrict the hash operation to the indicated substring of s.

string-hash-ci is the case-insensitive variant. Case-insensitive comparison is done by case-folding characters with the operation

(char-downcase (char-upcase c))

where the two case-mapping operations are assumed to be 1-1, locale- and context-insensitive, and compatible with the 1-1 case mappings specified by Unicode's UnicodeData.txt table:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

Invariants:

(<= 0 (string-hash s b) (- b 1)) ; When B > 0.
(string=    s1 s2)  =>  (= (string-hash s1 b)    (string-hash s2 b))
(string-ci= s1 s2)  =>  (= (string-hash-ci s1 b) (string-hash-ci s2 b))

A legal but nonetheless discouraged implementation:

(define (string-hash    s . other-args) 1)
(define (string-hash-ci s . other-args) 1)

Rationale: allowing the user to specify an explicit bound simplifies user code by removing the mod operation that typically accompanies every hash computation, and also may allow the implementation of the hash function to exploit a reduced range to efficiently compute the hash value. E.g., for small bounds, the hash function may be computed in a fashion such that intermediate values never overflow into bignum integers, allowing the implementor to provide a fixnum-specific "fast path" for computing the common cases very rapidly.

プレフィックスとサフィックス

string-prefix-length s1 s2 [start1 end1 start2 end2] -> integer

string-suffix-length s1 s2 [start1 end1 start2 end2] -> integer

string-prefix-length-ci s1 s2 [start1 end1 start2 end2] -> integer

string-suffix-length-ci s1 s2 [start1 end1 start2 end2] -> integer

２つの文字列の最長の共通プレフィックス／共通サフィックスの長さを返す。プレフィックスに関しては、この値は「マッチしないインデックス」に等しい (ただしstart インデックスのオフセット分は差し引く)。

オプションの start/end インデックスを指定すると、 s1 と s2 をその部分文字列に制限した上で比較する。

string-prefix-length-ci と string-suffix-length-ci は大小文字を区別しないバージョンである。大小文字を区別しない比較では、次式により変換された case-folding 文字によって比較される。

(char-downcase (char-upcase c))

ここで、この２つの大小文字変換は１対１対応であり、ロケール非依存、コンテキスト非依存であり、また、 Unicode の UnicodeData.txt テーブルで定義されている１対１の大小文字変換と互換性があることを仮定している。

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

比較は単に文字列の個々のコードポイントごとに行われる。

string-prefix? s1 s2 [start1 end1 start2 end2] -> boolean

string-suffix? s1 s2 [start1 end1 start2 end2] -> boolean

string-prefix-ci? s1 s2 [start1 end1 start2 end2] -> boolean

string-suffix-ci? s1 s2 [start1 end1 start2 end2] -> boolean

s1 が s2 のプレフィックス／サフィックスであるか確認します。

オプションの start/end インデックスを指定すると、 s1 と s2 をその部分文字列に制限した上で比較する。

string-prefix-ci? と string-suffix-ci? は大小文字を区別しないバージョンである。大小文字を区別しない比較では、次式により変換された case-folding 文字によって比較される。

(char-downcase (char-upcase c))

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

比較は単に文字列の個々のコードポイントごとに行われる。

検索

string-index s char/char-set/pred [start end] -> integer or #f

string-index-right s char/char-set/pred [start end] -> integer or #f

string-skip s char/char-set/pred [start end] -> integer or #f

string-skip-right s char/char-set/pred [start end] -> integer or #f

string-index (string-index-right) は文字列を左から (右から) 検索し、次のいずれかの条件を満たす最初の文字のインデックスを返す。

char/char-set/pred に等しい (これが文字の場合)。
char/char-set/pred に含まれる (これが文字セットの場合)。
述語 char/char-set/pred を満たす (これが手続きの場合)。

何もマッチしなければ、これらの関数は偽を返す。

start と end 引数は、検索の開始位置と終了位置を指定する。検索は start インデックスを含むが、end インデックスは含まない。この範囲指定には注意せよ。右から左に検索する場合、最初に注目されるインデックスは

end-1

であるのに対し、左から右に検索する場合は、最初に注目されるインデックスは

start

である。つまり、SRFI-13 の他の手続きと同じように、 start/end インデックスは半開区間 [start,end) を指定する。

skip 系の関数も同様であるが、条件が反転する。つまり、上記の条件を満たさない最初の文字を検索する。 たとえば、先頭の空白をスキップするには、次のようにする。

(cond ((string-skip s char-set:whitespace) =>
       (lambda (i) ...)) ; s[i] is not whitespace.
      ...)

string-count s char/char-set/pred [start end] -> integer

文字列 s の中で char/char-set/pred 引数の条件を満たす文字の個数を返す。この引数が手続きであれば、述語として各文字に適用される。この引数が文字セットであれば、各文字がその中に含まれるかどうか検査される。この引数が文字であれば、各文字がその文字に一致するかどうか検査される。

string-contains s1 s2 [start1 end1 start2 end2] -> integer or false

string-contains-ci s1 s2 [start1 end1 start2 end2] -> integer or false

文字列 s1 が文字列 s2 を含むかどうか確認する。

文字列 s1 の中で部分文字列 s2 が出現するインデックスを返す。出現しなければ偽を返す。オプションの start/end インデックスを指定すると、各文字列をその部分文字列に制限した上で検索を行う。

返されるインデックスは [start1,end1) の範囲の値になる。検索に成功すると、その部分文字列は s1 の [start1,end1) の範囲内に含まれている。

(string-contains "eek -- what a geek." "ee"
                 12 18) ; Searches "a geek"
    => 15

string-contains-ci は大小文字を区別しないバージョンである。大小文字を区別しない比較では、次式により変換された case-folding 文字によって比較される。

(char-downcase (char-upcase c))

ここで、この２つの大小文字変換は１対１対応であり、ロケール非依存、コンテキスト非依存であり、また、Unicode の UnicodeData.txt テーブルで定義されている１対１の大小文字変換と互換性があることを仮定している。

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

比較は単に文字列の個々のコードポイントごとに行われる。

これらの手続きの名前の末尾は疑問符ではないことに注意せよ。これは、単純なブール値 (#t または #f) を返すのではなく、偽 (#f) または 0 以上の正確整数を返すことを示唆している。

アルファベットの大小文字変換

string-titlecase s [start end] -> string

string-titlecase! s [start end] -> unspecified

文字列 s の指定された範囲内のすべての文字 c に対して、c の前に大小文字 (cased character) があれば小文字に変換され、そうでなければタイトル文字に変換される。

string-titlecase は結果の文字列を返すが、引数の文字列 s は破壊しない。 string-titlecase! は引数を破壊するバージョンである。

(string-titlecase "--capitalize tHIS sentence.") =>
  "--Capitalize This Sentence."

(string-titlecase "see Spot run. see Nix run.") =>
  "See Spot Run. See Nix Run."

(string-titlecase "3com makes routers.") =>
  "3Com Makes Routers."

start インデックスを指定した場合は、 s[start] より前の文字は s[start] の変換には影響しないことに注意せよ。

(string-titlecase "greasy fried chicken" 2) => "Easy Fried Chicken"

タイトル文字と大小文字は Unicode の仕様と互換性がなければならない。

string-upcase s [start end] -> string

string-upcase! s [start end] -> unspecified

string-downcase s [start end] -> string

string-downcase! s [start end] -> unspecified

文字列中の各アルファベット文字を大文字または小文字に変換する。

string-upcase と string-downcase は結果の文字列を返すが、引数の文字列 s は破壊しない。 string-upcase! と string-downcase! は引数を破壊するバージョンである。

これらの手続きは、 Unicode の UnicodeData.txt テーブルで定義されている、ロケール非依存、コンテキスト非依存の１対１の変換を使用する。

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

反転と追加

string-reverse s [start end] -> string

string-reverse! s [start end] -> unspecified

文字列を反転する。

string-reverse は反転した文字列を返すが、引数として渡された文字列 s は変更しない。 string-reverse! は引数を破壊的に変更する可能性がある。

(string-reverse "Able was I ere I saw elba.")
    => ".able was I ere I saw elbA"

;;; In-place rotate-left, the Bell Labs way:
(lambda (s i)
  (let ((i (modulo i (string-length s))))
    (string-reverse! s 0 i)
    (string-reverse! s i)
    (string-reverse! s)))

Unicode のためのノート: 文字列を反転するというのは、単にコードポイントのシーケンスを反転するということである。そのため、文字列 s 内のベース文字 b の後に配置されたゼロ幅のアクセント文字 a は、反転すると b の前に配置されることになる。

string-append s₁ ... -> string

[R5RS] 指定された文字列を連結した、新しい文字列を割り当てて返す。

string-concatenate string-list -> string

文字列リスト string-list の要素を連結して、新しく割り当てられた１つの文字列を返す。

この手続きと同じ処理をするために (apply string-append string-list) という式を使うこともできるが、 Scheme 処理系の中には、手続きの引数の個数に制限を課しているものもあるため、文字列リストが長い場合には、正常に動作しない可能性がある。

string-concatenate/shared string-list -> string

string-append/shared s₁ ... -> string

この２つの手続きは string-concatenate と string-append の変種であり、引数として渡された文字列とストレージを共有している文字列を返すことが許されている。特に、string-append/shared を１つの引数に対して適用した場合、その引数をそのまま返す可能性がある。一方、string-append は常に新しい文字列を割り当てて返すことが保証されている。

string-concatenate-reverse string-list [final-string end] -> string

string-concatenate-reverse/shared string-list [final-string end] -> string

オプション引数を指定しなければ、これらの関数はそれぞれ

(string-concatenate (reverse string-list))

と

(string-concatenate/shared (reverse string-list))

に同じである。

オプション引数 final-string を指定した場合、 list-reverse と string-concatenate を実行する前に、この文字列が string-list の先頭に追加される。

オプション引数 end を指定した場合、文字列 final-string のうちの最初の end 個の文字だけが文字列リストの先頭に追加されます。したがって、次の式に同じである。

(string-concatenate
  (reverse (cons (substring/shared final-string 0 end)
                 string-list)))

例

(string-concatenate-reverse '(" must be" "Hello, I") " going.XXXX" 7)
  => "Hello, I must be going."

文字データを文字列バッファとして使うリストに蓄積し、最後にそのバッファを連結して１つの文字列を構築するようば処理を行う場合に、この手続きは便利である。

Unicode のためのノート: 文字列を反転するというのは、単にコードポイントのシーケンスを反転するということである。そのため、文字列 s 内のベース文字 bc の後に配置されたゼロ幅のアクセント文字 ac は、反転すると bc の前に配置されることになる。

畳み込み、逆畳み込み、マップ

string-map proc s [start end] -> string

string-map! proc s [start end] -> unspecified

proc は char->char の手続きであり、この手続きが文字列 s の各文字を変換する。

string-map は結果の文字列を返すが、引数の文字列 s を破壊しない。 string-map! は引数を破壊するバージョンである。

ノート: 手続き proc を文字列 s の各文字に適用する順序は不定である。

string-fold kons knil s [start end] -> value

string-fold-right kons knil s [start end] -> value

これらは文字列に対する基本的なイテレータである。

左畳み込み演算子 (string-fold) は、kons 手続きを文字列の左から右に適用して、各文字を変換する。

(... (kons s[2] (kons s[1] (kons s[0] knil))))

言い換えると、string-fold は次の (末尾) 再帰に従う。

(string-fold kons knil s start end) =
    (string-fold kons (kons s[start] knil) start+1 end)

右畳み込み演算子 (string-fold-right) は、kons 手続きを文字列の右から左に適用して、各文字を変換する。

(kons s[0] (... (kons s[end-3] (kons s[end-2] (kons s[end-1] knil)))))

この場合は次の (末尾) 再帰に従う。

(string-fold-right kons knil s start end) =
    (string-fold-right kons (kons s[end-1] knil) start end-1)

例:

;;; 文字列を文字のリストに変換する。
(string-fold-right cons '() s)

;;; 文字列中の小文字を数える。
(string-fold (lambda (c count)
               (if (char-lower-case? c)
                   (+ count 1)
                   count))
             0
             s)

;;; 文字列 s の中のバックスラッシュ文字を二重化する。
(let* ((ans-len (string-fold (lambda (c sum)
                               (+ sum (if (char=? c #\\) 2 1)))
                             0 s))
       (ans (make-string ans-len)))
  (string-fold (lambda (c i)
                 (let ((i (if (char=? c #\\)
                              (begin (string-set! ans i #\\) (+ i 1))
                              i)))
                   (string-set! ans i c)
                   (+ i 1)))
               0 s)
  ans)

右畳み込みコンビネータは、catamorphism と呼ばれることがある。

string-unfold p f g seed [base make-final] -> string

これは文字列の基本的なコンストラクタである。

手続き g は「シード」(seed) の初期値に繰り返し適用することで、シード列を生成するために使われる：
seed, (g seed), (g² seed), (g³ seed), ...
手続き p は停止条件を与える。この手続きをシード値に適用したときに真を返すなら、そこで文字列の生成が停止する。
手続き f はシード値を文字に変換する。結果として生成される文字列は、この文字を左から右に並べたものとなる。
オプション引数 base は、生成される文字列のプレフィックスを指定する。デフォルトは空文字列 "" である。
オプション手続き make-final は最後のシード値 (p が真を返すシード値) を受け取り、結果として生成される文字列のサフィックスを返す。デフォルトは (lambda (x) "") である。

もっと正確に言うと、以下の (シンプルだが効率的ではない) 定義になる。

;;; 繰り返しによる定義
(define (string-unfold p f g seed base make-final)
  (let lp ((seed seed) (ans base))
    (if (p seed)
        (string-append ans (make-final seed))
        (lp (g seed) (string-append ans (string (f seed)))))))

;;; 再帰による定義
(define (string-unfold p f g seed base make-final)
  (string-append base
                 (let recur ((seed seed))
                   (if (p seed) (make-final seed)
                       (string-append (string (f seed))
                                      (recur (g seed)))))))

string-unfold は非常に強力な文字列コンストラクタである。これを使えば、リストを文字列に変換する、ポートから読み取って文字列を構築する、文字列を反転する、文字列をコピーする、などの操作が実現できる。たとえば：

(port->string p) = (string-unfold eof-object? values
                                  (lambda (x) (read-char p))
                                  (read-char p))

(list->string lis) = (string-unfold null? car cdr lis)

(string-tabulate f size) = (string-unfold (lambda (i) (= i size)) f add1 0)

リスト lis の要素に手続き f を適用することで文字列を生成するには：

(string-unfold null? (compose f car) cdr lis)

好奇心のある関数プログラマは、 string-fold-right と string-unfold はある意味で逆変換の関係にあると主張するかも知れない。すなわち、手続き knull?, kar, kdr, kons, knil が

(kons (kar x) (kdr x)) = x  and (knull? knil) = #t

を満たすなら、

(string-fold-right kons knil (string-unfold knull? kar kdr x)) = x

と

(string-unfold knull? kar kdr (string-fold-right kons knil s)) = s.

が成り立つ。

最終的に生成される文字列は、文字列 base や手続き make-final が返す文字列とは、ストレージを共有しない。

このコンビネータは anamorphism と呼ばれることがある。

ノート: この手続きの実装では、巨大な (たとえば メガバイト単位の) 文字列に string-unfold を適用した場合に、実行時のスタック制限をオーバーフローしないように注意すべきである。

string-unfold-right p f g seed [base make-final] -> string

This is a fundamental constructor for strings.

G is used to generate a series of "seed" values from the initial seed: seed, (g seed), (g² seed), (g³ seed), ...
P tells us when to stop -- when it returns true when applied to one of these seed values.
F maps each seed value to the corresponding character in the result string. These chars are assembled into the string in a right-to-left order.
Base is the optional initial/rightmost portion of the constructed string; it defaults to the empty string "".
Make-final is applied to the terminal seed value (on which P returns true) to produce the final/leftmost portion of the constructed string. It defaults to (lambda (x) "").

More precisely, the following (simple, inefficient) definitions hold:

;;; Iterative
(define (string-unfold-right p f g seed base make-final)
  (let lp ((seed seed) (ans base))
    (if (p seed)
        (string-append (make-final seed) ans)
        (lp (g seed) (string-append (string (f seed)) ans)))))

;;; Recursive
(define (string-unfold-right p f g seed base make-final)
  (string-append (let recur ((seed seed))
                   (if (p seed) (make-final seed)
                       (string-append (recur (g seed))
                                      (string (f seed)))))
                 base))

Interested functional programmers may enjoy noting that string-fold and string-unfold-right are in some sense inverses. That is, given operations knull?, kar, kdr, kons, and knil satisfying

(kons (kar x) (kdr x)) = x and (knull? knil) = #t

then

(string-fold kons knil (string-unfold-right knull? kar kdr x)) = x

and

(string-unfold-right knull? kar kdr (string-fold kons knil s)) = s.

The final string constructed does not share storage with either base or the value produced by make-final.

Note: implementations should take care that runtime stack limits do not cause overflow when constructing large (e.g., megabyte) strings with string-unfold-right.

string-for-each proc s [start end] -> unspecified

Apply proc to each character in s. string-for-each is required to iterate from start to end in increasing order.

string-for-each-index proc s [start end] -> unspecified

Apply proc to each index of s, in order. The optional start/end pairs restrict the endpoints of the loop. This is simply a method of looping over a string that is guaranteed to be safe and correct. Example:

(let* ((len (string-length s))
       (ans (make-string len)))
  (string-for-each-index
      (lambda (i) (string-set! ans (- len i) (string-ref s i)))
      s)
  ans)

置換とローテート

xsubstring s from [to start end] -> string

これは「拡張部分文字列」手続きであり、文字列が仮想的に反復された文字列から、その部分文字列を切り出す処理をする。

s は文字列である。start と end は文字列 s の部分文字列を指定するためのオプション引数であり、デフォルトは 0 と s の長さである (つまり文字列全体が使われる)。この部分文字列が、インデックスが正と負の方向に仮想的に反復されていると見なされる。たとえば、s = "abcdefg", start=3, end=6, の場合、次のような双方向に無限に連なる文字列を考える。

...

-9

-8

-7

-6

-5

-4

-3

-2

-1

...

xsubstring は、この無限長の文字列のうち、インデックス from から to (デフォルトは from+(end-start)) の範囲の部分文字列を返す。

この xsubstring 手続きは、いろいろな処理に使うことができる：

文字列を左にローテートする： (xsubstring "abcdef" 2) => "cdefab"
文字列を右にローテートする： (xsubstring "abcdef" -2) => "efabcd"
文字列を反復する： (xsubstring "abc" 0 7) => "abcabca"

次のことに注意せよ。

インデックス from/to は半開区間を指定する。これはインデックス from から to の範囲を指定するが、インデックス to は含まれない。
インデックス from/to は文字列 s のインデックスを指定するのではなく、 start, end で制限された s の部分文字列が無限に反復された文字列におけるインデックスを指定する。

start=end の場合はエラーである。ただし、from=to であるならエラーにはならない。

string-xcopy! target tstart s sfrom [sto start end] -> unspecified

xsubstring と同様だが、抽出された文字列は、文字列 target のインデックス tstart から始まる位置に書き込まれる。 (eq? target s) である場合、あるいは、この２つの文字列がストレージを共有する場合は、その結果は未定義である。文字列を自分自身にコピーすることはできないからである。

その他: 挿入、解析

string-replace s1 s2 start1 end1 [start2 end2] -> string

Returns

(string-append (substring/shared s1 0 start1)
               (substring/shared s2 start2 end2)
               (substring/shared s1 end1 (string-length s1)))

That is, the segment of characters in s1 from start1 to end1 is replaced by the segment of characters in s2 from start2 to end2. If start1=end1, this simply splices the s2 characters into s1 at the specified index.

Examples:

(string-replace "The TCL programmer endured daily ridicule."
                "another miserable perl drone" 4 7 8 22 ) =>
    "The miserable perl programmer endured daily ridicule."

(string-replace "It's easy to code it up in Scheme." "lots of fun" 5 9) =>
    "It's lots of fun to code it up in Scheme."

(define (string-insert s i t) (string-replace s t i i))

(string-insert "It's easy to code it up in Scheme." 5 "really ") =>
    "It's really easy to code it up in Scheme."

string-tokenize s [token-set start end] -> list

Split the string s into a list of substrings, where each substring is a maximal non-empty contiguous sequence of characters from the character set token-set.

token-set defaults to char-set:graphic (see SRFI 14 for more on character sets and char-set:graphic).
If start or end indices are provided, they restrict string-tokenize to operating on the indicated substring of s.

This function provides a minimal parsing facility for simple applications. More sophisticated parsers that handle quoting and backslash effects can easily be constructed using regular-expression systems; be careful not to use string-tokenize in contexts where more serious parsing is needed.

(string-tokenize "Help make programs run, run, RUN!") =>
  ("Help" "make" "programs" "run," "run," "RUN!")

フィルタと削除

string-filter char/char-set/pred s [start end] -> string
string-delete har/char-set/pred s [start end] -> string: 文字列 s をフィルタする。引数 char/char-set/pred を満足する (または満足しない) 文字だけを残して返す。この引数が手続きである場合、述語として適用される。この引数が文字セットである場合、そこで指定された文字であるか判定される。この引数が文字である場合、その文字に等しいかどうか判定される。
その文字列がフィルタ (または削除) 操作により変更されない場合、この関数は s をそのまま返してもよいし、 s のコピーを返してもよい。

低レベルの手続き

以下の手続きは、他の文字列処理関数を実装するときに役に立つ。モジュールシステムやパッケージシステムをもつ Scheme システムでは、これらの手続きは "string-lib-internals" という名前のモジュールに含まれるべきである。

オプション引数 start/end の解析と検証を行うユーティリティ

string-parse-start+end proc s args -> [rest start end]

string-parse-final-start+end proc s args -> [start end]

string-parse-start+end は引数リストからオプション引数 start/end を解析するために使用し、それらのデフォルト値を 0 と文字列 s の長さとする。文字列 s の長さを slen とすると、以下の処理を行う。

もし args = () ならば、この関数は (values '() 0 slen) を返す。
もし args = (i) ならば、 i が正確数の整数であり、 0 <= i <= slen であるかを検査する。 (values (cdr args) i slen) を返す。
もし args = (i j ...) ならば、 i と j が正確数の整数であり、 0 <= i <= j <= slen であるか検査する。 (values (cddr args) i j) を返す。

もし、上記の検査が失敗した場合、エラー状態が発生し、 proc はエラー状態の一部として使用される。 -- it should be the client procedure whose argument list string-parse-start+end is parsing.

string-parse-final-start+end の動作も同様だが、引数リスト (args) の長さは 2 以下でなければならない。それより長い引数リストを渡すと、エラー状態が発生する。この手続きは、オプションの start/end パラメータが手続きの最後の引数である場合に使用されるだろう。

Note that in all cases, these functions ensure that s is a string (by necessity, since all cases apply string-length to s either to default end or to bounds-check it).

let-string-start+end (start end [rest]) proc-exp s-exp args-exp body ... -> value(s)

[構文] string-parse-start+end や string-parse-final-start+end を適用するための構文シュガーである。

rest 変数が指定された場合、このフォームは次の式に等しくなる。

(call-with-values
    (lambda () (string-parse-start+end proc-exp s-exp args-exp))
  (lambda (rest start end) body ...))

rest 変数が指定されない場合、このフォームは次の式に等しくなる。

(call-with-values
    (lambda () (string-parse-final-start+end proc-exp s-exp args-exp))
  (lambda (start end) body ...))

check-substring-spec proc s start end -> unspecified

substring-spec-ok? s start end -> boolean

引数 s, start, end の値を検査し、それらが正しい部分文字列を示しているか確認する。つまり、s が文字列であり、 start と end が正確数であり、かつ、 0 <= start <= end <= (string-length s) を満足するか検査する。

これらの値が適切でない場合、

check-substring-spec はエラー状態を発生する。 proc はそのエラー状態の一部として使用され、その引数には検査対象の値を受け取る。
substring-spec-ok? は偽値を返す。

検査に成功した場合、substring-spec-ok? は真値を返し、 check-substring-spec は何もせずに戻る (戻り値は規定されません)。

Knuth-Morris-Pratt 検索

Knuth-Morris-Pratt 検索アルゴリズムは、テキストの中から何らかの文字列をすばやく検索するための方法である。この検索方法は、バックトラッキングを必要としないという利点がある。そのため、文字列検索だけでなく、入力ポートのようなバックトラッキングやランダムアクセスをサポートしない他の文字シーケンスの検索にも有用である。これらのルーチンは、汎用的に使えるように、アルゴリズムの初期化フェーズや検索フェーズを一つにまとめている。また、中間検索状態 (intermediate search state) をアプリケーション間で受け渡し可能なので、バッファリングされたチャンク上のテキスト検索もサポートしている。

KMP 検索の 2 つめの重要な性質は、パターンの長さに比例する補助的なメモリの割り当てを必要とするが、文字型のサイズに対しては一定のメモリしか必要としないということである。他の検索アルゴリズムでは、すべての可能な文字のエントリをもつテーブルを作成する必要があることが多いが、その場合 16 ビットや 32 ビットで表現される文字に対しては、非常に高価な処理となり、現実的ではない。

make-kmp-restart-vector s [c= start end] -> integer-vector

Knuth-Morris-Pratt の「再開始ベクタ」(restart vector) を作成する。これは、文字列 s (または start/end オプション引数を指定した場合は s の部分文字列) を高速に検索するために使われる。 c= は文字の等値性を判定する関数で、再開始ベクタを作成するために使われる。そのデフォルト値は char=? である。大小文字を区別しない検索であれば char-ci=? を使用する。

The definition of the restart vector rv for string s is: If we have matched chars 0..i-1 of s against some search string ss, and s[i] doesn't match ss[k], then reset i := rv[i], and try again to match ss[k]. If rv[i] = -1, then punt ss[k] completely, and move on to ss[k+1] and s[0].

In other words, if you have matched the first i chars of s, but the i+1'th char doesn't match, rv[i] tells you what the next-longest prefix of s is that you have matched.

The following string-search function shows how a restart vector is used to search. Note the attractive feature of the search process: it is "on line," that is, it never needs to back up and reconsider previously seen data. It simply consumes characters one-at-a-time until declaring a complete match or reaching the end of the sequence. Thus, it can be easily adapted to search other character sequences (such as ports) that do not provide random access to their contents.

(define (find-substring pattern source start end)
  (let ((plen (string-length pattern))
        (rv (make-kmp-restart-vector pattern)))

    ;; The search loop. SJ & PJ are redundant state.
    (let lp ((si start) (pi 0)
             (sj (- end start))     ; (- end si)  -- how many chars left.
             (pj plen))             ; (- plen pi) -- how many chars left.

      (if (= pi plen) (- si plen)                   ; Win.

          (and (<= pj sj)                           ; Lose.

               (if (char=? (string-ref source si)           ; Test.
                           (string-ref pattern pi))
                   (lp (+ 1 si) (+ 1 pi) (- sj 1) (- pj 1)) ; Advance.

                   (let ((pi (vector-ref rv pi)))           ; Retreat.
                     (if (= pi -1)
                         (lp (+ si 1)  0   (- sj 1)  plen)  ; Punt.
                         (lp si        pi  sj        (- plen pi))))))))))

The optional start/end parameters restrict the restart vector to the indicated substring of pat; rv is end - start elements long. If start > 0, then rv is offset by start elements from pat. That is, rv[i] describes pattern element pat[i + start]. Elements of rv are themselves indices that range just over [0, end-start), not [start, end).

Rationale: the actual value of rv is "position independent" -- it does not depend on where in the pat string the pattern occurs, but only on the actual characters comprising the pattern.

kmp-step pat rv c i c= p-start -> integer

This function encapsulates the work performed by one step of the KMP string search; it can be used to scan strings, input ports, or other on-line character sources for fixed strings.

Pat is the non-empty string specifying the text for which we are searching. Rv is the Knuth-Morris-Pratt restart vector for the pattern, as constructed by make-kmp-restart-vector. The pattern begins at pat[p-start], and is (string-length rv) characters long. C= is the character-equality function used to construct the restart vector, typically char=? or char-ci=?.

Suppose the pattern is N characters in length: pat[p-start, p-start + n). We have already matched i characters: pat[p-start, p-start + i). (P-start is typically zero.) C is the next character in the input stream. kmp-step returns the new i value -- that is, how much of the pattern we have matched, including character c. When i reaches n, the entire pattern has been matched.

Thus a typical search loop looks like this:

(let lp ((i 0))
  (or (= i n)                           ; Win -- #t
      (and (not (end-of-stream))        ; Lose -- #f
           (lp (kmp-step pat rv (get-next-character) i char=? 0)))))

Example:

;; Read chars from IPORT until we find string PAT or hit EOF.
(define (port-skip pat iport)
  (let* ((rv (make-kmp-restart-vector pat))
         (patlen (string-length pat)))
    (let lp ((i 0) (nchars 0))
      (if (= i patlen) nchars                    ; Win -- nchars skipped
          (let ((c (read-char iport)))
            (if (eof-object? c) c                ; Fail -- EOF
                (lp (kmp-step pat rv c i char=? 0) ; Continue
                    (+ nchars 1))))))))

This procedure could be defined as follows:

(define (kmp-step pat rv c i c= p-start)
  (let lp ((i i))
    (if (c= c (string-ref pat (+ i p-start)))     ; Match =>
        (+ i 1)                                   ;   Done.
        (let ((i (vector-ref rv i)))              ; Back up in PAT.
          (if (= i -1) 0                          ; Can't back up more.
              (lp i)))))))                        ; Keep going.

Rationale: this procedure takes no optional arguments because it is intended as an inner-loop primitive and we do not want any run-time penalty for optional-argument parsing and defaulting, nor do we wish barriers to procedure integration/inlining.

string-kmp-partial-search pat rv s i [c= p-start s-start s-end] -> integer

Applies kmp-step across s; optional s-start/s-end bounds parameters restrict search to a substring of s. The pattern is (vector-length rv) characters long; optional p-start index indicates non-zero start of pattern in pat.

Suppose plen = (vector-length rv) is the length of the pattern. I is an integer index into the pattern (that is, 0 <= i < plen) indicating how much of the pattern has already been matched. (This means the pattern must be non-empty -- plen > 0.)

On success, returns -j, where j is the index in s bounding the end of the pattern -- e.g., a value that could be used as the end parameter in a call to substring/shared.
On continue, returns the current search state i' (an index into rv) when the search reached the end of the string. This is a non-negative integer.

Hence:

A negative return value indicates success, and says where in the string the match occured.
A non-negative return value provides the i to use for continued search in a following string.

This utility is designed to allow searching for occurrences of a fixed string that might extend across multiple buffers of text. This is why, for example, we do not provide the index of the start of the match on success -- it may have occurred in a previous buffer.

To search a character sequence that arrives in "chunks," write a loop of this form:

(let lp ((i 0))
  (and (not (end-of-data?))             ; Lose -- return #f.
       (let* ((buf (get-next-chunk))    ; Get or fill up the buffer.
              (i (string-kmp-partial-search pat rv buf i)))
         (if (< i 0) (- i)              ; Win -- return end index.
             (lp i)))))                 ; Keep looking.

Modulo start/end optional-argument parsing, this procedure could be defined as follows:

(define (string-kmp-partial-search pat rv s i c= p-start s-start s-end)
  (let ((patlen (vector-length rv)))
    (let lp ((si s-start)       ; An index into S.
             (vi i))            ; An index into RV.
      (cond ((= vi patlen) (- si))      ; Win.
            ((= si end) vi)             ; Ran off the end.
            (else (lp (+ si 1)          ; Match s[si] & loop.
                      (kmp-step pat rv (string-ref s si)
                                vi c= p-start)))))))

参照実装

この SRFI には参照実装が付属しており、以下のサイトからダウンロードできる。

http://srfi.schemers.org/srfi-13/srfi-13.scm

私はこのソースコードをネット上に置き、簡単で「オープンな」著作権表示とした。このソースコードのプレフィックス／サフィックスおよび比較のルーチンは、 (非常に遠い昔の) MIT Scheme の文字列ライブラリを元にしており、実質的に私によって書き直されている。 MIT Scheme の著作権表示にしたがい、このソースコードを一般的な BSD スタイルのオープンソースの著作権としている。詳細はソースコードを参照せよ。

KMP 文字列検索コードは Stephen Bevan, Brian Denheyer, Will Fitzgerald により書かれた実装の影響を受けている。しかし、このバージョンは、私が一から書いたものである。

このコードの残りの部分は、scsh のため、もしくは、この SRFI のために私が書いた。私はこのコードを scsh の著作権表示に従う形で提供するが、 scsh もまた一般的な BSD スタイルのオープンソース著作権である。

このコードはポータビリティを重視して書かれているので、どんな Scheme 処理系にもすぐに移植できるだろう。このソースコードのコメントには、 R5RS 以外に依存する部分について、詳細な注意書きが書かれている。

このライブラリは明快に書かれており、コメントもよく書かれている。現状のソースコードはだいたい 1000 行のコード行と 1000 行のコメントおよび空白で書かれている。このコードは、効率も重視している。可能性が高いケースに対しては、高速な実行パスを通るように実装している。しかし、特定の Scheme 処理系に特化した最適化ができないと言っているわけではない。参照実装を最適化して性能を高めるための方法を、コメントに書いておいた。

要するに、私はライブラリ実装者にできるだけ苦痛を与えないように参照実装を書いたのである。そうすれば、このライブラリを採用してもらえるし、このライブラリを使用することで満足のいく結果を得ることができるだろうからだ。

謝辞

The design of this library benefited greatly from the feedback provided during the SRFI discussion phase. Among those contributing thoughtful commentary and suggestions, both on the mailing list and by private discussion, were Paolo Amoroso, Lars Arvestad, Alan Bawden, Jim Bender, Dan Bornstein, Per Bothner, Will Clinger, Brian Denheyer, Mikael Djurfeldt, Kent Dybvig, Sergei Egorov, Marc Feeley, Matthias Felleisen, Will Fitzgerald, Matthew Flatt, Arthur A. Gleckler, Ben Goetter, Sven Hartrumpf, Erik Hilsdale, Richard Kelsey, Oleg Kiselyov, Bengt Kleberg, Donovan Kolbly, Bruce Korb, Shriram Krishnamurthi, Bruce Lewis, Tom Lord, Brad Lucier, Dave Mason, David Rush, Klaus Schilling, Jonathan Sobel, Mike Sperber, Mikael Staldal, Vladimir Tsyshevsky, Donald Welsh, and Mike Wilson. I am grateful to them for their assistance.

I am also grateful the authors, implementors and documentors of all the systems mentioned in the introduction. Aubrey Jaffer and Kent Pitman should be noted for their work in producing Web-accessible versions of the R5RS and Common Lisp spec, which was a tremendous aid.

This is not to imply that these individuals necessarily endorse the final results, of course.

During this document's long development period, great patience was exhibited by Mike Sperber, who is the editor for the SRFI, and by Hillary Sullivan, who is not.

参考文献とリンク

[Case-map]

Case mappings.
Unicode Technical Report 21.
http://www.unicode.org/unicode/reports/tr21/

[CommonLisp]

Common Lisp: the Language.
Guy L. Steele Jr. (editor).
Digital Press, Maynard, Mass., second edition 1990.
Available at http://www.elwood.com/alu/table/references.htm#cltl2.

The Common Lisp "HyperSpec," produced by Kent Pitman, is essentially the ANSI spec for Common Lisp: http://www.harlequin.com/education/books/HyperSpec/.

[Java]

The following URLs provide documentation on relevant Java classes.
http://java.sun.com/products/jdk/1.2/docs/api/java/lang/Character.html
http://java.sun.com/products/jdk/1.2/docs/api/java/lang/String.html
http://java.sun.com/products/jdk/1.2/docs/api/java/lang/StringBuffer.html
http://java.sun.com/products/jdk/1.2/docs/api/java/text/Collator.html
http://java.sun.com/products/jdk/1.2/docs/api/java/text/package-summary.html

[MIT-Scheme]

http://www.swiss.ai.mit.edu/projects/scheme/

[R5RS]

Revised⁵ report on the algorithmic language Scheme.
R. Kelsey, W. Clinger, J. Rees (editors).
Higher-Order and Symbolic Computation, Vol. 11, No. 1, September, 1998.
and ACM SIGPLAN Notices, Vol. 33, No. 9, October, 1998.
Available at http://www.schemers.org/Documents/Standards/.

[SRFI]

The SRFI web site.
http://srfi.schemers.org/

[SRFI-13]

SRFI-13: String libraries.
http://srfi.schemers.org/srfi-13/

This document, in HTML:: http://srfi.schemers.org/srfi-13/srfi-13.html
This document, in plain text format:: http://srfi.schemers.org/srfi-13/srfi-13.txt
Source code for the reference implementation:: http://srfi.schemers.org/srfi-13/srfi-13.scm
Scheme 48 module specification, with typings:: http://srfi.schemers.org/srfi-13/srfi-13-s48-module.scm

[SRFI-14]

SRFI-14: Character-set library.
http://srfi.schemers.org/srfi-14/
The SRFI 14 char-set library defines a character-set data type, which is used by some procedures in this library.

[Unicode]

http://www.unicode.org/

[UnicodeData]

The Unicode character database.
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html

著作権

Certain portions of this document -- the specific, marked segments of text describing the R5RS procedures -- were adapted with permission from the R5RS report.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

表題

著者

状態

目次