利用者:加藤勝憲/バークレーRISC

バークレーRISCは、国防高等研究計画局（Defense Advanced Research Projects Agency）のVLSIプロジェクトで行われた、縮小命令セットコンピュータ（RISC）ベースのマイクロプロセッサ設計に関する2つの代表的な研究プロジェクトのうちの1つである。RISCは、1980年から1984年にかけて、カリフォルニア大学バークレー校のデビッド・パターソン（RISCという言葉を作った人物）が主導した[1]。もう1つのプロジェクトは、少し離れたスタンフォード大学のMIPSプロジェクトで、1981年から1984年まで行われた。

Berkeley RISC is one of two seminal research projects into reduced instruction set computer (RISC) based microprocessor design taking place under the Defense Advanced Research Projects Agency VLSI Project. RISC was led by David Patterson (who coined the term RISC) at the University of California, Berkeley between 1980 and 1984.^[1] The other project took place a short distance away at Stanford University under their MIPS effort starting in 1981 and running until 1984.

バークレーのプロジェクトは大成功を収め、後に続く同様の設計の名称となった。バークレーのRISC設計は、後にサン・マイクロシステムズによってSPARCアーキテクチャとして製品化され、ARMアーキテクチャに影響を与えた[2]。

Berkeley's project was so successful that it became the name for all similar designs to follow; even the MIPS would become known as a "RISC processor". The Berkeley RISC design was later commercialized by Sun Microsystems as the SPARC architecture, and inspired the ARM architecture.^[2]

The RISC concept

RISCもMIPSも、大半のプログラムがプロセッサで利用可能な命令セットのごく一部しか使用していないという認識から開発された。アンドリュー・タネンバウムは、1978年の有名な論文で、複雑な1万行の高水準プログラムは、8ビット固定長のオペコードを使用した簡略化された命令セットアーキテクチャで表現できることを実証した^[1]。これは、IBM360のようなメインフレーム上で実行される独自のコードの研究で、利用可能な全命令のごく一部しか使用していなかったIBMで得られた結論とほぼ同じであった。これらの研究はいずれも、よりシンプルなCPUを作れば、ほとんどの実世界のコードを実行できることを示唆していた。当時はまだ十分に検討されていなかったもう1つの発見は、タネンバウムが定数の81％が0、1、2のいずれかであると指摘したことである^[1]。

マイクロプロセッサー市場が8ビットから16ビットに移行し、32ビットの設計が登場しようとしていた頃、このような気づきがあった。これらの設計は、メインフレームやミニコンピュータの世界でより高く評価されている既存のISAのいくつかを複製するという目標を前提としていた。例えば、ナショナル・セミコンダクターのNS32000は、豊富な命令セットと多様なアドレッシング・モードを持っていたVAX-11をシングルチップで実装する取り組みとしてスタートしました。モトローラ68000も一般的なレイアウトは同様だった。この豊富な命令セットを提供するために、CPUはマイクロコードを使ってユーザーから見える命令を一連の内部オペレーションにデコードしていました。このマイクロコードは、設計全体のトランジスタのおそらく1/4から1/3を占めていた。

These realizations were taking place as the microprocessor market was moving from 8 to 16-bit with 32-bit designs about to appear. Those designs were premised on the goal of replicating some of the more well-respected existing ISAs from the mainframe and minicomputer world. For instance, the National Semiconductor NS32000 started out as an effort to produce a single-chip implementation of the VAX-11, which had a rich instruction set with a wide variety of addressing modes. The Motorola 68000 was similar in general layout. To provide this rich set of instructions, CPUs used microcode to decode the user-visible instruction into a series of internal operations. This microcode represented perhaps 1⁄4 to 1⁄3 of the transistors of the overall design.

これらの論文が示唆するように、これらのオペコードの大半が実際には使われないとすれば、この重要なリソースが無駄になっていることになる。一方、使用されない命令をデコードする代わりに、そのトランジスタを使って性能を向上させれば、より高速なプロセッサが可能になる。RISCのコンセプトは、この両方を利用し、68000と同レベルの複雑さを持ちながら、はるかに高速なCPUを製造することだった。

そのためにRISCは、より多くのレジスタを追加することに集中した。レジスタは、一時的な値を保持する小さなビット・メモリで、非常に高速にアクセスできる。これは、アクセスに数サイクルかかる通常のメイン・メモリとは対照的である。より多くのレジスタを用意し、コンパイラがそれらを実際に使用するようにすることで、プログラムの実行速度は格段に速くなるはずだった。さらに、メモリ・アクセスの待ち時間が減るため、プロセッサの速度はクロック速度によってより厳密に定義されるようになる。トランジスタ単位では、RISC設計は従来のCPUを凌駕するだろう。

欠点としては、削除される命令は一般的にいくつかの「サブ命令」を実行していた。例えば、伝統的な設計のADD命令には、2つのレジスタの数値を加算して3つ目のレジスタに配置するもの、メインメモリにある数値を加算して結果をレジスタに配置するものなど、一般的にいくつかの種類があります。一方、RISCの設計では、特定の命令のフレーバーは1つしかなく、例えばADDは常にすべてのオペランドにレジスタを使用する。このため、プログラマーは、必要に応じてメモリから値をロードする命令を追加で書かなければならず、RISCプログラムは「密度が低い」ものになっていた。

高価なメモリの時代には、これは現実的な懸念事項であった。RISC設計のADDは実際には4命令（2つのロード、1つの加算、1つのセーブ）を必要とするため、マシンは余分な命令を読み出すために、より多くのメモリ・アクセスを行わなければならず、大幅に速度が低下する可能性があった。これは、新しい設計が当時32ビットという非常に大きな命令ワードを使用し、小さな定数を別々にロードする代わりに直接命令に折り込むことを可能にしたという事実によってある程度相殺された。さらに、ある演算の結果は、その直後に別の演算で使われることが多いため、メモリへの書き込みをスキップして結果をレジスタに格納することで、プログラムはそれほど大きくならず、理論的にははるかに高速に実行できるようになった。例えば、一連の数学演算を実行する命令列では、メモリからのロードは数回で済むかもしれないが、使用される数値の大部分は命令内の定数か、前の計算でレジスタに残っている中間値である。ある意味で、この手法では、一部のレジスタがメモリ位置のシャドウとして使用され、一群の命令が決定された後の最終値まで、レジスタがメモリ位置の代理として使用される。

傍目には、RISCのコンセプトがパフォーマンスを向上させるとは思えず、むしろ悪化させる可能性さえあった。それを確かめる唯一の方法は、シミュレーションを行うことだった。テストに次ぐテストで、どのシミュレーションでも、この設計による性能の全体的な恩恵が非常に大きいことが示された。

RISCとMIPSの2つのプロジェクトの違いは、レジスタの扱いにあった。MIPSは単に多くのレジスタを追加し、その利用はコンパイラ（またはアセンブリ言語プログラマ）に任せた。一方RISCは、コンパイラを補助する回路をCPUに追加した。RISCはレジスタ・ウィンドウという概念を使い、「レジスタ・ファイル」全体をブロックに分割することで、コンパイラがグローバル変数用のブロックとローカル変数用のブロックを「見る」ことを可能にした。

このアイデアは、特によく使われる命令であるプロシージャーコールprocedure callを、非常に簡単に実装できるようにすることだった。ほとんどすべてのプログラミング言語では、プロシージャが呼び出されたアドレス、渡されたデータ（パラメータ）、返す必要のある結果値のためのスペースを含む、各プロシージャの起動レコードまたはスタックフレームとして知られるシステムを使用しています。多くの場合、これらのフレームは小さく、入力は3つ以下、出力は1つ以下です（入力が出力として再利用されることもあります）。バークレーの設計では、レジスタ・ウィンドウはいくつかのレジスタの集合であり、プロシージャ・スタック・フレーム全体がレジスタ・ウィンドウ内に完全に収まるような十分な数のレジスタであった。

この場合、プロシージャーへの呼び出しとプロシージャーからの戻りは単純で非常に高速である。1つの命令が呼び出され、新しいレジスタ・ブロック（新しいレジスタ・ウィンドウ）がセットアップされ、新しいウィンドウの "ローエンド "にあるプロシージャーにオペランドが渡されると、プログラムはプロシージャーにジャンプする。戻ってくると、結果は同じ端のウィンドウに置かれ、手続きは終了する。レジスタ・ウィンドウは両端が重なるように設定されているので、呼び出しの結果は単に呼び出し元のウィンドウに「現れる」だけで、データをコピーする必要はない。このように、コモンプロシージャーコールはメインメモリとやりとりする必要がなく、大幅に高速化される。

RISC I

RISCコンセプトを実装する最初の試みは、当初Goldと名付けられた。この設計は1980年にVLSI設計コースの一環として開始されたが、当時の複雑な設計は既存の設計ツールのほとんどをクラッシュさせた。チームはツールの改良や書き直しにかなりの時間を費やさなければならなかったが、これらの新しいツールを使っても、VAX-11/780上で設計を抽出するのに1時間弱しかかからなかった。

RISC Iと名付けられた最終設計は、1981年にACM（Association for Computing Machinery）のISCA（International Symposium on Computer Architecture）で発表された。31命令を実装する44,500個のトランジスタと、78本の32ビットレジスタを含むレジスタファイルを備えていた。これにより、14本のレジスタを含む6つのレジスタウィンドウを実現した。この14本のレジスタのうち、4本は前のウィンドウからオーバーラップしていた。合計するとウィンドウ内の10*6レジスタ＋18グローバル＝合計78レジスタ。当時の一般的な設計では同じ役割のために約50％が使用されていたのに対し、制御および命令デコード・セクションはダイのわずか6％しか占めていなかった。レジスタファイルはそのスペースのほとんどを占めていた。

The final design, named RISC I, was published in Association for Computing Machinery (ACM) International Symposium on Computer Architecture (ISCA) in 1981.

RISC Iはまた、さらなる高速化のために2段の命令パイプラインを備えていたが、より近代的な設計のような複雑な命令の並べ替えはなかった。なぜなら、コンパイラーは、条件分岐に続く命令（いわゆる分岐遅延スロット）を、「安全な」（つまり、条件分岐の結果に依存しない）命令で満たさなければならないからである。この場合、適切な命令はNOPしかないこともある。後発のRISCスタイルの設計では、今でも分岐遅延を考慮する必要があるものが少なくない。

ヶ月の検証とデバッグの後、設計は1981年6月22日に革新的なMOSISサービスに送られ、2μm（2,000nm）プロセスで製造された。さまざまな遅れにより、4回にわたってマスクの放棄を余儀なくされ、動作例のウェハーがバークレーに戻ってきたのは1982年5月のことであった。最初の動作するRISC I「コンピューター」（実際にはチェックアウトボード）は6月11日に動作した。テストでは、チップの性能は予想よりも低いことが判明した。一般に、1つの命令が完了するのに2μ秒かかるが、当初の設計では約0.4μ秒（5倍の速度）だった。この問題の正確な理由は完全には説明されなかった。しかし、テストを通じて、特定の命令が期待された速度で動作することは明らかであり、問題は論理的なものではなく物理的なものであることが示唆された。

Had the design worked at full speed, performance would have been excellent. Simulations using a variety of small programs compared the 4 MHz RISC I to the 5 MHz 32-bit VAX 11/780 and the 5 MHz 16-bit Zilog Z8000 showed this clearly. Program size was about 30% larger than the VAX but very close to that of the Z8000, validating the argument that the higher code density of CISC designs was not actually all that impressive in reality. In terms of overall performance, the RISC I was twice as fast as the VAX, and about four times that of the Z8000. The programs ended up performing about the same overall number of memory accesses because the large register file dramatically improved the odds the needed operand was already on-chip.

It is important to put this performance in context. Even though the RISC design had run slower than the VAX, it made no difference to the importance of the design. RISC allowed for the production of a true 32-bit processor on a real chip die using what was already an older fab. Traditional designs simply could not do this; with so much of the chip surface dedicated to decoder logic, a true 32-bit design like the Motorola 68020 required newer fabs before becoming practical. Using the same fabs, RISC I could have largely outperformed the competition.

On February 12, 2015, IEEE installed a plaque at UC Berkeley to commemorate the contribution of RISC-I.^[2] The plaque reads:

UC Berkeley students designed and built the first VLSI reduced instruction set computer in 1981. The simplified instructions of RISC-I reduced the hardware for instruction decode and control, which enabled a flat 32-bit address space, a large set of registers, and pipelined execution. A good match to C programs and the Unix operating system, RISC-I influenced instruction sets widely used today, including those for game consoles, smartphones and tablets.

RISC II

While the RISC I design ran into delays, work at Berkeley had already turned to the new Blue design. Work on Blue progressed slower than Gold, due both to the lack of a pressing need now that Gold was going to fab, and to changeovers in the classes and students staffing the effort. This pace also allowed them to add in several new features that would end up improving the design considerably.

The key difference was simpler cache circuitry that eliminated one line per bit (from three to two), dramatically shrinking the register file size. The change also required much tighter bus timing, but this was a small price to pay and in order to meet the needs several other parts of the design were sped up as well.

The savings due to the new design were tremendous. Whereas Gold contained a total of 78 registers in 6 windows, Blue contained 138 registers broken into 8 windows of 16 registers each, with another 10 globals. This expansion of the register file increases the chance that a given procedure can fit all of its local storage in registers, and increase the nesting depth. Nevertheless, the larger register file required fewer transistors, and the final Blue design, fabbed as RISC II, implemented all of the RISC instruction set with only 40,760 transistors.^[3]

The other major change was to include an instruction-format expander, which invisibly "up-converted" 16-bit instructions into a 32-bit format.^[要出典]^{[citation needed]} This allowed smaller instructions, typically things with one or no operands, like NOP, to be stored in memory in a smaller 16-bit format, and for two such instructions to be packed into a single machine word. The instructions would be invisibly expanded back to 32-bit versions before they reached the arithmetic logic unit (ALU), meaning that no changes were needed in the core logic. This simple technique yielded a surprising 30% improvement in code density, making an otherwise identical program on Blue run faster than on Gold due to the decreased number of memory accesses.

RISC II proved to be much more successful in silicon and in testing outperformed almost all minicomputers on almost all tasks. For instance, performance ranged from 85% of VAX speed to 256% on a variety of loads. RISC II was also benched against the famous Motorola 68000, then considered to be the best commercial chip implementation, and outperformed it by 140% to 420%.

Follow-ons

Work on the original RISC designs ended with RISC II, but the concept lived on at Berkeley. The basic core was re-used in SOAR in 1984, basically a RISC converted to run Smalltalk (in the same way that it could be claimed RISC ran C), and later in the similar VLSI-BAM that ran Prolog instead of Smalltalk. Another effort was SPUR, which was a full set of chips needed to build a full 32-bit workstation.

RISC is less famous, but more influential, for being the basis of the commercial SPARC processor design from Sun Microsystems. It was the SPARC that first clearly demonstrated the power of the RISC concept; when they shipped in the first Sun-4s they outperformed anything on the market. This led to virtually every Unix vendor hurrying for a RISC design of their own, leading to designs like the DEC Alpha and PA-RISC, while Silicon Graphics (SGI) purchased MIPS Computer Systems. By 1986, most large chip vendors followed, working on efforts like the Motorola 88000, Fairchild Clipper, AMD 29000 and the PowerPC. On February 13, 2015, IEEE installed a plaque at Oracle Corporation in Santa Clara.^[4] It reads

Sun Microsystems introduced the Scalable Processor Architecture (SPARC) RISC in 1987. Building on UC Berkeley RISC and Sun compiler and operating system developments, SPARC architecture was highly adaptable to evolving semiconductor, software, and system technology and user needs. The architecture delivered the highest performance, scalable workstations and servers, for engineering, business, Internet, and cloud computing uses.

Techniques developed for and alongside the idea of the reduced instruction set have also been adopted in successively more powerful implementations and extensions of the traditional "complex" x86 architecture. Much of a modern microprocessor's transistor count is devoted to large caches, many pipeline stages, superscalar instruction dispatch, branch prediction and other modern techniques which are applicable regardless of instruction architecture. The amount of silicon dedicated to instruction decoding on a modern x86 implementation is proportionately quite small, so the distinction between "complex" and RISC processor implementations has become blurred.

脚注・参考文献

脚注

参考文献

[[Category:命令処理]] [[Category:CPU]] [[Category:未査読の翻訳があるページ]]

^ ^a ^b Tanenbaum, Andrew (March 1978). “Implications of Structured Programming for Machine Architecture”. Communications of the ACM 21 (3): 237–246. doi:10.1145/359361.359454.
^ “memorabilia [RISC-I Reunion]”. risc.berkeley.edu. 2020年3月19日閲覧。
^ “Berkeley Hardware Prototypes”. people.eecs.berkeley.edu. 2021年11月6日閲覧。
^ Gee. “Oracle to Receive IEEE Milestone Award for SPARC RISC Architecture”. blogs.oracle.com. 2020年3月19日閲覧。

[coinrisk-1] Reilly, Edwin D. (2003). Milestones in Computer Science and Information Technology. p. 50. ISBN 1573565210

[2] Chisnal, David (2010年8月23日). “Understanding ARM Architectures”. Informit 13 October 2015閲覧。

[implications-3] Tanenbaum, Andrew (March 1978). “Implications of Structured Programming for Machine Architecture”. Communications of the ACM 21 (3): 237–246. doi:10.1145/359361.359454.

[4] “memorabilia [RISC-I Reunion]”. risc.berkeley.edu. 2020年3月19日閲覧。

[5] “Berkeley Hardware Prototypes”. people.eecs.berkeley.edu. 2021年11月6日閲覧。

[6] Gee. “Oracle to Receive IEEE Milestone Award for SPARC RISC Architecture”. blogs.oracle.com. 2020年3月19日閲覧。

[1]

[2]

[1]

[2]

[3]

[4]