windows deep internals

пятница, 4 апреля 2025 г.

ptx instructions emitting by nvidia compiler. part 2

Part 1 described v10

And today let's check cicc v12. The first thing that catches your eye is its size - almost 76Mb! And it also contains at least 5 different decryptors - Nvidia really wants to hide something from its grateful clients

Why it is so fat?

Bcs it contains at least 4 code generators: for arm32, aarch64, x86 & nvptx

+ at least 27 llvm bytecode blobs (signature 0x42 0x43 0xc0 0xde) - they contains mostly bodies of intrinsic functions like nvvm_mulq/nvvm_divq but on some llvm-dis just crashes:

#0 0x000055abcc64743d in llvm::Intrinsic::getIntrinsicInfoTableEntries (id=0, T=...) at /home/redp/disc/src/llvm-project/llvm/lib/IR/Function.cpp:1339 1339 unsigned TableVal = IIT_Table[id-1]; >>> where #0 0x000055abcc64743d in llvm::Intrinsic::getIntrinsicInfoTableEntries (id=0, T=...) at /home/redp/disc/src/llvm-project/llvm/lib/IR/Function.cpp:1339 #1 0x000055abcc5fe41f in UpgradeIntrinsicFunction1 (F=0x55abce63cf18, NewFn=@0x7ffc60414e70: 0x0) at /home/redp/disc/src/llvm-project/llvm/include/llvm/IR/Function.h:204 #2 0x000055abcc60111a in llvm::UpgradeIntrinsicFunction (F=F@entry=0x55abce63cf18, NewFn=@0x7ffc60414e70: 0x0) at /home/redp/disc/src/llvm-project/llvm/lib/IR/AutoUpgrade.cpp:1226 #3 0x000055abcc584778 in (anonymous namespace)::BitcodeReader::globalCleanup (this=0x55abce608e30) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:3696 #4 0x000055abcc5856cc in (anonymous namespace)::BitcodeReader::parseModule (this=<optimized out>, ResumeBit=<optimized out>, ShouldLazyLoadMetadata=<optimized out>, Callbacks=...) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:4385 #5 0x000055abcc5959ca in (anonymous namespace)::BitcodeReader::parseBitcodeInto (Callbacks=..., IsImporting=false, ShouldLazyLoadMetadata=false, M=0x55abce5f3d80, this=0x55abce608e30) at /usr/include/c++/9/bits/std_function.h:564 #6 llvm::BitcodeModule::getModuleImpl (this=<optimized out>, Context=..., MaterializeAll=<optimized out>, ShouldLazyLoadMetadata=<optimized out>, IsImporting=<optimized out>, Callbacks=...) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:7981 #7 0x000055abcc596070 in llvm::BitcodeModule::getLazyModule (this=0x7ffc60415d10, Context=..., ShouldLazyLoadMetadata=<optimized out>, IsImporting=<optimized out>, Callbacks=...) at /usr/include/c++/9/bits/std_function.h:263 #8 0x000055abcc550e49 in main (argc=<optimized out>, argv=<optimized out>) at /home/redp/disc/src/llvm-project/llvm/include/llvm/Support/CommandLine.h:1399

>>> p id $1 = 0

At least they should check that index can become negative, no? Who would doubt that llvm is very reliable and secure

So if they process llvm ByteCode then they also must link half of llvm run-time to do it, but they also use

Дальше »

воскресенье, 30 марта 2025 г.

ptx instructions emitting by nvidia compiler

I recently became curious what exactly ptx instructions can produce nvidia compiler - like if it uses something totally undocumented or vice versa - some official ptx instructions are never generated during compilation

The first thing is where those compiler located - no, it's not nvcc. Real compiler is cicc from packet cuda-nvvm. cicc from v10 has size 21Mb. The strings utility shows many interesting things, like

Portions Copyright (c) 1988-2016 Edison Design Group, Inc.
Portions Copyright (c) 2007-2016 University of Illinois at Urbana-Champaign.
Based on Edison Design Group C/C++ Front End

So they use front-end from Edison Design Group and llvm as back-end

Then I extracted several tables:

compiler errors
list of built-in functions with prototypes
list of llvm attributes - as you can see they are mostly correspond to nvvm LLVM dialect
and finally what I looked for - list with internal instruction names & their bodies to place into PTX file

As you can see mapping is very straightforward - for example for instruction BFE_S32rii (index 0x27) generating PTX bfe.s32

Results

I found only minor PTX instructions not presented in their official documentations: suq.xxx - perhaps should mean surface qword or something like this

среда, 26 марта 2025 г.

nvidia sass disassembler, part 5

Previous parts: 1, 2, 3 & 4

I've finally add native rendering for instructions - actually just rewrite from perl terrible function make_inst. Because in output typically rendering only small fraction of instructions data for formats are filling by demand via std::call_once. Results to compare with genuine nvdisasm:

mine	nvdisasm
LDC R1,c:[0][0x37C] LDCU.64 UR8,c:[0][URZ+0x440] LDC R16,c:[0][0x3B8] LDCU.64 UR12,c:[0][URZ+0x448] LDCU UR4,c:[0][URZ+0x3AC] LDC._64 R4,c:[0][0x450] LDCU.64 UR14,c:[0][URZ+0x380] LDCU.64 UR10,c:[0][URZ+0x358] HFMA2 R13,-RZ,RZ, 1.875000, 0.000000 ISETP.NE.S64.AND P2,PT,RZ,UR8,PT	LDC R1, c[0x0][0x37c] LDCU.64 UR8, c[0x0][0x440] LDC R16, c[0x0][0x3b8] LDCU.64 UR12, c[0x0][0x448] LDCU UR4, c[0x0][0x3ac] LDC.64 R4, c[0x0][0x450] LDCU.64 UR14, c[0x0][0x380] LDCU.64 UR10, c[0x0][0x358] HFMA2 R13, -RZ, RZ, 1.875, 0 ISETP.NE.S64.AND P2, PT, RZ, UR8, PT

IMHO very similar, has some minor problems with formatting of floating point values (I used FP16 to extract 16bit values but don't know what means E8M7Imm in format descriptor)

So the next thing to show is

labels for branches

As I mentioned you can identify instruction as branches via it's PROPERTIES, get value in BRANCH_TARGET_INDEX and render it as label address. There are two problems:

size of branch offset vary in size - it can be 58bit for sm_90, 50bit for sm_75, 24 for sm_3 and so on
branch offset is signed value, so we need some method to detect that some value of known bit size is negative

Дальше »

пятница, 21 марта 2025 г.

nvidia sass disassembler, part 4

I've made native sass disasm - just adding c++ codegen (can be produced by ead.pl with -C option). It works via dynamic loading of right disasm module - see list of supported architectures in map s_sms. For now it supports only operands dump with -O option - not rendered yet (bcs rewriting bunch of perl code with duck-types to C++ is boring and tedious work). Also you can dump attributes with -e option. You can make those modules with something like "make sm90.so". Btw dumb gcc allocates for local vars ~600kb on stack and with -Os option it compiles each module for 10 minutes with stack consumption shrink to normal values)

Tests show zero unrecognized instructions (and I am truly proud of this), however if you will find such - I also add option -N to dump it's content to bit-mask, which you then can pass to ead.pl with the same -N option to see what happened

On the other side it seems that nvidia trying to hide something important from us - let's check libcublas.so from v12 - we can notice lots of sections

.nv.merc.nv.info - genuine nvdiasm unable to show their content
.nv.capmerc.text - however, the instructions they contain are clearly in some other format and cannot be disassembled - I add -s option to disasm single section by it's index, so you can try it by yourself
and they obviously has corresponding relocs in sections .nv.merc.rela.text
and even .nv.merc.rela.debug_frame & .nv.merc.symtab

Known problems

Дальше »

пятница, 14 марта 2025 г.

nvidia sass disassembler, part 3

It looks like this rabbit hole goes much deeper

Some const banks does not have ConstBankAddressX:

CX:Sb[UniformRegister:URb][UImm(16)*:Sb_offset]
BITS_6_37_32_Ra_URb=URb
BITS_14_53_40_Sb_offset=Sb_offset SCALE 4

Btw there is no encoding for field Sb

Next they have desc memory:
DESC:memoryDescriptor[UniformRegister:Ra_URb][Register:Ra /ONLY64:input_reg_sz_64_dist + SImm(24/0)*:Ra_offset]

genuine nvdisasm shows them like LDG.E.U16.CONSTANT R10, desc[UR8][R2.64], my disasm as LDG.E.U16.CONSTANT ,R10,desc[UR8][R2.64 + 0x0]

And finally we also have:

A:srcAttr[ UniformRegister:URa + SImm(11/0)*:URa_offset ]

GMMA:gdesc[ UniformRegister:URb ]

TMA:desc[ UniformRegister:URe ]

TTU:ttuAddr[ UImm(16)*:ImmU16 ]

RF:indexURb[UniformRegister:URb] ','UImm(4/0xf)*:PixMaskU04

TMEMA:tmemA[ UniformRegister:URa ]

TMEM perhaps means "tensor memory" and I have no idea about the rest of the prefixes

понедельник, 10 марта 2025 г.

nvidia sass disassembler, part 2

Lets continue explore "machine descriptions" - in this time try to understand how to make format output more similar to genuine nvdisasm

For example format for one of variant I2F looks like:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /Float64:dstfmt /SRCFMT_U16_S16:srcfmt /Round1("RN"):rnd
Register:Rd
','C:Sb[UImm(5/0*):Sb_bank]*   [SImm(17)*:Sb_addr] /HSEL("H0"):hsel
$( { '&' REQ:req '=' BITSET(6/0x0000):req_bit_set } )$
$( { '&' RD:rd '=' UImm(3/0x7):src_rel_sb } )$
$( { '&' WR:wr '=' UImm(3/0x7):dst_wr_sb } )$
$( { '?' USCHED_INFO("DRAIN"):usched_info } )$
$( { '?' BATCH_T("NOP"):batch_t } )$
$( { '?' PM_PRED("PMN"):pm_pred } )$ ;

...

ENCODING
!i2f_Rd64__Cb_16b_unused;
BITS_3_14_12_Pg = Pg;
BITS_1_15_15_Pg_not = Pg@not;
BITS_13_91_91_11_0_opcode=Opcode;
BITS_3_77_75_sz=*dstfmt;
BITS_3_85_84_74_74_srcfmt=*srcfmt;
BITS_2_79_78_stride=rnd;
BITS_8_23_16_Rd=Rd;

BITS_5_58_54_Sb_bank,BITS_14_53_40_Sb_offset =  ConstBankAddress2(Sb_bank,Sb_addr);
BITS_2_61_60_hsel=hsel;
BITS_6_121_116_req_bit_set=req_bit_set;
BITS_3_115_113_src_rel_sb=VarLatOperandEnc(src_rel_sb);
BITS_3_112_110_dst_wr_sb=VarLatOperandEnc(dst_wr_sb);
BITS_2_103_102_pm_pred=pm_pred;
BITS_8_124_122_109_105_opex=TABLES_opex_0(batch_t,usched_info);

Дальше »

четверг, 6 марта 2025 г.

nvidia sass disassembler

Couple weeks ago I made decryptor to extract from nvdisasm so called "machine descriptions" (MD) (btw nvdisasm v12 uses lz4 compression library, so I made yet another decryptor + results). And after that I became extremely curious whether it was possible to make full SASS disassembler - sure format of those MDs even more undocumented than syntax of PTX - but anyway it's much better than having no documentation about ISA at all

First and most important thing to check is width of instruction - it can be

64bit for sm37 (Kepler) and more old
88bit for sm5x (Maxwell) until sm70
128bit since sm70 (Volta, Turing, Ampere, Ada, Hopper & Blackwell)

I have a very limited imagination so I couldn't imagine how hardware could support alignment for 11 byte instruction just bcs 11 is not power of 2. So after some magic with debugger I found following code snippet:
lea eax, [r9+r9*4] ; eax = r9 * 5 lea ecx, [r9+rax*4] ; ecx = r9 * 21 mov eax, 1FFFFFh ; 17bit mask
and then some search in google revealed this document:

On Kepler there is 1 control instruction for every 7 operational instructions. Maxwell added additional control capabilities and so has 1 control for every 3 instructions

So 88bit became 4 64bit qwords where first is Control qword and 3 remaining are instructions, then 21 + 64 = 85bit - very close to 88

Note: such martian architecture makes it impossible to create IDA Pro processor module for Maxwell and more old GPUs - bcs IDA expects that instruction at any properly aligned address should be valid, and you just don't know there Control qword is located for block of instructions

Lets check how looks description of each instruction (from here onwards I will refer to sm90 MD)

Дальше »

пятница, 4 апреля 2025 г.

ptx instructions emitting by nvidia compiler. part 2

Why it is so fat?

воскресенье, 30 марта 2025 г.

ptx instructions emitting by nvidia compiler

Results

среда, 26 марта 2025 г.

nvidia sass disassembler, part 5

labels for branches

пятница, 21 марта 2025 г.

nvidia sass disassembler, part 4

Known problems

пятница, 14 марта 2025 г.

nvidia sass disassembler, part 3

понедельник, 10 марта 2025 г.

nvidia sass disassembler, part 2

четверг, 6 марта 2025 г.

nvidia sass disassembler

пятница, 4 апреля 2025 г.

воскресенье, 30 марта 2025 г.

среда, 26 марта 2025 г.

пятница, 21 марта 2025 г.

пятница, 14 марта 2025 г.

понедельник, 10 марта 2025 г.

четверг, 6 марта 2025 г.