R Allan Barker science | technology | history | philosophy | curiosity |
Last updated
April 28, 2023 |
GCN GCN (Graphics Core Next} is AMD's newest
processor architecture used in GPUs and integrated
CPU/graphics processors. For anyone who enjoys the power of
lower level programming, GCN ia all new and amazingly
powerful.
GCN C compiler and assembler, extending OPENCLAt the assembly level, programming is similar to traditional CPUs but uses two interwoven instruction sets, one for vector operations and one for scalars. The two run in parallel and access common resources. Both have rich and powerful instructions like single cycle floating point transcendentals and a wide range of bit operations. Forget twiddling bits, now you can bash them. The problem is you cannot access GCN's full power. Currently the only way to program GCN gpus is
with OpenCL, which is hardware architecture independent
leaving much of GCN's power inaccessible. OpenCL doesn't
even support global synchronization, probably because it
roots come from graphics programming.
GCNC implements a traditional C/assembler programming environment that compiles directly to GCN machine code. C and/or assembly (ISA) programs are compiled to binary and run from the user's host program like OpenCL, using the OpenCL environment. Source code files can contain sections with GCNC C, ISA assembly, and AMD's IL code. Compilation is directed by simple #define statements in the host program, for which I use mingw gcc C. GCNC is based on the wonderful LCC compiler and is still under development. LCC provides very good optimization for its age but works on a different paradigm than modern compilers like AMD's openCL. While modern compilers often fully rewrite programs, GCNC optimizes the code the programmer writes. Optimization is a dual effort between programmer and compiler allowing the programmer to control the output, just like the good old days. GCNC fully opens the GCN architecture by including low level built in functions, instructions, register access, and inline assembly. The assembler implements the full GCN instruction set and does assembly level optimization to find the best GCN instruction formats. It also assembles AMD's GCN assembly code. Like C, GCNC started as a way to avoid coding in assembly yet maintain a strong link to hardware. As a project, it has been a daunting task to adapt to the full scope of a parallel environment like GCN. Currently, the pace of development is based mostly on need, although whim is often considered a need. While it implements C, complex syntax can still uncover areas that need further attention. I have now used GCNC for several substantial projects and the results have better than expected. Although AMD's openCL may write better code, the programmer's control over the output seems more important to producing very high performance code. I was never able to share this development as I had hoped but it still serves as an example of how to access GCN's low level power. See example code in below file. The GCN C compiler was developed to run on a 24 GPU system The system uses 12 water cooled AMD R295X2
dual GPU cards for 6 GPUs per computer. Computers
share 80gb RDMA links allowing one program to use all
24 GPUs.
The file below shows many examples of
GCNC's syntax.
|
#include "../gcnc.h" // The GCNC header file //------------------------------------------------------------------------ // Skeleton program compiled by the OCL compiler // Synchronizes the opencl system with the GCNC compiler //------------------------------------------------------------------------ #ifndef _MIXCODE_LCC kernel void lcctest( global double *p, global double *q, global int *prb, const int a, const int b) { int x = get_global_id(0), y = get_global_id(1); p[y] = q[prb[x]]; } #endif //------------------------------------------------------------------------ // A GCNC C program with some example syntax //------------------------------------------------------------------------ #ifdef _MIXCODE_LCC #define WAVE_BAR_0 1 #define N_WAVES 31 int real_function(int,int); // forward delaration of a callable functions extern sregister unsigned long exec,vcc; // access to special register 'exec' , 'vcc' extern sregister unsigned int scc; // access to special register 'scc' extern sregister unsigned int m0; // access to special register 'm0' //BUILT IN FUNCTIONS, INSTRUCTIONS, and ASM() builtin float sin(float); // builtin function sin(x), units of 2*PI builtin float sin_2pi(float); // builtin function 1 insn sin(x), units of 1.0 builtin float v_sin_f32(float); // builtin 'instruction' uses GCN insn names asm("v_sin_f32 #0 #1",x,y); // inline assembly code builtin float v_sin_f32(float); // by default, 'builtin' return vgprs sbuiltin unsigned long mem_time(void); // get gpu clock, 'sbuiltin' returns sgprs builtin double fma64(double,double,double); // many builtin functions builtin uint bitcount(uint, uint); // builtin GCN bit op functions builtin void ds_max_f64(double *,double,...); // lds fucntions, variable args for offsets builtin void gs_max_f64(double *,double); // gds fucntions can use gd_ prefix builtin int atomic_cmpswap(int,int); // builtin global atomics builtin void tbuf_load(buf, yy, 3*8); // global read/write builtin functions builtin void tbuf_load4_glc(buf, yy, 3*8); // global read/write functions use glc bit #define CON_INT 123 // constant forms #define CON_UINT 123U #define CON_LONG 123L #define CON_ULONG 123LU #define CON_FLOAT 1.234f #define CON_DOUBLE 1.234 struct tree{ // structures and pointers struct tree *l,*r; // pointers to structures int a:8,b:12,c:12; int f; }; //------------------------------------------------------------------------------------------ kernel void lcctest( // global double * buf, global double * out, global int * _printf_, // buffer for fast printf const int arg2, const int arg3 ) { register int gx,gy,gid; // private variables are type 'register' int a,j,k,vgpr; // 'register' is assumed uint ui,ret; // supports int, uint register long longtime; // long, ulong register char ch; // char, uchar register short sh; // short, ushort register float x,y,z; // float register double xx,yy,zz; // double register int array2d[3][2]; // constant indexed register arrays sregister int sgpr1,sgpr2; // SGPRS are defined as 'sregister' sregister long sgpr64,t2; // long SGPRS, s[22:23] are even aligned local double localdouble; // local memory variable (lds) local int localbuf[1024]; // local memory buffer (lds) local int *lp; // pointer to local memory local struct tree trees[4],*tp; // local structures and struct pointers _gds int gdsval; // gds memory buffer and vars, _gds long gdslong; // gds seen by all kernels _gds int gds_buf[512]; // gds arrays _gds double gdsdouble; // double floating point _gds value _gds int *gp; // pointer to gds memory global sregister double *sp; // pointer to global memory with SGPR global register double *vp; // pointer to global memory with VGPR register double *vp; // pointers assumed to point to global memory gx = get_global_id(0); // basic opencl functions gy = get_global_id(1); // gid = 256 * gy + gx; // calculate global id #define USE_GWS_BARRIER #ifdef USE_GWS_BARRIER ret=atomic_inc((uint *)buf,MAXU); // first addr = 0, ret=0 indicates first wave if(ret == 0) // first wave initializes a gws global barrier. gws_init(N_WAVES, WAVE_BAR_0); // .. for all 128 waves, used below #endif if(gid==3) //fast printf, runs near full speed printf("hello amd = %d\n",gid); //prints to screen, uses global arg _printf_ sgpr64 = vcc; // read/write registers vcc, m0, exec, etc. sgpr64 = exec; //save exec register exec = 0xffffffffffffffffUL; // this is dangerous. exec = sgpr64; // restore exec register a = m0; // read the the m0 register. yy = fma64(yy, xx, yy); // builtin math functions, eg, double fma ui = 0xBEEFBABE; // builtin bit functions j = bitcount(bitrev(ui), ffbitl(ui)); // no. of bits in a + value of first bit in a localdouble = 1.7777; // 64 bit local memory ds_max_f64(&localdouble, 3.3); // double atomic max in local memory #define CFLOAT 1.2345678f // float constant #define CDBL 1.2345678912345678 // double constant ds_min_f64(&localdouble, CDBL, 4); // local atomics with optional offset y = v_cos_f32(x); // builtin instructions use gcn assembler names a = v_mad_i32_i24(j, a, 12); // most hardware instructions are available z = v_add_f32(abs(x), -y); // gcn optimiztions for abs() and sign k = v_subb_u32(vcc, a, j, vcc); // BIs use registers vcc, scc, etc k = v_add_i32(sgpr1, a, j); // BIs use user's SGPR registers precise24(); //int mul/div precision set to 24bits k = a * j + d; // v_mad_i32_i24 1 clock. precise32(); //precision returns to 32 bits k = a * j + d; // v_mad_i32 4 clocks. // inline assembly gcn instructions asm("v_add_i32 #0, vcc, #1, 123",a,j); // referencing C register variables asm("v_mul_f32 #0, abs(#2), -#1",x,y,x);// optimization available via the assembler asm("v_nop"); asm(".message: here I am!"); // Expression optimization y = abs(x) * y - abs(z); // this compiles to one instruction. ui = (a & ~k) | (j & k); // Compiles to one instruction d = bfi(k, j, a); // Compute z = sin(x)^2 + cos(x)^2 y = sin2pi(x); // sin2pi and cos2pi are native instructions x = cos2pi(x); // both have domains in units of 2*PI z = x*x + y*y; // Full computation takes 4 instructions // global memory, yy = buf[3]; // compiler generates waits for 'C' expressions sp = &buf[3]; // pointer to global memory yy = *sp; // read global memory tbuf_load2(buf, yy, 3*8); // read using builtin function tbuf_load2_glc(buf, yy, 3*8); // force the glc bit to bypass the cache //... vm_wait(0); // user added wait() for 'builtin' instructions if(y > x + 3.3f)y=0.0; else y=x; // simple conditionals use v_cndmask instruction if(y==x && i>3) // most conditionals use v_cmpx_(op), short code for(a = 0; a < 7; a++){ // do something in a loop for(j = 0; j < 7; j++)a = a + j; // do more } lp = &localbuf[3]; // lds and gds access, pointer to local memory localbuf[4] = *(lp + 2); // localbuf[4] = localbuf[5] gp = &gds_buf[3]; // pointer to _gds memory gds_buf[4] = *(gp + 2); // gds_buf[4] = gds_buf[5] //LDS/GDS builtin functionss ds_atom_inc((uint *)&localbuf[4]); // atomic inc of global data store ds_atom_inc((uint *)&gds_buf[8]); // ds functionss use lds or gds ds_cmpst_f64(&localdouble, 1.22); // LDS low level instructions with gcn names gs_cmpst_f64(&gdsdouble, 3.44, 8); // GDS instruction with optional offset a = vgpr + sgpr1; // mixing SGPRs with VGPRs is allowed j = gx & 0x3f; // readlane 1) get the local thread id in wave sgpr1 = 8; // readlane 2) select lane no. 8 in sgpr1 sgpr2 = v_readlane_b32(j, sgpr1); // readlane 3) read j from thread 8 (lane 8) j = sgpr2; // readlane 4) assign to j for all threads tp=&trees[0]; // point to a structure if(gx < 0)tp = (tp->l)->r; // structure pointer to pointer to ...... yy = localdouble; // read local double set at the top yy = xx / yy; // full precision double divide sgpr64 = mem_time(); // using the gpu clock to time insturctions y = cos2pi(x); // measure clocks for this instruction t2 = mem_time(); // most insns are 4 clocks, others are 6 or 8 ds_wait(0); // wait for timer to return longtime = t2 - sgpr64; // most insns are 4 clocks, others are 6 or 8 #ifdef USE_GWS_BARRIER gws_barrier(WAVE_BAR_0); // global wave barrier, everyone wait here! #endif j = real_function(a,0); // real stack function call out[gid] = yy; // write something before leaving } int real_function(int a,int b) // a non-inline function, can be reentrant { int tmp; // local variables are temporary registers tmp = a * b; // do something return tmp; // return } #endif //------------------------------------------------------------------------------------------ // Alternatively, programs are written using GCN ISA code // Option 1, first compile a shell program using the OCL compiler and paste here // Option 2, compile a shell program using the GCNC compiler and paste here // Option 3, Write your own ISA program, requires in depth knowlege opencl internals //------------------------------------------------------------------------------------------ #ifdef _MIXCODE_GCN x_set_nvgpr 63 x_set_nsgpr 40 x_set_ldsmax 0x7fff s_mov_b32 m0, 0x00008000 s_buffer_load_dwordx2 s[0:1], s[4:7], 0x04 s_waitcnt lgkmcnt(0) s_mul_i32 s0, s12, s0 s_mul_i32 s1, s13, s1 v_add_i32 v0, vcc, s0, v0 v_add_i32 v1, vcc, s1, v1 s_load_dwordx4 s[20:23], s[2:3], 0x50 s_load_dwordx4 s[16:19], s[2:3], 0x58 s_buffer_load_dword s24, s[8:11], 0x00 s_buffer_load_dword s25, s[8:11], 0x04 s_buffer_load_dword s26, s[8:11], 0x08 s_buffer_load_dword s27, s[8:11], 0x0c s_buffer_load_dword s28, s[8:11], 0x10 s_waitcnt lgkmcnt(0) v_mov_b32 v5, v0 v_mov_b32 v6, v1 v_lshlrev_b32 v11, 8, v6 v_add_i32 v7, vcc, v5, v11 v_mov_b32 v11, 0x3333 v_mov_b32 v2, 0 s_waitcnt lgkmcnt(0) v_lshlrev_b32 v11, 2, v7 v_add_i32 v11, vcc, s25, v11 s_waitcnt lgkmcnt(0) v_mov_b32 v0,1.111 v_mov_b32 v1,7.0 v_mov_b32 v9,10.0 v_mac_f32 v9,v1,2.222 tb_store_fmt_x v9, v11, s[20:23], 0 offen format:TF_X label_1: s_endpgm end .end lcctest #endif //_MIXCODE_GCN //------------------------------------------------------------------------------------------ |