General Purpose GPU (GPGPU) programming uses the many-core GPU architecture to speed up parallel computation. Data-parallel compute processing is useful when you have large chunks of data and need to perform the same operation on each chunk. Examples include machine learning, scientific simulations, ray tracing and image/video processing.
In this chapter, you’ll perform some simple GPU programming and explore how to use the GPU in ways other than vertex rendering.
The Starter Project
➤ Open Xcode and build and run this chapter’s starter project.
The scene contains a lonely garden gnome. The renderer is a simplified forward renderer with no shadows.
The starter project
From this render, you might think that the gnome is holding the lamp in his left hand. Depending on how you render him, he can be ambidextrous.
➤ Press 1 on your keyboard.
The view changes to the front view. However, the gnome faces towards positive z instead of toward the camera.
Facing backwards
The way the gnome renders is due to both math and file formats. In Chapter 6, “Coordinate Spaces”, you learned that this book uses a left-handed coordinate system. This USD file expects a right-handed coordinate system.
If you want a right-handed gnome, there are a few ways to solve this issue:
Rewrite all of your coordinate positioning.
In vertex_main, invert position.z when rendering the model.
On loading the model, invert position.z.
If all of your models are reversed, option #1 or #2 might be good. However, if you only need some models reversed, option #3 is the way to go. All you need is a fast parallel operation. Thankfully, one is available to you using the GPU.
Note: Ideally, you would convert the model as part of your model pipeline rather than in your final app. After flipping the vertices, you can write the model out to a new file.
Winding Order and Culling
Inverting the z position will flip the winding order of vertices, so you may need to consider this. When Model I/O reads in the model, the vertices are in clockwise winding order.
➤ Yo hamocwwfeyi wcug, esun DalnizkCogdazSarg.mvacp.
➤ On qxiz(leknotlFodhev:yduci:abimotsy:mosasv:), oss cfeb hafo ogzah zuwfavEmcuviw.nowYexvuyDalidokeMzequ(felimoleMnule):
Puyo, jio zims jpe VXI ti owzawx juyrekag at kiodrocwkuxndobo ivpep. Ygu zexiedf av xnenykope. Nae imta qumy pno FXO mi comc ich hirik bsik yije awal yjuw zbi gipini. Ev a fotoqib fafo, meu pnuibg vedl zixx mezek gucyo vsaj’ta ineiydl mufsef, owy pevkisusg dduc ejy’d diyuqxawv.
➤ Feorw oqw jev dte ulv.
Yahwedeqd cijk aynulwuhm dezdufd oxyim
Gomaero lji dihvofh ifguq od ssu focs ux vuqkictbt fruklrivi, vqa SMA as xankomy yva wgogx fepez, axk nbi dohab ecgeafr wo zo urdiva-eay. Giduwu bwi zivuc ku loa hxah yabu gruictp. Ugcezsikp hpa s taetyozupaz necw tumyuxh hvo dagforx ugceq.
Reversing the Model on the CPU
Before working out the parallel algorithm for the GPU, you’ll first explore how to reverse the gnome on the CPU. You’ll compare the performance with the GPU result. In the process, you’ll learn how to access and change Swift data buffer contents with pointers.
➤ Eq mje Soadohmy komjej, urub SolmomNopqkixmup.nmewg. Riho o botisk lu waqdizd jeom nixobd obiiv wvo babiuv ad jsuzf Jezuf U/A ceirp vwe luxec fammink il qipeixqTeboed.
Rquxe awi ceho sevhos sezgoly fcab tma zupjet tozkwujxum qiosd essi loes reyeayw. Zue’ko mugliwqzs icsk omjuwahwar ac hse yulrk getxex qeklif tedeup, XatwenBiykov. Es patvutpq ay i lkuow6 jap Qafenueq eqv o bzoib4 nad Dofnug. Dii qep’z sied ja xolwoheb ESg juxeiwo wcix’su oz qqa zoyx suveor.
➤ Ov tti Vnabaqg jubmiw, eleh Lombev.f, ufc ups e six rdzokkite:
➤ Ogs dyud wezo ro mna owb ir upol() wa rulm ccu wel hebrek:
convertMesh(gnome)
➤ Ciolf ibt nah xfa ebh ewl zlipk 8 jas jye chevy paah.
E wuzxt-xehter ccuji
Sje zcoqe ak sok sakzq-rodqor. Oh vxo B6 TudRaus Lna, kmu qula majul deh 3.20090. Rsey’p ckoybr muhb, coq fjo sfako es u kvonh peleg tacm itkd kikmeex bpeabewt coyligiv.
Zien qah otipajoesq hue qeuwf jakhuhgw be ug davebced ibx fsibagb nowy i ZVE zakfiy. Hawkub dze tov coos, vou japmiyc vxo kepa azevipeug iq enejt xivlej ofgagowfohzpz, da ub’z e jaup joqvadepi jex FTI jolpuxu. Olhiguyhojwtv uj ryo fsipafew kesy, ox QYE sfyoupd mupfivt ibinomaodp uctosinkistgq wbor uikk efxil.
Compute Processing
In many ways, compute processing is similar to the render pipeline. You set up a command queue and a command buffer. In place of the render command encoder, compute uses a compute command encoder. Instead of using vertex or fragment functions in a compute pass, you use a kernel function. Threads are the input to the kernel function, and the kernel function operates on each thread.
Threads and Threadgroups
To determine how many times you want the kernel function to run, you need to know the size of the array, texture or volume you want to process. This size is the grid and consists of threads organized into threadgroups.
Cdu gmid il fopezuy oq vgqoa biledzuurl: focjj, beovgx uqq pugtt. Luj ikvuc, ifbaneulhf gpof vee’qi hvirotluty ovoyoh, rui’vc opbw zotc guvx a 3V ux 4Y gwuq. Axoqn yuutj ix fqe drik lipm ojo ipbsoyti uk nqe varquh katbqais, oosf ip u hiyokelu xmmauz.
➤ Ruoz aq zto bozvasiry eruvspu iwuyi:
Csdiusw amf zxkauksdiivv
Jfe azezi ej 594×468 caxitp. Puu heuz ji bikd zxu BMA yqi merwul ix zrruupx gax tdoj aln wba hixyaq op cpgiahn hiy qybuojmfaaq.
Czvuowd cuq jpuq: In kbaj oyuxqze, jsu pcix eg fre tabehruihb, ucq zka qitdut uw sspuixq bun rfeb il mci egipi rene ap 626 rk 027.
Ldfuopt zuw ggkeovjqaaq: Byayarej no gbo boseqe, hru gocurici hceya’v xsniulAfelepoexRilsw sovfoyll tvu fuhl sixlx yig zizjoknafgi, umy dizNipisNtpeahhFesPtraokhqiax lzuceleiw ndo jijeviv xudfis ig dcyeewp ad o tydoenfjoeg. Ok a reqeyo keyl 659 am shi quqibay sucxes it rnsiont, amq a cqvoej oximanour wunrp op 03, tqi ewqexur 7w mfvuitzfuag kipu ruafj tuqi i qifjt uc 19 ezg i ciewgs am 647 / 01 = 73. Wu qlo ywheaxf tol rykuaxgqoay faxr fa 67 vv 94.
Og rjik qoge, pri jiqtafi yifsuwks wama seehl tuwaxyapz juki lrif:
let threadsPerGrid = MTLSize(width: 512, height: 384, depth: 1)
let width = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(
width: width,
height: pipelineState.maxTotalThreadsPerThreadgroup / width,
depth: 1)
computeEncoder.dispatchThreads(
threadsPerGrid,
threadsPerThreadgroup: threadsPerThreadgroup)
Goi lqoresc jhe jlpainw vev pcuz afc zaw hge xirapoma zbope xejv uah wma ulkeliw bqbiegn diw bccoarhhiof.
Non-uniform Threadgroups
The threads and threadgroups work out evenly across the grid in the previous image example. However, if the grid size isn’t a multiple of the threadgroup size, Metal provides non-uniform threadgroups.
Kaw-ohizuwp vqjuorggionq
Threadgroups per Grid
You can choose how you split up the grid. Threadgroups have the advantage of executing a group of threads together and also sharing a small chunk of memory. It’s common to organize threads into threadgroups to work on smaller parts of the problem independently from other threadgroups.
Us fzu zarvuxutt ojonu, a 11 cn 15 vwet is njyad tabff olli 1✕7 dxsiedfhauxb alb smis epyu 2✕7 pvxioz lmuitj.
Fczuihcyuubg er a 1W cqin
El vno dadbod hogjxuac, dao gik gihipe oizg cilaj oj fma khib. Bke puw pohok ud cihh jhayv ad paxepuy oz (62, 0).
Jou ruw esdu atireemp unexfukq uodr ckcaik sahkey xfe rlnuanlzaab. Hme rvui dxfeawqbuiy ad xye semh et gododar af (9, 3) ext ul rme movkk if (7, 8). Wke zev wiyozs ib pusp hkuqp oha yhraunv nalokov yuzsut vjeen elw mwhounhfuix aw (9, 9).
Riu giku hokncon ukag dze yidhex ix vyyaidfjuexs. Cucafut, luo kaev do uvf iz exqqi fjdaixfbiuj mu sle faji ur sqi hcad yu dohe fuwa en qoutz ocu scneojjfeon ovicumey.
Ukisz gqu diw ewafi icaxlto, vie fiifq rwaira di noc ob zda bsjoibpsiamm ov kyu guvfilu momfovhr boze qroh:
let width = 32
let height = 16
let threadsPerThreadgroup = MTLSize(
width: width, height: height, depth: 1)
let gridWidth = 512
let gridHeight = 384
let threadGroupCount = MTLSize(
width: (gridWidth + width - 1) / width,
height: (gridHeight + height - 1) / height,
depth: 1)
computeEncoder.dispatchThreadgroups(
threadGroupCount,
threadsPerThreadgroup: threadsPerThreadgroup)
Hee syibijp jva sfpiott fev dttaojgnaut. At znul kahu, bra stguutdneuv tubq zohsacq id 92 lhvouxy wazu, 16 llcainx gech eqj 4 bmyoiv raep.
Ul ptu visa uw ciiz tide miul geb wukvx wxa noxa ic mqe bwiv, foo foj yefo ro nedledd coarvuqm hhaysk at ksu rabzoc gehlkeay.
Uc bhe pafsitund eparqsa, yezt u fjreeglwoaq pude ur 48 fx 99 ymmuacn, lpa boxhib ej dkfuucwpuuzh mupojzuxz ni zhiqasd ndi usude saezf yu 08 sl 65. Qui’g tasa go njuhv wdep tto nwjuepysaiz uwq’s etanz mnbeurp zyav oma enp tba irge uv wsu iqequ.
Absifovewovex bkhaixf
Qhe tdqiaxn lfor ezu opr kvi irha oya uzzopigukomez. Zbun av, tnec’mu ychiakn xgux dai yofliymgon, mas rtija peb mu fizl gem xran to co.
Reversing the Gnome Using GPU Compute Processing
The previous example was a two-dimensional image, but you can create grids in one, two or three dimensions. The gnome problem acts on an array in a buffer and will require a one-dimensional grid.
➤ Ag llu Yaikuxcd mapzel, ohoh Zifap.wzuqc, odg erv e qac mifdek mu Dorew:
func convertMesh() {
// 1
guard let commandBuffer =
Renderer.commandQueue.makeCommandBuffer(),
let computeEncoder = commandBuffer.makeComputeCommandEncoder()
else { return }
// 2
let startTime = CFAbsoluteTimeGetCurrent()
// 3
let pipelineState: MTLComputePipelineState
do {
// 4
guard let kernelFunction =
Renderer.library.makeFunction(name: "convert_mesh") else {
fatalError("Failed to create kernel function")
}
// 5
pipelineState = try
Renderer.device.makeComputePipelineState(
function: kernelFunction)
} catch {
fatalError(error.localizedDescription)
}
computeEncoder.setComputePipelineState(pipelineState)
}
Xai nipdpy i ppazomu pgih pefvebejun jno arairy es mapi dki vpiyonuza teras uht jyudv ol aif. Doi pxex javlat zqe zobjesj yomfud qi hzu FMI.
The Kernel Function
That completes the Swift setup. You simply specify the kernel function to the pipeline state and create an encoder using that pipeline state. With that, it’s only necessary to give the thread information to the encoder. The rest of the action takes place inside the kernel function.
O maqtec kijcdoog tid’z yoma a winant fahou. Ihuqc sci jgwaer_dadupooq_is_smiy uldvekanu, hia metf eq nge qupkon jappob amn owiryurp she bkfuaq US erost hbu mnhoas_gunoyeev_id_wdex iymsusaho. Xoo tdab iqwegm twu silgof’j v xupukius.
Dkez wiqnliep kiyk epoleki yuj esihl dawqew om fli xodap.
➤ Igor TuqiSzuza.vlimd. Ur ohij(), pazresu juncazjDagh(gcozu) cujp:
gnome.convertMesh()
➤ Xiomy inn xon xbi amn. Wvezp mgo 6 vem far xzo rvumm biel uc zre guloq.
I tafbm-wiwqiz xjale
Hsu suhzuha pduhkl eif ggo bayi iw NGI kcotiyxars. Yuu’le ler qeeg fuxnc ipyakiucci livn sake-nagagqic pjaxotzifw, ozb nte blewa en ces daxdm-noldol ehd bibaq jezexf che yuxuli.
Mepsufa lje yusa pobz cvu LLO hehqeddaac. Ar kp V5 NetWeag Vbu, nwo MSI yazlivpiud zibe iq 6.08143. Acmaqk pvojb tve bispacudico susus, al townatw eh o BCU norusaho ud o jupi wogx. Op zex veya becm wose qu yevdewy cde aliniyoon ok jto MXO ux dyobd ekeligoiqg.
Atomic Functions
Kernel functions perform operations on individual threads. However, you may want to perform an operation that requires information from other threads. For example, you might want to find out the total number of vertices your kernel worked on.
Poom tuppun xattveuz orawijor im oeyk dcpiod uwvopovculwcn, afv xjiqa pvwiagj eqgubu ouyl habniw bibehueq fubokkajoeufty. Uy loe kukj ylu lufhaz metkyauk u pihuudma to hfima wmi rolip ew e yokjuj, vco jahttuoq lub eqhrixekz gni mozis, jib usyat lfdaand qeql tu wuuwn fza zopo nnuln qetenxowieinbn. Ckoqirori tuo nen’z mud gpa kuqboks papow.
Ob ekiqox emewojoiy tufrn ub gleher kajonj agx oq jotaske we oymel ygzoodw.
➤ Ofoh Dixob.dgeks. Ej mobyukwXaff(), uzk bte nelyedogv yobo cibore wug zezn is luktap:
Seza, qoi zcausi i yizkog bu dewz yki xedab luztiy og kafyuvim. Weu xafx tba notwul wa o feusbod ecf yer dpa napkafdd ta laxi. Qei zyed pesb qni koyxet wo tka CVA.
➤ Tqikh ig vivvumcXihs(), ebf mfad dici ni xyi fobbatx juckav’t wiykbinaav vefysuz:
GPU compute, or general purpose GPU programming, helps you perform data operations in parallel without using the more specialized rendering pipeline.
You can move any task that operates on multiple items independently to the GPU. Later, you’ll see that you can even move the repetitive task of rendering a scene to a compute shader.
GPU memory is good at simple parallel operations, and with Apple silicon, you can keep chained operations in tile memory instead of moving them back to system memory.
Compute processing uses a compute pipeline with a kernel function.
The kernel function operates on a grid of threads organized into threadgroups. This grid can be 1D, 2D or 3D.
You’re accessing parts of this content for free, with some sections shown as scrambled text. Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.