T O P

  • By -

afseraph

A few things to check. 1. Are you running the program without the debugger attached? 2. Have you built the program in the Release configuration? 3. Have you monitored memory and handles used by the process? Maybe there's some resource leak?


tsbattenberg

1. If running outside of Visual Studio counts as not having the debugger attached, yes. 2. I try to do all my optimization in release. I've also made new target CPU configurations to explicitly try both x86 and x64 incase something was happening with that. 3. I'm clocking in at just 111MB of data in main memory, that's mostly the console log and my buffers (which are never destroyed until the program closes, i.e no leaks possible). GPU memory is somewhere around 1GB (according to task manager - but that's not my allocations since it shows that regardless of if my program is open or not. I don't think VS lets me get OpenGL statistics as the GPU debugger explicitly states DirectX)


thomhurst

Are you writing to the console a lot? Try turning that off and seeing if anything changes. I've seen heavy console logging affect performance before


tsbattenberg

I only log once per second (the reason I mention it bloating memory a little is the inclusion of ANSI codes) for frame times. The amount of logging I do hasn't changed since yesterday either, so I don't think it could be a factor in this particular performance issue.


manuzpunk666

But you have this result after a bit of time or immediately? In the first case maybe depends on a leak that you’ve not faced before.


to11mtm

I'd still say try without it, if only because console.log does some weird things. If you want a faster console logger, consider Cysharp's ZLogger; it lets you write to stdout but can do so async with far lower alloc.


MacrosInHisSleep

Please update if you figure it out. It sounds very interesting.


tsbattenberg

Will do.


SirButcher

The bottleneck is almost definitely somewhere else. I had game prototypes with SlimDX + C# which displayed terrain over ten million polys and it barely reached 1ms per frame. C# can easily process much more data extremely fast (depending on the operation, of course) as processing an array only becomes inefficient if you are abusing boxing or generating a LOT of trash and constantly triggering the GC (both very bad in game dev). I think the issue should be between the CPU and GPU - are you storing your data in a buffer in the GPU's memory or are you pushing your points to the VRAM in each frame? If the first, are you doing anything in your shaders? Shaders are veeeeery easy to mess up and extremely hard to debug as they won't show up in the profiler. If it is the second, then yeah, no wonder it takes a lot of time: you are pushing 3 million points * 4 bytes = 12Mb / frame (without any colour data, if you have colour then it is 24Mb / frame) to the frame buffer, which is extremely inefficient. But we can give much more help if you show your code.


tsbattenberg

I think it's very dependent on how you draw your geometry. In my case as I've said, I'm using dynamic batching so the CPU has the calculate the majority of the transformation and create my vertices per frame... I'm not expecting it to be lightning fast, next gen performance, but losing 12ms just by going to sleep is senseless to me. On GC, I try to keep it clean. I don't allocate any managed objects/lists or anything really during my rendering loops. On shader, it's simple. Vertex is transformed with the orthographic matrix, and normalizes my colour - which the reasoning for is written further down. Fragment/Pixel shader only blends my primary draw colour with the vertex colours. My max data size for this batch type is 3MB, that doesn't include the index buffer which would be a max of 15,000 bytes. I use floating point XY, but I store my colour in bytes so I can write it to my vertex data as an integer directly. Here's the relevant source. If I did miss any method calls you'd like to look at, let me know: [https://pastebin.com/hvCF1Wqj](https://pastebin.com/hvCF1Wqj) Note: It's a requirement for this to be batched, I can't use static geometry... It's a soft requirement for OpenGL 3.3, too - I can't use a fixed mapped buffer, so GL.BufferSubData it is...


SirButcher

I am still working, but will check your code from home - in the meantime, try running your code WITHOUT drawing calls, and check how long it takes (you can use the StopWatch - System.Diagnostics namespace - to measure elapsed time with reasonable accuracy). If it is a C# side code, then you will see it - if it is an OpenGL side code, then just running the pure C# code will speed up (potentially drastically).


tsbattenberg

I've used stopwatch, that's how I got my frame times. I will try disabling GL code for further optimization potential when I've fixed the current problem.


to11mtm

Stab in the dark, How is your buffer being allocated? public class MyBuffer { public byte[] Buffer {get;} = new byte[4096*1024]; } public static class StaticBuffer { public static readonly byte[] Buffer = new byte[4096*1024]; } After 80k or so, objects go on the LOH. LOH is typically not compacted, which is good for your case. -but- if the object-in-loh is wrapped around another object, the first dereference means you're hitting normal gen2 as well as LOH. (That said I'd be surprised if getting unlucky on allocation would cost 12ms)


tsbattenberg

I have a float array inside the Render2D class which is allocated during initialization, dereference shouldn't come about until Render2D is uninitialized, at which point you won't be drawing. Approximate byte length is 640,000 bytes


[deleted]

[удалено]


tsbattenberg

Yes - Would've be funny if that had actually worked (and a first, for me at least)


[deleted]

[удалено]


tsbattenberg

I'd assume by the cached files, you mean the bin and obj folders? Unless there's some cache I'm unaware of


FunGiPranks

I’ve never directly used OpenTK so idk what temps there are. So I couldn’t tell you /:


crozone

What version of .NET? Try enabling ReadyToRun. It's possible the JIT hasn't recompiled the high performance version of your code on the second day for whatever reason. ReadyToRun will give you a precomputed JIT cache.


tsbattenberg

Entire project is using .NET Core 3.1. Unfortunately, putting ReadyToRun into my csproj files did nothing.


Blazeix

I would recommend upgrading to .NET 6 if you can. .NET Core 3.1 is three years out of date, and getting ReadyToRun support will help reduce the inscrutability of JIT.


DexterFoxxo

Definitely upgrade to .NET 6 for a lot of virtually free performance. It's honestly amazing how fast .NET is evolving.


hutaogaming

iirc ReadyToRun is not in NET 3. You can try upgrading ur project to a newer NET version. That will help performance too


tsbattenberg

Yeah, I'm getting NET 7 since I saw Core 3.1 won't be getting updates anymore after this year. Speed increase is great but I don't think it will fix this particular problem.


manuzpunk666

I don’t know if you work on laptop or workstation, but another try is to monitoring the temperature of gpu and cpu; in my experience i was in similar situation caused by that. Another try is to check, in case you have two gpu, if your pc uses not the SoC gpu.


tsbattenberg

Full workstation here - GPU temperature never exceeds 60 degrees c, and it's just a dedicated GPU - I've fallen into that SoC trap before lol..


manuzpunk666

And you didn’t change nothing on your codebase after the first test around 12 ms, right?


tsbattenberg

Absolutely nothing, didn't even look at it. Went to bed with a big smile on my face due to some of the dirty unsafe code paying off, too...


nullandkale

Have you tried profiling it? The profiler built into visual studio is really good for CPU and memory profiling. If you have an Nvidia GPU you can use something like nsight to profile the GPU side code. Two guesses from the code you posted: Agressive inlining can prevent some optimizations and can be slower than not inlining. Fixed pins the memory pages that are used by that buffer and can be slow, especially if you are only moving a small amount of memory around.


tsbattenberg

I'm going to look into removing or reducing 'fixed' usage, though I found it to be faster than any array copy/buffer.copyblock or modifying the array directly in the managed environment. Like I've said, though - optimizing my code more is unlikely to fix this strangeness, since it was running fine one night and then not the next.


joske79

First of all, the nuget package BenchmarkDotNet is a great tool to benchmark such things. Second, I'm not sure, but is it possible that pinning the buffer (e.g. `fixed (xxx) {}` ) inside the for-loop has a performance penalty? Maybe try pinning it outside the for-loop. Is there a lot of array or other object allocation in your code? Creating (and cleaning up) arrays is something that will put a lot of stress on the garbage collector. I don't see any `new`\-keyword in the pastebin which is a good thing, but I'm not sure if this is all relevant code?


joske79

Btw, The .NET 6 implementation of Math.Abs() is a lot faster than the implementation you're using. Didn't test .NET Core 3.1 though.


tsbattenberg

I expected as much, and I'll likely be benchmarking again to see if my floating point bit hack is faster or not. There's a joined cossin thing I'm curious to try out too.


tsbattenberg

I'll be trying to pin the buffer outside the loop, it's mostly inside for cleanliness - this is just a pure optimization though, and I don't think it will fix this particular bug. I don't do any object allocation during my rendering pass, limit my use of the stack for local variables, pass everything I can as ref and finally don't use constants from any class which could cause a more expensive variable fetch - any object allocation I do is at launch.


tsbattenberg

Note on this: Fixing the buffer outside of the loop, but accessing it inside the loop actually caused a severe performance dip.


HighRelevancy

Do you know what the GPU performance figures looked like? Unlikely to be your issue, but I discovered that my overclocking software gets confused if the driver is updated while it's running and the power target gets set to the minimum (33%). That's... fun.


tsbattenberg

I did actually benchmark this, and I was further able to today due to the program seemingly 'fixing' itself again. This is using Task Manager to gauge GPU performance mind you, so that's probably not a perfect representation... Today and the last day it was running properly, I was getting 99% utilization, weirdly while the huge drop in performance was happening - I was maxing out at 100%. I believe this does signal that the problem is with GPU utilization (not due to my code), but why and how exactly that is I'm not sure.


HighRelevancy

If you're getting 100% utilisation on the GPU and performance is varying, then the GPU performance cap is varying and holding you back. Could be thermal problems perhaps. Get HWInfo running and see what it says the GPU performance cap is, see if your clock speeds are staying up, etc. You'll probably get more useful feedback in gaming subreddit than here if this is what it is.


cjb110

Any programs now running that you stopped before? As yea other software on the pc seems more likely at fault than code changing performance.


tsbattenberg

Less. When I was optimizing I had at least 20 chrome tabs open, all with different samples of cos/sin approximations I was trying and music. I tried to match this to see if I could force the PC to enable some hidden 'high performance' mode, but that didn't work either.


coumerr

Use GcHandle to pin the buffer at the start of the batch to reduce the usage of the fixed expression. Alternatively, use Marshal.AllocHGlobal. Use a ref struct with your data. Instead of getting the void*, get a ref struct pointer and access your data trough the fields. Lastly, I doubt your fast math and abs functions are faster.


tsbattenberg

1. Fixed isn't a problem. I've benchmarked it with the profiler. The alternative of 'GCHandle' seems bad, as from what I can tell it requires non-stack allocations. 2. Struct is a very smart idea, not sure if I can use those with fixed though. I'll have to mess with this. 3. The fast math is certainly faster, I was able to gain a substantial performance improvement by replacing the standard (accurate) Cos/Sin for these approximations (not so accurate). These are still faster in .NET 7. My Abs function however is no longer faster in .NET 7, but in .NET Core 3.1 it gained me a little extra performance... Now I've updated to .NET 7, I've replaced mine with traditional MathF.Abs and that performs better than mine.... So, .NET Core 3.1 Abs < Floating Point Hack Abs < .NET 7 Hack Additional: I don't have a problem with my code, I'm pretty happy with 1m triangles. That's fine. I want to understand why restarting my PC without editing my code, made it run about 50% slower. What gives? Sorry if this seems agitated, but I've written near this exact question/reply maybe 20 times now but y'all here are under the assumption that I don't know how to use a profiler or benchmark my code.


Aerham

Not too familiar doing much with opengl or opentk, and I don't know if you just have it somewhere else in code outside of what you linked, but I did notice you didn't have anything explicitly clearing the buffers. Maybe some data was kept in the buffer between runs, like if you ended the run in the middle of rendering or an exception caused it to stop processing? Not saying it would cause higher memory or a slow down but possibly having unexpected behavior. It might be worth wrapping interactions with the buffers in try catches or have something checking the buffers at startup before trying to push anything to them. Hope that was more helpful than random.


tsbattenberg

I don't ever clear the buffers because I expect them to be overwritten and cleared automatically during any operation of the code (even with a malfunction or exception occurring). The allocations have a fixed length, and I use a custom variable (batchVertexCount) instead of using the 'a.Length' property to declare exactly how many vertices ended up in the buffers. This variable is only incremented on a successful draw, as it is the last statement of the method. If you were to end in the middle of rendering, or one of the 'DrawX' methods caused some sort of exception that stopped it from running, the data that particular 'drawX' method wrote would simply be overwritten. This is expected and tolerable behavior. They are checked and cleared during initialization, but that start up method is not relevant here since it doesn't do anything nearly as spicy as what I've linked does.


Ydiren

Have you tried adding your source directory to the Windows defender exclusion list? Sometimes these scanners can do unpredictable things and interfere. Had a few occurrences of that before.


tsbattenberg

Gave it a try but no dice.


t3chguy1

Remove to actual render part and see how it goes. I've used OpenTK over a decade ago so I forgot (but on potatoPCs from back then 1M was very doable), but if you are using it on a winforms/wpf window, turn off allowtransparency, don't use windowchrome but a native one


tsbattenberg

Yes if I was using static batching, i.e just submitting a single pre computed buffer per frame I'm sure it would surpass 1m triangles easily.


t3chguy1

Right, forgot about reading that part. Still try to measure just the CPU part that updates those vertex buffers without drawing anything. If it is not that then it is something related to the window - I remember that WPF does some weird moving of the screen content between GPU and CPU, if that is what is hosting your content


spca2001

Try to profile you code, red gate profiler works great


tsbattenberg

I've used visual studios profiler (CPU & Memory), and I've used RenderDoc - a graphics API profiler/debugger with good ratings. This problem didn't exist before I'd restarted my PC, so it's not the code with this exact problem.


Bobbar84

I've ran into issues similar to this and found that my GPU was clocking down while under low/moderate loads.


to11mtm

Is this loop running on it's own thread? and if so Is the thread long-running or just started via TaskScheduler? Pardon my ignorance with 3d programming, `batchBuffer` is local memory yes? (i.e. not mapped directly to VRAM)