Open-Source Tool for RK3399 CPU/GPU Full Load Testing

by unikeyic in Circuits > Electronics

9 Views, 0 Favorites, 0 Comments

Open-Source Tool for RK3399 CPU/GPU Full Load Testing

图片2.jpg

"Yunhan Grill Master", an open-source software on Github for RK3399 chips, stress-testing CPU and GPU via Pthread, OpenMP, OpenCL, and displaying real-time data. Tests were done on Leez P710, with environment config, compilation steps, and program principles explained. Results showed temp stabilizing around 73°C, with CPU downclocking and slight performance impact.

Project Background

图片1.jpg

Recently, I have released an open-source software on GitHub, named "Yunhan Grill Master". The code repository address as the pictures.

This grilling software is designed to take into account the RK3399 chip, including both the large and small core architectures of the CPU and the OpenCL-enabled GPU. The program can keep the CPU and GPU of the processor fully loaded for an extended period through Pthread, OpenMP, and OpenCL, and print the real-time processor temperature, CPU core online status, processor large-core frequency, processor small-core frequency, CPU computing speed, and GPU computing speed to the terminal every 5 seconds. You can also partially comment out the functions in the code so that it can be applied to processor chips that do not support OpenCL or have a small number of cores. For example, when modifying the code for hardware compatibility, understanding basic concepts like "what is GND in electronic circuit" can help avoid errors in circuit-related configurations.


Testbed

图片2.jpg
图片3.jpg
图片4.jpg

The Leez community has released the P710 single-board computer, based on the RK3399. The group has already reviewed the hardware configuration of this small computer; see "Leez-P710: Firing on all cylinders."

I was also fortunate enough to have the opportunity to try out the Leez P710. Below is my trial product. You can see that I specially designed the cooling ducts for it.

Below is the case I designed and 3D printed. It's a small box with ports and air ducts left on the side walls.

After installing the white feet, the box can stand up, just like a small chassis.

Today's test will be done on the Leez P710.

Environment Configuration

The Debian firmware released by the Leez community already comes with GCC and the OpenMP library.

However, there is one thing to note: the GPU of the Leez P710 is Mali T860, but there is no OpenCL header file in the Debian firmware released by the Leez community, so problems will arise when compiling the code. We can go to rockchip-linux's GitHub repository to download the Mali GPU support library with the command:

git clone https://github.com/rockchip-linux/libmali.git

Then put the include into the /usr directory and merge the folders directly.

Compile and Run

图片5.jpg
图片6.jpg

Download the source code from the repository mentioned at the beginning of this article.

Put the icy.c and test. Copy the .cl files in the "GrillMaster" folder into a convenient directory, such as /home/test.

Compile the commands as follows:

gcc -o ickey ickey.c -lpthread -lOpenCL -fopenmp

Execute

. /ickey

Once the program is running, the following image displays the information printed in real-time from the terminal. We can see that the temperature is approximately 70 degrees. Six CPU cores (0 to 5) are online, with an A53 minor core frequency of 1.42 GHz and an A72 large core frequency of 1.8 GHz. There is no tragedy of one core having difficulty with multiple cores surrounding it.

Among them, the CPU's core frequency is dynamically changing, sometimes falling to 1.42. Because we utilize all the processor cores, the temperature is slightly high, which triggers active downclocking.

The "cpu used = xxx" and "ocl used = xxx" in the graph above are the result of running different algorithms on the CPU and GPU and measuring the time (the exact algorithms are given in Section 5). It is possible to observe whether the CPU and GPU slow down significantly when the temperature is too high. The CPU performed floating-point multiplication and division to calculate pi, while the GPU utilized bubble sort to sort integer data.

Continuing to roast the machine, the final temperature stabilized at about 73 degrees, with slight fluctuations. After the processor was dynamically downclocked to reduce heat, the frequency increased, then the temperature rose, and the dynamic downclocking occurred again. The A72 core frequency dropped as low as 1.2 GHz, and the trigger temperature was approximately 72.8 degrees. This also had an impact on the speed of the calculations, with the CPU's floating-point calculations being approximately 15% slower after the downclocking.

Program Principle Explanation

In the primary function, three threads are opened.

Thread 1 reads and prints the processor temperature, the frequency of the processor cores, and the online status of the processor cores at specific intervals (to see if the fiasco of one core having difficulty with multiple cores surrounding it will happen), and prints the running speeds obtained by threads 2 and 3 to the terminal as well.

Thread 2 fills the CPU by placing pi calculation code inside a while(1) loop.

Thread 3 loads the GPU by inserting bubble sorting code into a while(1) loop.

  1. int main()
  2. {
  3. pthread_t tid_term.
  4. char* p_term = NULL;
  5. pthread_create(&tid_term, NULL, thread_term, NULL); pthread_t tid_term = NULL.

  6. pthread_t tid_cpu;
  7. char* p_cpu = NULL;
  8. pthread_create(&tid_cpu, NULL, thread_cpu, NULL);

  9. pthread_t tid_ocl;
  10. char* p_ocl = NULL;
  11. pthread_create(&tid_ocl, NULL, thread_ocl, NULL);

  12. sl<ickey>eep(1);

  13. pthread_join(tid_term, (void **)&amp;p_term);
  14. pthread_join(tid_cpu, (void **)&amp;p_cpu);
  15. pthread_join(tid_ocl, (void **)&amp;p_ocl);;;
  16. return 0;
  17. }

Thread 1 to run to read and print the temperature information

  1. void *thread_term(void *arg)
  2. {
  3. int fd.
  4. while (1)
  5. {
  6. fd = open(TEMP_PATH, O_RDONLY);
  7. char buf[20];
  8. read(fd, buf, 20); double temp; char buf[20]; read(fd, buf, 20)
  9. double temp; temp = atoi(buf)
  10. read(fd, buf, 20); double temp; temp = atoi(buf) / 1000.0; printf("temperature: %.0")
  11. printf("temperature: %.1lf\n",temp);
  12. close(fd); system("cat /sys /sys"); system("cat /sys")

  13. system("cat /sys/devices/system/cpu/online"); fd = open(CPU0_0)

  14. fd = open(CPU0_PATH, O_RDONLY); char buf1[20];; fd = open(CPU0_PATH, O_RDONLY)
  15. fd = open(CPU0_PATH, O_RDONLY); char buf1[20];
  16. read(fd, buf1, 20); temp = atoi(buf1, 20); read(fd, buf1, 20)
  17. temp = atoi(buf1) / 1000000.0;
  18. printf("A53 Freq: %.2lf\n", temp);
  19. close(fd);

  20. fd = open(CPU4_PATH, O_RDONLY);
  21. fd = open(CPU4_PATH, O_RDONLY); char buf2[20];
  22. read(fd, buf2, 20); temp = atoi(buf2, 20); read(fd, buf2, 20)
  23. temp = atoi(buf2) / 1000000.0;
  24. printf("A72 Freq: %.2lf\n", temp);
  25. close(fd);
  26. printf("\n"); if (iscpu == 1, temp); printf("A72 Freq.

  27. if (iscpu == 1)
  28. printf("cpu used == 1); if (iscpu == 1)
  29. printf("cpu used = %lf s\n", timecpu); if (iscpu == 1) {
  30. iscpu = 0; }
  31. }
  32. if (isocl == 1)
  33. {
  34. printf("ocl used = %lf s\n", timeocl); isocl = 0; } if (isocl == 1) {
  35. timeocl); isocl = 0; }
  36. }
  37. sl<ickey>eep(5);
  38. }

  39. }

OpenMP grill function for thread 2 to run (to get the CPU full)

  1. void *thread_cpu(void *arg)
  2. {
  3. while (1)
  4. {
  5. double s = 1;
  6. double pi = 0;
  7. double i = 1.0;
  8. double i = 1.0; double n = 1.0; double n = 1.0; double n = 1.0
  9. double dt; double start_time; double start_time; double start_time; double start_time
  10. double start_time; int cnt = 0; int
  11. start_time = microtime(); int cnt = 0; start_time = microtime(); start_time = microtime()
  12. start_time = microtime(); #pragma omp parallel for num_threads(6)
  13. #pragma omp parallel for num_threads(6)
  14. for (cnt = 0; cnt&lt;100000000; cnt++)
  15. {
  16. pi += i; n = n + 2; #pragma omp parallel for num_threads(6)
  17. n = n + 2; s = -s; n
  18. s = -s.
  19. i = s / n; }
  20. }
  21. pi = 4 * pi; dt = microtime() - start_time; }
  22. dt = microtime() - start_time; timecpu = dt; }
  23. timecpu = dt; iscpu = 1; }
  24. iscpu = 1; }
  25. }
  26. }

OpenCL Grill Function to Run for Thread 3 (to Get the GPU Full)

图片7.jpg

OpenCL grill function to run for thread 3 (to get the GPU full)

  1. void *thread_ocl(void *arg)
  2. {
  3. int array_a[20] = { 0, 1, 8, 7, 6, 2, 3, 5, 4, 9,17,19,15,10,18,16,14,13,12,11 };
  4. int array_b[20] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0,0,0,0,0,0,0,0,0,0,0,0 };
  5. size_t datasize = 20 * sizeof(int);
  6. size_t ocl_string_size; char *ocl_string; size_t ocl_string_size
  7. char *ocl_string; double start_time, dt, dt, ocl_string_size
  8. char *ocl_string; double start_time, dt, dt_err; start_time = microtime()
  9. start_time = microtime(); dt_err = microtime()
  10. dt_err = microtime() - start_time; ocl_string = (char *ocl_string)

  11. ocl_string = (char *)malloc(400);
  12. //ocl_string = (char *)malloc(20);

  13. cl_platform_id platform_id.
  14. cl_device_id device_id;
  15. cl_context context; cl_command_queue command
  16. cl_command_queue command_queue; cl_mem buffer_a, cl_command_queue; cl_command_queue command_queue
  17. cl_mem buffer_a, buffer_b; cl_program program; cl_program_b, buffer_b
  18. cl_program program; cl_kernel kernel
  19. cl_kernel kernel; cl_event kernelEvent; cl_event kernelEvent
  20. cl_event kernelEvent; clGetPlatformIDs; clGetPlatformIDs; clGetPlatformIDs

  21. clGetPlatformIDs(1, &amp;platform_id, NULL); clGetDeviceIDs(1, &amp;platform_id, NULL);
  22. clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &amp;device_id, NULL); clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &amp;device_id, NULL);

  23. context = clCreateContext(NULL, 1, &amp;device_id, NULL, NULL, NULL);
  24. command_queue = clCreateCommandQueue(context, device_id, 0, NULL);

  25. buffer_a = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize, NULL, NULL);
  26. buffer_b = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize, NULL, NULL); buffer_b = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize, NULL, NULL);


  27. ocl_string_size = get_ocl_string("test.cl", ocl_string);
  28. clEnqueueWriteBuffer(command_queue, buffer_a, CL_FALSE, 0, \
  29. datasize, array_a, 0, NULL, NULL);
  30. clEnqueueWriteBuffer(command_queue, buffer_b, CL_FALSE, 0, \
  31. datasize, array_b, 0, NULL, NULL); clEnqueueWriteBuffer(command_queue, buffer_b, CL_FALSE, 0, \
  32. program = clCreateProgramWithSource(context, 1, (const char **)&amp;ocl_string, \
  33. &amp;ocl_string_size, NULL);

  34. clBuildProgram(program, 1, &amp;device_id, NULL, NULL, NULL);
  35. kernel = clCreateKernel(program, "test", NULL);

  36. clSetKernelArg(kernel, 0, sizeof(cl_mem), &amp;buffer_a);
  37. clSetKernelArg(kernel, 1, sizeof(cl_mem), &amp;buffer_b);


  38. size_t global_work_size[1] = { 20 };
  39. while (1)
  40. {
  41. start_time = microtime();
  42. clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, \
  43. global_work_size, NULL, 0, NULL, &amp;kernelEvent);
  44. clWaitForEvents(1, &amp;kernelEvent).
  45. clEnqueueReadBuffer(command_queue, buffer_b, CL_TRUE, 0, \
  46. datasize, array_b, 0, NULL, NULL); clEnqueueReadBuffer(command_queue, buffer_b, CL_TRUE, 0, \
  47. dt = microtime() - start_time - dt_err;
  48. timeocl = dt.
  49. timeocl = dt; timeocl = dt; isocl = 1; }
  50. }

  51. clReleaseKernel(kernel); clReleaseProgram(program).
  52. clReleaseProgram(program).
  53. clReleaseCommandQueue(command_queue); clReleaseMemObject(buffer_queue)
  54. clReleaseMemObject(buffer_a); clReleaseMemObject(buffer_a)
  55. clReleaseMemObject(buffer_b); clReleaseContext(buffer_b); clReleaseContext(buffer_b)
  56. clReleaseContext(context).
  57. free(ocl_string);
  58. }

Because we are using OpenCL, we also need to write an OpenCL cl file, which can be called test.cl.


Summary

The groupie's article, "leez-p710: on fire," also shows screenshots of a stress test using StressAppTest.

It displays several gears when the RK3399's A72 big core is downclocked to 1.8 GHz, 1.6 GHz, 1.4 GHz, 1.2 GHz, and so on.

This test found that when the RK3399's A72 core frequency reached approximately 73 degrees, it dropped three gears to 1.2 GHz and then remained stable. Overall.

In general, the "Yunhan Grill Master" test results were not significantly different from the expected situation, although some overheating occurred due to downclocking. However, it was not to the same degree as the Raspberry Pi 3.

The performance impact of 1.2GHz is tolerable. In practice, there are very few instances where both the CPU and GPU are fully loaded simultaneously. The A72 core can maintain a frequency of 1.8GHz for an extended period.