Given the lack of answers I thought I would let people know what ended up working very well for me.
So I made the mexFunction() itself do almost nothing - just some basic validation of the inputs and outputs. I put all of the C functionality I cared about in separate myAlgorithm.h / myAlgorithm.c files with 1 public function declared in the *.h, with a prototype similar to:
mxArray* my_algorithm(mxArray *arg1, mxArray *arg2);
Then after validating the arguments, the mexFunction() makes a call to my_algorithm().
For profiling I created a completely separate testMyAlgorithm.c file which uses MATLAB Engine to load a *.mat file containing input data and then uses MATLAB engine to copy that data as mxArray pointers into the testMyAlgorithm process. Then the myAlgorithm() function containing the common code is called. This C file capable of being profiled is compiled with:
mex -g -client engine testMyAlgorithm.c myAlgorithm.c
Then to profile the code I used something similar to:
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/MATLAB/R2016a/bin/glnxa64/ valgrind --tool=callgrind ./testMyAlgorithm
This approach worked to run the code by itself as well as with gdb, valgrind, callgrind, cachegrind, etc.
Essentially it is really just C code. But MATLAB Engine and the mx library is being used for convenience so we can have common code shared between a MEX function and pure C code as well as a common way of reading input test data from MATLAB.
If anyone else has an easier way of doing this I would appreciate learning about it.