编译> 2 GB的代码时如何修复GCC编译错误?
我有大量函数,总共大约 2.8 GB 的目标代码(不幸的是,没有办法绕过,科学计算......)
当我尝试链接它们时,我得到(预期)relocation truncated to fit: R_X86_64_32S
错误,我希望通过指定编译器标志来规避 -mcmodel=medium
.除了我可以控制的所有链接的库都使用 -fpic
标志编译.
但错误仍然存??在,我假设我链接到的某些库不是用 PIC 编译的.
错误如下:
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: 在函数‘_start’中:(.text+0x12): 重定位被截断以适应:R_X86_64_32S 反对符号`__libc_csu_fini' 定义在/usr/lib64/libc_nonshared.a(elf-init.oS) 的.text 部分/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o:在函数_start"中:(.text+0x19): 重定位被截断以适应:R_X86_64_32S 对在/usr/lib64/libc_nonshared.a(elf-init.oS) 的 .text 部分中定义的符号`__libc_csu_init'/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o:在函数_start"中:(.text+0x20): 对main"的未定义引用/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crti.o:在函数`call_gmon_start'中:(.text+0x7): 重定位被截断以适合:R_X86_64_GOTPCREL 针对未定义的符号`__gmon_start__'/usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtbegin.o:在函数`__do_global_dtors_aux'中:crtstuff.c:(.text+0xb): 重定位被截断以适应:R_X86_64_PC32 反对`.bss'crtstuff.c:(.text+0x13): 重定位被截断以适应:R_X86_64_32 针对符号`__DTOR_END__' 在/usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtend.o 的 .dtors 部分中定义crtstuff.c:(.text+0x19): 重定位被截断以适应:R_X86_64_32S 对 `.dtors'crtstuff.c:(.text+0x28): 重定位被截断以适应:R_X86_64_PC32 对 `.bss'crtstuff.c:(.text+0x38): 重定位被截断以适应:R_X86_64_PC32 对 `.bss'crtstuff.c:(.text+0x3f): 重定位被截断以适应:R_X86_64_32S 对 `.dtors'crtstuff.c:(.text+0x46): 重定位被截断以适应:R_X86_64_PC32 对 `.bss'crtstuff.c:(.text+0x51):输出中省略了额外的重定位溢出collect2: ld 返回 1 个退出状态make: *** [testsme] 错误 1
和我链接的系统库:
-lgfortran -lm -lrt -lpthread
有什么线索可以在哪里查找问题吗?
首先感谢大家的讨论...
为了澄清一下,我有数百个函数(每个函数在单独的目标文件中大小约为 1 MB),如下所示:
double func1(std::tr1::unordered_map & csc,std::vector&钛,ProcessVars &s){双和,前置因子,expr;prefactor = +s.ds8*s.ds10*ti[0]->value();expr = ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -27/10.*s.x14*s.x15*csc[49304] + 12/5.*s.x14*s.x15*csc[49305] -3/10.*s.x14*s.x15*csc[49306] - 4/5.*s.x14*s.x15*csc[49307] +21/10.*s.x14*s.x15*csc[49308] + 1/10.*s.x14*s.x15*csc[49309] -s.x14*s.x15*csc[51370] - 9/10.*s.x14*s.x15*csc[51371] -1/10.*s.x14*s.x15*csc[51372] + 3/5.*s.x14*s.x15*csc[51373] +27/10.*s.x14*s.x15*csc[51374] - 12/5.*s.x14*s.x15*csc[51375] +3/10.*s.x14*s.x15*csc[51376] + 4/5.*s.x14*s.x15*csc[51377] -21/10.*s.x14*s.x15*csc[51378] - 1/10.*s.x14*s.x15*csc[51379] -2*s.x14*s.x15*csc[55100] - 9/5.*s.x14*s.x15*csc[55101] -1/5.*s.x14*s.x15*csc[55102] + 6/5.*s.x14*s.x15*csc[55103] +27/5.*s.x14*s.x15*csc[55104] - 24/5.*s.x14*s.x15*csc[55105] +3/5.*s.x14*s.x15*csc[55106] + 8/5.*s.x14*s.x15*csc[55107] -21/5.*s.x14*s.x15*csc[55108] - 1/5.*s.x14*s.x15*csc[55109] -2*s.x14*s.x15*csc[55170] - 9/5.*s.x14*s.x15*csc[55171] -1/5.*s.x14*s.x15*csc[55172] + 6/5.*s.x14*s.x15*csc[55173] +27/5.*s.x14*s.x15*csc[55174] - 24/5.*s.x14*s.x15*csc[55175] +//...;sum += prefactor*expr;//...返还金额;}
对象 s
相对较小,保留了所需的常量 x14、x15、...、ds0、...等,而 ti
只返回一个 double来自外部库.如您所见,csc[]
是一个预先计算的值映射,它也在以下形式的单独目标文件(同样是数百个,每个大约 1 MB 大小)中进行评估:
void cscs132(std::tr1::unordered_map & csc, ProcessVars & s){{双 csc19295 = + s.ds0*s.ds1*s.ds2 * ( -32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -32*s.x12pow2*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -32*s.x12pow2*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -32*s.x12pow2*s.x25*s.x35*s.x45*s.mWpowinv2 +32*s.x12pow2*s.x34*s.mbpow4*s.mWpowinv2 +32*s.x12pow2*s.x34*s.x35*s.mbpow2*s.mWpowinv2 +32*s.x12pow2*s.x34*s.x45*s.mbpow2*s.mWpowinv2 +32*s.x12pow2*s.x35*s.mbpow4*s.mWpowinv2 +32*s.x12pow2*s.x35pow2*s.mbpow2*s.mWpowinv2 +32*s.x12pow2*s.x35pow2*s.x45*s.mWpowinv2 +64*s.x12pow2*s.x35*s.x45*s.mbpow2*s.mWpowinv2 +32*s.x12pow2*s.x35*s.x45pow2*s.mWpowinv2 -64*s.x12*s.p1p3*s.x15*s.mbpow4*s.mWpowinv2 +64*s.x12*s.p1p3*s.x15pow2*s.mbpow2*s.mWpowinv2 +96*s.x12*s.p1p3*s.x15*s.x25*s.mbpow2*s.mWpowinv2 -64*s.x12*s.p1p3*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -64*s.x12*s.p1p3*s.x15*s.x45*s.mbpow2*s.mWpowinv2 -32*s.x12*s.p1p3*s.x25*s.mbpow4*s.mWpowinv2 +32*s.x12*s.p1p3*s.x25pow2*s.mbpow2*s.mWpowinv2 -32*s.x12*s.p1p3*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -32*s.x12*s.p1p3*s.x25*s.x45*s.mbpow2*s.mWpowinv2 -32*s.x12*s.p1p3*s.x45*s.mbow2 +64*s.x12*s.x14*s.x15pow2*s.x35*s.mWpowinv2 +96*s.x12*s.x14*s.x15*s.x25*s.x35*s.mWpowinv2 +32*s.x12*s.x14*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -32*s.x12*s.x14*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -64*s.x12*s.x14*s.x15*s.x35pow2*s.mWpowinv2 -32*s.x12*s.x14*s.x15*s.x35*s.x45*s.mWpowinv2 +32*s.x12*s.x14*s.x25pow2*s.x35*s.mWpowinv2 +32*s.x12*s.x14*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -32*s.x12*s.x14*s.x25*s.x35pow2*s.mWpowinv2 -//...csc.insert(cscMap::value_type(192953, csc19295));}{double csc19296 =//... ;csc.insert(cscMap::value_type(192956, csc19296));}//...}
就是这样.最后一步就是调用所有这些 func[i]
并总结结果.
关于这是一个相当特殊和不寻常的案例:是的,确实如此.这是人们在尝试对粒子物理进行高精度计算时必须应对的问题.
我还应该补充一点,x12、x13 等并不是真正的常量.它们被设置为特定值,运行所有这些函数并返回结果,然后选择一组新的 x12、x13 等来生成下一个值.这必须做 105 到 106 次...
感谢您到目前为止的建议和讨论......我会尝试以某种方式在代码生成时滚动循环,老实说,不确定如何做到这一点,但这是最好的选择.>
顺便说一句,我并没有试图躲在这是科学计算――没有办法优化"后面.
只是这段代码的基础是从黑匣子"中出来的东西.在那里我没有真正的访问权限,而且,整个事情通过简单的例子工作得很好,我主要对现实世界应用程序中发生的事情感到不知所措...
因此,我通过简化计算机代数系统 (csc 定义的代码大小减少了大约四分之一.org/wiki/Mathematica" rel="nofollow noreferrer">Mathematica).我现在还看到了一些方法,通过在生成代码之前应用一些其他技巧(这将使这部分减少到大约 100 MB),将它减少另一个数量级左右,我希望这个想法有效.
现在与您的答案相关:
我正在尝试在 func
中再次回滚循环,其中 CAS 不会有太大帮助,但我已经有了一些想法.例如,按变量(如 x12、x13、...
)对表达式进行排序,用 Python 解析 csc
并生成将它们相互关联的表.然后我至少可以将这些部分生成为循环.由于这似乎是迄今为止最好的解决方案,因此我将其标记为最佳答案.
不过,我还要感谢 VJo.GCC 4.6 确实更好地工作,生成更小的代码并且速度更快.使用大模型可以按原样使用代码.所以从技术上讲,这是正确的答案,但改变整个概念是一个更好的方法.
感谢大家的建议和帮助.如果有人感兴趣,我会在准备好后尽快发布最终结果.
备注:
只是对其他一些答案的一些评论:我试图运行的代码并非源于简单函数/算法的扩展和愚蠢的不必要的展开.实际发生的事情是,我们开始的东西是非常复杂的数学对象,并将它们转化为数值可计算形式会生成这些表达式.问题实际上在于潜在的物理理论.中间表达式的复杂性按因子缩放,这是众所周知的,但是当将所有这些东西组合到物理上可测量的东西――一个可观察的东西时――它只是归结为只有少数非常小的函数,它们构成了表达式的基础.(在这方面肯定有一些错误",只有可用ansatz 被称为微扰理论")我们试图将这个 ansatz 带到另一个层次,这在分析上不再可行,并且所需函数的基础未知.所以我们尝试像这样暴力破解它.不是最好的方法,但希望最终能帮助我们理解手头的物理学...
最后
感谢您的所有建议,我已经设法使用 Mathematica 和对 func
的代码生成器的修改,大大减少了代码大小,这在某种程度上与最佳答案一致:)
我使用 Mathematica 简化了 csc
函数,将其减小到 92 MB.这是不可约的部分.第一次尝试花费了很长时间,但经过一些优化,现在在单个 CPU 上运行大约需要 10 分钟.
对 func
的影响是巨大的:它们的整个代码大小降低到大约 9 MB,因此代码现在总计在 100 MB 范围内.现在开启优化是有意义的,并且执行速度非常快.
再次感谢大家的建议,我学到了很多.
解决方案所以,您已经有一个程序可以生成此文本:
prefactor = +s.ds8*s.ds10*ti[0]->value();expr = ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -...
和
double csc19295 = + s.ds0*s.ds1*s.ds2 * ( -32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -...
对吗?
如果您的所有函数都具有类似的格式"(将 n 个数字乘以 m 次并添加结果 - 或类似的结果),那么我认为您可以这样做:
- 将生成器程序更改为输出偏移量而不是字符串(即,它会生成
offsetof(ProcessVars, ds0)
而不是字符串s.ds0" - 创建一个此类偏移量的数组
- 编写一个求值器,它接受上面的数组和结构指针的基地址并产生结果
数组+求值器将表示与您的函数之一相同的逻辑,但只有求值器是代码.该数组是数据",可以在运行时生成或保存在磁盘上并读取 i 个块或使用内存映射文件.
对于 func1 中的特定示例,想象一下如果您可以访问 s
和 csc
的基地址以及像这样的向量,您将如何通过评估器重写该函数您需要添加到基地址的常量和偏移量的表示,以获得 x14
、ds8
和 csc[51370]
您需要创建一种新的数据"形式,以描述如何处理传递给大量函数的实际数据.
I have a huge number of functions totaling around 2.8?GB of object code (unfortunately there's no way around, scientific computing ...)
When I try to link them, I get (expected) relocation truncated to fit: R_X86_64_32S
errors, that I hoped to circumvent by specifing the compiler flag -mcmodel=medium
. All libraries that are linked in addition that I have control of are compiled with the -fpic
flag.
Still, the error persists, and I assume that some libraries I link to are not compiled with PIC.
Here's the error:
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x12): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_fini' defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x19): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_init' defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crti.o: In function `call_gmon_start':
(.text+0x7): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol `__gmon_start__'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtbegin.o: In function `__do_global_dtors_aux':
crtstuff.c:(.text+0xb): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x13): relocation truncated to fit: R_X86_64_32 against symbol `__DTOR_END__' defined in .dtors section in /usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtend.o
crtstuff.c:(.text+0x19): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x28): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x38): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x3f): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x46): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x51): additional relocation overflows omitted from the output
collect2: ld returned 1 exit status
make: *** [testsme] Error 1
And system libraries I link against:
-lgfortran -lm -lrt -lpthread
Any clues where to look for the problem?
EDIT:
First of all, thank you for the discussion...
To clarify a bit, I have hundreds of functions (each approx 1?MB in size in separate object files) like this:
double func1(std::tr1::unordered_map<int, double> & csc,
std::vector<EvaluationNode::Ptr> & ti,
ProcessVars & s)
{
double sum, prefactor, expr;
prefactor = +s.ds8*s.ds10*ti[0]->value();
expr = ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -
27/10.*s.x14*s.x15*csc[49304] + 12/5.*s.x14*s.x15*csc[49305] -
3/10.*s.x14*s.x15*csc[49306] - 4/5.*s.x14*s.x15*csc[49307] +
21/10.*s.x14*s.x15*csc[49308] + 1/10.*s.x14*s.x15*csc[49309] -
s.x14*s.x15*csc[51370] - 9/10.*s.x14*s.x15*csc[51371] -
1/10.*s.x14*s.x15*csc[51372] + 3/5.*s.x14*s.x15*csc[51373] +
27/10.*s.x14*s.x15*csc[51374] - 12/5.*s.x14*s.x15*csc[51375] +
3/10.*s.x14*s.x15*csc[51376] + 4/5.*s.x14*s.x15*csc[51377] -
21/10.*s.x14*s.x15*csc[51378] - 1/10.*s.x14*s.x15*csc[51379] -
2*s.x14*s.x15*csc[55100] - 9/5.*s.x14*s.x15*csc[55101] -
1/5.*s.x14*s.x15*csc[55102] + 6/5.*s.x14*s.x15*csc[55103] +
27/5.*s.x14*s.x15*csc[55104] - 24/5.*s.x14*s.x15*csc[55105] +
3/5.*s.x14*s.x15*csc[55106] + 8/5.*s.x14*s.x15*csc[55107] -
21/5.*s.x14*s.x15*csc[55108] - 1/5.*s.x14*s.x15*csc[55109] -
2*s.x14*s.x15*csc[55170] - 9/5.*s.x14*s.x15*csc[55171] -
1/5.*s.x14*s.x15*csc[55172] + 6/5.*s.x14*s.x15*csc[55173] +
27/5.*s.x14*s.x15*csc[55174] - 24/5.*s.x14*s.x15*csc[55175] +
// ...
;
sum += prefactor*expr;
// ...
return sum;
}
The object s
is relatively small and keeps the needed constants x14, x15, ..., ds0, ..., etc. while ti
just returns a double from an external library. As you can see, csc[]
is a precomputed map of values which is also evaluated in separate object files (again hundreds with about ~1?MB of size each) of the following form:
void cscs132(std::tr1::unordered_map<int,double> & csc, ProcessVars & s)
{
{
double csc19295 = + s.ds0*s.ds1*s.ds2 * ( -
32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -
32*s.x12pow2*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
32*s.x12pow2*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
32*s.x12pow2*s.x25*s.x35*s.x45*s.mWpowinv2 +
32*s.x12pow2*s.x34*s.mbpow4*s.mWpowinv2 +
32*s.x12pow2*s.x34*s.x35*s.mbpow2*s.mWpowinv2 +
32*s.x12pow2*s.x34*s.x45*s.mbpow2*s.mWpowinv2 +
32*s.x12pow2*s.x35*s.mbpow4*s.mWpowinv2 +
32*s.x12pow2*s.x35pow2*s.mbpow2*s.mWpowinv2 +
32*s.x12pow2*s.x35pow2*s.x45*s.mWpowinv2 +
64*s.x12pow2*s.x35*s.x45*s.mbpow2*s.mWpowinv2 +
32*s.x12pow2*s.x35*s.x45pow2*s.mWpowinv2 -
64*s.x12*s.p1p3*s.x15*s.mbpow4*s.mWpowinv2 +
64*s.x12*s.p1p3*s.x15pow2*s.mbpow2*s.mWpowinv2 +
96*s.x12*s.p1p3*s.x15*s.x25*s.mbpow2*s.mWpowinv2 -
64*s.x12*s.p1p3*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
64*s.x12*s.p1p3*s.x15*s.x45*s.mbpow2*s.mWpowinv2 -
32*s.x12*s.p1p3*s.x25*s.mbpow4*s.mWpowinv2 +
32*s.x12*s.p1p3*s.x25pow2*s.mbpow2*s.mWpowinv2 -
32*s.x12*s.p1p3*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
32*s.x12*s.p1p3*s.x25*s.x45*s.mbpow2*s.mWpowinv2 -
32*s.x12*s.p1p3*s.x45*s.mbpow2 +
64*s.x12*s.x14*s.x15pow2*s.x35*s.mWpowinv2 +
96*s.x12*s.x14*s.x15*s.x25*s.x35*s.mWpowinv2 +
32*s.x12*s.x14*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
32*s.x12*s.x14*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
64*s.x12*s.x14*s.x15*s.x35pow2*s.mWpowinv2 -
32*s.x12*s.x14*s.x15*s.x35*s.x45*s.mWpowinv2 +
32*s.x12*s.x14*s.x25pow2*s.x35*s.mWpowinv2 +
32*s.x12*s.x14*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
32*s.x12*s.x14*s.x25*s.x35pow2*s.mWpowinv2 -
// ...
csc.insert(cscMap::value_type(192953, csc19295));
}
{
double csc19296 = // ... ;
csc.insert(cscMap::value_type(192956, csc19296));
}
// ...
}
That's about it. The final step then just consists in calling all those func[i]
and summing the result up.
Concerning the fact that this is a rather special and unusual case: Yes, it is. This is what people have to cope with when trying to do high precision computations for particle physics.
EDIT2:
I should also add that x12, x13, etc. are not really constants. They are set to specific values, all those functions are run and the result returned, and then a new set of x12, x13, etc. is chosen to produce the next value. And this has to be done 105 to 106 times...
EDIT3:
Thank you for the suggestions and the discussion so far... I'll try to roll the loops up upon code generation somehow, not sure how to this exactly, to be honest, but this is the best bet.
BTW, I didn't try to hide behind "this is scientific computing -- no way to optimize".
It's just that the basis for this code is something that comes out of a "black box" where I have no real access to and, moreover, the whole thing worked great with simple examples, and I mainly feel overwhelmed with what happens in a real world application...
EDIT4:
So, I have managed to reduce the code size of the csc
definitions by about one forth by simplifying expressions in a computer algebra system (Mathematica). I see now also some way to reduce it by another order of magnitude or so by applying some other tricks before generating the code (which would bring this part down to about 100?MB) and I hope this idea works.
Now related to your answers:
I'm trying to roll the loops back up again in the func
s, where a CAS won't help much, but I have already some ideas. For instance, sorting the expressions by the variables like x12, x13,...
, parse the csc
s with Python and generate tables that relate them to each other. Then I can at least generate these parts as loops. As this seems to be the best solution so far, I mark this as the best answer.
However, I'd like to also give credit to VJo. GCC 4.6 indeed works much better, produces smaller code and is faster. Using the large model works at the code as-is. So technically this is the correct answer, but changing the whole concept is a much better approach.
Thank you all for your suggestions and help. If anyone is interested, I'm going to post the final outcome as soon as I am ready.
REMARKS:
Just some remarks to some other answers: The code I'm trying to run does not originate in an expansion of simple functions/algorithms and stupid unnecessary unrolling. What actually happens is that the stuff we start with is pretty complicated mathematical objects and bringing them to a numerically computable form generates these expressions. The problem lies actually in the underlying physical theory. Complexity of intermediate expressions scales factorially, which is well known, but when combining all of this stuff to something physically measurable -- an observable -- it just boils down to only a handful of very small functions that form the basis of the expressions. (There is definitely something "wrong" in this respect with the general and only available ansatz which is called "perturbation theory") We try to bring this ansatz to another level, which is not feasible analytically anymore and where the basis of needed functions is not known. So we try to brute-force it like this. Not the best way, but hopefully one that helps with our understanding of the physics at hand in the end...
LAST EDIT:
Thanks to all your suggestions, I've managed to reduce the code size considerably, using Mathematica and a modification of the code generator for the func
s somewhat along the lines of the top answer :)
I have simplified the csc
functions with Mathematica, bringing it down to 92?MB. This is the irreducible part. The first attempts took forever, but after some optimizations this now runs through in about 10?minutes on a single CPU.
The effect on the func
s was dramatic: The whole code size for them is down to approximately 9?MB, so the code now totals in the 100?MB range. Now it makes sense to turn optimizations on and the execution is quite fast.
Again, thank you all for your suggestions, I've learned a lot.
解决方案So, you already have a program that produces this text:
prefactor = +s.ds8*s.ds10*ti[0]->value();
expr = ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -...
and
double csc19295 = + s.ds0*s.ds1*s.ds2 * ( -
32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -...
right?
If all your functions have a similar "format" (multiply n numbers m times and add the results - or something similar) then I think you can do this:
- change the generator program to output offsets instead of strings (i.e. instead of the string "s.ds0" it will produce
offsetof(ProcessVars, ds0)
- create an array of such offsets
- write an evaluator which accepts the array above and the base addresses of the structure pointers and produces an result
The array+evaluator will represent the same logic as one of your functions, but only the evaluator will be code. The array is "data" and can be either generated at runtime or saved on disk and read i chunks or with a memory mapped file.
For your particular example in func1 imagine how you would rewrite the function via an evaluator if you had access to the base address of s
and csc
and also a vector like representation of the constants and the offsets you need to add to the base addresses to get to x14
, ds8
and csc[51370]
You need to create a new form of "data" that will describe how to process the actual data you pass to your huge number of functions.
相关文章