Yesterday’s post on space saving techniques generated a few comments. It seemed worthwhile to highlight a few of the comments for a wider audience.
- Various people have pointed out that clang and GCC support a -Wpadded option to warn when padding is necessary inside of a structure. Visual C++ supports warning C4280 that does the same thing. You can enable this warning in Visual C++ by passing /we4280 on the compiler command line. I’m fairly certain this warning would generate a lot of output, but it might be worthwhile to comb through the output and see if anything interesting turns up.
- David Major pointed out the /d1reportAllClassLayout switch for Visual C++, which prints the class layout of all the classes in the compilation unit. If you’re only interested in a single class, you can use /d1reportSingleClass$NAME to narrow the report down to the class with $NAME. GCC used to have something similar in -fdump-class-hierarchy, but that option has been removed.
- Benoit Girard asked if he could see a list of the 50 largest things on a Linux build. Forthwith, I present the 50 largest objects and the 50 largest functions in libxul for an x86-64 Linux optimized build. One thing to note about the objects is that they’re not all in the same section; the seventh field in readelf output tells you the section index. So for the linked list of objects above, section 15 is .rodata (read-only, shareable data), section 22 is .data (read-write non-shareable data), section 27 is .data.rel.ro (data that needs to have relocations applied at load time, but can be read-only thereafter, e.g. virtual function tables), and section 29 is .bss (zero-initialized memory). Unsurprisingly, string encoding/decoding tables are the bulk of the large objects, with various bits from WebRTC, JavaScript, and the Gecko profiler also making an appearance. Media codec-related functions appear to take up a large amount of space, along with some JavaScript functions, and a few things related to the HTML parser.
- A commenter by the name of “AnimalFriend” correctly pointed out that what you really want to know is which structures both have a lot of instances hanging around and have holes that you could fill. I don’t know of a good way to answer the first part without adding a lot of instrumentation (though perhaps you could catch a lot of cases by augmenting the MOZ_COUNT_CTOR macro to tell you which structures get allocated a lot). The second part can be answered by something like pahole.
- Alternatively, you could use something like access_profiler to tell you what fields in your objects get accessed and how often, then carefully packing those fields into the same cache line. The techniques access_profiler uses are also applicable to counting allocations of individual objects. Maybe we should start using something more access_profiler-like instead of MOZ_COUNT_CTOR and friends! Definitely more C++-ish, more flexible, and it eliminates the need to write the corresponding MOZ_COUNT_DTOR.
So the table you posted kind of proves my suspicion that the method can only find static objects?
That’s correct.
Hmm. So looking at the largest functions, mozilla::dom::Register is basically registering all the WebIDL bindings. We could try to convert this from a bunch of function calls to a data table and a single function call. Worthwhile?
Hm. Register is ~27K of code. There’s ~500 calls to RegisterDefineDOMInterface in that function, which means that each call is ~50 bytes. A data table would be 24 bytes/entry (8 bytes for the DefineInterface hook, 8 bytes for the ConstructorEnabled pointer, 8 full bytes for the name pointer or index into a global table). Then with a relocation for the DefineInterface hook and the ConstructorEnabled pointer (24 bytes each), you’re already over 50 bytes/entry. I haven’t looked at ARM, but the considerations are probably similar…though relocations might be discounted somewhat because of elfhack.
You might be able to save some space by sticking all the strings in a table and just indexing out of that; that would save you some space on ARM (and x86?) because you’d only have one pointer to refer to for the PC-relative load to get at the strings instead of N pointers for individual strings. Maybe?
(I realize that not all the calls are for RegisterDefineDOMInterface. But the considerations for RegisterNavigatorDOMConstructor are similar.)
It looks like we are calling nsString’s constructor and destructor for every RegisterDefineDOMInterface call in that function…surely we can do better than that!
You could pass const char16_t* strings instead of const nsString& strings there, that would save you some constructions and destructions, assuming that all of the strings in question are really literals.
Indeed, this is https://bugzilla.mozilla.org/show_bug.cgi?id=965634
This might be a stupid comment, but don’t modern x86 CPUs have full speed unaligned access? (except when crossing cache lines or page boundaries or somesuch)
The really fast decompressors (lz4, recent lzo) use unaligned reads on both x86 and ARM with excellent results.
What would be the performance impact for disabling alignment throughout?
x86 supports unaligned reads for integer and scalar floating-point operands; for our purposes, vector operands must be aligned properly.
The situation is more complicated for ARM; your system has to be properly set up so unaligned accesses don’t trap. Not all memory instructions support unaligned accesses, and unaligned accesses are not guaranteed to run at full speed on ARMv6 and ARMv7. I believe ARM prior to ARMv7 was not required to support unaligned accesses also.
Disabling alignment throughout would be complicated: you’d have to ensure that the alignment requirements of system data structures are not altered. Some CPU architectures may support unaligned access for integer data, but not floating-point data, so you have to watch out for that. (I’m thinking particularly of the PowerPC; while it’s not a Tier 1 platform, it doesn’t make much sense to gratuitously break support for it or similar processors.) The GCC versions we use on ARM are not always so clever about supporting unaligned data access, partly because of the above discussion, so performance might get quite a bit worse and code size might increase for Android and B2G. Some CPU architectures don’t support unaligned access efficiently at all; gratuitously breaking support for them doesn’t seem very friendly. (And we might wind up supporting one of them one day officially and I sure don’t want to have to do the work of making sure everything is re-aligned properly…)
In short, it’s just not worth it. You can alter alignment requirements of individual data structures with #pragma pack or similar, but given all the above caveats, it’s probably best not to do that unless you have a really compelling case otherwise.
Whoa, that Java->C++ conversion produces some really big “static” initialisation functions. Good thing they only have to run once…
FWIW, this is https://bugzilla.mozilla.org/show_bug.cgi?id=959331