summaryrefslogtreecommitdiff
path: root/src/string/i386
AgeCommit message (Collapse)AuthorFilesLines
2015-04-18remove the last of possible-textrels from i386 asmRich Felker2-1/+5
none of these are actual textrels because of ld-time binding performed by -Bsymbolic-functions, but I'm changing them with the goal of making ld-time binding purely an optimization rather than relying on it for semantic purposes. in the case of memmove's call to memcpy, making it explicit that the memmove asm is assuming the forward-copying behavior of the memcpy asm is desirable anyway; in case memcpy is ever changed, the semantic mismatch would be apparent while editing memmcpy.s.
2015-02-26overhaul optimized i386 memset asmRich Felker1-32/+61
on most cpu models, "rep stosl" has high overhead that makes it undesirable for small memset sizes. the new code extends the minimal-branch fast path for short memsets from size 15 up to size 62, and shrink-wraps this code path. in addition, "rep stosl" is very sensitive to misalignment. the cost varies with size and with cpu model, but it has been observed performing 1.5 to 4 times slower when the destination address is not aligned mod 16. the new code thus ensures alignment mod 16, but also preserves any existing additional alignment, in case there are cpu models where it is beneficial. this version is based in part on changes to the x86_64 memset asm proposed by Denys Vlasenko.
2013-08-01optimized memset asm for i386 and x86_64Rich Felker1-0/+47
the concept of both versions is the same; they differ only in details. for long runs, they use "rep movsl" or "rep movsq", and for small runs, they use a trick, writing from both ends towards the middle, that reduces the number of branches needed. in addition, if memset is called multiple times with the same length, all branches will be predicted; there are no loops. for larger runs, there are likely faster approaches than "rep", at least on some cpu models. for 32-bit, it's unlikely that there is any faster approach that does not require non-baseline instructions; doing anything fancier would require inspecting cpu capabilities. for 64-bit, there may very well be faster versions that work on all models; further optimization could be explored in the future. with these changes, memset is anywhere between 50% faster and 6 times faster, depending on the cpu model and the length and alignment of the destination buffer.
2012-09-10asm for memmove on i386 and x86_64Rich Felker1-0/+21
for the sake of simplicity, I've only used rep movsb rather than breaking up the copy for using rep movsd/q. on all modern cpus, this seems to be fine, but if there are performance problems, there might be a need to go back and add support for rep movsd/q.
2012-08-11memcpy asm for i386 and x86_64Rich Felker1-0/+29