Opened 8 years ago

Closed 8 years ago

#15045 closed defect (fixed)

ATLAS: multiple definition of `ATL_SetAtomicCount'

Reported by: jdemeyer Owned by:
Priority: blocker Milestone: sage-5.12
Component: packages: standard Keywords:
Cc: jpflori Merged in: sage-5.12.beta5
Authors: Volker Braun Reviewers: Nils Bruin
Report Upstream: Reported upstream. No feedback yet. Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by vbraun)

This non-reproducible problem which can occur during the ATLAS build (with atlas-3.10.1.p3.spkg from #14754) was supposed to be fixed but it actually is not fixed:

make[5]: Entering directory `/home/buildbot/build/sage/lena-1/lena_full/build/sage-5.12.beta1/spkg/build/atlas-3.10.1.p3/ATLAS-build/lib'
gcc -fPIC -m64 -shared -o libsatlas.so  \
           -Wl,"rpath-link /home/buildbot/build/sage/lena-1/lena_full/build/sage-5.12.beta1/local/lib" \
           -Wl,--whole-archive liblapack.a libf77blas.a libcblas.a libatlas.a -Wl,--no-whole-archive -L/usr/local/gcc-4.7.0/x86_64-Linux-k10-fc/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../lib64 -lgfortran  -lc -lpthread -lm -lgcc
/usr/local/binutils-2.22/x86_64-Linux-k10-fc-gcc-4.6.2-rh/bin/ld: cannot find rpath-link /home/buildbot/build/sage/lena-1/lena_full/build/sage-5.12.beta1/local/lib: No such file or directory
libatlas.a(ATL_SetAtomicCount_mut.o): In function `ATL_SetAtomicCount':
ATL_SetAtomicCount_mut.c:(.text+0x0): multiple definition of `ATL_SetAtomicCount'
libatlas.a(ATL_SetAtomicCount_arch.o):ATL_SetAtomicCount_arch.c:(.text+0x0): first defined here
libatlas.a(ATL_ResetAtomicCount_mut.o): In function `ATL_ResetAtomicCount':
ATL_ResetAtomicCount_mut.c:(.text+0x0): multiple definition of `ATL_ResetAtomicCount'
libatlas.a(ATL_ResetAtomicCount_amd64.o):(.text+0x0): first defined here
libatlas.a(ATL_DecAtomicCount_mut.o): In function `ATL_DecAtomicCount':
ATL_DecAtomicCount_mut.c:(.text+0x0): multiple definition of `ATL_DecAtomicCount'
libatlas.a(ATL_DecAtomicCount_amd64.o):(.text+0x0): first defined here
libatlas.a(ATL_FreeAtomicCount_mut.o): In function `ATL_FreeAtomicCount':
ATL_FreeAtomicCount_mut.c:(.text+0x0): multiple definition of `ATL_FreeAtomicCount'
libatlas.a(ATL_FreeAtomicCount_arch.o):ATL_FreeAtomicCount_arch.c:(.text+0x0): first defined here
collect2: error: ld returned 1 exit status
make[5]: *** [GCCTRY] Error 1

full ATLAS build log

Upstream: http://sourceforge.net/p/math-atlas/support-requests/907/

New spkg: http://boxen.math.washington.edu/home/vbraun/spkg/atlas-3.10.1.p5.spkg

Attachments (2)

atlas-p3-p4.diff (1.5 KB) - added by vbraun 8 years ago.
diff for review only
atlas-p4-p5.diff (2.6 KB) - added by vbraun 8 years ago.
diff for review only

Download all attachments as: .zip

Change History (34)

comment:2 follow-ups: Changed 8 years ago by vbraun

  • Authors set to Volker Braun
  • Description modified (diff)
  • Status changed from new to needs_review

Its probably not something that is reproducable on modern hardware, so progress upstream has been slow.

For now, I propose we just plow ahead and fall back to static libraries.

Changed 8 years ago by vbraun

diff for review only

comment:3 Changed 8 years ago by jdemeyer

  • Description modified (diff)
  • Report Upstream changed from N/A to Reported upstream. No feedback yet.

comment:4 in reply to: ↑ 2 Changed 8 years ago by jdemeyer

Replying to vbraun:

Its probably not something that is reproducable on modern hardware

2009 might not meet your definition of "modern", but it's far from ancient either.

comment:5 in reply to: ↑ 2 Changed 8 years ago by jdemeyer

Replying to vbraun:

For now, I propose we just plow ahead and fall back to static libraries.

If static libraries always work, why don't we always only use static libraries?

Conversely, if static libraries don't always work, wouldn't this "fall back" cause other problems?

comment:6 follow-up: Changed 8 years ago by vbraun

Because static libraries suck.

It won't cause other problems for now. If we want to use the upstream shared libraries (with a different naming convention than the static libraries) then we'll run into troubles. But by then we hopefully have found a way to not hardcode atlas/blas/lapack library names in the Sage build system.

comment:7 in reply to: ↑ 6 Changed 8 years ago by jdemeyer

It still feels bad to build the shared libraries "sometimes" depending on some non-deterministic condition. If static libraries work (even if they don't work so well), I think it's better to stick with them in all cases.

Last edited 8 years ago by jdemeyer (previous) (diff)

comment:8 follow-up: Changed 8 years ago by vbraun

I would agree with you if we would want to stick with a hard-coded ATLAS in Sage until the end of time. But IMHO we should work towards a more configurable solution that will just fall back to reference / openblas as appropriate if atlas doesn't build. And perhaps not build atlas by default if it takes too long. And then static libraries would be a huge pain in the butt.

comment:9 in reply to: ↑ 8 Changed 8 years ago by jdemeyer

Replying to vbraun:

But IMHO we should work towards a more configurable solution that will just fall back to reference / openblas as appropriate if atlas doesn't build. And perhaps not build atlas by default if it takes too long. And then static libraries would be a huge pain in the butt.

If you want to work towards that goal, then the problem on this ticket must be fixed in a proper way anyway. I don't see how using shared libraries "sometimes" is closer to the stated goal than "never" using shared libraries.

I am particularly bothered that the choice is non-deterministic and difficult to predict, so Sage builds on the same machine will be substantially different (one with shared ATLAS and one with static ATLAS).

comment:10 follow-up: Changed 8 years ago by vbraun

Once we have an alternative to ATLAS that we can use, we disable the static library fallback for ATLAS. Its easy. Until then, we have this crutch. I agree that its ugly as sin but not shipping the new ATLAS would be a mistake imho.

I just don't want to work on changing the blas build system in Sage until we have switched to git. It'll be much easier if I don't have to copy tarballs around just for a change to the build script.

comment:11 in reply to: ↑ 10 ; follow-up: Changed 8 years ago by jdemeyer

Replying to vbraun:

Once we have an alternative to ATLAS that we can use, we disable the static library fallback for ATLAS. Its easy. Until then, we have this crutch. I agree that its ugly as sin but not shipping the new ATLAS would be a mistake imho.

I am not saying that we shouldn't ship the new ATLAS. I am proposing to always use the static ATLAS libraries always instead of only when the linker problem appears.

comment:12 in reply to: ↑ 11 ; follow-ups: Changed 8 years ago by vbraun

Replying to jdemeyer:

I am not saying that we shouldn't ship the new ATLAS. I am proposing to always use the static ATLAS libraries always instead of only when the linker problem appears.

I think that just means more changes to undo later. And mind you the shared libraries build apparently fine on most systems, its just some AMD platforms where presumably the performance impact of holding mutexes is different.

comment:13 in reply to: ↑ 12 Changed 8 years ago by jdemeyer

Replying to vbraun:

I think that just means more changes to undo later.

Really? If you comment out the current approach, then it's essentially just removing # characters. That together with hg or git should work easily.

comment:14 in reply to: ↑ 12 Changed 8 years ago by jdemeyer

Replying to vbraun:

And mind you the shared libraries build apparently fine on most systems, its just some AMD platforms where presumably the performance impact of holding mutexes is different.

So far, Sage has worked equally well for all CPUs of a given architecture, let's keep it that way.

Last edited 8 years ago by jdemeyer (previous) (diff)

comment:15 follow-up: Changed 8 years ago by vbraun

As long as the library names are the same (which is currently the case) it does not make a difference for the build system if the library is shared or static. Deliberately disabling shared libraries would just mean to make it suck for everybody, and not just for the small percentage where its the only option. In other words, I'm against it ;-)

comment:16 in reply to: ↑ 15 Changed 8 years ago by jdemeyer

Replying to vbraun:

In other words, I'm against it ;-)

And I'm against more special cases (especially non-deterministic) in the Sage build system ;-)

comment:17 follow-up: Changed 8 years ago by vbraun

Noted, but the ATLAS spkg has always been trying different ways to build and falling back if they fail. Which always sucked, but the problem is that we don't have a deterministic alternative to deploy.

comment:18 in reply to: ↑ 17 Changed 8 years ago by jdemeyer

  • Status changed from needs_review to needs_work

Replying to vbraun:

Noted, but the ATLAS spkg has always been trying different ways to build and falling back if they fail.

Those "different ways" only were about tuning parameters and CPU instruction sets, right? Which is far less fundamental than static vs. dynamic libraries.

Imagine we go with your solution and in the future somebody decides to change the Sage build system or some package such that it only works for a dynamic ATLAS library. Initial testing might reveal that everything works, even though it doesn't work in case your ATLAS spkg decides to use static libraries. This is what I want to avoid.

comment:19 follow-up: Changed 8 years ago by vbraun

I disagree. Pretty much the only way to achieve your scenario would be to hardcode the shared library file name somewhere. But those already differ on Linux vs. OSX.

What I'm mainly objecting to is the basic premise that we should keep this monstrosity alive for any longer than necessary. Things have to be fixed either in ATLAS or by fixing the Sage BLAS build system. As soon as that is done, we can switch off the static library fallback again.

comment:20 in reply to: ↑ 19 Changed 8 years ago by jdemeyer

Replying to vbraun:

Pretty much the only way to achieve your scenario would be to hardcode the shared library file name somewhere.

If there is no observable difference between shared and static libraries, then why do you care so much about shared libraries?

What I'm mainly objecting to is the basic premise that we should keep this monstrosity alive for any longer than necessary.

The problem is that it needs to be kept alive in any case precisely because of this ticket.

comment:21 Changed 8 years ago by vbraun

There is no difference at the linker command line. There is of course a price to be paid later in that you can never benefit from blas/atlas upgrades if you have linked statically.

This ticket is just a workaround until we have a better fix. Unless you think we don't need a workaround for now, in which case I'm happy to wait for either upstream to fix it or we have switched to git...

comment:22 Changed 8 years ago by jdemeyer

I asked sage-devel about what to do, since we're clearly not going to agree...

comment:23 follow-up: Changed 8 years ago by nbruin

For atlas issue 907, the relevant code would be:

ATLAS/tune/threads/tune_count.c:242

      if (tmut < tldec*1.02)
      {
         printf("\nNO REAL ADVANTAGE TO ASSEMBLY, FORCING USE OF MUTEX\n");
         ATL_assert(!system("make iForceUseMutex"));
      }

Looking at ATLAS/makes/Make.ttune:159

iFind_atomic_arch :
        if $(MAKE) xprobe_atomic_amd64 ; then \
           cp $(myTHRdir)/ATL_*AtomicCount_arch.c . ; \
           cp $(myTHRdir)/ATL_ResetAtomicCount_amd64.S \
              ATL_ResetAtomicCount_arch.S ; \
           cp $(myTHRdir)/ATL_DecAtomicCount_amd64.S \
              ATL_DecAtomicCount_arch.S ; \
           rm $(BLDdir)/src/threads/atomic.inc ; \
           echo "aobj = ATL_SetAtomicCount_arch.o ATL_ResetAtomicCount_amd64.o ATL_DecAtomicCount_amd64.o ATL_FreeAtomicCount_arch.o" > $(BLDdir)/src/threads/atomic.inc ;
           .... lines elided ....
        fi
        - rm -f $(BLDdir)/src/threads/lib.grd
        $(MAKE) tlib

ATLAS/makes/Make.ttune:205

iForceUseMutex:
        rm $(BLDdir)/src/threads/atomic.inc
        echo "aobj = ATL_SetAtomicCount_mut.o ATL_ResetAtomicCount_mut.o ATL_DecAtomicCount_mut.o ATL_FreeAtomicCount_mut.o" > $(BLDdir)/src/threads/atomic.inc

The file atomic.inc is only touched in Make.ttune, always in the style above. The file is only used in ATLAS/makes/Make.thr, where it is included to get aobj defined, which is then included in obj to build a library.

I suspect the problem is that make iForceUseMutex doesn't actually rebuild the library, which does happen in make iFind_atomic_arch. So, when we do

iTune_atomic :
        if $(MAKE) mutcnt ; then \
           if $(MAKE) iFind_atomic_arch ; then \
              $(MAKE) xtune_count ; ./xtune_count -r 1000000 -o yes; \
           fi ; \
        else \
           $(MAKE) iFind_atomic_arch ; \
        fi

hich presumably runs the code in tune_count.c, we end up with a built library that has the architecture-specific bits in it, but an aobj that reflects otherwise. Probably, the make system runs into these things another time and then tries to include the same definitions another time in the library.

Perhaps just patch the system?

ATLAS/tune/threads/tune_count.c:244
-         printf("\nNO REAL ADVANTAGE TO ASSEMBLY, FORCING USE OF MUTEX\n");
-         ATL_assert(!system("make iForceUseMutex"));
+         printf("\nNO REAL ADVANTAGE TO ASSEMBLY, BUT WE LEAVE IT IN ANYWAY\n");

I think there is some hope that this will eliminate the (rare) build fails.

comment:24 Changed 8 years ago by jpflori

I definitely cannot work on that at the moment but this is reallyt reminiscent of http://trac.sagemath.org/ticket/10508#comment:455

Ok, I just read the ticket description, and our zorkaround was working because of our custoñ way of building shared libraries.

Now that we also build the upstream shared libraries, our workaround is not enough anymore....

comment:25 Changed 8 years ago by jpflori

From what I remember the main problem is that ATLAS builds a (static) lib for tuning, and then just update it to get the final (static) library so it ends up including two objects files with the same symbols (because of the atoñic.inc ñagic which does notinclude the same pieces of code in tyhe tuning and in the final phases, and because old object files of the tuning phase are still lurking around when the final phase is going on).

It presumably does so because updating the static archive is faster than rebuilding everything from scratch.

So another solution would be to erase the static lib built for tuning (and the corresponding object files) when the tuning phase is finished and let everything be built again when the real building phase starts.

comment:26 Changed 8 years ago by jpflori

See http://trac.sagemath.org/ticket/10508#comment:446 which might be clearer than the above :)

comment:27 in reply to: ↑ 23 Changed 8 years ago by nbruin

Replying to nbruin:

ATLAS/tune/threads/tune_count.c:242

      if (tmut < tldec*1.02)
      {
         printf("\nNO REAL ADVANTAGE TO ASSEMBLY, FORCING USE OF MUTEX\n");
         ATL_assert(!system("make iForceUseMutex"));
      }

So would it be a plan to

  • check that this check is the culprit of (at least some) failure by patching it to be tmut < tldec*1000 and verifying that on relevant hardware ALTAS now reliably fails to build
  • patch atlas to NOT run make iForceUseMutex at this point and ship with that. Perhaps patch it as:
    printf("\nASSEMBLY/MUTEX ratio is %d, but we'll stick with assembly anyway\n",tldec/tmut);
    

so that our logfiles at least tells us what the performance penalty is (well, there's none because ATLAS doesn't build dynamic libs without).

comment:28 Changed 8 years ago by vbraun

  • Description modified (diff)
  • Status changed from needs_work to needs_review

I've made the change to force iForceUseMutex and that indeed reproduces the bug reliably. I made the change that Nils suggested and always use the assembly version (never iForceUseMutex), this fixes it for me.

Last edited 8 years ago by vbraun (previous) (diff)

Changed 8 years ago by vbraun

diff for review only

comment:29 Changed 8 years ago by nbruin

Congratulations on getting this fixed.

Just for the record: I think the mutex code can still get built on platforms where no assembly alternative is available. In that case, I don't think we should particularly expect conflicting symbols either, because there are no other implementations available that the mutex code can conflict with in that case.

This is why I figured it was best to patch the code that takes action depending on the tuning result and not the target in the makefile.

The comment in the SPKG.txt doesn't strictly contradict this, but I had to read the comment a second time to convince myself of that.

comment:30 follow-up: Changed 8 years ago by jdemeyer

Nils: do you want to formally review this ticket?

comment:31 in reply to: ↑ 30 Changed 8 years ago by nbruin

  • Reviewers set to Nils Bruin
  • Status changed from needs_review to positive_review

I have no idea what p4 is about but the change from p4 to p5 as posted on this ticket looks pretty reasonable. I haven't built or downloaded this new package, but other people have and it's a blocker, so a positive review helps things along.

comment:32 Changed 8 years ago by jdemeyer

  • Merged in set to sage-5.12.beta5
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.