3.4 GHz Ryzen 5 slower to diagonalise large matrix than Intel i5-6300U 2.4 GHz












5












$begingroup$


Sorry that this is quite a specific question but I need to diagonalise large matrices for the problem I'm trying to solve and can't for the life of me work out what's going on:



I was expecting that diagonalising these would be much quicker on my desktop than laptop, however there seems to be very little performance difference and in fact my more powerful desktop - every aspect of the system is significantly superior to my laptop - is about 15% slower.



The simplest code that produces the difference is:



AbsoluteTiming[Eigensystem[Table[RandomReal[{0, 1}], {i, 10000}, {j, 10000}],-10, Method -> "Arnoldi"];]


This takes on average about 64 seconds on my desktop but around 58 secs on my laptop. For the specific task I'm trying to solve that difference actually seems larger too.



Any idea what's going on and if there's anyway to solve it? I've read that Mathematica can be a slower on AMD chips than Intel but I seem to get faster performance in almost every task on my desktop than on my laptop... apart from this specific one.



Full specs:



Desktop: AMD Ryzen 5 2600 3.4 GHz 6-core (3.7 GHz Boost),
32 GB 3000 MT/s (2x16 GB) RAM, Intel EVO 970(R 3500, W 2200)



Laptop (Surface Book 1): Intel i5-6300U 2.4 GHz Dual Core (3.0 GHz Boost),
8 GB 1866 MT/s (2x4 GB) RAM, SSD (R 1500, W 600)



Edit:
I notice that the CPU usage in Task Manager is slightly different - running at around 60-65% for the laptop but at 50% for the desktop. Is there perhaps a different implementation of Arnoldi for Intel that can take advantage of multiple cores? Laptop CPU usageDesktop CPU usage










share|improve this question









New contributor




Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    I am just curious: Does this happen also for positive-definite input matrices?
    $endgroup$
    – Henrik Schumacher
    5 hours ago










  • $begingroup$
    Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
    $endgroup$
    – Ram Problems
    5 hours ago










  • $begingroup$
    Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
    $endgroup$
    – Henrik Schumacher
    3 hours ago
















5












$begingroup$


Sorry that this is quite a specific question but I need to diagonalise large matrices for the problem I'm trying to solve and can't for the life of me work out what's going on:



I was expecting that diagonalising these would be much quicker on my desktop than laptop, however there seems to be very little performance difference and in fact my more powerful desktop - every aspect of the system is significantly superior to my laptop - is about 15% slower.



The simplest code that produces the difference is:



AbsoluteTiming[Eigensystem[Table[RandomReal[{0, 1}], {i, 10000}, {j, 10000}],-10, Method -> "Arnoldi"];]


This takes on average about 64 seconds on my desktop but around 58 secs on my laptop. For the specific task I'm trying to solve that difference actually seems larger too.



Any idea what's going on and if there's anyway to solve it? I've read that Mathematica can be a slower on AMD chips than Intel but I seem to get faster performance in almost every task on my desktop than on my laptop... apart from this specific one.



Full specs:



Desktop: AMD Ryzen 5 2600 3.4 GHz 6-core (3.7 GHz Boost),
32 GB 3000 MT/s (2x16 GB) RAM, Intel EVO 970(R 3500, W 2200)



Laptop (Surface Book 1): Intel i5-6300U 2.4 GHz Dual Core (3.0 GHz Boost),
8 GB 1866 MT/s (2x4 GB) RAM, SSD (R 1500, W 600)



Edit:
I notice that the CPU usage in Task Manager is slightly different - running at around 60-65% for the laptop but at 50% for the desktop. Is there perhaps a different implementation of Arnoldi for Intel that can take advantage of multiple cores? Laptop CPU usageDesktop CPU usage










share|improve this question









New contributor




Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    I am just curious: Does this happen also for positive-definite input matrices?
    $endgroup$
    – Henrik Schumacher
    5 hours ago










  • $begingroup$
    Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
    $endgroup$
    – Ram Problems
    5 hours ago










  • $begingroup$
    Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
    $endgroup$
    – Henrik Schumacher
    3 hours ago














5












5








5


1



$begingroup$


Sorry that this is quite a specific question but I need to diagonalise large matrices for the problem I'm trying to solve and can't for the life of me work out what's going on:



I was expecting that diagonalising these would be much quicker on my desktop than laptop, however there seems to be very little performance difference and in fact my more powerful desktop - every aspect of the system is significantly superior to my laptop - is about 15% slower.



The simplest code that produces the difference is:



AbsoluteTiming[Eigensystem[Table[RandomReal[{0, 1}], {i, 10000}, {j, 10000}],-10, Method -> "Arnoldi"];]


This takes on average about 64 seconds on my desktop but around 58 secs on my laptop. For the specific task I'm trying to solve that difference actually seems larger too.



Any idea what's going on and if there's anyway to solve it? I've read that Mathematica can be a slower on AMD chips than Intel but I seem to get faster performance in almost every task on my desktop than on my laptop... apart from this specific one.



Full specs:



Desktop: AMD Ryzen 5 2600 3.4 GHz 6-core (3.7 GHz Boost),
32 GB 3000 MT/s (2x16 GB) RAM, Intel EVO 970(R 3500, W 2200)



Laptop (Surface Book 1): Intel i5-6300U 2.4 GHz Dual Core (3.0 GHz Boost),
8 GB 1866 MT/s (2x4 GB) RAM, SSD (R 1500, W 600)



Edit:
I notice that the CPU usage in Task Manager is slightly different - running at around 60-65% for the laptop but at 50% for the desktop. Is there perhaps a different implementation of Arnoldi for Intel that can take advantage of multiple cores? Laptop CPU usageDesktop CPU usage










share|improve this question









New contributor




Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




Sorry that this is quite a specific question but I need to diagonalise large matrices for the problem I'm trying to solve and can't for the life of me work out what's going on:



I was expecting that diagonalising these would be much quicker on my desktop than laptop, however there seems to be very little performance difference and in fact my more powerful desktop - every aspect of the system is significantly superior to my laptop - is about 15% slower.



The simplest code that produces the difference is:



AbsoluteTiming[Eigensystem[Table[RandomReal[{0, 1}], {i, 10000}, {j, 10000}],-10, Method -> "Arnoldi"];]


This takes on average about 64 seconds on my desktop but around 58 secs on my laptop. For the specific task I'm trying to solve that difference actually seems larger too.



Any idea what's going on and if there's anyway to solve it? I've read that Mathematica can be a slower on AMD chips than Intel but I seem to get faster performance in almost every task on my desktop than on my laptop... apart from this specific one.



Full specs:



Desktop: AMD Ryzen 5 2600 3.4 GHz 6-core (3.7 GHz Boost),
32 GB 3000 MT/s (2x16 GB) RAM, Intel EVO 970(R 3500, W 2200)



Laptop (Surface Book 1): Intel i5-6300U 2.4 GHz Dual Core (3.0 GHz Boost),
8 GB 1866 MT/s (2x4 GB) RAM, SSD (R 1500, W 600)



Edit:
I notice that the CPU usage in Task Manager is slightly different - running at around 60-65% for the laptop but at 50% for the desktop. Is there perhaps a different implementation of Arnoldi for Intel that can take advantage of multiple cores? Laptop CPU usageDesktop CPU usage







matrix linear-algebra system system-performance






share|improve this question









New contributor




Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 5 hours ago







Ram Problems













New contributor




Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 6 hours ago









Ram ProblemsRam Problems

263




263




New contributor




Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • $begingroup$
    I am just curious: Does this happen also for positive-definite input matrices?
    $endgroup$
    – Henrik Schumacher
    5 hours ago










  • $begingroup$
    Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
    $endgroup$
    – Ram Problems
    5 hours ago










  • $begingroup$
    Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
    $endgroup$
    – Henrik Schumacher
    3 hours ago


















  • $begingroup$
    I am just curious: Does this happen also for positive-definite input matrices?
    $endgroup$
    – Henrik Schumacher
    5 hours ago










  • $begingroup$
    Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
    $endgroup$
    – Ram Problems
    5 hours ago










  • $begingroup$
    Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
    $endgroup$
    – Henrik Schumacher
    3 hours ago
















$begingroup$
I am just curious: Does this happen also for positive-definite input matrices?
$endgroup$
– Henrik Schumacher
5 hours ago




$begingroup$
I am just curious: Does this happen also for positive-definite input matrices?
$endgroup$
– Henrik Schumacher
5 hours ago












$begingroup$
Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
$endgroup$
– Ram Problems
5 hours ago




$begingroup$
Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
$endgroup$
– Ram Problems
5 hours ago












$begingroup$
Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
$endgroup$
– Henrik Schumacher
3 hours ago




$begingroup$
Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
$endgroup$
– Henrik Schumacher
3 hours ago










1 Answer
1






active

oldest

votes


















4












$begingroup$

Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.




  • Probably Mathematica uses a non-parallelized implementation of
    Anoldi's method. In single-core performance, these processors are not
    that much different:
    https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes this Eigensystem at 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...


  • Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...


  • Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.







share|improve this answer











$endgroup$













  • $begingroup$
    Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
    $endgroup$
    – Ram Problems
    5 hours ago










  • $begingroup$
    This is my CPU.
    $endgroup$
    – Henrik Schumacher
    5 hours ago










  • $begingroup$
    @RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
    $endgroup$
    – Peter Cordes
    10 mins ago












  • $begingroup$
    There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
    $endgroup$
    – Peter Cordes
    9 mins ago













Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "387"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});






Ram Problems is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f190411%2f3-4-ghz-ryzen-5-slower-to-diagonalise-large-matrix-than-intel-i5-6300u-2-4-ghz%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









4












$begingroup$

Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.




  • Probably Mathematica uses a non-parallelized implementation of
    Anoldi's method. In single-core performance, these processors are not
    that much different:
    https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes this Eigensystem at 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...


  • Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...


  • Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.







share|improve this answer











$endgroup$













  • $begingroup$
    Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
    $endgroup$
    – Ram Problems
    5 hours ago










  • $begingroup$
    This is my CPU.
    $endgroup$
    – Henrik Schumacher
    5 hours ago










  • $begingroup$
    @RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
    $endgroup$
    – Peter Cordes
    10 mins ago












  • $begingroup$
    There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
    $endgroup$
    – Peter Cordes
    9 mins ago


















4












$begingroup$

Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.




  • Probably Mathematica uses a non-parallelized implementation of
    Anoldi's method. In single-core performance, these processors are not
    that much different:
    https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes this Eigensystem at 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...


  • Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...


  • Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.







share|improve this answer











$endgroup$













  • $begingroup$
    Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
    $endgroup$
    – Ram Problems
    5 hours ago










  • $begingroup$
    This is my CPU.
    $endgroup$
    – Henrik Schumacher
    5 hours ago










  • $begingroup$
    @RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
    $endgroup$
    – Peter Cordes
    10 mins ago












  • $begingroup$
    There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
    $endgroup$
    – Peter Cordes
    9 mins ago
















4












4








4





$begingroup$

Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.




  • Probably Mathematica uses a non-parallelized implementation of
    Anoldi's method. In single-core performance, these processors are not
    that much different:
    https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes this Eigensystem at 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...


  • Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...


  • Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.







share|improve this answer











$endgroup$



Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.




  • Probably Mathematica uses a non-parallelized implementation of
    Anoldi's method. In single-core performance, these processors are not
    that much different:
    https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes this Eigensystem at 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...


  • Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...


  • Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.








share|improve this answer














share|improve this answer



share|improve this answer








edited 5 hours ago

























answered 5 hours ago









Henrik SchumacherHenrik Schumacher

51.4k469146




51.4k469146












  • $begingroup$
    Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
    $endgroup$
    – Ram Problems
    5 hours ago










  • $begingroup$
    This is my CPU.
    $endgroup$
    – Henrik Schumacher
    5 hours ago










  • $begingroup$
    @RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
    $endgroup$
    – Peter Cordes
    10 mins ago












  • $begingroup$
    There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
    $endgroup$
    – Peter Cordes
    9 mins ago




















  • $begingroup$
    Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
    $endgroup$
    – Ram Problems
    5 hours ago










  • $begingroup$
    This is my CPU.
    $endgroup$
    – Henrik Schumacher
    5 hours ago










  • $begingroup$
    @RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
    $endgroup$
    – Peter Cordes
    10 mins ago












  • $begingroup$
    There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
    $endgroup$
    – Peter Cordes
    9 mins ago


















$begingroup$
Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
$endgroup$
– Ram Problems
5 hours ago




$begingroup$
Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
$endgroup$
– Ram Problems
5 hours ago












$begingroup$
This is my CPU.
$endgroup$
– Henrik Schumacher
5 hours ago




$begingroup$
This is my CPU.
$endgroup$
– Henrik Schumacher
5 hours ago












$begingroup$
@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
$endgroup$
– Peter Cordes
10 mins ago






$begingroup$
@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
$endgroup$
– Peter Cordes
10 mins ago














$begingroup$
There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
$endgroup$
– Peter Cordes
9 mins ago






$begingroup$
There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
$endgroup$
– Peter Cordes
9 mins ago












Ram Problems is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















Ram Problems is a new contributor. Be nice, and check out our Code of Conduct.













Ram Problems is a new contributor. Be nice, and check out our Code of Conduct.












Ram Problems is a new contributor. Be nice, and check out our Code of Conduct.
















Thanks for contributing an answer to Mathematica Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f190411%2f3-4-ghz-ryzen-5-slower-to-diagonalise-large-matrix-than-intel-i5-6300u-2-4-ghz%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Callistus I

Tabula Rosettana

How to label and detect the document text images