3.4 GHz Ryzen 5 slower to diagonalise large matrix than Intel i5-6300U 2.4 GHz
$begingroup$
Sorry that this is quite a specific question but I need to diagonalise large matrices for the problem I'm trying to solve and can't for the life of me work out what's going on:
I was expecting that diagonalising these would be much quicker on my desktop than laptop, however there seems to be very little performance difference and in fact my more powerful desktop - every aspect of the system is significantly superior to my laptop - is about 15% slower.
The simplest code that produces the difference is:
AbsoluteTiming[Eigensystem[Table[RandomReal[{0, 1}], {i, 10000}, {j, 10000}],-10, Method -> "Arnoldi"];]
This takes on average about 64 seconds on my desktop but around 58 secs on my laptop. For the specific task I'm trying to solve that difference actually seems larger too.
Any idea what's going on and if there's anyway to solve it? I've read that Mathematica can be a slower on AMD chips than Intel but I seem to get faster performance in almost every task on my desktop than on my laptop... apart from this specific one.
Full specs:
Desktop: AMD Ryzen 5 2600 3.4 GHz 6-core (3.7 GHz Boost),
32 GB 3000 MT/s (2x16 GB) RAM, Intel EVO 970(R 3500, W 2200)
Laptop (Surface Book 1): Intel i5-6300U 2.4 GHz Dual Core (3.0 GHz Boost),
8 GB 1866 MT/s (2x4 GB) RAM, SSD (R 1500, W 600)
Edit:
I notice that the CPU usage in Task Manager is slightly different - running at around 60-65% for the laptop but at 50% for the desktop. Is there perhaps a different implementation of Arnoldi for Intel that can take advantage of multiple cores? 

matrix linear-algebra system system-performance
New contributor
Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
Sorry that this is quite a specific question but I need to diagonalise large matrices for the problem I'm trying to solve and can't for the life of me work out what's going on:
I was expecting that diagonalising these would be much quicker on my desktop than laptop, however there seems to be very little performance difference and in fact my more powerful desktop - every aspect of the system is significantly superior to my laptop - is about 15% slower.
The simplest code that produces the difference is:
AbsoluteTiming[Eigensystem[Table[RandomReal[{0, 1}], {i, 10000}, {j, 10000}],-10, Method -> "Arnoldi"];]
This takes on average about 64 seconds on my desktop but around 58 secs on my laptop. For the specific task I'm trying to solve that difference actually seems larger too.
Any idea what's going on and if there's anyway to solve it? I've read that Mathematica can be a slower on AMD chips than Intel but I seem to get faster performance in almost every task on my desktop than on my laptop... apart from this specific one.
Full specs:
Desktop: AMD Ryzen 5 2600 3.4 GHz 6-core (3.7 GHz Boost),
32 GB 3000 MT/s (2x16 GB) RAM, Intel EVO 970(R 3500, W 2200)
Laptop (Surface Book 1): Intel i5-6300U 2.4 GHz Dual Core (3.0 GHz Boost),
8 GB 1866 MT/s (2x4 GB) RAM, SSD (R 1500, W 600)
Edit:
I notice that the CPU usage in Task Manager is slightly different - running at around 60-65% for the laptop but at 50% for the desktop. Is there perhaps a different implementation of Arnoldi for Intel that can take advantage of multiple cores? 

matrix linear-algebra system system-performance
New contributor
Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
$begingroup$
I am just curious: Does this happen also for positive-definite input matrices?
$endgroup$
– Henrik Schumacher
5 hours ago
$begingroup$
Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
$endgroup$
– Ram Problems
5 hours ago
$begingroup$
Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
$endgroup$
– Henrik Schumacher
3 hours ago
add a comment |
$begingroup$
Sorry that this is quite a specific question but I need to diagonalise large matrices for the problem I'm trying to solve and can't for the life of me work out what's going on:
I was expecting that diagonalising these would be much quicker on my desktop than laptop, however there seems to be very little performance difference and in fact my more powerful desktop - every aspect of the system is significantly superior to my laptop - is about 15% slower.
The simplest code that produces the difference is:
AbsoluteTiming[Eigensystem[Table[RandomReal[{0, 1}], {i, 10000}, {j, 10000}],-10, Method -> "Arnoldi"];]
This takes on average about 64 seconds on my desktop but around 58 secs on my laptop. For the specific task I'm trying to solve that difference actually seems larger too.
Any idea what's going on and if there's anyway to solve it? I've read that Mathematica can be a slower on AMD chips than Intel but I seem to get faster performance in almost every task on my desktop than on my laptop... apart from this specific one.
Full specs:
Desktop: AMD Ryzen 5 2600 3.4 GHz 6-core (3.7 GHz Boost),
32 GB 3000 MT/s (2x16 GB) RAM, Intel EVO 970(R 3500, W 2200)
Laptop (Surface Book 1): Intel i5-6300U 2.4 GHz Dual Core (3.0 GHz Boost),
8 GB 1866 MT/s (2x4 GB) RAM, SSD (R 1500, W 600)
Edit:
I notice that the CPU usage in Task Manager is slightly different - running at around 60-65% for the laptop but at 50% for the desktop. Is there perhaps a different implementation of Arnoldi for Intel that can take advantage of multiple cores? 

matrix linear-algebra system system-performance
New contributor
Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
Sorry that this is quite a specific question but I need to diagonalise large matrices for the problem I'm trying to solve and can't for the life of me work out what's going on:
I was expecting that diagonalising these would be much quicker on my desktop than laptop, however there seems to be very little performance difference and in fact my more powerful desktop - every aspect of the system is significantly superior to my laptop - is about 15% slower.
The simplest code that produces the difference is:
AbsoluteTiming[Eigensystem[Table[RandomReal[{0, 1}], {i, 10000}, {j, 10000}],-10, Method -> "Arnoldi"];]
This takes on average about 64 seconds on my desktop but around 58 secs on my laptop. For the specific task I'm trying to solve that difference actually seems larger too.
Any idea what's going on and if there's anyway to solve it? I've read that Mathematica can be a slower on AMD chips than Intel but I seem to get faster performance in almost every task on my desktop than on my laptop... apart from this specific one.
Full specs:
Desktop: AMD Ryzen 5 2600 3.4 GHz 6-core (3.7 GHz Boost),
32 GB 3000 MT/s (2x16 GB) RAM, Intel EVO 970(R 3500, W 2200)
Laptop (Surface Book 1): Intel i5-6300U 2.4 GHz Dual Core (3.0 GHz Boost),
8 GB 1866 MT/s (2x4 GB) RAM, SSD (R 1500, W 600)
Edit:
I notice that the CPU usage in Task Manager is slightly different - running at around 60-65% for the laptop but at 50% for the desktop. Is there perhaps a different implementation of Arnoldi for Intel that can take advantage of multiple cores? 

matrix linear-algebra system system-performance
matrix linear-algebra system system-performance
New contributor
Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
edited 5 hours ago
Ram Problems
New contributor
Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
asked 6 hours ago
Ram ProblemsRam Problems
263
263
New contributor
Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$begingroup$
I am just curious: Does this happen also for positive-definite input matrices?
$endgroup$
– Henrik Schumacher
5 hours ago
$begingroup$
Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
$endgroup$
– Ram Problems
5 hours ago
$begingroup$
Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
$endgroup$
– Henrik Schumacher
3 hours ago
add a comment |
$begingroup$
I am just curious: Does this happen also for positive-definite input matrices?
$endgroup$
– Henrik Schumacher
5 hours ago
$begingroup$
Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
$endgroup$
– Ram Problems
5 hours ago
$begingroup$
Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
$endgroup$
– Henrik Schumacher
3 hours ago
$begingroup$
I am just curious: Does this happen also for positive-definite input matrices?
$endgroup$
– Henrik Schumacher
5 hours ago
$begingroup$
I am just curious: Does this happen also for positive-definite input matrices?
$endgroup$
– Henrik Schumacher
5 hours ago
$begingroup$
Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
$endgroup$
– Ram Problems
5 hours ago
$begingroup$
Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
$endgroup$
– Ram Problems
5 hours ago
$begingroup$
Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
$endgroup$
– Henrik Schumacher
3 hours ago
$begingroup$
Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
$endgroup$
– Henrik Schumacher
3 hours ago
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.
Probably Mathematica uses a non-parallelized implementation of
Anoldi's method. In single-core performance, these processors are not
that much different:
https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes thisEigensystemat 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...
Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.
$endgroup$
$begingroup$
Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
$endgroup$
– Ram Problems
5 hours ago
$begingroup$
This is my CPU.
$endgroup$
– Henrik Schumacher
5 hours ago
$begingroup$
@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
$endgroup$
– Peter Cordes
10 mins ago
$begingroup$
There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
$endgroup$
– Peter Cordes
9 mins ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "387"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Ram Problems is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f190411%2f3-4-ghz-ryzen-5-slower-to-diagonalise-large-matrix-than-intel-i5-6300u-2-4-ghz%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.
Probably Mathematica uses a non-parallelized implementation of
Anoldi's method. In single-core performance, these processors are not
that much different:
https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes thisEigensystemat 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...
Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.
$endgroup$
$begingroup$
Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
$endgroup$
– Ram Problems
5 hours ago
$begingroup$
This is my CPU.
$endgroup$
– Henrik Schumacher
5 hours ago
$begingroup$
@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
$endgroup$
– Peter Cordes
10 mins ago
$begingroup$
There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
$endgroup$
– Peter Cordes
9 mins ago
add a comment |
$begingroup$
Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.
Probably Mathematica uses a non-parallelized implementation of
Anoldi's method. In single-core performance, these processors are not
that much different:
https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes thisEigensystemat 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...
Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.
$endgroup$
$begingroup$
Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
$endgroup$
– Ram Problems
5 hours ago
$begingroup$
This is my CPU.
$endgroup$
– Henrik Schumacher
5 hours ago
$begingroup$
@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
$endgroup$
– Peter Cordes
10 mins ago
$begingroup$
There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
$endgroup$
– Peter Cordes
9 mins ago
add a comment |
$begingroup$
Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.
Probably Mathematica uses a non-parallelized implementation of
Anoldi's method. In single-core performance, these processors are not
that much different:
https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes thisEigensystemat 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...
Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.
$endgroup$
Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.
Probably Mathematica uses a non-parallelized implementation of
Anoldi's method. In single-core performance, these processors are not
that much different:
https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes thisEigensystemat 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...
Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.
edited 5 hours ago
answered 5 hours ago
Henrik SchumacherHenrik Schumacher
51.4k469146
51.4k469146
$begingroup$
Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
$endgroup$
– Ram Problems
5 hours ago
$begingroup$
This is my CPU.
$endgroup$
– Henrik Schumacher
5 hours ago
$begingroup$
@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
$endgroup$
– Peter Cordes
10 mins ago
$begingroup$
There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
$endgroup$
– Peter Cordes
9 mins ago
add a comment |
$begingroup$
Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
$endgroup$
– Ram Problems
5 hours ago
$begingroup$
This is my CPU.
$endgroup$
– Henrik Schumacher
5 hours ago
$begingroup$
@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
$endgroup$
– Peter Cordes
10 mins ago
$begingroup$
There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
$endgroup$
– Peter Cordes
9 mins ago
$begingroup$
Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
$endgroup$
– Ram Problems
5 hours ago
$begingroup$
Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
$endgroup$
– Ram Problems
5 hours ago
$begingroup$
This is my CPU.
$endgroup$
– Henrik Schumacher
5 hours ago
$begingroup$
This is my CPU.
$endgroup$
– Henrik Schumacher
5 hours ago
$begingroup$
@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
$endgroup$
– Peter Cordes
10 mins ago
$begingroup$
@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
$endgroup$
– Peter Cordes
10 mins ago
$begingroup$
There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
$endgroup$
– Peter Cordes
9 mins ago
$begingroup$
There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
$endgroup$
– Peter Cordes
9 mins ago
add a comment |
Ram Problems is a new contributor. Be nice, and check out our Code of Conduct.
Ram Problems is a new contributor. Be nice, and check out our Code of Conduct.
Ram Problems is a new contributor. Be nice, and check out our Code of Conduct.
Ram Problems is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Mathematica Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f190411%2f3-4-ghz-ryzen-5-slower-to-diagonalise-large-matrix-than-intel-i5-6300u-2-4-ghz%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
I am just curious: Does this happen also for positive-definite input matrices?
$endgroup$
– Henrik Schumacher
5 hours ago
$begingroup$
Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
$endgroup$
– Ram Problems
5 hours ago
$begingroup$
Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
$endgroup$
– Henrik Schumacher
3 hours ago