3.4 GHz Ryzen 5 slower to diagonalise large matrix than Intel i5-6300U 2.4 GHz

Sorry that this is quite a specific question but I need to diagonalise large matrices for the problem I'm trying to solve and can't for the life of me work out what's going on:

I was expecting that diagonalising these would be much quicker on my desktop than laptop, however there seems to be very little performance difference and in fact my more powerful desktop - every aspect of the system is significantly superior to my laptop - is about 15% slower.

The simplest code that produces the difference is:

AbsoluteTiming[Eigensystem[Table[RandomReal[{0, 1}], {i, 10000}, {j, 10000}],-10, Method -> "Arnoldi"];]

This takes on average about 64 seconds on my desktop but around 58 secs on my laptop. For the specific task I'm trying to solve that difference actually seems larger too.

Any idea what's going on and if there's anyway to solve it? I've read that Mathematica can be a slower on AMD chips than Intel but I seem to get faster performance in almost every task on my desktop than on my laptop... apart from this specific one.

Full specs:

Desktop: AMD Ryzen 5 2600 3.4 GHz 6-core (3.7 GHz Boost),
32 GB 3000 MT/s (2x16 GB) RAM, Intel EVO 970(R 3500, W 2200)

Laptop (Surface Book 1): Intel i5-6300U 2.4 GHz Dual Core (3.0 GHz Boost),
8 GB 1866 MT/s (2x4 GB) RAM, SSD (R 1500, W 600)

Edit:
I notice that the CPU usage in Task Manager is slightly different - running at around 60-65% for the laptop but at 50% for the desktop. Is there perhaps a different implementation of Arnoldi for Intel that can take advantage of multiple cores? Laptop CPU usage Desktop CPU usage

edited 5 hours ago

asked 6 hours ago

Ram Problems

263

New contributor

$begingroup$
I am just curious: Does this happen also for positive-definite input matrices?
$endgroup$
– Henrik Schumacher
5 hours ago

$begingroup$
Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
$endgroup$
– Ram Problems
5 hours ago

$begingroup$
Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
$endgroup$
– Henrik Schumacher
3 hours ago

add a comment |

Sorry that this is quite a specific question but I need to diagonalise large matrices for the problem I'm trying to solve and can't for the life of me work out what's going on:

The simplest code that produces the difference is:

AbsoluteTiming[Eigensystem[Table[RandomReal[{0, 1}], {i, 10000}, {j, 10000}],-10, Method -> "Arnoldi"];]

This takes on average about 64 seconds on my desktop but around 58 secs on my laptop. For the specific task I'm trying to solve that difference actually seems larger too.

Full specs:

Desktop: AMD Ryzen 5 2600 3.4 GHz 6-core (3.7 GHz Boost),
32 GB 3000 MT/s (2x16 GB) RAM, Intel EVO 970(R 3500, W 2200)

Laptop (Surface Book 1): Intel i5-6300U 2.4 GHz Dual Core (3.0 GHz Boost),
8 GB 1866 MT/s (2x4 GB) RAM, SSD (R 1500, W 600)

edited 5 hours ago

asked 6 hours ago

Ram Problems

263

New contributor

$begingroup$
I am just curious: Does this happen also for positive-definite input matrices?
$endgroup$
– Henrik Schumacher
5 hours ago

$begingroup$
Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
$endgroup$
– Ram Problems
5 hours ago

$begingroup$
Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
$endgroup$
– Henrik Schumacher
3 hours ago

add a comment |

Sorry that this is quite a specific question but I need to diagonalise large matrices for the problem I'm trying to solve and can't for the life of me work out what's going on:

The simplest code that produces the difference is:

AbsoluteTiming[Eigensystem[Table[RandomReal[{0, 1}], {i, 10000}, {j, 10000}],-10, Method -> "Arnoldi"];]

This takes on average about 64 seconds on my desktop but around 58 secs on my laptop. For the specific task I'm trying to solve that difference actually seems larger too.

Full specs:

Desktop: AMD Ryzen 5 2600 3.4 GHz 6-core (3.7 GHz Boost),
32 GB 3000 MT/s (2x16 GB) RAM, Intel EVO 970(R 3500, W 2200)

Laptop (Surface Book 1): Intel i5-6300U 2.4 GHz Dual Core (3.0 GHz Boost),
8 GB 1866 MT/s (2x4 GB) RAM, SSD (R 1500, W 600)

edited 5 hours ago

asked 6 hours ago

Ram Problems

263

New contributor

Sorry that this is quite a specific question but I need to diagonalise large matrices for the problem I'm trying to solve and can't for the life of me work out what's going on:

The simplest code that produces the difference is:

AbsoluteTiming[Eigensystem[Table[RandomReal[{0, 1}], {i, 10000}, {j, 10000}],-10, Method -> "Arnoldi"];]

This takes on average about 64 seconds on my desktop but around 58 secs on my laptop. For the specific task I'm trying to solve that difference actually seems larger too.

Full specs:

Desktop: AMD Ryzen 5 2600 3.4 GHz 6-core (3.7 GHz Boost),
32 GB 3000 MT/s (2x16 GB) RAM, Intel EVO 970(R 3500, W 2200)

Laptop (Surface Book 1): Intel i5-6300U 2.4 GHz Dual Core (3.0 GHz Boost),
8 GB 1866 MT/s (2x4 GB) RAM, SSD (R 1500, W 600)

matrix linear-algebra system system-performance

edited 5 hours ago

asked 6 hours ago

Ram Problems

263

New contributor

edited 5 hours ago

asked 6 hours ago

Ram Problems

263

New contributor

edited 5 hours ago

asked 6 hours ago

Ram Problems

263

New contributor

asked 6 hours ago

Ram Problems

263

asked 6 hours ago

Ram Problems

263

New contributor

Ram Problems is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
I am just curious: Does this happen also for positive-definite input matrices?
$endgroup$
– Henrik Schumacher
5 hours ago

$begingroup$
Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
$endgroup$
– Ram Problems
5 hours ago

$begingroup$
Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
$endgroup$
– Henrik Schumacher
3 hours ago

add a comment |

$begingroup$
I am just curious: Does this happen also for positive-definite input matrices?
$endgroup$
– Henrik Schumacher
5 hours ago

$begingroup$
Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.
$endgroup$
– Ram Problems
5 hours ago

$begingroup$
Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...
$endgroup$
– Henrik Schumacher
3 hours ago

I am just curious: Does this happen also for positive-definite input matrices?

– Henrik Schumacher
5 hours ago

Nope the specifics of the matrix seems to make little difference to the difference in timings between the computers.

– Ram Problems
5 hours ago

Hm. Okay. I've just read this morning that ARPACK++ implements its own data type for machine precision complex numbers; compared to FORTRAN's native complex data type, this data type is said to severe performance issues...

– Henrik Schumacher
3 hours ago

add a comment |

1 Answer
1

active

oldest

votes

Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.

Probably Mathematica uses a non-parallelized implementation of
Anoldi's method. In single-core performance, these processors are not
that much different:
https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes this Eigensystem at 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...

Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...

Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.

edited 5 hours ago

answered 5 hours ago

Henrik Schumacher

51.4k469146

$begingroup$
Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
$endgroup$
– Ram Problems
5 hours ago

$begingroup$
This is my CPU.
$endgroup$
– Henrik Schumacher
5 hours ago

$begingroup$
@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
$endgroup$
– Peter Cordes
10 mins ago

$begingroup$
There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
$endgroup$
– Peter Cordes
9 mins ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "387"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

Ram Problems is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f190411%2f3-4-ghz-ryzen-5-slower-to-diagonalise-large-matrix-than-intel-i5-6300u-2-4-ghz%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.

Probably Mathematica uses a non-parallelized implementation of
Anoldi's method. In single-core performance, these processors are not
that much different:
https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes this Eigensystem at 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...

Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...

Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.

edited 5 hours ago

answered 5 hours ago

Henrik Schumacher

51.4k469146

$begingroup$
Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
$endgroup$
– Ram Problems
5 hours ago

$begingroup$
This is my CPU.
$endgroup$
– Henrik Schumacher
5 hours ago

$begingroup$
@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
$endgroup$
– Peter Cordes
10 mins ago

$begingroup$
There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
$endgroup$
– Peter Cordes
9 mins ago

add a comment |

Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.

Probably Mathematica uses a non-parallelized implementation of
Anoldi's method. In single-core performance, these processors are not
that much different:
https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes this Eigensystem at 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...

Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...

Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.

edited 5 hours ago

answered 5 hours ago

Henrik Schumacher

51.4k469146

$begingroup$
Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
$endgroup$
– Ram Problems
5 hours ago

$begingroup$
This is my CPU.
$endgroup$
– Henrik Schumacher
5 hours ago

$begingroup$
@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
$endgroup$
– Peter Cordes
10 mins ago

$begingroup$
There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
$endgroup$
– Peter Cordes
9 mins ago

add a comment |

Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.

Probably Mathematica uses a non-parallelized implementation of
Anoldi's method. In single-core performance, these processors are not
that much different:
https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes this Eigensystem at 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...

Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...

Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.

edited 5 hours ago

answered 5 hours ago

Henrik Schumacher

51.4k469146

Here are some possible explanations that came to my mind. But after further thinking, none of those should really apply to OP's situation. However, I leave this for documentation reasons.

Probably Mathematica uses a non-parallelized implementation of
Anoldi's method. In single-core performance, these processors are not
that much different:
https://cpu.userbenchmark.com/Compare/Intel-Core-i5-6300U-vs-AMD-Ryzen-5-2600/m27864vs3955. But thinking of it: By Haswell Quad Core executes this Eigensystem at 400 % CPU speed (that's macOS' way of telling me that it runs on all cores without hyperthreading) and it only takes about 29 seconds...

Mathematica uses the Intel MKL for many numerical tasks and these libraries are best optimized for - guess what - Intel CPUs. Thinking of it, this is probably not the case for Arnoldi's method; I heard once that Mathematica uses ARPACK...

Arnoldi's method uses mostly matrix-vector multiplications. Those are usually far from operating at peak floating point performance in case of sparse arrays because they are somewhat memory bound (nearly random access of memory is required). For dense matrices however, these are usually highly optimized. So this is probably also not the reason for the Ryzen not profiting from its peak performance... -- unless the BLAS that Mathematica uses on your system is really shabby.

edited 5 hours ago

answered 5 hours ago

Henrik Schumacher

51.4k469146

edited 5 hours ago

answered 5 hours ago

Henrik Schumacher

51.4k469146

answered 5 hours ago

Henrik Schumacher

51.4k469146

answered 5 hours ago

Henrik Schumacher

51.4k469146

$begingroup$
Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
$endgroup$
– Ram Problems
5 hours ago

$begingroup$
This is my CPU.
$endgroup$
– Henrik Schumacher
5 hours ago

$begingroup$
@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
$endgroup$
– Peter Cordes
10 mins ago

$begingroup$
There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
$endgroup$
– Peter Cordes
9 mins ago

add a comment |

$begingroup$
Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?
$endgroup$
– Ram Problems
5 hours ago

$begingroup$
This is my CPU.
$endgroup$
– Henrik Schumacher
5 hours ago

$begingroup$
@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.
$endgroup$
– Peter Cordes
10 mins ago

$begingroup$
There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.
$endgroup$
– Peter Cordes
9 mins ago

Thanks for the response. It does seem it might be related to the Arnoldi algorithm. On both machines not specifying the algorithm slows the computation down overall but gives about 68 seconds for the Ryzen and 72 secs for the Intel (although does it just decide to chose Arnoldi anyway?) What speed is your Haswell?

– Ram Problems
5 hours ago

This is my CPU.

– Henrik Schumacher
5 hours ago

@RamProblems: Intel Haswell and Skylake (i.e. your i3-6300U) have twice the raw FMA throughput per core per clock vs. Ryzen. (2x 256-bit vector FMA units, vs. 2x 128-bit in Ryzen). If this workload is bottlenecking on ALU throughput like a dense matmul with efficient cache-blocking would, then that explains a good fraction of the difference. See Why is an AMD Ryzen 2700x 2x slower than a 3-year-old laptop Intel i7-6820HQ with Python? (SKL quad-core vs. Zen 8-core) where FMA throughput explains ~half of the difference.

– Peter Cordes
10 mins ago

There are also differences in memory / cache, but those don't obviously favour the dual-core Skylake CPU. Ryzen uses clusters of 4 cores sharing an L3 cache (with higher inter-core latency across clusters than within one cluster), so maybe inter-core data movement is a problem too, using all 6 cores. SKL does have excellent L1d and L2 cache bandwidth, though, and pretty good bandwidth to its shared L3. It's still a surprising result, and yeah might be due to software that's much better tuned for Intel, possibly with some cripple-AMD thrown in for good measure in Intel's libraries.

– Peter Cordes
9 mins ago

add a comment |

Ram Problems is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Ram Problems is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Mathematica Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk